Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD

Post on 22-Jan-2018

605 views 1 download

Transcript of Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD

A D N A N M A S O O D , P H D

S Y S T E M S A R C H I T E C T / D A T A S C I E N T I S TA D N A N . M A S O O D @ O W A S P . O R G

( H T T P : / / B L O G . A D N A N M A S O O D . C O M )

G I T H U B ( G I T H U B . C O M / A D N A N M A S O O D ) ,

T W I T T E R ( @ A D N A N M A S O O D ) .

P R E S E N T E D A T M I C R O S O F T D A T A S C I E N C E G R O U P –T A M P A B A Y D A T A S C I E N C E P R O F E S S I O N A L S

H T T P : / / W W W . M E E T U P . C O M / D A T A - S C I E N T I S T S - T A M P A - B A Y / E V E N T S / 2 3 1 2 9 3 0 7 7 /

Spark with Azure HDInsight

About the Speaker

Adnan Masood, Ph.D. is a developer, software architect, and researcher and specializes in FinTech, machine learning and Bayesian belief networks. Before joining PDS Health care, and GDC (a leading prepaid financial technology institution), he enjoyed life as a principal engineer of a start-up and worked for a leading UK based nonprofit organization as a solutions architect.

A strong believer in the development community, Adnan is an active member of the Open Web Application Security Project (OWASP), an organization dedicated to software security. In the .NET community, he is a cofounder and president of the Pasadena .NET Developers group, which he has been successfully leading for 8 years. He led a number of successful enterprise solutions and consulted for several Fortune 500 company projects.

Adnan devotes himself to his own continual, practical education. He holds certifications in big data, machine learning, and systems architecture from Massachusetts Institute of Technology; an Application Security certification from Stanford University; an SOA Smarts certification from Carnegie Mellon University; and certifications as a ScrumMaster, Microsoft Certified Trainer, Microsoft Certified Solutions Developer, and Sun Certified Java Developer.

For more details, visit Adnan's blog (http://blog.adnanmasood.com), GitHub repository (http://github.com/adnanmasood), and Twitter (@adnanmasood). Adnan can be reached at adnan.masood@owasp.org.

Spark 101

Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.

Spark with Azure HDInsight

Deployment Models

Big Data Deployment – Public Cloud

• Hadoop-as-a-Service

- Amazon Web Services EC2 and EMR

- Microsoft Azure HDInsight

- Google Cloud Dataproc

- IBM Bluemix ... and others

• Spark-as-a-Service

- All of the above

- Databricks

Big Data Deployment – On-Premises

• Bare-Metal

• Virtual Machines

- VMware Big Data Extensions

- OpenStack Sahara

• Containers

- BlueData

- Mesos

HDInsight as Part of Azure Portal

Spark - Benefits

Spark – Use Cases

Spark is Fast!

Spark is Fast!

Demo - Creating a HDInsight Spark Cluster

HDInsight Spark Streaming

“Along with traditional Hadoop technologies, HDInsight also provides Spark as a cloud service. Spark is an integrated set of open source technologies that can run on a Hadoop cluster. The Spark family includes options for analyzing large amounts of operational data, doing machine learning, and more. It also includes Spark Streaming, a technology for working with streaming data. Spark Streaming is similar to Storm in some ways. Like Storm, it’s a general-purpose technology for processing streaming data. Unlike Storm, Spark Streaming is implemented as an extension to the basic Spark engine—it’s not an add-on technology. This tight connection can make Spark applications faster, since there’s less need to move data between components, and easier to create, since everything uses the same core Spark technology. Because of this, Spark Streaming (and Spark in general) are getting more popular by the day”

David Chappell STREAMING SCENARIOS USING THE MICROSOFT DATA PLATFORM

A GUIDE FOR IT LEADERS

HDInsight Spark Streaming

• What is it?- Distributed compute framework, an extension of the core Apache Spark API

- Allows users to integrate real-time data from disparate event streams (e.g. Kafka, HDFS, Twitter) in event-driven, asynchronous, scalable, type-safe, and fault tolerant applications

• When to use it?- When organizations need realtime decision making

- When you are working with streams of continuous data

• Why Spark Streaming?- Enables high-throughput and reliable processing of live data streams

- Batch, Iterative, and Streaming analysis on the same platform

- Easily add Machine Learning for streaming data pathways

Getting Started.

References & Further Reading

Use MapReduce in Hadoop on HDInsight https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-mapreduce

Get started: Create Apache Spark cluster on HDInsight Linux and run interactive queries using Spark SQL https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-zeppelin-notebook-jupyter-spark-sql/

Azure Machine Learning -https://azure.microsoft.com/en-us/services/machine-learning/

References & Further Reading

Announcing Apache Spark on Azure HDInsight https://channel9.msdn.com/Shows/Azure-Friday/Announcing-Apache-Spark-on-Azure-HDInsight

Apache Zeppelin https://zeppelin.incubator.apache.org Project Jupyter http://jupyter.org/ https://azure.microsoft.com/en-us/services/hdinsight/ https://azure.microsoft.com/en-us/blog/apache-spark-for-azure-

hdinsight-now-generally-available/ Microsoft expands its commitment to Apache Spark big-data framework https://azure.microsoft.com/en-us/documentation/articles/hdinsight-

apache-spark-use-zeppelin-notebook/ https://channel9.msdn.com/Shows/Azure-Friday/Announcing-Apache-

Spark-on-Azure-HDInsight http://www.c-sharpcorner.com/UploadFile/aa700f/jumpstart-into-big-

data-with-hdinsight/

References & Further Reading

Get started: Create Apache Spark cluster on HDInsight Linux and run interactive queries using Spark SQL https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-jupyter-spark-sql/

EdX Course: Processing Big Data with Azure HDInsight Processing Big Data with Azure HDInsight Learn how to use Hadoop technologies in Microsoft Azure HDInsight to process big data in this five week, hands-on course. https://www.edx.org/course/processing-big-data-azure-hdinsight-microsoft-dat202-1x-0

Apache Spark for Azure HDInsight https://azure.microsoft.com/en-us/services/hdinsight/apache-spark/

Build Machine Learning applications to run on Apache Spark clusters on HDInsight Linux https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-ipython-notebook-machine-learning/

References & Further Reading

Questions