Big data solutions in azure
-
Upload
mostafa-elzoghbi -
Category
Technology
-
view
601 -
download
0
Transcript of Big data solutions in azure
![Page 1: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/1.jpg)
Big Data Solutions in AzureMostafa ElzoghbiSr. Technical Evangelist - Microsoft @MostafaElzoghbi
![Page 2: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/2.jpg)
Session Objectives And Takeaways Understanding HDInsight cluster types in Azure HBase as a Hadoop storage option in Hadoop Understanding data processing options in Hadoop ecosystem
using Storm and Spark.
![Page 3: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/3.jpg)
• HDInsight is the Microsoft implementation of Hadoop ecosystem components in the cloud.
• Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data with high reliability and availability.
• HDInsight is available on Windows and Linux• HDInsight on Linux: A Hadoop cluster on Ubuntu• HDInsight on Windows: A Hadoop cluster on Win Server 2012 R2
What is HDInsight
![Page 4: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/4.jpg)
• HDInsight provides cluster Types & Configurations for:• Hadoop (HDFS)• HBase• Storm• Spark (Preview)
• Skip maintaining and purchasing hardware• HDInsight has powerful programming extensions for languages including C#, Java, and
.NET. Use your programming language of choice on Hadoop to create, configure, submit, and monitor Hadoop jobs.
HDInsight clusters on Azure
![Page 5: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/5.jpg)
![Page 6: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/6.jpg)
• Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable.
• HBase provides random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families
• Data is stored in the rows of a table, and data within a row is grouped by column family. • The open-source code scales linearly to handle petabytes of data on thousands of nodes. It can rely
on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem.
What is HBase
![Page 7: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/7.jpg)
Order No Customer Name Customer Phone Company Name Company Address
12012015 Mostafa 101-232-2345 Microsoft Redmond, WA
Customer Company
Order No Customer Name Customer Phone Company Name Company Address
12012015 Mostafa 101-232-2345 Microsoft Redmond, WA
A record for Order table in a RDMBS
The same record for Order table in HBase
![Page 8: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/8.jpg)
• HBase Commands:• create Equivalent to Create table in T-SQL• get Equivalent to Select statements in T-SQL• put Equivalent to Update, Insert statement in T-SQL• scan Equivalent to Select (no where condition) in T-SQL
• HBase shell is your query tool to execute in CRUD commands to a HBase cluster.• Data can also be managed using the HBase C# API, which provides a client library on top
of the HBase REST API. • An HBase database can also be queried by using Hive.
What is HBase
![Page 9: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/9.jpg)
• Apache Hive is a data warehouse system for Hadoop, which enables data summarization, querying, and analysis of data by using HiveQL (a query language similar to SQL).
• Hive understands how to work with structured and semi-structured data, such as text files where the fields are delimited by specific characters.• Hive also supports custom serializer/deserializers (SerDe) for complex or irregularly
structured data. • Hive can also be extended through user-defined functions (UDF).• A UDF allows you to implement functionality or logic that isn't easily modeled in HiveQL.
What is Hive
![Page 10: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/10.jpg)
Demo: Working with HDInsight HBase cluster
![Page 11: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/11.jpg)
• Apache Storm is a distributed, fault-tolerant, open-source computation system that allows you to process data in real-time with Hadoop.• Apache Storm on HDInsight allows you to create distributed, real-time analytics solutions
in the Azure environment by using Apache Hadoop.• Storm solutions can also provide guaranteed processing of data, with the ability to replay
data that was not successfully processed the first time.• Ability to write Storm components in C#, JAVA and Python.• Azure Scale up or Scale down without an impact for running Storm topologies.• Ease of provision and use in Azure• Visual Studio project templates for Storm apps
What is Apache Storm
![Page 12: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/12.jpg)
• Apache Storm apps are submitted as Topologies.• A topology is a graph of computation that processes streams• Stream: An unbound collection of tuples. Streams are produced by spouts and bolts, and
they are consumed by bolts.• Tuple: A named list of dynamically typed values.• Spout: Consumes data from a data source and emits one or more streams.• Bolt: Consumes streams, performs processing on tuples, and may emit streams. Bolts are
also responsible for writing data to external storage, such as a queue, HDInsight, HBase, a blob, or other data store.• Nimbus: JobTracker in Hadoop that distribute jobs, monitoring failures.
Apache Storm Components
![Page 13: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/13.jpg)
Demo: Storm App in VS 2015
![Page 14: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/14.jpg)
• Apache Spark™ is a fast and general engine for large-scale data processing.• Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.• Write applications quickly in Java, Scala, Python, R.• Combine SQL, streaming, and complex analytics.• Spark's in-memory computation capabilities make it a good choice for iterative algorithms in ML and graph computations. • Spark is also compatible with Azure Blob storage (WASB) so your existing data stored in
Azure can easily be processed via Spark.• Easy to provision on Azure as HDInsight Spark cluster.
What is Apache Spark
![Page 15: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/15.jpg)
Demo: Working with Spark Notebooks
![Page 16: Big data solutions in azure](https://reader036.fdocuments.in/reader036/viewer/2022083106/5873ac931a28aba3548b640b/html5/thumbnails/16.jpg)
Session Objectives And Takeaways Understanding HDInsight cluster types in Azure HBase as a Hadoop storage option in Hadoop Understanding data processing options in Hadoop ecosystem
using Storm and Spark.