· 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the...

––

•–

•––

••

Prepare data for the next iteration

●○○

●○○○

●○○

Status of RDD actions being computed

Info about cached RDDs and memory usage

In-depth job info

Resilient Distributed

Datasets (RDD)DataFrame DataSet

● Distributed collection of JVM objects

● Functional operators (map, filter, etc.)

● Distributed collection of Row objects

● Expression-based operations

● Fast, efficient internal representations

● Internally rows, externally JVM objects

● Type safe and fast

● Slower than dataframes

●●●●

RDD1 RDD1’

RDD2 RDD2’

RDD3 RDD3’

Machine B

Machine A

Machine C

RDD Operation(e.g. map, filter)

>>> input_RDD = sc.textFile("text.file")

>>> transform_RDD = input_RDD.filter(lambda x: "abcd" in x)

>>> print "Number of “abcd”:" + transform_RDD.count()

>>> output.saveAsTextFile(“hdfs:///output”)

●●

○○○

●●

people.json{"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}

val df = spark.read.json("people.json")

val sqlDF = df.filter($"age" > 20).show()+---+----+

|age|name|

+---+----+

| 30|Andy|

+---+----+

df.filter($"age" > 20).select(“name”).write.format(“parquet”).save(“output”)

Note: Parquet is a column-based storage format for Hadoop. You will need special dependencies to read this file

Task Points Description Language

●●●● map filter ● reduceByKey aggregateByKey

groupByKey

●●●●

●○○○○

●○ ⇒○ ⇒

● How do we measure influence?○ Intuitively, it should be the node with the most followers

● Influence scores are initialized to 1.0/number of vertices

0.333 0.333

● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are

redistributed between the users they are following

0.333 0.333

redistributed between the users they are following

0.333/2 = 0.167

0.333 + 0.333/2 = 0.500

0.333From Node 2

From Node 1

From Node 1From Node 0

redistributed between the users they are following● Convergence is achieved when the scores of nodes do not

change between iterations ● Pagerank is guaranteed to converge

0.333/2 = 0.167

0.333 + 0.333/2 = 0.500

From Node 2

From Node 1

From Node 1From Node 0

redistributed between the users they are following● Convergence is achieved when the scores of nodes do not

change between iterations● Pagerank is guaranteed to converge

0.208 0.396

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {

// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {

(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)}

● Dangling or sink vertex○ No outgoing edges○ Redistribute contribution equally among all vertices

● Isolated vertex○ No incoming and outgoing edges○ No isolated nodes in Project 4.1 dataset

● Damping factor d○ Represents the probability that a user clicking on links

will continue clicking on them, traveling down an edge○ Use d = 0.85

Dangling vertexIsolated vertex

● Adjacency matrix:

● Transition matrix: (rows sum to 1)

Formula for calculating rank

d = 0.85

Note: contributions from isolated and dangling vertices are constant in an iteration

d = 0.85

Note: contributions from isolated and dangling vertices are constant in an iteration

This simplifies the formula to

d = 0.85

● Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight

● Databricks is an Apache Spark-based analytics platform optimized for Azure

● One-click setup, an interactive workspace, and an optimized Databricks runtime

● Optimized connectors to Azure storage platforms for fast data access

● Software-as-a-Service

● reduceByKeygroupByKey aggregateByKey

●○○ ./spark/bin/spark-shell ○ ./spark/bin/pyspark

●○○○

●○○

● Ensuring correctness○ Make sure total scores sum to 1.0 in every iteration○ Understand closures in Spark

■ Do not do something like thisval data = Array(1,2,3,4,5)

var counter = 0

var rdd = sc.parallelize(data)

rdd.foreach(x => counter += x)

println("Counter value: " + counter)

○ Graph representation■ Adjacency lists use less memory than matrices

○ More detailed walkthroughs and sample calculations can be found here

● Optimization○ Eliminate repeated calculations○ Use the Spark Web UI

■ Monitor your instances to make sure they are fully utilized

■ Identify bottlenecks○ Understand RDD manipulations

■ Actions vs transformations■ Lazy transformations

○ Explore parameter tuning to optimize resource usage○ Be careful with repartition on your RDDs

tWITTER DATA ANALYTICS:TEAM PROJECT

Team Project

Team Project● Phase 1:

○ Q1○ Q2 (MySQL AND HBase)

● Phase 2○ Q1○ Q2 & Q3 (MySQL AND HBase)

● Phase 3○ Q1○ Q2 & Q3 (MySQL OR HBase)

Team Project Deadlines● Writeup and queries were released on

Monday, October 29th, 2018.● Phase 2 milestones:

○ Q2:■ Q2 on scoreboard, due on Sunday, 11/11

○ Phase 2, Live test:■ Q1, Q2 and Q3, on Sunday, 11/11

○ Phase 2, code and report:■ due on Tuesday, 11/13

Query 3, Definitions

● time_start, time_end: in Unix time / Epoch time format, e.g. time_start=1480000000

● uid_start, uid_end: marks the user search boundary, e.g. uid_end=492600000

● n1: the maximum number of topic words that should be included in the response

● n2: the maximum number of tweets that should be included in the response

Query 3: Effective Word Count (EWC)EWC:● one or more consecutive alphanumeric

characters (A through Z, a through z, 0 through 9) with zero or more ' or/and - characters.

Query 3 is su-per-b! I'mmmm lovin' it! ⇐ 6 EWC

Don’t forget to remove the short URL and stop words before calculation

Query 3, Impact Score

Impact Score = EWC*(favorite_count+retweet_count+followers_count)

Consider negative impact_score as 0.

Query 3, Topic WordsTopic words:● After filtering short urls● Exclude stop words● Before censor● Case insensitive (lower case)

TF-IDF:● TF: term frequency of a topic word w● IDF: ln(Total number of tweets in range/

Number of tweets with w in it) 38

Query 3, Topic ScoreTopic Score = sum(x * ln(y + 1)) (i from 1 to n)

n: The total number of tweets in the given time and uid range

x: TF-IDF score of word w in tweet Ti

y: The impact score of Ti

Query 3 Example

word1:score1\tword2:score2...\twordn1:scoren1Impactscore1\ttid1\ttext1…..

Example:channel:2270.04 amp:1586.31 new:1166.24 just:1153.70 love:1063.31 like:1015.71 good:937.63

26200650 461159182406672384 I just buyed the comedy album of my bestest friend in the entire world @briangaar. https://t.co/hwDB4veaYG #RacesAsToad... 38

Don’t forget to censor the tweets

Warning!!! Any Hadoop Cluster

For any hadoop cluster on AWS, Azure or GCP

● Don’t open ports to the public, except○ Ports: 22, 80, 25, 443, or 465

● Follow the HBase Primer and use SSH tunnel to communicate with your Yarn UI

Note:● There will be a report due at the end of each phase, where you are expected to discuss optimizations● WARNING: Check your AWS instance limits on the new account (should be > 10 instances)

Phase (and query due) Start Deadline Code and Report Due

○●

○○

Questions?

· 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the...

Documents

Transcript of · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the...

Microsoft Cloud Microsoft Confidential SaaS Office 365 Azure SQL PaaS Azure Storage Azure HDInsight IaaS SQL Server Apache One common problem: “How.

Designing and Implementing Big Data Analytics SolutionsA. Azure Stream Analytics B. Azure Data Lake Analytics C. Azure Machine Learning D. Azure HDInsight Storm clusters Answer: A

Jonelle Kreiner GBB, SSP IoT Public Sector...Azure IoT Hub SQL,DW, HDInsight, Cosmos Customer equipment and devices Azure Machine Learning, Stream Analytics, AI Azure IoT Suite and

Microsoft HDIndight 大数据平台 - bg.bespinglobal.cn · 2. Azure HDInsight Service HDInsight ު在 Microsoft Azure 上ھଭ۾ً Apache Hadoop 技ߑ堆߸ （҂为מ数据分ߥच选解ӑޑࠅ）चлؙ࣬。它包ܭ

Attunity Solutions for Microsoft Azure Databricks · 2020-03-09 · data sources (Oracle, Microsoft SQL Server, SAP, mainframe and more) to Azure Databricks and Delta Lake with zero

MDW: Azure Data Factory V2 - Data fact… · Azure V2 Qué es Novedades Demo: Copia de datos incrementales On Prem – Azure SSIS en la nube IR Dataflows con Azure Databricks for

Understanding NoSQL on Microsoft Azure - David · PDF fileBig data analytics, including the managed service provided by Azure HDInsight. This service implements Hadoop, ... Understanding

Big Data con Windows Azure HDInsight | Lanzamiento SQL Server 2014

Azure Cognitive Services - Microsoft... · Azure Databricks Machine Learning VMs Popular frameworks To build advanced deep learning solutions Pytorch TensorFlow Keras Onnx Azure Machine

Processing Big Data with Hadoop in Azure HDInsight€¦ · Hadoop uses a file system named HDFS, which in Azure HDInsight clusters is implemented as a blob container in Azure Storage.

Azure Data Factory Data Management für die Cloud · 2019-03-13 · • Azure Data Lake Analytics • Azure SQL •Activities • Copy • HDInsight (Pig, Hive ... Azure Table Azure

Qlik Compose for Data Lakes Setup and User Guide...Setting up a Microsoft Azure HDInsight Cluster with Attunity Compose Agent 29 Launching a Microsoft Azure HDInsight Cluster with

Why modernize? - info.microsoft.com€¦ · Visualize Azure SQL Data Warehouse Model & Serve Azure Data Factory Azure Databricks Ingest & Prep Best end-to-end ecosystem to turn your

Campus days Azure HDInsight automation

Data to Actionable Insights with Azure - Trustmarque · Azure Data Factory Azure Blob Storage Azure Databricks Real-time apps Business/custom apps (structured) Azure SQL Data Warehouse

HDInsight Hadoop on Windows Azure

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics with microsoft azure

Azure Networking Fridays · •Technical Overview of Azure Load Balancing by Bryan Woodworth! •Partner Spotlight: Riverbed’s SD-WAN solution! ... HDInsight Machine Learning Stream

Introduction to Azure HDInsight

AZURE DATABRICKS - Poznań University of TechnologyAzure Databricks clusters are the set of Azure Linux VMs that host the Spark Worker and Driver Nodes Your Spark application code