· 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the...

Post on 20-May-2020

8 views 0 download

Transcript of  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the...

––

•–

•––

••

Prepare data for the next iteration

Status of RDD actions being computed

Info about cached RDDs and memory usage

In-depth job info

Resilient Distributed

Datasets (RDD)DataFrame DataSet

● Distributed collection of JVM objects

● Functional operators (map, filter, etc.)

● Distributed collection of Row objects

● Expression-based operations

● Fast, efficient internal representations

● Internally rows, externally JVM objects

● Type safe and fast

● Slower than dataframes

●●●●

RDD1 RDD1’

RDD2 RDD2’

RDD3 RDD3’

Machine B

Machine A

Machine C

RDD Operation(e.g. map, filter)

>>> input_RDD = sc.textFile("text.file")

>>> transform_RDD = input_RDD.filter(lambda x: "abcd" in x)

>>> print "Number of “abcd”:" + transform_RDD.count()

>>> output.saveAsTextFile(“hdfs:///output”)

●●

○○○

●●

people.json{"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}

val df = spark.read.json("people.json")

val sqlDF = df.filter($"age" > 20).show()+---+----+

|age|name|

+---+----+

| 30|Andy|

+---+----+

df.filter($"age" > 20).select(“name”).write.format(“parquet”).save(“output”)

Note: Parquet is a column-based storage format for Hadoop. You will need special dependencies to read this file

Task Points Description Language

●●●● map filter ● reduceByKey aggregateByKey

groupByKey

●●●●

●○○○○

●○ ⇒○ ⇒

● How do we measure influence?○ Intuitively, it should be the node with the most followers

● Influence scores are initialized to 1.0/number of vertices

0.333 0.333

0.333

● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are

redistributed between the users they are following

0.333 0.333

0.333

● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are

redistributed between the users they are following

0.333/2 = 0.167

0.333 + 0.333/2 = 0.500

0.333From Node 2

From Node 1

From Node 1From Node 0

● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are

redistributed between the users they are following● Convergence is achieved when the scores of nodes do not

change between iterations ● Pagerank is guaranteed to converge

0.333/2 = 0.167

0.333 + 0.333/2 = 0.500

0.333

From Node 2

From Node 1

From Node 1From Node 0

● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are

redistributed between the users they are following● Convergence is achieved when the scores of nodes do not

change between iterations● Pagerank is guaranteed to converge

0.208 0.396

0.396

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {

// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {

(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}

// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)}

● Dangling or sink vertex○ No outgoing edges○ Redistribute contribution equally among all vertices

● Isolated vertex○ No incoming and outgoing edges○ No isolated nodes in Project 4.1 dataset

● Damping factor d○ Represents the probability that a user clicking on links

will continue clicking on them, traveling down an edge○ Use d = 0.85

Dangling vertexIsolated vertex

● Adjacency matrix:

● Transition matrix: (rows sum to 1)

Formula for calculating rank

d = 0.85

Formula for calculating rank

d = 0.85

Note: contributions from isolated and dangling vertices are constant in an iteration

Let

Formula for calculating rank

d = 0.85

Note: contributions from isolated and dangling vertices are constant in an iteration

Let

This simplifies the formula to

Formula for calculating rank

d = 0.85

Formula for calculating rank

d = 0.85

● Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight

● Databricks is an Apache Spark-based analytics platform optimized for Azure

● One-click setup, an interactive workspace, and an optimized Databricks runtime

● Optimized connectors to Azure storage platforms for fast data access

● Software-as-a-Service

● reduceByKeygroupByKey aggregateByKey

●○○ ./spark/bin/spark-shell ○ ./spark/bin/pyspark

●○○○

●○○

●○○

● Ensuring correctness○ Make sure total scores sum to 1.0 in every iteration○ Understand closures in Spark

■ Do not do something like thisval data = Array(1,2,3,4,5)

var counter = 0

var rdd = sc.parallelize(data)

rdd.foreach(x => counter += x)

println("Counter value: " + counter)

○ Graph representation■ Adjacency lists use less memory than matrices

○ More detailed walkthroughs and sample calculations can be found here

● Optimization○ Eliminate repeated calculations○ Use the Spark Web UI

■ Monitor your instances to make sure they are fully utilized

■ Identify bottlenecks○ Understand RDD manipulations

■ Actions vs transformations■ Lazy transformations

○ Explore parameter tuning to optimize resource usage○ Be careful with repartition on your RDDs

tWITTER DATA ANALYTICS:TEAM PROJECT

Team Project

33

Team Project● Phase 1:

○ Q1○ Q2 (MySQL AND HBase)

● Phase 2○ Q1○ Q2 & Q3 (MySQL AND HBase)

● Phase 3○ Q1○ Q2 & Q3 (MySQL OR HBase)

34

Team Project Deadlines● Writeup and queries were released on

Monday, October 29th, 2018.● Phase 2 milestones:

○ Q2:■ Q2 on scoreboard, due on Sunday, 11/11

○ Phase 2, Live test:■ Q1, Q2 and Q3, on Sunday, 11/11

○ Phase 2, code and report:■ due on Tuesday, 11/13

36

Query 3, Definitions

● time_start, time_end: in Unix time / Epoch time format, e.g. time_start=1480000000

● uid_start, uid_end: marks the user search boundary, e.g. uid_end=492600000

● n1: the maximum number of topic words that should be included in the response

● n2: the maximum number of tweets that should be included in the response

38

Query 3: Effective Word Count (EWC)EWC:● one or more consecutive alphanumeric

characters (A through Z, a through z, 0 through 9) with zero or more ' or/and - characters.

Query 3 is su-per-b! I'mmmm lovin' it! ⇐ 6 EWC

Don’t forget to remove the short URL and stop words before calculation

38

Query 3, Impact Score

Impact Score = EWC*(favorite_count+retweet_count+followers_count)

Consider negative impact_score as 0.

38

Query 3, Topic WordsTopic words:● After filtering short urls● Exclude stop words● Before censor● Case insensitive (lower case)

TF-IDF:● TF: term frequency of a topic word w● IDF: ln(Total number of tweets in range/

Number of tweets with w in it) 38

Query 3, Topic ScoreTopic Score = sum(x * ln(y + 1)) (i from 1 to n)

n: The total number of tweets in the given time and uid range

x: TF-IDF score of word w in tweet Ti

y: The impact score of Ti

38

Query 3 Example

word1:score1\tword2:score2...\twordn1:scoren1Impactscore1\ttid1\ttext1…..

Example:channel:2270.04 amp:1586.31 new:1166.24 just:1153.70 love:1063.31 like:1015.71 good:937.63

26200650 461159182406672384 I just buyed the comedy album of my bestest friend in the entire world @briangaar. https://t.co/hwDB4veaYG #RacesAsToad... 38

Don’t forget to censor the tweets

Warning!!! Any Hadoop Cluster

For any hadoop cluster on AWS, Azure or GCP

● Don’t open ports to the public, except○ Ports: 22, 80, 25, 443, or 465

● Follow the HBase Primer and use SSH tunnel to communicate with your Yarn UI

38

Note:● There will be a report due at the end of each phase, where you are expected to discuss optimizations● WARNING: Check your AWS instance limits on the new account (should be > 10 instances)

Phase (and query due) Start Deadline Code and Report Due

50

○●

○●

○●

○○

Questions?