Big Data Trend with Open Platform
-
Upload
jongwook-woo -
Category
Data & Analytics
-
view
235 -
download
0
Transcript of Big Data Trend with Open Platform
![Page 1: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/1.jpg)
Jongwook Woo
HiPIC
CalStateLA
SWRC 2017
San Diego, CAFeb 25 2017
Jongwook Woo, PhD, [email protected]
High-Performance Information Computing Center (HiPIC)California State University Los Angeles
Big Data Trend with Open Platform
![Page 2: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/2.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Contents
Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend
![Page 3: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/3.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Myself
Experience: Since 2002, Professor at California State Univ Los Angeles
– PhD in 2001: Computer Science and Engineering at USC Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등– Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware
Since 2007: Exposed to Big Data at CitySearch.com 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training• Amazon AWS, MicroSoft Azure, IBM Bluemix• Databricks, Hadoop vendors
![Page 4: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/4.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009Collaborating with LA city in 2016
– Collect, Search, and Analyze City Data• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training Institute
Since 2008– Introduce Hadoop Big Data and education to Univ and
Research Centers• Yonsei, Gachon• US: USC, Pennsylvania State Univ, University of Maryland College Park,
Univ of Bridgeport, Louisiana State Univ, California State Univ LB• Europe: Univ of Luxembourg
Myself
![Page 5: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/5.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Experience in Big Data
Collaboration Council Member of IBM Spark Technology Center City of Los Angeles for OpenHub and Open Data Startup Companies in Los Angeles External Collaborator and Advisor in Big Data
– IMSC of USC– Pennsylvania State University
Grants IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in
Research and Education Grant
Partnership Academic Education Partnership with Databricks, Tableau, Qlik,
Cloudera, Hortonworks, SAS, Teradata
![Page 6: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/6.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Contents
Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend
![Page 7: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/7.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Data Issues
Large-Scale dataTera-Byte (1012), Peta-byte (1015)
– Because of web– Sensor Data (IoT), Bioinformatics, Social Computing,
Streaming data, smart phone, online game…
Cannot handle with the legacy approachToo bigNon-/Semi-structured dataToo expensive
Need new systemsNon-expensive
![Page 8: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/8.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big DataHow to compute Big DataGoogle
How to store Big Data– GFS– Distributed Systems on non-expensive commodity
computersHow to compute Big Data
– MapReduce– Parallel Computing with non-expensive computers
Own super computersPublished papers in 2003, 2004
![Page 9: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/9.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
What is Hadoop?
9
Hadoop Founder: o Doug Cutting
Apache Committer: Lucene, Nutch, …
![Page 10: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/10.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Super Computer vs Hadoop
Parallel vs. Distributed file systems by Michael Malak
Cluster for Compute
Cluster for Store Cluster for Compute/Store
![Page 11: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/11.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Definition: Big Data
Non-expensive frameworks that can store a large scale data and process it faster in parallelHadoop
–Non-expensive Super Computer–More public than the traditional super
computers• You can store and process your applications
– In your university labs, small companies, research centers
![Page 12: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/12.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Hadoop Cluster: Logical Diagram
Web Browser of Clus-ter nonitor: CM/Am-
bariHTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
......
...
Agent Hadoop Agent Hadoop Agent Hadoop
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala
![Page 13: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/13.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Hadoop Ecosystems
http://dawn.dbsdataprojects.com/tag/hadoop/
![Page 14: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/14.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Contents
Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend
![Page 15: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/15.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Alternate of Hadoop MapReduce
Limitation in MapReduceHard to program in JavaOnly Map and Reduce
– Limited ParallelizationBatch Processing
– Not interactiveDisk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab In-memory storage for intermediate data 20 ~ 100 times faster than N/W and Disk
– MapReduce
![Page 16: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/16.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Spark
In-Memory Data ComputingFaster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystemsHDFS Amzon S3, HBase, Hive, Sequence files, Cassandra,
ArcGIS, Couchbase…
New Programming with faster data sharingGood in complex multi-stage applications
– Iterative graph algorithms, Machine LearningInteractive query
![Page 17: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/17.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
SparkRDDs, Transformations, and Actions
Spark Streamin
greal-time
SparkSQL
ML / MLLibmachin
e learnin
g
DStream’s: Streams of
RDD’s
SchemaRDD’s
DataFramesRDD-Based Matrices
Spark Cores
GraphX
(graph)
RDD-Based Matrices
SparkR
RDD-Based Matrices
![Page 18: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/18.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Spark Drivers and Workers
DriversClient
–with SparkContext• Communicate with Spark workers
WorkersSpark ExecutorRun on cluster nodes
–ProductionRun in local threads
–Development and Test
![Page 19: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/19.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
RDD
Resilient Distributed Dataset (RDD)Distributed collections of objects
–that can be cached in memoryImmutable
–RDD, DStream, SchemaRDD, PairRDDLineage
–History of the objects–Automatically and efficiently re-compute lost
data
![Page 20: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/20.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
RDD and Data Frame Operations
TransformationDefine new RDDs and Data Frame from the
current–Lazy: not computed immediately
map(), filter(), join(), select(), groupBy()
ActionsReturn valuescount(), collect(), take(), save()
![Page 21: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/21.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Programming in Spark
ScalaFunctional Programming
– Fundamental of programming is function• Input/Output is function
No side effects– No states
PythonLegacy, large Libraries
JavaR
![Page 22: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/22.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
![Page 23: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/23.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Spark Spark SQL
Querying using SQL, HiveQL Data Frame
ML Machine Learning on Data Frame, Pipelining MLib
– On RDD– Sparse vector support, Decision trees, Linear/Logistic Regression,
PCA, SVM
Spark Streaming DStream
– RDD in streaming– Windows
• To select DStream from streaming data
![Page 24: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/24.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Scheduling Process
) rdd1.join(rdd2) .groupBy(…) .filter(…)
RDD Objects Optimizer
Optimizer: build operator DAG
agnostic to operators!
doesn’t know about stages
DAGScheduler
split graph into stages of taskssubmit each stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via cluster managerretry failed or straggling tasks
Clustermanager
Worker
execute tasks
store and serve blocks
Block manager
ThreadsTask
stagefailed
![Page 25: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/25.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
During Scheduling Process
https://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming-62871797
![Page 26: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/26.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Contents
Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend
![Page 27: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/27.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Spark
SparkFile Systems: TachyonResource Manager: Mesos
But, Hadoop has been dominating marketIntegrating Spark into Hadoop clusterCloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix• Object Storage, S3
Hadoop vendors– HDP, CDH
Databricks: Spark on AWS– No Hadoop ecosystems
![Page 28: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/28.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Block manager
Task threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…) .count()
...
Your program
Spark Driver/Client(app master) Spark worker(s)
HDFS, HBase, Amazon S3, Couchbase, Cassandra, …
RDD graph
Scheduler
Block tracker Block manager
Task threads
Shuffle tracker
Clustermanager
Block manager
Task threads
![Page 29: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/29.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Spark with Hadoop YARN
Spark Client
Slave Nodes
ResourceManager (RM) Per Cluster Create Spark AM and allocate Containers for Spark AM
NodeManager (NM) Per Node Spark workers
ApplicationMaster (AM) Per Application Containers for Spark Executors
Master Node
NodeManager
NodeManager
NodeManager
Container: Spark Executor
Spark AM
ResourceManager
![Page 30: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/30.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Databricks cluster at CalStateLA
![Page 31: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/31.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Contents
Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend
![Page 32: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/32.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Open Platform
Open SourceOpen ConferenceOpen Data
Public Data
![Page 33: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/33.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Open Source
Hadoophttp://hadoop.apache.org/
Sparkhttp://spark.apache.org/
NoSQLhttp://hbase.apache.org/
Search Enginehttp://lucene.apache.org/solr/
![Page 34: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/34.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Open Conference
Hadoop SummitLive Streaming
–http://siliconangle.tv/hadoop-summit-2016/
Spark Summithttps://spark-summit.org/east-2017/Live Streaming
–http://go.spark-summit.org/east-2017/live-stream?_ga=1.62160364.1150099959.1484851457
![Page 35: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/35.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Open Data
USA governmentFederal, State, City governmentsExpose data to public
USA BusinessTwitter, Yelp, …Expose data to public with APIs
– Some restriction to download
City governmentNew York
– Taxi, Uber, …Los Angeles
– Open Data, Open Hub with Geo info
![Page 36: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/36.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Contents
Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend
![Page 37: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/37.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Databricks Partners
![Page 38: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/38.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Industrial CollaborationCloudera visits to interview Jongwook Woo
![Page 39: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/39.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Industrial Collaboration: IBM Bluemix at CalStateLA
![Page 40: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/40.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Big Data Analysis and Prediction Flow
Data CollectionBatch API: Yelp, GoogleStreaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government
Data StorageHDFS, S3, Object Storage, NoSQL DB (Couchbase)…
Data FilteringHive, Pig
Data Analysis and ScienceHive, Pig, Spark, BI Tools (Tableua, Qlik, …)
Data VisualizationQlik, Datameer, Excel PowerView
![Page 41: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/41.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Databricks cluster at CalStateLA
![Page 42: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/42.jpg)
Jongwook Woo
HiPIC
CalStateLA
LOCAL BUSINESS DATA ANALYSIS
Yashaswi AnanthRuchi Singh
Mahsa Tayer Farahani
![Page 43: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/43.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
LOCAL BUSINESS DATA ANALYSIS
Using Local Business DataFrom Yelp and Google Local
Grad Students at CalStateLA Symposium, Feb 24 2017Yashaswi AnanthRuchi SinghMahsa Tayer Farahani
![Page 44: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/44.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
REVIEW COUNT FOR BUSINESS TYPES
• Food• Services• Entertainment• Shopping• Medical
![Page 45: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/45.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
TOP BUSINESS IN THE SIX CATEGORIES
![Page 46: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/46.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Review count of popular sub-categories of business
![Page 47: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/47.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Sentiment Analysis of Services category
![Page 48: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/48.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Top business
Top 5 most popular local business on Yelp between 2006-2016 in the selected cities
![Page 49: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/49.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Businesses popular in 5 miles of CalStateLA, USC , UCLA
![Page 50: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/50.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Historical Analysis Of College Scorecard
CalStateLA Symposium Feb 24 2017
Kunal PritwaniAtinder Singh
Dharmesh SoniMounika Vallabhaneni
![Page 51: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/51.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Data is collected from the site. : https://www.kaggle.com/kaggle/college-scorecard
We have historical data of over 100,000 colleges in the US spanning over 14 years.
Data Size – 1.33 GB
File Format – CSV ( Comma Separated Values)
–
Specification of Data Set
![Page 52: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/52.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Mean Income Medical college of Wisconsin: 250KUpstate Medical University: 152.7KCalTech: 103KWashington and Lee University: 100K
![Page 53: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/53.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Comparing Average Net Price of Two States (Annual Tuition)
UCLA: $13,817 CalStateLA: $4,370 Fashion Inst of Tech: $11.5K CUNY: $5K
![Page 54: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/54.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
SAT Scores in Different Colleges
Math (Blue), Verbal (Orange), Mean Earning (Purple)• CalTech: 800, 778.9, $98.7K• MIT: 800, 764.4, $124.4K• Harvard: 791, 795.6, $133K• Princeton: 793, 791, $115.6K• Yale: 788, 794.4, $97.8K
![Page 55: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/55.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Comparing Average Undergraduates Receiving PELL
GRANT
Universal Career Community College: 100% PELL grant scholarship
![Page 56: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/56.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Average Undergraduates Receiving PELL GRANT in Each
CollegeEast Georgia State College: $2,854 Avg. PELL grant: 97.285%
![Page 57: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/57.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Alphago vs Lee using Twitter Data
SystemsAzure HDInsights Spark8 Nodes
– 40 cores: 2.4GHz Intel Xeon– Memory - Each Node: 28 GB
Data SourceKeyword ‘alphago’ from Tweeter via Apache NiFi
Data Size 63,193 tweets
Real Time Data Collection period03/12 – 03/17/2016
– No data collected on 03/13
![Page 58: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/58.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Top 10 Countries that Tweets “Alphago”
![Page 59: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/59.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Top 10 Countries
# of Tweets per CountryUSA: > 11,000Japan: > 9,000Korea: > 1,900Russia, UK: > 1,600Thai Land, France : >
1,000 Netherland, Spain,
Ukraine: > 600
![Page 60: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/60.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Top 10 Countries Sentiment
Positive Negative
![Page 61: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/61.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Top 10 Countries
Most Tweeted Countries All countries show more positive tweets
–Korea, Japan, USA
Country Positive Negative
USA 5070 3567
Japan 8118 217
…
Korea 1053 407
…
![Page 62: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/62.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Daily Tweets in 03/12 – 03/17/2016
3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/20160
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Alphago vs Lee Sedol
Game 4: Mar 13 Lee Se-Dol win
Game 5: Mar 15
Game 3: Mar 12
![Page 63: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/63.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Ngram words
3 word in row right after Go-Champion “sedol” and “se-dol”
sedol
se-dol3-grams FrequencyAgain-to-win 1,187
Is-something-I’ll 369
Is-something-i 199
In-go-tournament 168
![Page 64: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/64.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Sentiment Map of Alphago
PositiveNegative
![Page 65: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/65.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Sentiment Map of Lee Se-Dol vs Alphago
YouTube video: “alphago sentiment” by Google The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTbToiB8wQ2w14a
![Page 66: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/66.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Airline Data Set
Government Open DataAirline Data Set in 2012 – 2014
– US Dept of transportation
Cluster by Nillohit at HiPIC, CSULA Microsoft Azure using Hive and Spark SQL Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB– Windows Server 2012 R2 Datacenter
![Page 67: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/67.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Airline Data Set
![Page 68: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/68.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Airline Data Set
![Page 69: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/69.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Airline Data Set
![Page 70: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/70.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
City Government: Crime Data Set
Open Data in City of Los AngelesCrime Data Set in 2014
Ram Dharan and Sridhar Reddy at HiPIC, CSULA Microsoft Azure using Hive and Spark SQLNumber of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 14 GB– Windows Server 2012 R2 Datacenter– Extending to last 10 years of data set
![Page 71: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/71.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Crime DataLos Angeles 2014
2% 8%
9%
12%
17%19%
33%
Total occurences of each Crime
CRIMINALVANDALISMOTHERSBURGALARYASSAULTTRAFFICTHEFT
![Page 72: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/72.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Total No.of Crimes in 2014
1 2 3 4 5 6 7 8 9 10 11 120
5000
10000
15000
20000
25000
19169
17384
19730
19413
20645
20494
21480
21280
21287
21669
19844
21355
No.of Crimes per Month
![Page 73: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/73.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Raw Data Projection on Map
![Page 74: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/74.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Mapping of Crimes Occurred within 5miles from CalStateLA
![Page 75: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/75.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Mapping of Crimes Occurred within 5miles from UCLA
![Page 76: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/76.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Mapping of Crimes Occurred within 5miles from USC
![Page 77: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/77.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Mapping of Crimes Occurred within 5miles from CalStateLA, UCLA and USC in 2015
![Page 78: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/78.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
No. of crimes within 5 miles from CSULA, UCLA and USC on crime type
ASSAULT
BURGALARY
CRIMIN
AL
THEFT
TRAFFIC
VANDALISM
others
0
5000
10000
15000
20000
25000
30000
csula ucla usc
![Page 79: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/79.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Contents
Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend
![Page 80: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/80.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Future Research Trend
Deep LearningTensorFlow and Spark
– Yahoo, Intel, Google– Image Recognition, Prediction Analysis
ChatBotAmazon Alexa APIIBM Watson ChatBot APIGoogle Home API
More into In-Memory Processing
– Spark DataFrame, Data Set, MLCloud Computing
– IBM Bluemix, MS Azure, Google Cloud, Amazon AWS
![Page 81: Big Data Trend with Open Platform](https://reader037.fdocuments.in/reader037/viewer/2022092801/58ce9dbf1a28abb26e8b48e3/html5/thumbnails/81.jpg)
High Performance Information Computing CenterJongwook Woo
CalStateLA
Question?