SparkSpark in the Big Data dark by Sergey Levandovskiy
-
Upload
lohikaodessatechtalks -
Category
Engineering
-
view
155 -
download
1
Transcript of SparkSpark in the Big Data dark by Sergey Levandovskiy
Apache Hadoop
Pros: • Batch operations • Scalability • User defined methods
Cons: • The problem must be resolved in context of a single
job • Filesystem based
Tez, Pig, Hive, etc
Pros: • Batch operations • Over Hadoop • Faster then MapReduce • DAG
Cons: • Filesystem based
HDD vs MEMORY?
• Memory speed is in nanoseconds• 10GbE Network speed is in microseconds (~50)• Flash speed is in microseconds (between 20-500+)• Disk speed is in milliseconds (between 4-7)
Apache Spark
Pros: • In memory operations up to 100x times faster then
Hadoop MapReduce • On disc operations up to 10x times faster then Hadoop
MapReduce• In-memory• Batch operations & near real time • Interactive • Not bound to hadoop• Easy to start for developers
Lazy RDD
• map• filter• flatMap• mapPartitions• mapPartitionsWithIndex
• union• intersection• distinct• groupByKey• reduceByKey• join
• collect• count• first• take(n)• reduce• countByKey• foreach• takeOrdered• takeSample• saveAsTextFile• saveAsSequenceFile• saveAsObjectFile
Transformations Actions
DataFrame
• Distributed collection of data organized into named columns
• SQL like syntax
• Catalyst Optimizer
Reference list
https://spark.apache.orghttps://databricks.com/bloghttp://hadoop.apache.org/docs/currenthttp://www.gridgain.comhttps://www.google.com/trendshttp://blog.revolutionanalytics.com/2013/12/apache-spark.htmlhttp://0xdata.com/blog/2014/09/Sparkling-Water/http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.htmlhttps://spark.apache.org/docs/1.3.1/job-scheduling.htmlhttps://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttp://aryannava.com/2014/02/19/apache-hadoop-ecosystem/http://www.gridgain.com/in-memory-compute-grid-explained/http://gridgain.blogspot.com/2012/11/gridgain-and-hadoop-differences-and.htmlhttp://blog.infinio.com/relative-speeds-from-ram-to-flash-to-disk