Using apache spark to fight world hunger - spark meetup
-
Upload
noam-barkai -
Category
Software
-
view
30 -
download
0
Transcript of Using apache spark to fight world hunger - spark meetup
![Page 1: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/1.jpg)
Spark Meetup, December 2015Noam [email protected]
![Page 2: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/2.jpg)
Overview
Food shortage: new problems, new solutions
Intermezzo: how DNA works
Tach’les: what we do with Apache Spark
![Page 3: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/3.jpg)
The planet has gotten very populous
And it’s the only one we got
![Page 4: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/4.jpg)
World Population
Annual Growth Rate:Peak - 2.1% (1962)Current - 1.1% (2009)
https://en.wikipedia.org/wiki/World_population#/media/File:World-Population-1800-2100.svg
![Page 5: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/5.jpg)
Food intake
source: http://www.coolgeography.co.uk/A-level/AQA/Year%2012/Food%20supply/Patterns%20and%20intro/Food_consumption.gif
![Page 6: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/6.jpg)
Upscale: Same area, more crops
![Page 7: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/7.jpg)
Plant breeding
An ancient art
Incremental changes
Slow but considerable
source: https://en.wikipedia.org/wiki/Zea_%28genus%29#/media/File:Maize-teosinte.jpg
![Page 8: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/8.jpg)
How long does it take today?
Maize: 10-15 years
source: http://www.cropj.com/shimelis_6_11_2012_1542_1549.pdf
![Page 9: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/9.jpg)
How breeding works1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
![Page 10: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/10.jpg)
Computational genomics
⬇ Prices of DNA sequencing⬆ Number of samples per crop sequenced and analyzed⬆ Amount and quality of genomic data⬇ Prices of computation⬇ Prices of storageWe’re entering a new era
BIG DATA Genomics
![Page 11: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/11.jpg)
Food security - a computational problem?
The plant’s potential lies in its DNA.
We analyze and compare sequences from many plants.
Resulting in better predictions for breeding.
Faster rate of crop improvement.
![Page 12: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/12.jpg)
Intermezzo: DNA - how does it work?
Four “letters”:
cytosine(C), guanine(G),
adenine(A), thymine(T)
Encode 20 amino acids
Combine to make:
+100K proteins
![Page 13: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/13.jpg)
Conceptually we can think of this as a “pipeline”:“The Central Dogma”
![Page 14: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/14.jpg)
DNA as storageDurable
Supports random access
Efficient sequential reads
Easily replicated
Contains error correction mechanisms
Maximally “data local”
![Page 15: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/15.jpg)
Part 2: What we do with
Analyze lots of genome sequences.
Apply similarity algorithms, find where they match.
Finally, assist the breeding program.
![Page 16: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/16.jpg)
Input data is “noisy”
Contains errors and gaps.
Is fragmented.
All due to sequencing technology.
![Page 17: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/17.jpg)
Our setup
Hadoop clusters on both private cloud and AWS
Textual files, using Parquet.
MapR 5 Hadoop distro
Spark 1.4.1
SparkSQL and Hive (JDBC)
Instances: ~150GB RAM, 40 cores.
Provisioning: Ansible
![Page 18: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/18.jpg)
Our data
A dozen or so different crops, going for hundreds.
Each crop: potentially ~1K fully sequenced samples
~100K “markers”.
Each sequence: 1Gbp - 10Gbp (giga base-pairs =
characters) long
Current: several terabytes, aiming at petabytes
![Page 19: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/19.jpg)
Working with Spark and Scala
Scala’s type system is your friend
Thinking functional takes time - and can be “overdone”
Remember to add @tailrec when needed
Scala case classes - great
Nested structure: keeps you DRY, but sluggish.
Scala has its pitfalls - profile.
Spark as the “ultimate scala collection” - Martin Odersky.
![Page 20: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/20.jpg)
Complex unmanaged framework - the usual 20/80 rule:
20% fun algorithmic stuff,
80% integration/devops/tuning/black-voodoo
Integration with Hive - doable but cumbersome
DataFrames API - very clean
Parquet in Spark 1.4 - seamless, Parquet with SparkSQL <
1.3 - rather sucks.
Integrations with Spark
![Page 21: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/21.jpg)
If RDD objects need high RAM → memory gets tricky.
Spark UI in 1.4.1 - very nice
PairRDD - need to be your own “query optimizer”
repartition / coalesce - very useful, but gets tricky if data
variability is high (a dynamic real-time optimizer would be
great).
Performance tuning with Spark
![Page 22: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/22.jpg)
Testing: “local” is great, but means no unit-test :-(
sbt-pack - good alternative to sbt-assembly.
Spark packages: spark-csv, spark-notebook and more.
Speaking of open-source packages...
Testing, packaging and extending Spark
![Page 23: Using apache spark to fight world hunger - spark meetup](https://reader035.fdocuments.in/reader035/viewer/2022062522/589bc6901a28ab082b8b62e9/html5/thumbnails/23.jpg)
ADAM Project - Genomics using Spark
Fully open sourced from
Similarity algorithms
Population clustering
Predictive analysis using Deep Learning
And more