Hadoop 101 for bioinformaticians
-
Upload
attilacsordas -
Category
Technology
-
view
1.667 -
download
0
description
Transcript of Hadoop 101 for bioinformaticians
![Page 1: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/1.jpg)
Hadoop 101 for bioinformaticians
Attila Csordas
Anybody used Hadoop before?
![Page 2: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/2.jpg)
1 hour:!theoretical introduction -> be able to assess whether your
problem fits MR/Hadoop!!
practical session -> be able to start developing code!!
Aim
![Page 3: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/3.jpg)
Hadoop & memostly non-coding bioinformatician at PRIDE!Cloudera Certified Hadoop Developer!~1300 hadoop jobs ~ 4000 MR jobs!3+ yrs of operator experience!
![Page 4: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/4.jpg)
Linear scalability
6+ days on a 4 core local machine!
Db build time = f(# of peptides)
# of proteins: 4 mill (quarter nr) 8 mill (half nr) 16 mill (nr)
5000 spectra
Search time = f(job complexity)
100,000 spectra
200,000 spectra
![Page 5: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/5.jpg)
Theoretical introduction• Big Data!
• Data Operating System!
• Hadoop 1.0!
• MapReduce!
• Hadoop 2.0
![Page 6: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/6.jpg)
Big Datadifficult to process on a single machine!Volume: low TB - low PB!Velocity: generation rate!Variety: tab separated, machine data, documents!biological repositories fit the bell: e.g.. PRIDE
![Page 7: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/7.jpg)
Data Operating Systemfeatures! components!
distributed storage
scalable resource management
fault-tolerant processing engine 1
redundant processing engine 2
![Page 8: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/8.jpg)
Hadoop 1.0
storageHadoop Distributed
File System!aka HDFS!
redundant, reliable, distributed
processing engine/cluster resource
managementMapReduce!
distributed, fault-tolerant resource
management, scheduling &
processing engine
single use data platform
![Page 9: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/9.jpg)
MapReduceprogramming model for large scale, distributed, fault tolerant, batch data processing
execution framework, the “runtime”
software implementation
![Page 10: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/10.jpg)
Stateless algorithms
dependent: Fibonacci series: F(k+2)=F(k+1)+F(k)
output only depends on the current input but not on previous inputs
![Page 11: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/11.jpg)
Parallelizable problems
might need a to maintain a global, shared state
easy to separate to parallel tasks
![Page 12: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/12.jpg)
Embarassingly/pleasingly parallel problems
no dependency or communication between those tasks
easy to separate to parallel tasks
![Page 13: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/13.jpg)
Functional programming roots
higher-order functions that accept other functions as arguments !map: applies its argument f() to all elements of a list !fold: takes g() + initial value -> final value !sum of squares: map(x2), fold(+) + 0 as initial value
![Page 14: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/14.jpg)
MapReduce
map(key, value) : [(key, value)] !shuffle&sort: values grouped by key !reduce(key, iterator<value>) : [(key, value)]
![Page 15: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/15.jpg)
Word CountMap(String docid, String text):! for each word w in text:! Emit(w, 1);!!Reduce(String term, Iterator<Int> values):! int sum = 0;! for each v in values:! sum += v;! Emit(term, value);!!
source: Jimmy Lin and Chris Dryer: Data-Intensive Text Processing with MapReduce
Morgan & Claypool Publishers, 2010.
![Page 16: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/16.jpg)
Count amino acids in peptides
Map(byte offset of the line, peptide sequence): for each amino acid in peptide:
Emit(amino acid, 1); !
Reduce(amino acid, Iterator<Int> values): int sum = 0;
for each v in values: sum += v;
Emit(amino acid, value);
![Page 17: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/17.jpg)
Source: redrawn from a slide by Cloduera, cc-licensed
Mapper Mapper Mapper Mapper Mapper
Partitioner Partitioner Partitioner Partitioner Partitioner
Intermediates Intermediates Intermediates Intermediates Intermediates
Reducer Reducer Reduce
Intermediates Intermediates Intermediates
(combiners omitted here)
source: Jimmy Lin and Chris Dryer: Data-Intensive Text Processing with MapReduce
Morgan & Claypool Publishers, 2010.
![Page 18: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/18.jpg)
combine combine combine combine
b a 1 2 c 9 a c 5 2 b c 7 8
partition partition partition partition
map map map map
k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6
b a 1 2 c c 3 6 a c 5 2 b c 7 8
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
a 1 5 b 2 7 c 2 9 8
r1 s1 r2 s2 r3 s3
c 2 3 6 8
source: Jimmy Lin and Chris Dryer: Data-Intensive Text Processing with MapReduce
Morgan & Claypool Publishers, 2010.
![Page 19: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/19.jpg)
12 DTNGSQFFITTVK
mapmapmap
0 AGELTEDEVER 26 IEVEKPFAIAKE
A 1 G 1 E 1 I 1 E 1 G 1D 1 T 1 N 1
Shuffle & sort: aggregate values by keys
reduce reduce reduce
E 1 1 G 1 1A 1 D 1 I 1 N 1 T 1
E 2 G 2 N 1
![Page 20: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/20.jpg)
Hadoop 1.0: single usePig: scripting 4 Hadoop, Yahoo!, 2006
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
Hive: SQL queries 4 Hadoop Facebook
![Page 21: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/21.jpg)
Hadoop 2.0
processing engines:MapReduce, Tez,
HBase, Storm, Giraph, Spark…
batch, interactive, online, streaming,
graph …
cluster resource management YARN!
distributed, fault-tolerant resource
management, scheduling &
processing engine
storage HDFS2! redundant, reliable, distributed
multi use data platform
![Page 22: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/22.jpg)
Practical session Objective: up & running w/ Hadoop in 30 mins
![Page 23: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/23.jpg)
requirements
• java
• intellij idea
• maven
![Page 24: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/24.jpg)
Outline• check out https://github.com/attilacsordas/
hadoop_introduction
• elements of a MapReduce app, basic Hadoop API & data types
• bioinformatics toy example: counting amino acids in sequences
• executing a hadoop job locally
• 2 more examples building on top of AminoAcidCounter
![Page 25: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/25.jpg)
https://github.com/attilacsordas/hadoop_for_bioinformatics
![Page 26: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/26.jpg)
MapReduce app components
• driver
• mapper
• reducer
repo: bioinformatics.hadoop.AminoAcidCounter.java
![Page 27: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/27.jpg)
Inputformats specifies data passed to Mappers
!specified in driver
!determines input splits and RecordReaders extracting key-
value pairs !
default formats: TextInputFormat, KeyValueTextInputFormat, SequenceFileInputFormat …
!, LongWritable …
![Page 28: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/28.jpg)
Data types
everything implements Writable, de/serialization !
all keys are WritableComparable: sorting keys !
box classes for primitive data types: Text, IntWritable, LongWritable …
![Page 29: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/29.jpg)
Our task is to prepare data to form a statistical model on what physico-chemico properties of the constituent amino acid
residues are affecting the visibility of the high-flyer peptides for a mass spectrometer.
!AGELTEDEVER
QSVHLVENEIQASIDQIFSHLER DTNGSQFFITTVK
IEVEKPFAIAKE
bioinformatics toy example
repo: headpeptides.txt in input folder
![Page 30: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/30.jpg)
run the code!local
pseudodistributed cluster
![Page 31: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/31.jpg)
Debugging do it locally if possible
print statements !
write to map output !
mapred.map.child.log.level=DEBUG mapred.reduce.child.log.level=DEBUG
-> syslog task logfile !
running debuggers remotely is hard !
JVM debugging options !
task profilers
![Page 32: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/32.jpg)
Counters
track records for statistics, malformed records
AminoAcidCounterwithMalformedCounter
![Page 33: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/33.jpg)
• Task 1: modify mapper to count positions too: R_2 -> AminoAcidPositionCounter
!
• Task 2: count positions & normalise to peptide length -> AminoAcidPositionCounterNormalizedToPeptideLength
![Page 34: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/34.jpg)
Run jar on the clusterset up an account on a hadoop cluster
!move input data into HDFS
!set mainClass in pom.xml
!mvn clean install
!cp jar from target to servers
!ssh into hadoop
!run hadoop command line
![Page 35: Hadoop 101 for bioinformaticians](https://reader035.fdocuments.in/reader035/viewer/2022062710/559c899d1a28ab517f8b45bd/html5/thumbnails/35.jpg)
homework, next stepscount all the 3 outputs for 1 input k,v pair
!count all amino acid pairs
!run FastaAminoAcidCounter & check FastaInputFormat
!run the jar on a cluster
!https://github.com/lintool/MapReduce-course-2013s/
tree/master/slides !
n * (trial, error) -> success