Seqpig script language for large bioinformatic datasets
-
Upload
arian-pasquali -
Category
Data & Analytics
-
view
196 -
download
0
description
Transcript of Seqpig script language for large bioinformatic datasets
![Page 1: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/1.jpg)
SeqPigA simple and scalable scripting language for
large sequencing data sets in Hadoop
arian pasqualijune 6, 2014
![Page 2: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/2.jpg)
/me
Arian PasqualiMaster's student in Data MiningData engineer at Semasio
background- engineering - cloud computing- data mining on big data - social networks
![Page 3: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/3.jpg)
study case
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.Schumacher A1, Pireddu L, Niemenmaa M, Kallio A, Korpelainen E, Zanetti G, Heljanko K.
Bioinformatics. 2014 Jan 1;30(1):119-20. doi: 10.1093/bioinformatics/btt601. Epub 2013 Oct 22.
http://www.ncbi.nlm.nih.gov/pubmed/24149054
![Page 4: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/4.jpg)
but first, some background
● Real world bioinformatics datasets are huge ● Gigabytes/Petabytes are hard to handle on a
single computer● in order to handle big data sets we have to
master parallel programming models
![Page 5: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/5.jpg)
Parallel programming models
some high-performance programming models- Serial (doesn’t scale)- MPI (expensive)- MapReduce
- Hadoop (cheap and scalable)
![Page 6: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/6.jpg)
hadoop
Hadoop is an open source implementation of that enables you to run MapReduce programs.
It is aimed to process huge volumes of data of Tera or PetaBytes, what fits perfectly in many bioinformatics scenarios.
http://hadoop.apache.org/
![Page 7: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/7.jpg)
how mapreduce works on hadoopProvides a framework for MapReduce, a fault-tolerant parallel programing model- easier to write programs than other paradigms- easier means cheaper- runs on clusters with commodity hardware - scales horizontally
- need more power? just add more nodes
![Page 8: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/8.jpg)
an application: BLAST algorithm
MapReduce Tasks- load data- map sequences- partitionate- reduce (merge)- output results
![Page 9: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/9.jpg)
MapReduce is easier, but not trivial
![Page 10: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/10.jpg)
Apache Pig tries to solve that
Apache Pig solves that. Under the hood it applies MapReduce paradigmIt hides all the pitfalls about writing MapReduce code
![Page 11: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/11.jpg)
Pig version of the same code
![Page 12: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/12.jpg)
Apache Pig in BioinformaticsIt is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.
It can be easier
![Page 13: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/13.jpg)
SeqPigScalable scripting language based on Apache Pig for large scale sequence
analysis
![Page 14: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/14.jpg)
SeqPig
● a script language,● a library,● and a collection of tools to manipulate,
analyze and query sequencing datasets in a scalable and simple manner
http://seqpig.sourceforge.net/
![Page 15: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/15.jpg)
SeqPig and data format support
Currently it supports BAMSAMFastQQseq input and outputFASTA input
![Page 16: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/16.jpg)
possible use cases
● converting data formats● filters regions of a chromossome● computing base frequencies● alignments● collecting read-mapping-quality-statistics
![Page 17: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/17.jpg)
code example run scripts/filter_defs.pig
A = load 'input.bam' using BamLoader('yes');
B = FILTER A BY not ReadUnmapped(flags) and not IsDuplicate(flags);
C = FOREACH B GENERATE ReadSplit(name,start,read,cigar,basequal,flags,mapqual,refindex,refname,attributes#'MD');
D = FOREACH C GENERATE FLATTEN($0);
base_stats_data = FOREACH D GENERATE refbase, basepos, UPPER(readbase) AS readbase;
base_stats_grouped = GROUP base_stats_data BY (refbase, basepos, readbase);
base_stats_grouped_count = FOREACH base_stats_grouped GENERATE group.$0 AS refbase, group.$1 AS basepos, group.$2 as readbase, COUNT($1) AS bcount;
base_stats_grouped = GROUP base_stats_grouped_count by (refbase, basepos);
base_stats = FOREACH base_stats_grouped {
TMP1 = FOREACH base_stats_grouped_count GENERATE readbase, bcount;
TMP2 = ORDER TMP1 BY bcount desc;
GENERATE group.$0, group.$1, TMP2;
}
STORE base_stats into 'outputfile_readstats.txt';
![Page 18: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/18.jpg)
resultsA 0 {(A,19),(G,2)}
A 1 {(A,10)}
A 2 {(A,18)}
A 3 {(A,16)}
A 4 {(A,14)}
A 5 {(A,15)}
A 6 {(A,16),(G,2)}
...
A 98 {(A,7)}
A 99 {(A,14)}
C 0 {(C,6)}
C 1 {(C,11)}
C 2 {(C,9)}
![Page 19: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/19.jpg)
results plotted
![Page 20: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/20.jpg)
scalability test● 61Gb dataset● running some
FastQC stats
* speed in minutes
![Page 21: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/21.jpg)
related workBiodoop: Bioinformatics on Hadoophttp://dl.acm.org/citation.cfm?id=1679817
BioPig: A Hadoop-based Analytic Toolkit for Large-Scale Sequence Data, Oxford Journalshttp://bioinformatics.oxfordjournals.org/content/early/2013/09/10/bioinformatics.btt528
![Page 22: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/22.jpg)
some cloud computing solutions
Amazon AWS , general use purpousehttp://aws.amazon.com/
Mortar Data , focused on data sciencehttp://www.mortardata.com/
CloudGene, focused on bioinformatics usershttp://cloudgene.uibk.ac.at/
![Page 23: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/23.jpg)
cloudgene, mapreduce for bioinformatics
![Page 24: Seqpig script language for large bioinformatic datasets](https://reader033.fdocuments.in/reader033/viewer/2022052903/5576ca80d8b42ae3108b4fe9/html5/thumbnails/24.jpg)
conclusionsBioinformatics have been creating innovative algorithms and solutions that sometimes are adopted in different fields in computer science.
Neural networks in Artificial Intelligence and Machine learning is an example.Now, large scalable approaches from data mining are helping Bioinformatics to move forward, faster and cheaper.