%Week%12,%Lecture%24% - personal.psu.edu

5
11/14/13 1 2013 ( BMMB 597D: Analyzing Next Genera>on Sequencing Data Week 12, Lecture 24 István Albert Biochemistry and Molecular Biology and Bioinforma>cs Consul>ng Center Penn State Topics to be covered in the next lectures 1. Assembly – building new genomes/ transcriptomes 2. Metagenomics – characterizing mul>ple genomes 3. Chip<Seq – quan>fying loca>ons in genomes 4. RNA<Seq – quan>fying gene expressions Genome Assembly The path to a whole genome Mul>step process 1. Assemble short reads into longer sequences ! conBgs 2. Scaffolding – arrange con>gs rela>ve to one another based on external informa>on 3. Genome finishing (gap closure) ! fill in gaps with directed sequencing procedures and manual cura>on AMOS – A Modular Open(Source whole genome assembler It is not a single so[ware rather than a collec>on of interoperable tools, standards and techniques

Transcript of %Week%12,%Lecture%24% - personal.psu.edu

Page 1: %Week%12,%Lecture%24% - personal.psu.edu

11/14/13%

1%

2013%(%BMMB%597D:%Analyzing%Next%Genera>on%Sequencing%Data%

%%Week%12,%Lecture%24%

István'Albert''

Biochemistry%and%Molecular%Biology%%and%Bioinforma>cs%Consul>ng%Center%

%Penn%State%

Topics%to%be%covered%in%the%next%lectures%

%1.   Assembly%–%building%new%genomes/

transcriptomes%%%

2.   Metagenomics%–%characterizing%mul>ple%genomes%%

3.   Chip<Seq%–%quan>fying%loca>ons%in%genomes%%

4.   RNA<Seq'–%quan>fying%gene%expressions%

Genome%Assembly%%%

The%path%to%a%whole%genome%

Mul>step%process%%

1.   Assemble%short%reads%into%longer%sequences%!%conBgs'

2.   Scaffolding%–%arrange%con>gs%rela>ve%to%one%another%based%on%external%informa>on%'

3.  Genome%finishing%(gap%closure)%!%fill%in%gaps%with%%directed%sequencing%procedures%and%manual%cura>on%

AMOS%–%A%Modular%Open(Source%whole%genome%assembler%

It%is%not%a%single%so[ware%rather%than%a%collec>on%of%%interoperable%tools,%standards%and%techniques%

Page 2: %Week%12,%Lecture%24% - personal.psu.edu

11/14/13%

2%

Read%assembly%challenges%repeated%elements:%RPT%A1%and%RPT%A2%

A%valid%assembly%of%two%con>gs%instead%of%one%

Other%challenges:%%

• %%genomic%varia>on,%heterozygosity,%copy%number%varia>on%• %%misassembly%due%to%sequencing%errors%• %chimeric%sequences%

%

Scaffolding%•  Orien>ng%con>gs%via%paired%end%(or%mate(pair)%informa>on%

Addi>onal%informa>on%to%assist%the%process:%%• %%use%alignment%posi>ons%in%related%genomes%• %%use%gene%synteny%(co(localiza>on%of%gene>c%loci)%%There%are%fewer%automated%pipelines:%%%'BAMBUS'–%Hierarchical%Scaffolding%With%Bambus%by%M.%Pop,%D.%Kosack%and%S.%Salzberg%%

Finishing%genomes% Genome%assembly%is%an%art'

Many%different%approaches%–%substan>al%supervision/evalua>on%required%at%each%step%of%the%process.%%

%Genomes'can'vary'greatly'in'complexity'–'genome'size/repeBBveness''

is'usually'the'limiBng'factor%%

Constant%tuning%and%evalua>on%is%needed.%%

Page 3: %Week%12,%Lecture%24% - personal.psu.edu

11/14/13%

3%

The%N50%sta>s>c%

•  N50%length%is%defined%as%the%con>g%length%L%for%which%50%%of%all%bases%in%the%sequences%are%in%con>gs%of%length%longer%than%L.%%

1.  Sort%all%con>gs%by%size%from%highest%to%lowest%%

2.  Compute%cumula>ve%sum%of%lengths%%

3.  Smallest%number%of%con>gs%that%add%up%to%the%half%of%the%assembled'length'

'The%NG50%sta>s>c%uses%the%genome%size%instead%of%con>g%size%

Using%the%Velvet%Assembler%

Download,%unpack,%and%make%Velvet%%Download%the%23.tar.gz%dataset%from%the%webpage%%Velvet%Assembly%is%a%two%step%process:%%• %velveth%!%builds%the%hashtable'• %velvetg%!%run%the%from%the%hashtable'

Running%velvet%single%end% Running%velvet%paired%end%mode%

Same%data,%radically%improved%assembly,%the%paired%end%informa>on%allows%the%assembler%to%resolve%repe>>ve%data%

Page 4: %Week%12,%Lecture%24% - personal.psu.edu

11/14/13%

4%

A%few%observa>ons%

•  Paired%end%assembly%leads%to%radically%beder%assembly%!%N50%of%667%vs%49270%

•  Hash%size%maders%!%how%to%pick%the%right%one?%Experts%say%to%try%“explore”%the%parameters.%

•  VelvetOp>mizer.pl%(in%the%velvet%contrib)%

Other%resources:%quite%a%few%review%papers%

•  A'PracBcal'Comparison'of'De'Novo'Genome'Assembly'SoSware'Tools'for'Next<GeneraBon'Sequencing'Technologies'(PLoS%ONE%2011)%

•  GAGE:'A'criBcal'evaluaBon'of'genome'assemblies'and'assembly'algorithms'(Genome%Research,%2011)%

Heng%Li’s%Fermi%

Bioinforma>cs%(2012)%28%(14):%

Assembly%evalua>on%

•  O[en%feels%surprisingly%ad(hoc%(people%write%home%grown%scripts%to%fetch%sta>s>cs/subselect%con>gs%etc)%

•  AMOS%–%contains%visualizers%hawkeye%

•  To%compare%to%related%genomes%we%need%op>mal%aligners%not%short%read%mappers!%

%

Page 5: %Week%12,%Lecture%24% - personal.psu.edu

11/14/13%

5%

Aligning%con>gs%

download%and%install%

MUMmer%tools%

•  nucmer%!%(NUCleo>de%MUMmer)%%DNA%sequence%alignment%

•  promer%!%PROmer%(PROtein%MUMmer)%(%all%matching%and%alignment%rou>nes%%performed%on%the%six%frame%amino%acid%transla>on%of%the%DNA%input%sequence%

Older%but%exceedingly%useful%tools%Their%formats%are%somewhat%complicated%%Blast'is'designed'to'search'target'databases'Mummer'is'designed'to'align'genomes!'

Homework%24%

%Simulate%reads%and%generate%assemblies%with%three%different%hash%sizes.%%%Which%assembly%produces%the%best%N50%sta>s>c?%%Which%assembly%produces%the%longest%con>gs%and%how%does%that%compare%to%the%expecta>ons?%