Abstract Sample prep and sequencing - lanl.gov

1
Los Alamos National Laboratory PacBio-based Assemblies of Small Eukaryotes Yuliya Kunde*, Karen Davenport, Cheryl Gleasner, Kim McMurry, Olga Chertkov, Shawn Starkenburg Los Alamos NaDonal Laboratory, Los Alamos, NM PhotosyntheDc microalgae are a promising source of feedstock material for biofuels. The mechanism(s) of lipid producDon are not fully understood, but the widely accepted hypothesis is that under stress condiDons, microalgae convert excess energy from light into storage compounds like starch and lipids. Currently, high quality genome assemblies from microalgae producDon strains are not available (thousands of conDgs). Access to nearly finished genomes for these organisms will significantly improve our understanding of key metabolic pathways, and inform raDonal geneDc engineering approaches. For this purpose, genomic DNA from two top ranking candidates, Chlorella sp. (strains 1228 and 1230) and Scenedesmus obliquus, was converted into 20kb libraries for sequencing and assembly with PacBio and HGAP, respecDvely. The PacBiobased assemblies were further improved with short reads from Illumina or OpGen opDcal maps. Herein, we will present the comparisons of these assembly methods as well as costbenefit analysis of generaDng hybrid assemblies with PacBio and OpGen. Abstract Sample prep and sequencing . Data analysis and genome assembly The ability of microalgae to store lipids under stress makes them an aXracDve source for the producDon of biofuels. Unfortunately, liXle is known about metabolic pathways that lead to the lipid storage. Shedding light on these biochemical transformaDons will significantly improve our understanding of the lifecycle of these organisms and will allow for efficient manipulaDons in growing and harvesDng algal biomass. DNA sequencing is a powerful tool in reaching this goal. While typical NGS technologies can provide plenty of sequencing data, their short reads and amplificaDon biases will produce fragmented genome assemblies whenever a complex repeat or poorly amplified (high or low GC) regions are encountered. Resolving fragmented assemblies will require more costly laboratory and computaDonal resources. Using PacBio long reads will reduce the Dme and simplify genome finishing process. We used various PacBiobased approaches to analyze data and perform genome assembly: Long PacBio reads data (HGAP assembly) was supplemented with the OpGen opDcal maps for C. sorokiniana 1228 and manual work brought the genome to the final assembly. Long PacBio reads data (HGAP assembly) was used alone or in combinaDon with short reads data assemblies for C. sorokiniana 1230 and manual work interspersed through the processes and a_er merging the assemblies brought the genome to the final assembly. Long PacBio reads data only (HGAP assembly) was used to analyze and assemble the S. obliquus genome. 9,652 7,477 2,753 1,147 64 82 20 0 2,000 4,000 6,000 8,000 10,000 12,000 A B C D E F G Number of con>gs Con>g Number 11,009 15,504 44,845 98,320 2,415,094 3,828,126 4,091,730 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 4,000,000 4,500,000 A B C D E F G Number of base pairs Con>g N50 84,533 116,504 236,940 435,074 4,567,720 5,130,349 5,120,617 0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000 A B C D E F G Number of base pairs Maximum Con>g 54,829,407 56,036,303 56,020,885 56,521,810 61,391,260 59,715,620 58,534,920 50,000,000 52,000,000 54,000,000 56,000,000 58,000,000 60,000,000 62,000,000 A B C D E F G Number of base pairs Assembly Size Figure 4. Algal Genome Assembly and improvement Processes for C. sorokiniana 1230: Illumina data, single kmerbased assembly; B. Assembly A improved with manual computaDonal work; C. Assembly from a collaborator (unknown shortread data type, assembly type and improvement process); D. Assemblies B and C merged; E. Assembly D improved with manual computaDonal work; F. PacBio data, overlap assembly; G. Assemblies E and F merged and improved with manual computaDonal work. Illuminaonly assemblies are shown in red. PacBioonly assemblies are shown in blue. The collaborator’s assembly is shown in gray. Merged assemblies are shown striped. The final assembly was a near finished genome (20 conDgs with two completed circular organellar chromosomes with the algal chromosome completed telomere to telomere with only a few gaps). Three algal candidates for the producDon of biofuels were sequenced at LANL using Pacific Biosciences NGS plaform RSII. PacBio is the current standard for long read data and is certainly advantageous for genomic analysis of eukaryotes. As shown in Figure 3, the OpGen map provided excellent scaffolding for the PacBio conDgs, but the OpGen/ PacBio combinaDon for C. sorokiniana 1228 was not as complete (64 conDgs) as the short read/long read merged assemblies of C. sorokiniana 1230 (20 conDgs). While the OpGen technology is expensive, the manual work on the short reads assemblies is also expensive and is likely of similar cost. The HGAP assembly of PacBioonly data for the S. obliquus genome was problemaDc, since the genome size produced was double what was expected. We are experimenDng with FALCON (an assembler designed for higher ploidy/larger genomes) for the assembly of the S. obliquus genome due to the heterozygosity in this genome. IniDal results with FALCON provide a reduced genome size (closer to the 100 Mb expected) and with a reducDon in conDg number by 2530%. These assemblies indicate 1520% of the genome is heterozygous. All of this PacBio data were produced with the P5/C3 chemistry; we believe that if generated with the current P6/C4 chemistry (or the soontobe released chemistry with longer movie Dmes) the iniDal PacBioonly assemblies would be beXer and the costs would be significantly reduced. Algae species Es>mated genome size (Mbp) # of template preps # of SMRT cells sequenced # of polished con>gs in HGAP Genome coverage by PacBio # of Mbp of data Template prep and sequencing cost C. sorokiniana 1228 61 3 41 322 38x 6,083 $14,860.97 C. sorokiniana 1230 58 1 16 85 85x 5,866 $7,341.15 S. obliquus 210 2 46 3,081 57x 18,238 $15,870.94 Algae species Technologies Es>mated genome size (Mbp) Final Con>gs Final N50 Final Largest Con>g Final Assembly Size C. sorokiniana 1228 PacBio, OpGen 60 64 2,415,094 4,567,720 61,391,260 C. sorokiniana 1230 PacBio, Illumina 55 20 4,091,730 5,120,617 58,534,920 S. obliquus PacBio only 100* 2812 152,111 2,334,183 210,263,644 Figure 1. Sequencing results and costs for three algal genomes. All three have 20 Kbp prep libraries. Figure 2. Final assembly results for three algal genomes. *While it’s thought that S. obliquus genome size is 100 Mbp, this assembly has a size of 200 Mbp which could be a result of heterozygosity. Figure 3. OpDcal maps for C. sorokiniana 1228 with HGAP conDgs (>40kb) scaffolded into 12 algal chromosomes (telomere to telomere). Some manual work on the 12 chromosomal scaffolds produced a final assembly of 64 conDgs and an assembly size of 61 Mbp. This included two smaller circular organellar chromosomes which were completed in the HGAP assembly. Introduction Conclusions Algal cells of C. sorokiniana 1228 were grown in culture, harvested and imbedded into the agarose plugs for further extracDon of the HMW DNA. The DNA was extracted through enzymaDc removal( destrucDon) of the cell wall and subsequent lysis of the cells. Some of the algal plugs were sent to OpGen for opDcal mapping, the others were used for the extracDon of HMW genomic DNA and preparaDon of 20kb SMRT bell library according to the standard PacBio protocol. Genomic DNA for S. obliquus and C. sorokiniana 1230 was received from an outside collaborator and processed for 20kb SMRT bell library prep. The long insert libraries were sizeselected using Blue Pippin instrument. The sequencing primer was annealed to the selected SMRT bells. DNA polymerase was bound to the SMRT bell template and the samples were loaded onto the SMRT cells using MagBeads. The samples were sequenced with C3P5 chemistry and 3 hour movies. The short reads data was generated both at LANL and by an outside collaborator

Transcript of Abstract Sample prep and sequencing - lanl.gov

Page 1: Abstract Sample prep and sequencing - lanl.gov

Los Alamos National Laboratory

PacBio-based Assemblies of Small Eukaryotes Yuliya  Kunde*,  Karen  Davenport,  Cheryl  Gleasner,  Kim  McMurry,  Olga  Chertkov,  Shawn  Starkenburg  

Los  Alamos  NaDonal  Laboratory,  Los  Alamos,  NM    

PhotosyntheDc  microalgae  are  a  promising  source  of  feedstock  material  for  biofuels.  The  mechanism(s)  of  lipid  producDon  are  not  fully  understood,  but  the  widely  accepted  hypothesis  is  that  under  stress  condiDons,  microalgae  convert  excess  energy  from  light  into  storage  compounds  like  starch  and  lipids.  Currently,  high  quality  genome  assemblies  from  microalgae  producDon  strains  are  not  available  (thousands  of  conDgs).  Access  to  nearly  finished  genomes  for  these  organisms  will  significantly  improve  our  understanding  of  key  metabolic  pathways,  and  inform  raDonal  geneDc  engineering  approaches.  For  this  purpose,  genomic  DNA  from  two  top  ranking  candidates,  Chlorella  sp.  (strains  1228  and  1230)  and  Scenedesmus  obliquus,  was  converted  into  20kb  libraries  for  sequencing  and  assembly  with  PacBio  and  HGAP,  respecDvely.  The  PacBio-­‐based  assemblies  were  further  improved  with  short  reads  from  Illumina  or  OpGen  opDcal  maps.  Herein,  we  will  present  the  comparisons  of  these  assembly  methods  as  well  as  cost-­‐benefit  analysis  of  generaDng  hybrid  assemblies  with  PacBio  and  OpGen.

Abstract Sample prep and sequencing

.

Data analysis and genome assembly

The  ability  of  microalgae  to  store  lipids  under  stress  makes  them  an  aXracDve  source  for  the  producDon  of  biofuels.  Unfortunately,  liXle  is  known  about  metabolic  pathways  that  lead  to  the  lipid  storage.  Shedding  light  on  these  biochemical  transformaDons  will  significantly  improve  our  understanding  of  the  lifecycle  of  these  organisms  and  will  allow  for  efficient  manipulaDons  in  growing  and  harvesDng  algal  biomass.  DNA  sequencing  is  a  powerful  tool  in  reaching  this  goal.  While  typical  NGS  technologies  can  provide  plenty  of  sequencing  data,  their  short  reads  and  amplificaDon  biases  will  produce  fragmented  genome  assemblies  whenever  a  complex  repeat  or  poorly  amplified  (high  or  low  GC)  regions  are  encountered.  Resolving  fragmented  assemblies  will  require  more  costly  laboratory  and  computaDonal  resources.  Using  PacBio  long  reads  will  reduce  the  Dme  and  simplify  genome  finishing  process.  

We  used  various  PacBio-­‐based  approaches  to  analyze  data  and  perform  genome  assembly:  ² Long  PacBio  reads  data  (HGAP  assembly)  was  supplemented  with  the    OpGen  

opDcal  maps  for  C.  sorokiniana  1228  and  manual  work  brought  the  genome  to  the  final  assembly.  

² Long  PacBio  reads  data  (HGAP  assembly)  was  used  alone  or  in  combinaDon  with  short  reads  data  assemblies  for  C.  sorokiniana  1230  and  manual  work  interspersed  through  the  processes  and  a_er  merging  the  assemblies  brought  the  genome  to  the  final  assembly.  

² Long    PacBio  reads  data  only  (HGAP  assembly)  was  used  to  analyze  and  assemble  the  S.  obliquus  genome.    

9,652  

7,477  

2,753  

1,147  

64   82   20  0  

2,000  

4,000  

6,000  

8,000  

10,000  

12,000  

A   B   C   D   E   F   G  

Num

ber  o

f  con

>gs  

Con>g  Number    

11,009   15,504   44,845   98,320  

2,415,094  

3,828,126  4,091,730  

0  

500,000  

1,000,000  

1,500,000  

2,000,000  

2,500,000  

3,000,000  

3,500,000  

4,000,000  

4,500,000  

A   B   C   D   E   F   G  

Num

ber  o

f  base  pa

irs  

Con>g  N50  

84,533   116,504   236,940   435,074  

4,567,720  

5,130,349   5,120,617  

0  

1,000,000  

2,000,000  

3,000,000  

4,000,000  

5,000,000  

6,000,000  

A   B   C   D   E   F   G  

Num

ber  o

f  base  pa

irs  

Maximum  Con>g  

54,829,407  

56,036,303  56,020,885  

56,521,810  

61,391,260  

59,715,620  

58,534,920  

50,000,000  

52,000,000  

54,000,000  

56,000,000  

58,000,000  

60,000,000  

62,000,000  

A   B   C   D   E   F   G  

Num

ber  o

f  base  pa

irs  

Assembly  Size  

Figure  4.  Algal  Genome  Assembly  and  improvement  Processes  for  C.  sorokiniana  1230:    Illumina  data,  single  kmer-­‐based  assembly;  B.  Assembly  A  improved  with  manual  computaDonal  work;    C.  Assembly  from  a  collaborator  (unknown  short-­‐read  data  type,  assembly  type  and  improvement  process);    D.  Assemblies  B  and  C  merged;  E.  Assembly  D  improved  with  manual  computaDonal  work;    F.  PacBio  data,  overlap  assembly;  G.  Assemblies  E  and  F  merged  and  improved  with  manual  computaDonal  work.  Illumina-­‐only  assemblies  are  shown  in  red.  PacBio-­‐only  assemblies  are  shown  in  blue.    The  collaborator’s  assembly  is  shown  in  gray.    Merged  assemblies  are  shown  striped.    The  final  assembly  was  a  near  finished  genome  (20  conDgs  with  two  completed  circular  organellar  chromosomes  with  the  algal  chromosome  completed  telomere  to  telomere  with  only  a  few  gaps).  

Three  algal  candidates  for  the  producDon  of  biofuels  were  sequenced  at  LANL  using  Pacific  Biosciences  NGS  plaform  RSII.  PacBio  is  the  current  standard  for  long  read  data  and  is  certainly  advantageous  for  genomic  analysis  of  eukaryotes.  As  shown  in  Figure  3,  the  OpGen  map  provided  excellent  scaffolding  for  the  PacBio  conDgs,  but  the  OpGen/PacBio  combinaDon  for  C.  sorokiniana  1228  was  not  as  complete  (64  conDgs)  as  the  short  read/long  read  merged  assemblies  of  C.  sorokiniana  1230  (20  conDgs).  While  the  OpGen  technology  is  expensive,  the  manual  work  on  the  short  reads  assemblies  is  also  expensive  and  is  likely  of  similar  cost.    The  HGAP  assembly  of  PacBio-­‐only  data  for  the  S.  obliquus  genome  was  problemaDc,  since  the  genome  size  produced  was  double  what  was  expected.  We  are  experimenDng  with  FALCON  (an  assembler  designed  for  higher  ploidy/larger  genomes)  for  the  assembly  of  the  S.  obliquus  genome  due  to  the  heterozygosity  in  this  genome.  IniDal  results  with  FALCON  provide  a  reduced  genome  size  (closer  to  the  100  Mb  expected)  and  with  a  reducDon  in  conDg  number  by  25-­‐30%.    These  assemblies  indicate  15-­‐20%  of  the  genome  is  heterozygous.    All  of  this  PacBio  data  were  produced  with  the  P5/C3  chemistry;  we  believe  that  if  generated  with  the  current  P6/C4  chemistry  (or  the  soon-­‐to-­‐be  released  chemistry  with  longer  movie  Dmes)  the  iniDal  PacBio-­‐only  assemblies  would  be  beXer  and  the  costs  would  be  significantly  reduced.      

Algae  species  Es>mated  genome  size  

(Mbp)  

#    of  template  preps  

#  of  SMRT  cells  

sequenced  

#  of  polished  con>gs  in  HGAP  

Genome  coverage  by  PacBio  

#  of  Mbp  of  data  

Template  prep  and  sequencing  cost  

   C.  sorokiniana  1228   61   3   41   322   38x   6,083   $14,860.97  

   C.  sorokiniana  1230   58   1   16   85   85x   5,866   $7,341.15  

   S.  obliquus   210   2   46   3,081   57x   18,238   $15,870.94  

Algae  species   Technologies  Es>mated  genome  size  (Mbp)  

Final  Con>gs   Final  N50   Final  Largest  

Con>g  Final  Assembly  

 Size  

   C.  sorokiniana  1228   PacBio,  OpGen   60   64   2,415,094   4,567,720   61,391,260  

   C.  sorokiniana  1230   PacBio,  Illumina   55   20   4,091,730   5,120,617   58,534,920  

   S.  obliquus   PacBio  only   100*   2812   152,111   2,334,183   210,263,644  

Figure  1.  Sequencing  results  and  costs  for  three  algal  genomes.  All  three  have  20  Kbp  prep  libraries.    

Figure  2.  Final  assembly  results  for  three  algal  genomes.    *While  it’s  thought  that  S.  obliquus  genome  size  is  100  Mbp,  this  assembly  has  a  size  of  200  Mbp  which  could  be  a  result  of  heterozygosity.  

Figure  3.  OpDcal  maps  for  C.  sorokiniana  1228  with  HGAP  conDgs  (>40kb)  scaffolded  into  12  algal  chromosomes  (telomere  to  telomere).  Some  manual  work  on  the  12  chromosomal  scaffolds  produced  a  final  assembly  of  64  conDgs  and  an  assembly  size  of  61  Mbp.  This  included  two  smaller  circular  organellar  chromosomes  which  were  completed  in  the  HGAP  assembly.      

Introduction

Conclusions

Algal  cells  of  C.  sorokiniana  1228  were  grown  in  culture,  harvested  and  imbedded  into  the  agarose  plugs  for  further  extracDon  of  the  HMW  DNA.  The  DNA  was  extracted  through  enzymaDc  removal(  destrucDon)  of  the  cell  wall  and  subsequent  lysis  of  the  cells.  Some  of  the  algal  plugs  were  sent  to  OpGen  for  opDcal  mapping,  the  others  were  used  for  the  extracDon  of    HMW  genomic  DNA  and    preparaDon  of  20kb  SMRT  bell  library  according  to  the  standard  PacBio  protocol.  Genomic  DNA  for  S.  obliquus    and  C.  sorokiniana  1230  was  received  from  an  outside  collaborator  and  processed  for  20kb  SMRT  bell  library  prep.  The  long  insert  libraries  were  size-­‐selected  using  Blue  Pippin  instrument.  The  sequencing  primer  was  annealed  to  the  selected  SMRT  bells.  DNA  polymerase  was  bound  to  the  SMRT  bell  template  and  the  samples  were  loaded  onto  the  SMRT  cells  using  MagBeads.  The  samples  were  sequenced  with  C3-­‐P5  chemistry  and  3  hour  movies.  The  short  reads  data  was  generated  both  at  LANL  and  by  an  outside  collaborator