Generation, annotation and archiving of NGS Generation, annotation and archiving of NGS data: Laboratory Information Management data: Laboratory Information Management
Systems (LIMS) and Distributed Annotation Systems (LIMS) and Distributed Annotation Server architectureServer architecture
Advanced genome browsers: The Integrated Advanced genome browsers: The Integrated Genome Browser Genome Browser
HeikoHeiko
MullerMullerComputational Research IIT@SEMMComputational Research IIT@SEMM
[email protected]@iit.it
Genomic Computing, DEIB, 16‐20 March 2015
Illumina
HiSeq
Each lane containsmore than one sample(multiplexing)
180 mio
clusters per lane
NGS data flow
The current situation: 1.Biologist fills in request form and sends it to service‐[email protected] are inserted into LIMS and request ID’s
are sent back to biologist3.Samples are sequenced and run data are inserted into LIMS4.LIMS prepares sample sheets that are used for demultiplexing
and bcl‐>fastq
conversion5.FastQC
is run for quality control6.FASTQ data are saved on IIT‐Isilon
device and hard links are produced in user folders7.Group bioinformaticians
align and analyze data8.Group bioinformaticians
interact with biologists to interpret results
Request LIMS‐>FASTQ bioinformaticiansElaborated data sets
homogeneous heterogeneous
NGS usage on campus
LIMS 1.0: NGS requests
http://hilt.iit.ieo.eu:8080/NGSSampleInfo/http://hilt.iit.ieo.eu:8080/NGSSampleInfo/
LIMS = Laboratory Information Management SystemLIMS = Laboratory Information Management System
LIMS 1.0: NGS requests
http://hilt.iit.ieo.eu:8080/NGSSampleInfo/http://hilt.iit.ieo.eu:8080/NGSSampleInfo/
filter
Data delivery
LIMS 1.0 LIMS 2.0
Data delivery
http://hilt.iit.ieo.eu/data/delivery_stats.xlsx
users
facility
Illumina
HiSeq2000
LIMSfrontend
SGE‐HPC
blade
GPUbladebladeblade
bladeblade
Storage Isilon
LIMS DB
Quality control (FastQC)
data
Genome browsers
UCSC
IGB, DAS/2, Quickload
Application servers:Apache2, Glassfish, UCSC, DAS/2,
Quickload, data listings
Infrastructure
Application server, blades, GPU
Isilon
storage (250 TB, 300.000 Euro)
Request LIMS‐>FASTQ bioinformaticiansElaborated data sets
homogeneous heterogeneous
NGS data flow
Can we improve it?
Raw data: 27.8 TBFASTQ data 25.5 TB
Elaborated data: > 57 TBScratch: 13 TB> 70 TB
Limitations of LIMS 1.0
• No roles• Sample – lane relationship N : 1, N : N desirable• No projects• No sample annotation compatible with GMQL• No workflows
• ‐> developed LIMS 2.0 together with PoliMi
Venco, Francesco, et al. "SMITH: A LIMS for handling next‐generation sequencing workflows." BMC bioinformatics
15.Suppl 14 (2014): S3.
GMQL Compatible Laboratory Information Management System
Demo available: https://cru.genomics.iit.it/smith/Demo available: https://cru.genomics.iit.it/smith/
SMITH: Sequencing Machine Information Tracking and HandlingFrancesco Venco, Yuriy
Vaskin, Arnaud Ceol, Marco Masseroli, Stefano Ceri, Heiko
Muller
Controller(FacesServlet)
Model(Managed beans)
View(xhtml
facelets)
Hibernate(ORM)
MySQLJava EE7 web server
MySQL
SGE‐HPC
File system
Web clients
Sample submission
Sample annotation
Sample analysis
Run folder monitor
Reagent store
Role based access
Virtual flow cell
Index compatibility
Email alertsSample tracking
Quality control
Project awareness
SMITH features
requested
queued
confirmed
analyzed
user
technician
Principal investigator
SMITH, HPC, Galaxy
SMITH Sample states
SMITH database schema
Request submission etc, stand‐alone DB client
SMITH Context parameter (configurable)
Request form
SMITH Sample search (role‐based)
Roles:
Admin
everythingGuest
look, no sample detailsGroup leader
define projects, collaborators, track
group samplesUser
submit and track samplesTechnician
start NGS runs
SMITH NGS runs
Mindex
(Mindful Index) to support multiplexing in flow cell assembly
SMITH NGS runs assembly: Mindex
SMITH Samplesheets
SMITH NGS analysis trigger
From BCL (base call format) to FASTQ: Demultiplexing
Samplesheet
Script generator
Run on IIT blades (Process proc = Runtime.getRuntime().exec(command);)
SMITH NGS reagents
SMITH Project aware
SMITH Sample annotation with attribute‐value pairs ‐> GMQL
Attributes:
search samplesdo statistics on attribute values (GQL)
SMITH workflows (Data tab)
Path to BigWig/Bam data
Path to FASTQ data
SMITH News
SMITH users
Automatic email communications
By the end of analysis we get big files files
fastq
bam bigWig
SMITH simplifies analysis workflow
Request LIMS‐>FASTQ CRUElaborated data sets
homogeneous heterogeneous
Previous situation
FASTQ file
User folder
FASTQ folder,Backed up
Current situation for bam files
bam file
User folder
BAM folder,Backed up
Quickload DAS2
bigWig
file
User folder
bigWig
folder,Backed up
Quickload DAS2
Current situation for bigWig
files
Request LIMS‐>FASTQ CRUElaborated data sets
homogeneousHomogeneous,
Less space consuming,Accessible, sharable,
Bioinformaticians
can do more science,Biologists get tracks instantly,
GQL meta‐analysis of ENCODE dataCollaborative (analyses and pipelines)
Advantages
View and share data immediately
Data sources
Data Sources
http://bioserver.iit.ieo.eu/genopub/http://bioserver.iit.ieo.eu/genopub/ http://hilt.iit.ieo.eu/quickload/http://hilt.iit.ieo.eu/quickload/
Share your data, in the lab or worldwide, by setting access levels, use plug‐ins
DAS2 manages access levels
Plugins
View side‐by‐side with UCSC tracks
Visualizing NGS data: Genome BrowsersVisualizing NGS data: Genome Browsers
Visualizing NGS data: Genome Browsers
Visualizing genomic data: What is a “Genome Browser”
• linear representation
of a genome
• position‐based annotations, each called a track
– continuous annotations: e.g. conservation– interval annotations: e.g. gene, read alignment
– point annotations: e.g. SNPs• user specifies a subsection
of genome to look at
Comparison of Genome Browsers
UCSC Ensembl IGV IGB
Referencehttp://genome.ucsc.edu/ http://www.ensembl.org/index.html http://www.broadinstitute.org/igv/ http://bioviz.org/igb/
Model Server Server Client Client
Interactive
HTS support
Database of tracks
Plugins
No support Some support Good support
Server model Client model
Server central data store Server stores datarenders imagessends to client
Client requests images Client local HTS storedisplays images renders images
displays images
Limitations:
do not
support multiple genomes simultaneouslydo not capture 3‐dimensional conformationdo not capture spatial or temporal informationdo not integrate well with analytics
• Browse many eukaryotic genomes (yeast to human)
• Most annotations are there
• Important evolutionary and variation data representation.
• Very flexible and configurable views
• Graphical and table views
• Upload your data into custom tracks and share with
colleagues
• Client/server application with it’s issues, but a great app!
About UCSC Genome Browser
http://genome.ucsc.edu
http://genome.ucsc.edu
Integrated Genome Browser and IIT DAS2 server
Integrated Genome Browser and published genome annotations
Genome browser view: ChIP‐seq
.bam.bed .bigWig
Genome browser view: sequencing errors
Integrated Genome Browser and the Distributed Annotation System (DAS)
Outline
Genome Browsing: Why was DAS developed?DAS: history, usage, and specification, reference implementationIntegrated Genome BrowserExamples
Integrated Genome Browser and the Distributed Annotation System (DAS)
Outline
Genome Browsing: Why was DAS developed?DAS: history, usage, and specification, reference implementationIntegrated Genome BrowserExamples
Frederic Sanger
Genbank
Centralized repository, sequences owned by submitter,
Genbank
LOCUS NM_053056 4304 bp
mRNA linear PRI 27‐MAY‐2012DEFINITION Homo sapiens cyclin
D1 (CCND1), mRNA.ACCESSION NM_053056 NM_001758VERSION NM_053056.2 GI:77628152KEYWORDS .SOURCE Homo sapiens (human)ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 4304)AUTHORS Li,Q., Dong,Q. and Wang,E.TITLE Rsf‐1 is overexpressed
in non‐small cell lung cancers and regulatescyclinD1 expression and ERK activity
JOURNAL Biochem. Biophys. Res. Commun. 420 (1), 6‐10 (2012)PUBMED 22387541REMARK GeneRIF: Rsf‐1 is overexpressed
in non‐small cell lung cancers andcontributes to malignant cell growth by cyclin
D1 and ERKmodulation.
PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN
COMP1‐138 BM796500.1 1‐138139‐1278 BC001501.2 73‐12121279‐4077 AP001888.4 12952‐157504078‐4304 X59798.1 4018‐4244
FEATURES Location/Qualifierssource 1..4304
/organism="Homo sapiens"/mol_type="mRNA"/db_xref="taxon:9606"/chromosome="11"/map="11q13"
gene
1..4304/gene="CCND1"/gene_synonym="BCL1; D11S287E; PRAD1; U21B31"/note="cyclin
D1"/db_xref="GeneID:595"/db_xref="HGNC:1582"/db_xref="HPRD:01346"/db_xref="MIM:168461"
exon
1..407/gene="CCND1"/gene_synonym="BCL1; D11S287E; PRAD1; U21B31"/inference="alignment:Splign"/number=1
CDS
210..1097/gene="CCND1"/gene_synonym="BCL1; D11S287E; PRAD1; U21B31"/note="B‐cell CLL/lymphoma 1; BCL‐1 oncogene; PRAD1oncogene; B‐cell lymphoma 1 protein"/codon_start=1/product="G1/S‐specific cyclin‐D1”
/protein_id="NP_444284.1"/db_xref="GI:16950655"/db_xref="CCDS:CCDS8191.1"/db_xref="GeneID:595"/db_xref="HGNC:1582"/db_xref="HPRD:01346"/db_xref="MIM:168461"/translation="MEHQLLCCEVETIRRAYPDANLLNDRVLRAMLKAEETCAPSVSYFKCVQKEVLPSMRKIVATWMLEVCEEQKCEEEVFPLAMNYLDRFLSLEPVKKSRLQLLGATCMFVASKMKETIPLTAEKLCIYTDNSIRPEELLQMELLLVNKLKWNLAAMTPHDFIEHFLSKMPEAEENKQIIRKHAQTFVALCATDVKFISNPPSMVAAGSVVAAVQGLNLRSPNNFLSYYRLTRFLSRVIKCDPDCLRACQEQIEALLESSLRQAQQNMDPKAAEEEEEEEEEVDLACTPTDVRDVDI"
misc_feature
885..887/gene="CCND1"/gene_synonym="BCL1; D11S287E; PRAD1; U21B31"/experiment="experimental evidence, no additional detailsrecorded"/note="Phosphotyrosine; propagated fromUniProtKB/Swiss‐Prot (P24385.1); phosphorylation
site"ORIGIN
1 cacacggact
acaggggagt
tttgttgaag
ttgcaaagtc
ctggagcctc
cagagggctg61 tcggcgcagt
agcagcgagc
agcagagtcc
gcacgctccg
gcgaggggca
gaagagcgcg121 agggagcgcg
gggcagcaga
agcgagagcc
gagcgcggac
ccagccagga
cccacagccc181 tccccagctg
cccaggaaga
gccccagcca
tggaacacca
gctcctgtgc
tgcgaagtgg241 aaaccatccg
ccgcgcgtac
cccgatgcca
acctcctcaa
cgaccgggtg
ctgcgggcca301 tgctgaaggc
ggaggagacc
tgcgcgccct
cggtgtccta
cttcaaatgt
gtgcagaagg361 aggtcctgcc
gtccatgcgg
aagatcgtcg
ccacctggat
gctggaggtc
tgcgaggaac421 agaagtgcga
ggaggaggtc
ttcccgctgg
ccatgaacta
cctggaccgc
ttcctgtcgc481 tggagcccgt
gaaaaagagc
cgcctgcagc
tgctgggggc
cacttgcatg
ttcgtggcct541 ctaagatgaa
ggagaccatc
cccctgacgg
ccgagaagct
gtgcatctac
accgacaact601 ccatccggcc
cgaggagctg
ctgcaaatgg
agctgctcct
ggtgaacaag
ctcaagtgga661 acctggccgc
aatgaccccg
cacgatttca
ttgaacactt
cctctccaaa
atgccagagg721 cggaggagaa
caaacagatc
atccgcaaac
acgcgcagac
cttcgttgcc
ctctgtgcca//
A Genbank
entry
By design, annotations are nearly impossible to
incorporate
Since 1989, centrally curated, annotations provided by the community‐> curation
bottleneck
AceDB: A C.elegans
database
2001
2002
‐To view massive amounts of sequencing data, genome browsers were
developed.‐Annotations developed in “Annotation Jamborees”‐Human Genome Project Analysis Group: concept of annotation tracks‐Tracks produced and curated
by different groups but stored on centralized server
‐>Bandwidth bottleneck
HUGO
Integrated Genome Browser and the Distributed Annotation System (DAS)
Outline
Genome Browsing: Why was DAS developed?DAS: history, usage, and specification, reference implementationIntegrated Genome BrowserExamples
Decentralized curation
of annotation tracksDecentralized storage of annotation tracks
Distributed Annotation System: DAS
The distributed annotation system
components:
1
Reference genome server
(provides coordinates and sequence)2 Annotation server(s)
(provides annotation tracks)3
Client
(view annotations mapped onto reference)
DAS basics
reference
Client (web or stand alone)
annotations
Dowell et al. 2001
Geodesic: Standalone client by Dowell et al. 2001
Source code: http://www.biodas.org/geodesic/
Glyphs: Graphic elements used for track display
DAS/2 (not listed in registry)
http://india907.server4you.de:8080/das2/genome
(epigenome.at)http://www.bioviz.org/das2/genome
(Bioviz)http://bioserver.hci.utah.edu:8080/DAS2DB/genome (UofUtahBioinfoCore)http://netaffxdas.affymetrix.com/das2/genome
(NetAffx)
Currently 1600 DAS/1 entriesClients:
DAS registry (www.dasregistry.org)
http://www.biodas.org/wiki/DAS/2
Main difference: DAS/2 supports non‐XML file formatsDAS/2 clients support DAS/1 but not vice versa
DAS/1 != DAS/2
2004‐2007, NIH grant for DAS/2 development,
partners:
Affymetrix, Cold Spring Harbor Lab, the EBI/ Sanger Center, Dalke
Scientific
DAS specification (www.biodas.org)
Sources:
list available genomesSegments:
lists chromosomes per genomeTypes:
list types of annotation (file format etc)Features:
list annotation details in specific region
DAS: Basic Query types: sources, segments, types, features
<?xml version="1.0" encoding="UTF‐8"?><SOURCESxmlns="http://biodas.org/documents/das2"xml:base="http://rubidio.ifom‐ieo‐campus.it:8080/das2/genome/" ><MAINTAINER email="ivan.lago@ifom‐ieo‐campus.it" /><SOURCE uri="D_rerio" title="D_rerio" ><VERSION uri="danRer7" title="danRer7" created="2012‐05‐05T16:47:27+0200" >
<CAPABILITY type="segments" query_uri="danRer7/segments" /><CAPABILITY type="types" query_uri="danRer7/types" /><CAPABILITY type="features" query_uri="danRer7/features" />
</VERSION></SOURCE><SOURCE uri="H_sapiens" title="H_sapiens" ><VERSION uri="H_sapiens_Mar_2006" title="H_sapiens_Mar_2006" created="2012‐05‐05T16:47:27+0200" >
<COORDINATES uri="http://www.ncbi.nlm.nih.gov/genome/H_sapiens/B36.1/" authority="NCBI" taxid="9606" version="36" source="Chromosome" /><CAPABILITY type="segments" query_uri="H_sapiens_Mar_2006/segments" /><CAPABILITY type="types" query_uri="H_sapiens_Mar_2006/types" /><CAPABILITY type="features" query_uri="H_sapiens_Mar_2006/features" />
</VERSION></SOURCE><SOURCE uri="M_musculus" title="M_musculus" ><VERSION uri="M_musculus_Jul_2007" title="M_musculus_Jul_2007" created="2012‐05‐05T16:47:27+0200" >
<CAPABILITY type="segments" query_uri="M_musculus_Jul_2007/segments" /><CAPABILITY type="types" query_uri="M_musculus_Jul_2007/types" /><CAPABILITY type="features" query_uri="M_musculus_Jul_2007/features" />
</VERSION><VERSION uri="M_musculus_Mar_2006" title="M_musculus_Mar_2006" created="2012‐05‐05T16:47:27+0200" >
<CAPABILITY type="segments" query_uri="M_musculus_Mar_2006/segments" /><CAPABILITY type="types" query_uri="M_musculus_Mar_2006/types" /><CAPABILITY type="features" query_uri="M_musculus_Mar_2006/features" />
</VERSION></SOURCE></SOURCES>
http://bioserver.iit.ieo.eu/genopub/genome
<?xml version="1.0" encoding="UTF‐8"?> <SEGMENTS
xmlns="http://biodas.org/documents/das2" xml:base="http://rubidio.ifom‐ieo‐campus.it:8080/das2/genome/M_musculus_Jul_2007/" uri="http://rubidio.ifom‐ieo‐campus.it:8080/das2/genome/M_musculus_Jul_2007/segments" >
<SEGMENT uri="chr1" title="chr1" length="197195432" /> <SEGMENT uri="chr2" title="chr2" length="181748087" /><SEGMENT uri="chr3" title="chr3" length="159599783" /> <SEGMENT uri="chr4" title="chr4" length="155630120" /><SEGMENT uri="chr5" title="chr5" length="152537259" /> <SEGMENT uri="chr6" title="chr6" length="149517037" /> <SEGMENT uri="chr7" title="chr7" length="152524553" /> <SEGMENT uri="chr8" title="chr8" length="131738871" /> <SEGMENT uri="chr9" title="chr9" length="124076172" /> <SEGMENT uri="chr10" title="chr10" length="129993255" /><SEGMENT uri="chr11" title="chr11" length="121843856" /> <SEGMENT uri="chr12" title="chr12" length="121257530" /> <SEGMENT uri="chr13" title="chr13" length="120284312" /> <SEGMENT uri="chr14" title="chr14" length="125194864" /> <SEGMENT uri="chr15" title="chr15" length="103494974" /> <SEGMENT uri="chr16" title="chr16" length="98319150" /><SEGMENT uri="chr17" title="chr17" length="95272651" /> <SEGMENT uri="chr18" title="chr18" length="90772031" /> <SEGMENT uri="chr19" title="chr19" length="61342430" /> <SEGMENT uri="chrX" title="chrX" length="166650296" /> <SEGMENT uri="chrY" title="chrY" length="15902555" /> <SEGMENT uri="chrM" title="chrM" length="16299" /> <SEGMENT uri="chr1_random" title="chr1_random" length="1231697" /> <SEGMENT uri="chr3_random" title="chr3_random" length="41899" /> <SEGMENT uri="chr4_random" title="chr4_random" length="160594" /> <SEGMENT uri="chr5_random" title="chr5_random" length="357350" /> <SEGMENT uri="chr7_random" title="chr7_random" length="362490" /> <SEGMENT uri="chr8_random" title="chr8_random" length="849593" /> <SEGMENT uri="chr9_random" title="chr9_random" length="449403" /> <SEGMENT uri="chr13_random" title="chr13_random" length="400311" /> <SEGMENT uri="chr16_random" title="chr16_random" length="3994" /> <SEGMENT uri="chr17_random" title="chr17_random" length="628739" /> <SEGMENT uri="chrUn_random" title="chrUn_random" length="5900358" /> <SEGMENT uri="chrX_random" title="chrX_random" length="1785075" /> <SEGMENT uri="chrY_random" title="chrY_random" length="58682461" />
</SEGMENTS>
http://bioserver.iit.ieo.eu/genopub/genome/M_musculus_Jul_2007/segments
<?xml version="1.0" encoding="UTF‐8"?><TYPES xmlns="http://biodas.org/documents/das2"xml:base="http://localhost:8080/genopub/genome/M_musculus_Jul_2007/" ><TYPE uri="EML1/PU1_ChIP/Input" title="EML1/PU1_ChIP/Input" ><FORMAT name="useq" /><PROP key="Normalization" value="N" /><PROP key="group" value="Alcalay" /><PROP key="group_contact" value="Myriam
Alcalay" /><PROP key="group_email" value="myriam.alcalay@ifom‐ieo‐campus.it" /><PROP key="name" value="Input" /><PROP key="owner" value="Alcalay, Myriam" /><PROP key="owner_email" value="IEO" /><PROP key="owner_institute" value="myriam.alcalay@ifom‐ieo‐campus.it" /><PROP key="url" value="http://localhost:8080/genopub/genopub?idAnnotation=11" /><PROP key="visibility" value="Members" />
</TYPE><TYPE uri="EML1/PU1_ChIP/PU1_A3" title="EML1/PU1_ChIP/PU1_A3" ><FORMAT name="useq" /><PROP key="Normalization" value="N" /><PROP key="group" value="Alcalay" /><PROP key="group_contact" value="Myriam
Alcalay" /><PROP key="group_email" value="myriam.alcalay@ifom‐ieo‐campus.it" /><PROP key="name" value="PU1_A3" /><PROP key="owner" value="Alcalay, Myriam" /><PROP key="owner_email" value="IEO" /><PROP key="owner_institute" value="myriam.alcalay@ifom‐ieo‐campus.it" /><PROP key="url" value="http://localhost:8080/genopub/genopub?idAnnotation=7" /><PROP key="visibility" value="Members" />
</TYPE></TYPES>
http://bioserver.iit.ieo.eu/genopub/genome/M_musculus_Jul_2007/types
http://localhost:8080/genopub/genome/M_musculus_Jul_2007/features?segment=http%3A%2F%2Flocalhost%3A8080%2Fgenopub%2Fgenome%2FM_musculus_Jul_2007%2Fchr1;overlaps=79374747%3A81152999;type=http%3A%2F%2Flocalhost%3A8080%2Fgenopub%2Fgenome%2FM_musculus_Jul_2007%2FEML1%2FPU1_ChIP%2FPU1_B2;format=useq
Returns a file in useq
format, essentially a zip file, preferred format in IGBContains a archiveReadMe.txt
and one or more “slice files”Observations can be textual or numerical
http://useq.sourceforge.net/useqArchiveFormat.html
Features
A BED file (.bed) is a tab‐delimited text file that defines a feature track. File extension .bed is recommended. The BED file format is described on the UCSC Genome Bioinformatics web site: http://genome.ucsc.edu/FAQ/FAQformat. Tracks in the UCSC Genome Browser (http://genome.ucsc.edu/) can be downloaded to BED files and loaded into IGB/IGV.
Notes: Zero‐based index: Start and end positions are identified using a zero‐based index. The end position is excluded. For example, setting start‐end to 1‐2
describes exactly one base, the second base in the sequence (ACGT).
track name=pairedReads
description="Clone Paired Reads"Chr22
1000
5000
cloneAChr22
2000
6000
cloneB
Other important file formats: BED (textual)
The bedGraph
format is line‐oriented. Bedgraph
data are preceededby a track definition line, which adds a number of options for controlling the default display of this track. The track type is REQUIRED, and must be “bedGraph”.
Bedgraph
track data values can be integer or real, positive or negative values. Chromosome positions are specified as 0‐relative. The first chromosome position is 0. The last position in a chromosome of length N would be N ‐
1. Only positions specified have data. Positions not specified do not have data and will not be graphed. All positions specified in the input data must be in numerical order. The bedGraph
format has four columns of data:
track type=bedGraph
name="BedGraph
Format"chr19 49302000 49302300 10 chr19 49302300 49302600 20 chr19 49302600 49302900 25
Intervals can be of any length and overlapping.
Other important file formats: BEDGraph
(numerical)
The wiggle (WIG) format is for display of dense, continuous data
such as GC percent, probability scores, and transcriptome
data. Wiggle data elements must be equally sized. If you need to display continuous data that is sparse
or contains elements of varying size, use the BedGraph
format instead. If you have a very large data set and you would
like to keep it on your own server, you should use the bigWig
data format. Chromosome positions are specified as 1‐relative.
variableStep
is for data with irregular intervals between new data points and is the more commonly used wiggle format. It begins with a declaration line and is followed by two columns containing chromosome positions and data values: variableStep
chrom=chrN
[span=windowSize] StartA
dataValueAStartB
dataValueB
variableStep
chrom=chr2
is equivalent to:
variableStep
chrom=chr2 span=5300701 12.5
300701 12.5 300702 12.5 300703 12.5 300704 12.5 300705 12.5
Both versions display a value of 12.5 at position 300701‐300705 on chromosome 2.
Other important file formats: Wig (“wiggle”)
The wiggle (WIG) format is for display of dense, continuous data
such as GC percent, probability scores, and transcriptome
data. Wiggle data elements must be equally sized. If you need to display continuous data that is sparse
or contains elements of varying size, use the BedGraph
format instead. If you have a very large data set and you would
like to keep it on your own server, you should use the bigWig
data format. Chromosome positions are specified as 1‐relative.
fixedStep
is for data with regular intervals between new data values and is the more compact wiggle format. It begins with a declaration line and is followed by a single column of data values:
The declaration line starts with the word fixedStep
and includes specifications for chromosome, start coordinate, and step size. The span specification has the same meaning as in
variableStep
format. For example, this fixedStep
specification:
fixedStep
chrom=chr3 start=400601 step=100 span=5 11 22 33
displays the values 11, 22, and 33 as single‐base regions on chromosome 3 at positions 400601, 400701, and 400801, respectively. Step and span are fixed for entire data set.
Other important file formats: Wig (“wiggle”)
Data transfer levelRequired: random access. At the lowest layer, we take advantage of the byte‐range protocols of HTTP and HTTPS, and the protocols associated with resuming interrupted FTP transfers, toachieve random access to binary files over the web.URL data cache layera cache layer on top of the data transfer layer. Data are fetched in blocks of 8 Kb, and each block is kept in a cache.Indexingbased on a single dimensional version of the R tree that is commonly used for indexing geographical data. The index size is typically less than 1% of the size of the data itself. Because the stored data are sorted by chromosome and start position, not every item in the file must be indexed; in fact by default only every 512th item is indexed.Compression:regions between indexed items (containing 512 items by default) are individually compressed (gzip).
BigWig and BigBed
Basic architecture Object relational mappingVia Hibernate
Flex
Apache Tomcat 6Glassfish
mySQL
DAS/2 server reference implementation: http://sourceforge.net/projects/genoviz
Database tables
User table
Annotation table
User role table Message digest 5 (MD5) encryptionfrom java.security package
Table views
Each file gets his own folder (automatically assigned folder names). No filenames to store in DB, which may contain non‐supported characters.
Data storage directory
Visibility levels:
DAS2 administration user interface
If you want to access data with restricted visibility, you must be inserted in the usertable and be part of a group that is headed by the owner of the data.
Users and groups setup
Every user, admin or non‐admin, can change his password,load new data, add data descriptions, and set visibility levels.
Non‐administrator users interface
IGB user identification
jdbcRealmldapRealmBoth work
NetAffx and UCSC hg19 annotations
All these annotations are one click away from the user
Conclusions
DAS2 servers provide distributed genome annotations
Support fine grained security model
Perform parsing of data for custom genome views
List of Genome Browsers
AlamutAnnmapApollo Genome Annotation Curation ToolArgo Genome BrowserArtemis Genome BrowserAvadis NGSBugViewCelera Genome BrowserDalliance Javascript‐based genome browserDiProGBDNAnexus Flash‐based interactive genome browserEnsembl The Ensembl Genome BrowserGaggle Genome BrowserGBrowseGenome WowserThe Genomic HyperBrowserIntegrative Genomics Viewer
Genostar GenoBrowserGenoverse interactive genome browserGenPlayGolden Helix GenomeBrowseIntegrated Genome BrowserIntegrated Microbial GenomesJBrowseMGV ‐
Microbial Genome ViewerMochiView Genome BrowserNextBio Genome BrowserPathway Tools Genome BrowserSavant Genome BrowserSEED viewerUCSC Genome Bioinformatics Genome BrowserViral Genome Organizer (VGO)VISTA genome browserWashU Genome Browser
Integrated Genome Browser: reference implementation of a DAS/2 client
IGB: Integrated Genome Browser (http://www.bioviz.org/igb/)
The Integrated Genome Browser (IGB, pronounced Ig‐Bee) is an interactive, zoomable, scrollable software program you can use to visualize and explore
genome‐scale data sets, such as tiling array data, next‐generation sequencing results, genome annotations, microarray designs, and the sequence itself. IGB is implemented using the Java programming language and should run on any computer.
IGB is an open source, publicly‐funded project, but it did not start out that way. Initial development of the software was largely funded by Affymetrix, Inc., which donated the IGB software to the community in 2005. Since then, community developers have continued to contribute their time and
efforts to improving the software. In 2008, funding from National Science Foundation has allowed us
to speed up the pace of development.
IGB interacts with DAS (distributed annotation system servers)
DAS (http://www.biodas.org/wiki/Main_Page)(DAS) defines a communication protocol used to exchange annotations on genomic or protein sequences. It is motivated by the idea that such annotations should not be provided by single centralized databases, but should instead be spread over multiple sites.
DAS/2 built to address the needs of distributing massive genomic data sets derived from high density microarray applications and Next (and Next Next) Generation Sequencing. Unlike DAS/1, DAS/2 does not require data exchange through text based XML but allows for data distribution using any text or binary format.
Genometry model
Central concept: SeqSymmetry: breadth (SeqSpans) and depth (hierarchy, parents, children)
Hierarchical annotations
URL: http://www.bioviz.org/igb/download.shtml
How to launch IGB
Refseq and cytoband annotations automatically loaded from NetAffx DAS2
IGB after startup
Data access tab
Search tab
Selection info tab
Sliced view to interrogate alternative splice variants, ORF analysis.
Sliced view tab
Graph adjuster tab
External view tab
To load data: Click desired data set, choose region in view or whole chromosome,Click refresh data.
Data access tab
Load Affy probesets in View
NetAffx and UCSC mm8 annotations
NetAffx and UCSC mm9 annotations
ChIPchipChIPseqExon arrayDNAseIRNA seqChIP petRNA petMethyl seqCage tags...Km of data
new server
NetAffx and UCSC hg18 annotations
Data sources: Quickload, DAS, DAS2
Server registration (data source) tab
1. Single files file type extensionBAM .bamBED .bedBinary .bps, .bgn, .brs, .bsnp, .brpt, .bnib, .bp1, .bp2, .ead, .useqGFF .gff, .gtf, .gff3FASTA .fa, .fasta, .fasPSL .psl, .psl3DAS .das, .dasxml, .das2xmlGraph .gr, .bgr, .sgr, .bar, .chp, .wigScored Interval .sin, .egr, .egr.txtCopy number .cntCopy number chp .cnchp, .lohchpGenomic variation (Toronto DB) .varRegion (genotype console segmenter) SegmenterRptParser.CN_REGION_FILE_EXT, SegmenterRptParser.LOH_REGION_FILE_EXTFishClones .fsh, FishClonesParser.FILE_EXTScored map .map
2. Quickload (local directory with auxiliary files)
Easy to set up but can load data only into entire genome.
example http://www.bioviz.org/quickload/)
Four types of data sources (files)
3. DAS(1) (example UCSC), (software http://code.google.com/p/mydas/)
Can load data into view of interestresponse XML (problematic for large datasets)
4. DAS2 (example NetAffx), (software http://genoviz.sourceforge.net/
Unlike DAS1, DAS2 does not require data exchange through text based XML but allows for data distribution using any text or binary format. The two versions are not natively compatible.
Can load data into view of interest in a range of different formats.
Four types of data sources (servers)
Loading BAM files from http listing (no need to move them)
2. Text based annotations (e.g. .bed, .bam, .psl, .gff, .fasta files)
1. Graph based annotations (.gr, .bgr, .sgr, .bar, .chp, .wig, .sin, .egr, .egr.txt)
text
graph
Permit different types of operations
graph
graph
Two basic types of annotation
Logical: intersect, union, A not B, B not A, Xor, Not
antisense transcription
all transcribed regions
Select tracks, right‐click to access context‐menu
Operations on text based annotations
Scale: filter displayed values by value or by percentile
Height: adjust display height
Style: bar, line, dot, min/max/avg. heatmap, stairstep, color
transform: log10
, log2
, loge
, and inverses thereof
Join/split: diplay all graphs as one
arithmetic (requires identical X‐values): sum, difference, product, division
Thresholding: transforms regions meeting given criterion into text‐based annotation(can then be used in logical operations)
Operations on graph based annotations
Plugins
Based on Open Services Gateway initiative (OSGI)
Implement Activator interface
Needed to display plugin in tab
Access tracks from Genometry model
Can perform arbitrary manipulations on tracks
E‐μ
myc mouse model, Amati/Faga
Example: myc bound and differentially regulated gene
External view
Molecular Interaction plugin
Molecular Interaction plugin: visualize molecular interactions
Molecular Interaction plugin: visualize interactions with small molecules (drugs)
Our plugin repository (by Arnaud Ceol): http://cru.genomics.iit.it/igb/plugins‐test/
Highly interactive
Excellent logarithmic zooming around hairline
Integrated with UCSC/campus browser
Can do logical/arithmetic operations on annotations
Can create custom annotations on the fly
Can incorporate distributed annotations
Easily customizable display options
Open‐source: new features can be added according to our needs
IGB summary
Acknowledgements
Arnaud CeolLuca ZammataroJole
CostanzaAnna BiressiSofia CappellariBruno Amati
Francesco VencoYuriy
VaskinMarco MasseroliStefano Ceriet al.
Pier Giuseppe PelicciDomenico
TriaricoRoberta CarboneAnnalisa AriesiDaniela Rossi
Top Related