Download - Generation, and archiving of NGS Management Systems … · · 2015-03-16Generation, annotation and archiving of NGS data ... 1.Biologist fills in request form and sends it to service‐[email protected]

Generation, annotation and archiving of NGS Generation, annotation and archiving of NGS data: Laboratory Information Management data: Laboratory Information Management

Systems (LIMS) and Distributed Annotation Systems (LIMS) and Distributed Annotation Server architectureServer architecture

Advanced genome browsers: The Integrated Advanced genome browsers: The Integrated Genome Browser Genome Browser

HeikoHeiko

MullerMullerComputational Research IIT@SEMMComputational Research IIT@SEMM

[email protected]@iit.it

Genomic Computing, DEIB, 16‐20 March 2015

Illumina

HiSeq

Each lane containsmore than one sample(multiplexing)

180 mio

clusters per lane

NGS data flow

The current situation: 1.Biologist fills in request form and sends it to service‐[email protected] are inserted into LIMS and request ID’s

are sent back to biologist3.Samples are sequenced and run data are inserted into LIMS4.LIMS prepares sample sheets that are used for demultiplexing

and bcl‐>fastq

conversion5.FastQC

is run for quality control6.FASTQ data are saved on IIT‐Isilon

device and hard links are produced in user folders7.Group bioinformaticians

align and analyze data8.Group bioinformaticians

interact with biologists to interpret results

Request LIMS‐>FASTQ bioinformaticiansElaborated data sets

homogeneous heterogeneous

mailto:[email protected]

NGS usage on campus

LIMS 1.0: NGS requests

http://hilt.iit.ieo.eu:8080/NGSSampleInfo/http://hilt.iit.ieo.eu:8080/NGSSampleInfo/

LIMS = Laboratory Information Management SystemLIMS = Laboratory Information Management System

LIMS 1.0: NGS requests

http://hilt.iit.ieo.eu:8080/NGSSampleInfo/http://hilt.iit.ieo.eu:8080/NGSSampleInfo/

filter

Data delivery

LIMS 1.0 LIMS 2.0

Data delivery

http://hilt.iit.ieo.eu/data/delivery_stats.xlsx

users

facility

Illumina

HiSeq2000

LIMSfrontend

SGE‐HPC

blade

GPUbladebladeblade

bladeblade

Storage Isilon

LIMS DB

Quality control (FastQC)

data

Genome browsers

UCSC

IGB, DAS/2, Quickload

Application servers:Apache2, Glassfish, UCSC, DAS/2,

Quickload, data listings

Infrastructure

Application server, blades, GPU

Isilon

storage (250 TB, 300.000 Euro)

Request LIMS‐>FASTQ bioinformaticiansElaborated data sets


NGS data flow

Can we improve it?

Raw data: 27.8 TBFASTQ data 25.5 TB

Elaborated data: > 57 TBScratch: 13 TB> 70 TB

Limitations of LIMS 1.0

• No roles• Sample – lane relationship N : 1, N : N desirable• No projects• No sample annotation compatible with GMQL• No workflows

• ‐> developed LIMS 2.0 together with PoliMi

Venco, Francesco, et al. "SMITH: A LIMS for handling next‐generation sequencing workflows." BMC bioinformatics

15.Suppl 14 (2014): S3.

GMQL Compatible Laboratory Information Management System

Demo available: https://cru.genomics.iit.it/smith/Demo available: https://cru.genomics.iit.it/smith/

SMITH: Sequencing Machine Information Tracking and HandlingFrancesco Venco, Yuriy

Vaskin, Arnaud Ceol, Marco Masseroli, Stefano Ceri, Heiko

Muller

Controller(FacesServlet)

Model(Managed beans)

View(xhtml

facelets)

Hibernate(ORM)

MySQLJava EE7 web server

MySQL

SGE‐HPC

File system

Web clients

Sample submission

Sample annotation

Sample analysis

Run folder monitor

Reagent store

Role based access

Virtual flow cell

Index compatibility

Email alertsSample tracking

Quality control

Project awareness

SMITH features

requested

queued

confirmed

analyzed

user

technician

Principal investigator

SMITH, HPC, Galaxy

SMITH Sample states

SMITH database schema

Request submission etc, stand‐alone DB client

SMITH Context parameter (configurable)

Request form

SMITH Sample search (role‐based)

Roles:

Admin

everythingGuest

look, no sample detailsGroup leader

define projects, collaborators, track

group samplesUser

submit and track samplesTechnician

start NGS runs

SMITH NGS runs

Mindex

(Mindful Index) to support multiplexing in flow cell assembly

SMITH NGS runs assembly: Mindex

SMITH Samplesheets

SMITH NGS analysis trigger

From BCL (base call format) to FASTQ: Demultiplexing

Samplesheet

Script generator

Run on IIT blades (Process proc = Runtime.getRuntime().exec(command);)

SMITH NGS reagents

SMITH Project aware

SMITH Sample annotation with attribute‐value pairs ‐> GMQL

Attributes:

search samplesdo statistics on attribute values (GQL)

SMITH workflows (Data tab)

Path to BigWig/Bam data

Path to FASTQ data

SMITH News

SMITH users

Automatic email communications

By the end of analysis we get big files files

fastq

bam bigWig

SMITH simplifies analysis workflow

Request LIMS‐>FASTQ CRUElaborated data sets


Previous situation

FASTQ file

User folder

FASTQ folder,Backed up

Current situation for bam files

bam file

User folder

BAM folder,Backed up

Quickload DAS2

bigWig

file

User folder

bigWig

folder,Backed up

Quickload DAS2

Current situation for bigWig

files

Request LIMS‐>FASTQ CRUElaborated data sets

homogeneousHomogeneous,

Less space consuming,Accessible, sharable,

Bioinformaticians

can do more science,Biologists get tracks instantly,

GQL meta‐analysis of ENCODE dataCollaborative (analyses and pipelines)

Advantages

View and share data immediately

Data sources

Data Sources

http://bioserver.iit.ieo.eu/genopub/http://bioserver.iit.ieo.eu/genopub/ http://hilt.iit.ieo.eu/quickload/http://hilt.iit.ieo.eu/quickload/

Share your data, in the lab or worldwide, by setting access levels, use plug‐ins

DAS2 manages access levels

Plugins

View side‐by‐side with UCSC tracks

Visualizing NGS data: Genome BrowsersVisualizing NGS data: Genome Browsers

Visualizing NGS data: Genome Browsers

Visualizing genomic data: What is a “Genome Browser”

• linear representation

of a genome

• position‐based annotations, each called a track

– continuous annotations: e.g. conservation– interval annotations: e.g. gene, read alignment

– point annotations: e.g. SNPs• user specifies a subsection

of genome to look at

Comparison of Genome Browsers

UCSC Ensembl IGV IGB

Referencehttp://genome.ucsc.edu/ http://www.ensembl.org/index.html http://www.broadinstitute.org/igv/ http://bioviz.org/igb/

Model Server Server Client Client

Interactive

HTS support

Database of tracks

Plugins

No support Some support Good support

Server model Client model

Server central data store Server stores datarenders imagessends to client

Client requests images Client local HTS storedisplays images renders images

displays images

Limitations:

do not

support multiple genomes simultaneouslydo not capture 3‐dimensional conformationdo not capture spatial or temporal informationdo not integrate well with analytics

• Browse many eukaryotic genomes (yeast to human)

• Most annotations are there

• Important evolutionary and variation data representation.

• Very flexible and configurable views

• Graphical and table views

• Upload your data into custom tracks and share with

colleagues

• Client/server application with it’s issues, but a great app!

About UCSC Genome Browser

http://genome.ucsc.edu

Integrated Genome Browser and IIT DAS2 server

Integrated Genome Browser and published genome annotations

Genome browser view: ChIP‐seq

.bam.bed .bigWig

Genome browser view: sequencing errors

Integrated Genome Browser and the Distributed Annotation System (DAS)

Outline

Genome Browsing: Why was DAS developed?DAS: history, usage, and specification, reference implementationIntegrated Genome BrowserExamples

Frederic Sanger

Genbank

Centralized repository, sequences owned by submitter,

Genbank

LOCUS NM_053056 4304 bp

mRNA linear PRI 27‐MAY‐2012DEFINITION Homo sapiens cyclin

D1 (CCND1), mRNA.ACCESSION NM_053056 NM_001758VERSION NM_053056.2 GI:77628152KEYWORDS .SOURCE Homo sapiens (human)ORGANISM Homo sapiens

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;Catarrhini; Hominidae; Homo.

REFERENCE 1 (bases 1 to 4304)AUTHORS Li,Q., Dong,Q. and Wang,E.TITLE Rsf‐1 is overexpressed

in non‐small cell lung cancers and regulatescyclinD1 expression and ERK activity

JOURNAL Biochem. Biophys. Res. Commun. 420 (1), 6‐10 (2012)PUBMED 22387541REMARK GeneRIF: Rsf‐1 is overexpressed

in non‐small cell lung cancers andcontributes to malignant cell growth by cyclin

D1 and ERKmodulation.

PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN

COMP1‐138 BM796500.1 1‐138139‐1278 BC001501.2 73‐12121279‐4077 AP001888.4 12952‐157504078‐4304 X59798.1 4018‐4244

FEATURES Location/Qualifierssource 1..4304

/organism="Homo sapiens"/mol_type="mRNA"/db_xref="taxon:9606"/chromosome="11"/map="11q13"

gene

1..4304/gene="CCND1"/gene_synonym="BCL1; D11S287E; PRAD1; U21B31"/note="cyclin

D1"/db_xref="GeneID:595"/db_xref="HGNC:1582"/db_xref="HPRD:01346"/db_xref="MIM:168461"

exon

1..407/gene="CCND1"/gene_synonym="BCL1; D11S287E; PRAD1; U21B31"/inference="alignment:Splign"/number=1

CDS

210..1097/gene="CCND1"/gene_synonym="BCL1; D11S287E; PRAD1; U21B31"/note="B‐cell CLL/lymphoma 1; BCL‐1 oncogene; PRAD1oncogene; B‐cell lymphoma 1 protein"/codon_start=1/product="G1/S‐specific cyclin‐D1”

/protein_id="NP_444284.1"/db_xref="GI:16950655"/db_xref="CCDS:CCDS8191.1"/db_xref="GeneID:595"/db_xref="HGNC:1582"/db_xref="HPRD:01346"/db_xref="MIM:168461"/translation="MEHQLLCCEVETIRRAYPDANLLNDRVLRAMLKAEETCAPSVSYFKCVQKEVLPSMRKIVATWMLEVCEEQKCEEEVFPLAMNYLDRFLSLEPVKKSRLQLLGATCMFVASKMKETIPLTAEKLCIYTDNSIRPEELLQMELLLVNKLKWNLAAMTPHDFIEHFLSKMPEAEENKQIIRKHAQTFVALCATDVKFISNPPSMVAAGSVVAAVQGLNLRSPNNFLSYYRLTRFLSRVIKCDPDCLRACQEQIEALLESSLRQAQQNMDPKAAEEEEEEEEEVDLACTPTDVRDVDI"

misc_feature

885..887/gene="CCND1"/gene_synonym="BCL1; D11S287E; PRAD1; U21B31"/experiment="experimental evidence, no additional detailsrecorded"/note="Phosphotyrosine; propagated fromUniProtKB/Swiss‐Prot (P24385.1); phosphorylation

site"ORIGIN

1 cacacggact

acaggggagt

tttgttgaag

ttgcaaagtc

ctggagcctc

cagagggctg61 tcggcgcagt

agcagcgagc

agcagagtcc

gcacgctccg

gcgaggggca

gaagagcgcg121 agggagcgcg

gggcagcaga

agcgagagcc

gagcgcggac

ccagccagga

cccacagccc181 tccccagctg

cccaggaaga

gccccagcca

tggaacacca

gctcctgtgc

tgcgaagtgg241 aaaccatccg

ccgcgcgtac

cccgatgcca

acctcctcaa

cgaccgggtg

ctgcgggcca301 tgctgaaggc

ggaggagacc

tgcgcgccct

cggtgtccta

cttcaaatgt

gtgcagaagg361 aggtcctgcc

gtccatgcgg

aagatcgtcg

ccacctggat

gctggaggtc

tgcgaggaac421 agaagtgcga

ggaggaggtc

ttcccgctgg

ccatgaacta

cctggaccgc

ttcctgtcgc481 tggagcccgt

gaaaaagagc

cgcctgcagc

tgctgggggc

cacttgcatg

ttcgtggcct541 ctaagatgaa

ggagaccatc

cccctgacgg

ccgagaagct

gtgcatctac

accgacaact601 ccatccggcc

cgaggagctg

ctgcaaatgg

agctgctcct

ggtgaacaag

ctcaagtgga661 acctggccgc

aatgaccccg

cacgatttca

ttgaacactt

cctctccaaa

atgccagagg721 cggaggagaa

caaacagatc

atccgcaaac

acgcgcagac

cttcgttgcc

ctctgtgcca//

A Genbank

entry

By design, annotations are nearly impossible to

incorporate

http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606

http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606

http://www.ncbi.nlm.nih.gov/nuccore/77628152?from=1&to=4304

http://www.ncbi.nlm.nih.gov/protein/16950655

http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=CCDS&DATA=CCDS8191.1

http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=595

http://www.genenames.org/data/hgnc_data.php?hgnc_id=1582

http://www.hprd.org/protein/01346

http://www.ncbi.nlm.nih.gov/omim/168461

http://www.ncbi.nlm.nih.gov/protein/16950655?from=226&to=226

Since 1989, centrally curated, annotations provided by the community‐> curation

bottleneck

AceDB: A C.elegans

database

2001

2002

‐To view massive amounts of sequencing data, genome browsers were

developed.‐Annotations developed in “Annotation Jamborees”‐Human Genome Project Analysis Group: concept of annotation tracks‐Tracks produced and curated

by different groups but stored on centralized server

‐>Bandwidth bottleneck

HUGO

Integrated Genome Browser and the Distributed Annotation System (DAS)

Outline

Genome Browsing: Why was DAS developed?DAS: history, usage, and specification, reference implementationIntegrated Genome BrowserExamples

Decentralized curation

of annotation tracksDecentralized storage of annotation tracks

Distributed Annotation System: DAS

The distributed annotation system

components:

1

Reference genome server

(provides coordinates and sequence)2 Annotation server(s)

(provides annotation tracks)3

Client

(view annotations mapped onto reference)

DAS basics

reference

Client (web or stand alone)

annotations

Dowell et al. 2001

Geodesic: Standalone client by Dowell et al. 2001

Source code: http://www.biodas.org/geodesic/

Glyphs: Graphic elements used for track display

DAS/2 (not listed in registry)

http://india907.server4you.de:8080/das2/genome

(epigenome.at)http://www.bioviz.org/das2/genome

(Bioviz)http://bioserver.hci.utah.edu:8080/DAS2DB/genome (UofUtahBioinfoCore)http://netaffxdas.affymetrix.com/das2/genome

(NetAffx)

Currently 1600 DAS/1 entriesClients:

DAS registry (www.dasregistry.org)

http://www.biodas.org/wiki/DAS/2

Main difference: DAS/2 supports non‐XML file formatsDAS/2 clients support DAS/1 but not vice versa

DAS/1 != DAS/2

2004‐2007, NIH grant for DAS/2 development,

partners:

Affymetrix, Cold Spring Harbor Lab, the EBI/ Sanger Center, Dalke

Scientific

DAS specification (www.biodas.org)

Sources:

list available genomesSegments:

lists chromosomes per genomeTypes:

list types of annotation (file format etc)Features:

list annotation details in specific region

DAS: Basic Query types: sources, segments, types, features

<?xml version="1.0" encoding="UTF‐8"?><SOURCESxmlns="http://biodas.org/documents/das2"xml:base="http://rubidio.ifom‐ieo‐campus.it:8080/das2/genome/" ><MAINTAINER email="ivan.lago@ifom‐ieo‐campus.it" /><SOURCE uri="D_rerio" title="D_rerio" ><VERSION uri="danRer7" title="danRer7" created="2012‐05‐05T16:47:27+0200" >

<CAPABILITY type="segments" query_uri="danRer7/segments" /><CAPABILITY type="types" query_uri="danRer7/types" /><CAPABILITY type="features" query_uri="danRer7/features" />

</VERSION></SOURCE><SOURCE uri="H_sapiens" title="H_sapiens" ><VERSION uri="H_sapiens_Mar_2006" title="H_sapiens_Mar_2006" created="2012‐05‐05T16:47:27+0200" >

<COORDINATES uri="http://www.ncbi.nlm.nih.gov/genome/H_sapiens/B36.1/" authority="NCBI" taxid="9606" version="36" source="Chromosome" /><CAPABILITY type="segments" query_uri="H_sapiens_Mar_2006/segments" /><CAPABILITY type="types" query_uri="H_sapiens_Mar_2006/types" /><CAPABILITY type="features" query_uri="H_sapiens_Mar_2006/features" />

</VERSION></SOURCE><SOURCE uri="M_musculus" title="M_musculus" ><VERSION uri="M_musculus_Jul_2007" title="M_musculus_Jul_2007" created="2012‐05‐05T16:47:27+0200" >

<CAPABILITY type="segments" query_uri="M_musculus_Jul_2007/segments" /><CAPABILITY type="types" query_uri="M_musculus_Jul_2007/types" /><CAPABILITY type="features" query_uri="M_musculus_Jul_2007/features" />

</VERSION><VERSION uri="M_musculus_Mar_2006" title="M_musculus_Mar_2006" created="2012‐05‐05T16:47:27+0200" >

<CAPABILITY type="segments" query_uri="M_musculus_Mar_2006/segments" /><CAPABILITY type="types" query_uri="M_musculus_Mar_2006/types" /><CAPABILITY type="features" query_uri="M_musculus_Mar_2006/features" />

</VERSION></SOURCE></SOURCES>

http://bioserver.iit.ieo.eu/genopub/genome

<?xml version="1.0" encoding="UTF‐8"?> <SEGMENTS

xmlns="http://biodas.org/documents/das2" xml:base="http://rubidio.ifom‐ieo‐campus.it:8080/das2/genome/M_musculus_Jul_2007/" uri="http://rubidio.ifom‐ieo‐campus.it:8080/das2/genome/M_musculus_Jul_2007/segments" >

<SEGMENT uri="chr1" title="chr1" length="197195432" /> <SEGMENT uri="chr2" title="chr2" length="181748087" /><SEGMENT uri="chr3" title="chr3" length="159599783" /> <SEGMENT uri="chr4" title="chr4" length="155630120" /><SEGMENT uri="chr5" title="chr5" length="152537259" /> <SEGMENT uri="chr6" title="chr6" length="149517037" /> <SEGMENT uri="chr7" title="chr7" length="152524553" /> <SEGMENT uri="chr8" title="chr8" length="131738871" /> <SEGMENT uri="chr9" title="chr9" length="124076172" /> <SEGMENT uri="chr10" title="chr10" length="129993255" /><SEGMENT uri="chr11" title="chr11" length="121843856" /> <SEGMENT uri="chr12" title="chr12" length="121257530" /> <SEGMENT uri="chr13" title="chr13" length="120284312" /> <SEGMENT uri="chr14" title="chr14" length="125194864" /> <SEGMENT uri="chr15" title="chr15" length="103494974" /> <SEGMENT uri="chr16" title="chr16" length="98319150" /><SEGMENT uri="chr17" title="chr17" length="95272651" /> <SEGMENT uri="chr18" title="chr18" length="90772031" /> <SEGMENT uri="chr19" title="chr19" length="61342430" /> <SEGMENT uri="chrX" title="chrX" length="166650296" /> <SEGMENT uri="chrY" title="chrY" length="15902555" /> <SEGMENT uri="chrM" title="chrM" length="16299" /> <SEGMENT uri="chr1_random" title="chr1_random" length="1231697" /> <SEGMENT uri="chr3_random" title="chr3_random" length="41899" /> <SEGMENT uri="chr4_random" title="chr4_random" length="160594" /> <SEGMENT uri="chr5_random" title="chr5_random" length="357350" /> <SEGMENT uri="chr7_random" title="chr7_random" length="362490" /> <SEGMENT uri="chr8_random" title="chr8_random" length="849593" /> <SEGMENT uri="chr9_random" title="chr9_random" length="449403" /> <SEGMENT uri="chr13_random" title="chr13_random" length="400311" /> <SEGMENT uri="chr16_random" title="chr16_random" length="3994" /> <SEGMENT uri="chr17_random" title="chr17_random" length="628739" /> <SEGMENT uri="chrUn_random" title="chrUn_random" length="5900358" /> <SEGMENT uri="chrX_random" title="chrX_random" length="1785075" /> <SEGMENT uri="chrY_random" title="chrY_random" length="58682461" />

</SEGMENTS>

http://bioserver.iit.ieo.eu/genopub/genome/M_musculus_Jul_2007/segments

<?xml version="1.0" encoding="UTF‐8"?><TYPES xmlns="http://biodas.org/documents/das2"xml:base="http://localhost:8080/genopub/genome/M_musculus_Jul_2007/" ><TYPE uri="EML1/PU1_ChIP/Input" title="EML1/PU1_ChIP/Input" ><FORMAT name="useq" /><PROP key="Normalization" value="N" /><PROP key="group" value="Alcalay" /><PROP key="group_contact" value="Myriam

Alcalay" /><PROP key="group_email" value="myriam.alcalay@ifom‐ieo‐campus.it" /><PROP key="name" value="Input" /><PROP key="owner" value="Alcalay, Myriam" /><PROP key="owner_email" value="IEO" /><PROP key="owner_institute" value="myriam.alcalay@ifom‐ieo‐campus.it" /><PROP key="url" value="http://localhost:8080/genopub/genopub?idAnnotation=11" /><PROP key="visibility" value="Members" />

</TYPE><TYPE uri="EML1/PU1_ChIP/PU1_A3" title="EML1/PU1_ChIP/PU1_A3" ><FORMAT name="useq" /><PROP key="Normalization" value="N" /><PROP key="group" value="Alcalay" /><PROP key="group_contact" value="Myriam

Alcalay" /><PROP key="group_email" value="myriam.alcalay@ifom‐ieo‐campus.it" /><PROP key="name" value="PU1_A3" /><PROP key="owner" value="Alcalay, Myriam" /><PROP key="owner_email" value="IEO" /><PROP key="owner_institute" value="myriam.alcalay@ifom‐ieo‐campus.it" /><PROP key="url" value="http://localhost:8080/genopub/genopub?idAnnotation=7" /><PROP key="visibility" value="Members" />

</TYPE></TYPES>

http://bioserver.iit.ieo.eu/genopub/genome/M_musculus_Jul_2007/types

http://localhost:8080/genopub/genome/M_musculus_Jul_2007/features?segment=http%3A%2F%2Flocalhost%3A8080%2Fgenopub%2Fgenome%2FM_musculus_Jul_2007%2Fchr1;overlaps=79374747%3A81152999;type=http%3A%2F%2Flocalhost%3A8080%2Fgenopub%2Fgenome%2FM_musculus_Jul_2007%2FEML1%2FPU1_ChIP%2FPU1_B2;format=useq

Returns a file in useq

format, essentially a zip file, preferred format in IGBContains a archiveReadMe.txt

and one or more “slice files”Observations can be textual or numerical

http://useq.sourceforge.net/useqArchiveFormat.html

Features

http://localhost:8080/genopub/genome/M_musculus_Jul_2007/features?segment=http://localhost:8080/genopub/genome/M_

A BED file (.bed) is a tab‐delimited text file that defines a feature track. File extension .bed is recommended. The BED file format is described on the UCSC Genome Bioinformatics web site: http://genome.ucsc.edu/FAQ/FAQformat. Tracks in the UCSC Genome Browser (http://genome.ucsc.edu/) can be downloaded to BED files and loaded into IGB/IGV.

Notes: Zero‐based index: Start and end positions are identified using a zero‐based index. The end position is excluded. For example, setting start‐end to 1‐2

describes exactly one base, the second base in the sequence (ACGT).

track name=pairedReads

description="Clone Paired Reads"Chr22

1000

5000

cloneAChr22

2000

6000

cloneB

Other important file formats: BED (textual)

The bedGraph

format is line‐oriented. Bedgraph

data are preceededby a track definition line, which adds a number of options for controlling the default display of this track. The track type is REQUIRED, and must be “bedGraph”.

Bedgraph

track data values can be integer or real, positive or negative values. Chromosome positions are specified as 0‐relative. The first chromosome position is 0. The last position in a chromosome of length N would be N ‐

1. Only positions specified have data. Positions not specified do not have data and will not be graphed. All positions specified in the input data must be in numerical order. The bedGraph

format has four columns of data:

track type=bedGraph

name="BedGraph

Format"chr19 49302000 49302300 10 chr19 49302300 49302600 20 chr19 49302600 49302900 25

Intervals can be of any length and overlapping.

Other important file formats: BEDGraph

(numerical)

http://genome.ucsc.edu/goldenPath/help/customTrack.html

The wiggle (WIG) format is for display of dense, continuous data

such as GC percent, probability scores, and transcriptome

data. Wiggle data elements must be equally sized. If you need to display continuous data that is sparse

or contains elements of varying size, use the BedGraph

format instead. If you have a very large data set and you would

like to keep it on your own server, you should use the bigWig

data format. Chromosome positions are specified as 1‐relative.

variableStep

is for data with irregular intervals between new data points and is the more commonly used wiggle format. It begins with a declaration line and is followed by two columns containing chromosome positions and data values: variableStep

chrom=chrN

[span=windowSize] StartA

dataValueAStartB

dataValueB

variableStep

chrom=chr2

is equivalent to:

variableStep

chrom=chr2 span=5300701 12.5

300701 12.5 300702 12.5 300703 12.5 300704 12.5 300705 12.5

Both versions display a value of 12.5 at position 300701‐300705 on chromosome 2.

Other important file formats: Wig (“wiggle”)

http://genome.ucsc.edu/goldenPath/help/bedgraph.html


http://genome.ucsc.edu/goldenPath/help/bigWig.html

The wiggle (WIG) format is for display of dense, continuous data

such as GC percent, probability scores, and transcriptome

data. Wiggle data elements must be equally sized. If you need to display continuous data that is sparse

or contains elements of varying size, use the BedGraph

format instead. If you have a very large data set and you would

like to keep it on your own server, you should use the bigWig

data format. Chromosome positions are specified as 1‐relative.

fixedStep

is for data with regular intervals between new data values and is the more compact wiggle format. It begins with a declaration line and is followed by a single column of data values:

The declaration line starts with the word fixedStep

and includes specifications for chromosome, start coordinate, and step size. The span specification has the same meaning as in

variableStep

format. For example, this fixedStep

specification:

fixedStep

chrom=chr3 start=400601 step=100 span=5 11 22 33

displays the values 11, 22, and 33 as single‐base regions on chromosome 3 at positions 400601, 400701, and 400801, respectively. Step and span are fixed for entire data set.

Other important file formats: Wig (“wiggle”)

http://www.biodas.org/wiki/Main_Page


http://genome.ucsc.edu/goldenPath/help/bigWig.html

Data transfer levelRequired: random access. At the lowest layer, we take advantage of the byte‐range protocols of HTTP and HTTPS, and the protocols associated with resuming interrupted FTP transfers, toachieve random access to binary files over the web.URL data cache layera cache layer on top of the data transfer layer. Data are fetched in blocks of 8 Kb, and each block is kept in a cache.Indexingbased on a single dimensional version of the R tree that is commonly used for indexing geographical data. The index size is typically less than 1% of the size of the data itself. Because the stored data are sorted by chromosome and start position, not every item in the file must be indexed; in fact by default only every 512th item is indexed.Compression:regions between indexed items (containing 512 items by default) are individually compressed (gzip).

BigWig and BigBed

Basic architecture Object relational mappingVia Hibernate

Flex

Apache Tomcat 6Glassfish

mySQL

DAS/2 server reference implementation: http://sourceforge.net/projects/genoviz

Database tables

User table

Annotation table

User role table Message digest 5 (MD5) encryptionfrom java.security package

Table views

Each file gets his own folder (automatically assigned folder names). No filenames to store in DB, which may contain non‐supported characters.

Data storage directory

Visibility levels:

DAS2 administration user interface

If you want to access data with restricted visibility, you must be inserted in the usertable and be part of a group that is headed by the owner of the data.

Users and groups setup

Every user, admin or non‐admin, can change his password,load new data, add data descriptions, and set visibility levels.

Non‐administrator users interface

IGB user identification

jdbcRealmldapRealmBoth work

NetAffx and UCSC hg19 annotations

All these annotations are one click away from the user

Conclusions

DAS2 servers provide distributed genome annotations

Support fine grained security model

Perform parsing of data for custom genome views

List of Genome Browsers

AlamutAnnmapApollo Genome Annotation Curation ToolArgo Genome BrowserArtemis Genome BrowserAvadis NGSBugViewCelera Genome BrowserDalliance Javascript‐based genome browserDiProGBDNAnexus Flash‐based interactive genome browserEnsembl The Ensembl Genome BrowserGaggle Genome BrowserGBrowseGenome WowserThe Genomic HyperBrowserIntegrative Genomics Viewer

Genostar GenoBrowserGenoverse interactive genome browserGenPlayGolden Helix GenomeBrowseIntegrated Genome BrowserIntegrated Microbial GenomesJBrowseMGV ‐

Microbial Genome ViewerMochiView Genome BrowserNextBio Genome BrowserPathway Tools Genome BrowserSavant Genome BrowserSEED viewerUCSC Genome Bioinformatics Genome BrowserViral Genome Organizer (VGO)VISTA genome browserWashU Genome Browser

Integrated Genome Browser: reference implementation of a DAS/2 client

IGB: Integrated Genome Browser (http://www.bioviz.org/igb/)

The Integrated Genome Browser (IGB, pronounced Ig‐Bee) is an interactive, zoomable, scrollable software program you can use to visualize and explore

genome‐scale data sets, such as tiling array data, next‐generation sequencing results, genome annotations, microarray designs, and the sequence itself. IGB is implemented using the Java programming language and should run on any computer.

IGB is an open source, publicly‐funded project, but it did not start out that way. Initial development of the software was largely funded by Affymetrix, Inc., which donated the IGB software to the community in 2005. Since then, community developers have continued to contribute their time and

efforts to improving the software. In 2008, funding from National Science Foundation has allowed us

to speed up the pace of development.

IGB interacts with DAS (distributed annotation system servers)

DAS (http://www.biodas.org/wiki/Main_Page)(DAS) defines a communication protocol used to exchange annotations on genomic or protein sequences. It is motivated by the idea that such annotations should not be provided by single centralized databases, but should instead be spread over multiple sites.

DAS/2 built to address the needs of distributing massive genomic data sets derived from high density microarray applications and Next (and Next Next) Generation Sequencing. Unlike DAS/1, DAS/2 does not require data exchange through text based XML but allows for data distribution using any text or binary format.

http://www.bioviz.org/igb/

http://www.biodas.org/wiki/Main_Page

Genometry model

Central concept: SeqSymmetry: breadth (SeqSpans) and depth (hierarchy, parents, children)

Hierarchical annotations

URL: http://www.bioviz.org/igb/download.shtml

How to launch IGB

Refseq and cytoband annotations automatically loaded from NetAffx DAS2

IGB after startup

Data access tab

Search tab

Selection info tab

Sliced view to interrogate alternative splice variants, ORF analysis.

Sliced view tab

Graph adjuster tab

External view tab

To load data: Click desired data set, choose region in view or whole chromosome,Click refresh data.

Data access tab

Load Affy probesets in View

NetAffx and UCSC mm8 annotations

NetAffx and UCSC mm9 annotations

ChIPchipChIPseqExon arrayDNAseIRNA seqChIP petRNA petMethyl seqCage tags...Km of data

new server

NetAffx and UCSC hg18 annotations

Data sources: Quickload, DAS, DAS2

Server registration (data source) tab

1. Single files file type extensionBAM .bamBED .bedBinary .bps, .bgn, .brs, .bsnp, .brpt, .bnib, .bp1, .bp2, .ead, .useqGFF .gff, .gtf, .gff3FASTA .fa, .fasta, .fasPSL .psl, .psl3DAS .das, .dasxml, .das2xmlGraph .gr, .bgr, .sgr, .bar, .chp, .wigScored Interval .sin, .egr, .egr.txtCopy number .cntCopy number chp .cnchp, .lohchpGenomic variation (Toronto DB) .varRegion (genotype console segmenter) SegmenterRptParser.CN_REGION_FILE_EXT, SegmenterRptParser.LOH_REGION_FILE_EXTFishClones .fsh, FishClonesParser.FILE_EXTScored map .map

2. Quickload (local directory with auxiliary files)

Easy to set up but can load data only into entire genome.

example http://www.bioviz.org/quickload/)

Four types of data sources (files)

3. DAS(1) (example UCSC), (software http://code.google.com/p/mydas/)

Can load data into view of interestresponse XML (problematic for large datasets)

4. DAS2 (example NetAffx), (software http://genoviz.sourceforge.net/

Unlike DAS1, DAS2 does not require data exchange through text based XML but allows for data distribution using any text or binary format. The two versions are not natively compatible.

Can load data into view of interest in a range of different formats.

Four types of data sources (servers)

Loading BAM files from http listing (no need to move them)

2. Text based annotations (e.g. .bed, .bam, .psl, .gff, .fasta files)

1. Graph based annotations (.gr, .bgr, .sgr, .bar, .chp, .wig, .sin, .egr, .egr.txt)

text

graph

Permit different types of operations

graph

graph

Two basic types of annotation

Logical: intersect, union, A not B, B not A, Xor, Not

antisense transcription

all transcribed regions

Select tracks, right‐click to access context‐menu

Operations on text based annotations

Scale: filter displayed values by value or by percentile

Height: adjust display height

Style: bar, line, dot, min/max/avg. heatmap, stairstep, color

transform: log10

, log2

, loge

, and inverses thereof

Join/split: diplay all graphs as one

arithmetic (requires identical X‐values): sum, difference, product, division

Thresholding: transforms regions meeting given criterion into text‐based annotation(can then be used in logical operations)

Operations on graph based annotations

Plugins

Based on Open Services Gateway initiative (OSGI)

Implement Activator interface

Needed to display plugin in tab

Access tracks from Genometry model

Can perform arbitrary manipulations on tracks

E‐μ

myc mouse model, Amati/Faga

Example: myc bound and differentially regulated gene

External view

Molecular Interaction plugin

Molecular Interaction plugin: visualize molecular interactions

Molecular Interaction plugin: visualize interactions with small molecules (drugs)

Our plugin repository (by Arnaud Ceol): http://cru.genomics.iit.it/igb/plugins‐test/

Highly interactive

Excellent logarithmic zooming around hairline

Integrated with UCSC/campus browser

Can do logical/arithmetic operations on annotations

Can create custom annotations on the fly

Can incorporate distributed annotations

Easily customizable display options

Open‐source: new features can be added according to our needs

IGB summary

Acknowledgements

Arnaud CeolLuca ZammataroJole

CostanzaAnna BiressiSofia CappellariBruno Amati

Francesco VencoYuriy

VaskinMarco MasseroliStefano Ceriet al.

Pier Giuseppe PelicciDomenico

TriaricoRoberta CarboneAnnalisa AriesiDaniela Rossi