Nature Methods GeneProf: analysis of high-throughput ... library before quality control ... data...

Nature Methods GeneProf: analysis of high-throughput sequencing experiments Florian Halbritter, Harsh J Vaidya & Simon R Tomlinson

Supplementary Figure 1 Screenshots from the GeneProf web interface

Supplementary Figure 2 Further screenshots from the GeneProf web interface

Supplementary Figure 3 GeneProf system architecture

Supplementary Figure 4 Average quality score per sequencing cycle and read library before quality control

Supplementary Figure 5 Nucleotide distribution along sequencing cycles

Supplementary Figure 6 Read alignment ambiguity

Supplementary Figure 7 GeneProf analysis workflow

Supplementary Figure 8 Alignment coverage plots

Supplementary Figure 9 Visualization of the Spearman correlation matrix of binary, gene-wise DNA-protein binding patterns

Supplementary Note Comparison of Assorted Software Tools for HTS Data Analysis

Supplementary Discussion GeneProf Software Design and Information about GeneProf Usage

Supplementary Methods Data Analysis Examples

Supplementary Data Experiment PDF Report

Nature Methods: doi:10.1038/nmeth.1809

Supplementary Figure 1 | Screenshots from the GeneProf web interface

Supplementary Figure 1 Screenshots from the GeneProf web interface. (a) The homepage provides quick access to all parts of the system and summarizes popular tasks. (b) Much of the public data stored in GeneProf's databases is included in gene-centric summary pages. (c) An extensive online and offline help system details all components of the system and gives many examples.


Supplementary Figure 2 | Further screenshots from the GeneProf web interface

Supplementary Figure 2 Further screenshots from the GeneProf web interface. (a) The graphical workflow-designer is a powerful tool for composing complex analysis pipelines in an intuitive, graphical manner using drag & drop and a plethora of versatile components. (b) Output datasets can be assessed in detail, tabular results can be browsed, filtered and sorted to quickly find the information you are looking for. (c) GeneProf can plot many customizable, publication-quality plots for any dataset in the system. Supported plots types include heatmaps, scatter plots, Venn diagrams, pie charts and more. The plots shown here are all examples from the manual.


Supplementary Figure 3 | GeneProf system architecture

Supplementary Figure 3 GeneProf system architecture. GeneProf is split into three major components (Supplementary Discussion): A web server manages all client-side interactions, provides interface components and acts as the primary access point for job management. A combination of a relational database and a file server stores all experimental and internal data in a space- and time-efficient manner. Lastly, a flexible network of compute nodes (“job agencies” and “workers”) deal with computationally demanding tasks.


Supplementary Figure 4 | Average quality score per sequencing cycle and read library before quality control

Supplementary Figure 4 Average quality score per sequencing cycle and read library before quality control. Quality scores usually range from 0 (bad) to 40 (excellent). The quality of base-calls drops along the length of the reads (which is normal for HTS libraries), but remains of an overall good quality (Supplementary Methods). Note that both members of a mate-pair sequence have been concatenated to create this plot. The start of the second mate-pair is in the center of this plot, where a striking change is evident.


Supplementary Figure 5 | Nucleotide distribution along sequencing cycles

Supplementary Figure 5 Nucleotide distribution along sequencing cycles. As in Supplementary Figure 4, mate-pair sequences have been concatenated (Supplementary Methods). Green = Adenine, Red = Thymine, Blue = Cytosine, Yellow = Guanine, Light Grey = N (not known / uncertain).


Supplementary Figure 6 | Read alignment ambiguity

Supplementary Figure 6 Read alignment ambiguity. Approximately 60.41-69.81% of all reads in example 1 (Supplementary Methods) could be aligned uniquely to one position in the mouse reference genome (NCBIM37). Only 9.91-11.55% could not be aligned at all or mapped to highly repetitive regions of the genome.


Supplementary Figure 7 | GeneProf analysis workflow

Supplementary Figure 7 GeneProf analysis workflow. GeneProf's analysis wizards create a complete data processing pipeline within a few steps, relieving users of the demanding task to set up these procedures themselves. At the same time, the methodology remains fully transparent and flexible so that users can assess and modify the workflows in details, if required. (a) A screenshot from GeneProf's dynamic workflow designer component showing the complete workflow used in example 1 (Supplementary Methods). The workflow can be easily modified using drag & drop and a range of versatile components (workflow "modules"). Valid input and output connections are color-coded, e.g. nucleotide sequence data is green, genomic data is red. (b) For summary purposes, a simplified version of the workflow, collapsing multiple processes of the same type into one, may be used.


Supplementary Figure 8 | Alignment coverage plots

Supplementary Figure 8 Alignment coverage plots. Genomic loci surrounding the genes Pou5f1 (a), Nanog (b) and Sox2 (c) with coverage plots for aligned RNA-seq reads in embryonic stem cells, lung fibroblasts and neural progenitors and transcription factor binding profiles for the same genes (Nanog, Sox2, Pou5f1) in two independent studies (Supplementary Methods). The expression patterns, in general, follow the exon structure of the genes. All genes at hand are highly expressed in embryonic stem cells, but not in the other cell types, with the exception of Sox2, which is known to be active in neural cells. Peaks in the binding profiles indicate putative binding sites and it is evident that many of these sites appear to be shared between different factors.


Supplementary Figure 9 | Visualization of the Spearman correlation matrix of binary, gene-wise DNA-protein binding patterns

Supplementary Figure 9 Visualization of the Spearman correlation matrix of binary, gene-wise DNA-protein binding patterns. We calculated the Spearman correlation matrix for a 2-dimensional matrix containing a value of 1 or 0 for each gene-binding factor combination (Supplementary Methods). A value of 1 means that the gene has a (putative) binding site for this factor in proximity of its transcription start site (TSS; here, 20kb upstream and 1kb downstream of the TSS), 0 means, no such binding site exists. The numbers in the matrix are the correlation coefficients calculated for each pairwise comparison and range from 1 (perfectly correlated) through 0 (no correlation) to -1 (perfectly anti-correlated). Proteins that cluster together in this matrix, often bind near the same genes.


Sheet1 - Software Comparison Table

Page 1

Supplementary Note | Comparison of Assorted Software Tools for HTS Data Analysis

Contents:Sheet 1.a/b Overview Comparison Table / Explanation of CriteriaSheet 2. Additional Comparison: GeneProf – GalaxySheet 3. Supplementary References

1.a Overview Comparison Table Good / advanced

Software DSAP Myrna GATK QIIME TAVERNA Galaxy GeneProf Software

Short Description Short Description

General Information DSAP Myrna GATK QIIME TAVERNA Galaxy GeneProf General Information

Interface Type Web GUI Web Web GUI or scripts GUI Scripts Scripts Scripts Scripts Scripts GUI Web Web Interface Type

Operating System(s) – User-Side any Unix, Mac any any Unix, Mac Unix Unix, Mac ** any any Unix, Mac any any any Operating System(s) – User-Side

URL http://www.geneprof.org URL

Reference [1] [2] [3] [4] [5] [6] [7] [8] [9] [11] [12] [13,14] this paper Reference

Version10/03/20 v1.2 v1.1.2 1.0 v2.0 2.5 1.0.5777 1.2.1 0.5.1p2 various 0.6 2.2.0 1.1106089

Version

Software Installation None Required None None Required Required Required Required Required Required Required Required None None Software Installation

Core Functionality DSAP Myrna GATK QIIME TAVERNA *** Galaxy GeneProf Core Functionality

Quality control Yes Yes No No No Yes Yes Yes Yes Yes No No Yes Yes Quality control

Alignment No Yes Yes No No No Limited Yes No Limited Limited No Yes Yes Alignment

Expression & statistics Yes Yes Yes No No Yes Limited No Yes Yes Yes No Yes Expression & statistics

Peak detection & related No No No Yes Yes No No No No Yes No No Yes Yes Peak detection & related

Further downstream processingSome None Some None Some Some Many Many Many Many Some Many Many Many

Further downstream processing

Organism support Many Unlimited Unlimited Unlimited Unlimited Unlimited Unlimited Unlimited Unlimited Unlimited Organism support

Workflow Design DSAP Myrna GATK QIIME TAVERNA Galaxy GeneProf Workflow Design

Workflow Type Fixed Fixed Fixed Fixed Semi-flexible Semi-flexible Flexible Flexible Flexible Flexible Flexible Flexible Flexible Flexible Workflow Type

Design Methodology N/A N/A N/A N/A Scripted pipelines Design Methodology

Assisted Workflow Creation * N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A none for HTS none Assisted Workflow Creation *

Exploratory Data Analysis * N/A N/A N/A N/A N/A N/A Exploratory Data Analysis *

Presentation of Results DSAP Myrna GATK QIIME TAVERNA Galaxy GeneProf Presentation of ResultsYes Static files Static files Static files Static files Static files Static files Static files Static files Static files Static files Dynamic tables

Graphs & plots * Few Few Few Few Few Few Few Few Yes Yes Few No Few Graphs & plots *

No No No No Yes No No No No Possible Track export No

(a spreadsheet version of this document is available online at: http://www.geneprof.org/_pub/GeneProf-SoftwareComparisonTable.xls.zip)

Color Code:Missing /

unsatisfactoryIncomplete / insufficient

miRNAkey W-ChIPeaks CisGenome SeqBuster HTSeq R/Bioconductor RSEQtoolsFixed analysis pipeline for small RNA deep-sequencing data

Software pipeline for the analysis of miRNA sequencing data

Analysis of RNA-seq data

ChIP-seq binding peak detection

Software for ChIP-seq and ChIp-on-chip peak detection and analysis

Toolkit for the analysis and visualization of small RNA sequencing data

Structured tools for HTS analysis, focus on variants / SNP

Pipeline for microbial community analysis

Programming framework for deep sequencing data analysis

Generic data analysis framework with many packages, e.g. [10]

Analysis of RNA-seq data

Customizable general-purpose workflow engine

Customizable workflow-system for genomic data

Customizable workflow-system and resource for gene expression and regulation

miRNAkey W-ChIPeaks CisGenome SeqBuster HTSeq R/Bioconductor RSEQtools

Windows (GUI), Linux, Mac (scripts)

http://dsap.cgu.edu.twhttp://ibis.tau.ac.il/miRNAkeyhttp://bowtie-bio.sourceforge.net/myrnahttp://motif.bmi.ohio-state.edu/W-ChIPeakshttp://www.biostat.jhsph.edu/~hji/cisgenomehttp://estivill_lab.crg.es/seqbuster/index2.htmlhttp://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkithttp://qiime.sourceforge.nethttp://www-huber.embl.de/users/anders/HTSeqhttp://www.bioconductor.org/help/bioc-views/release/bioc/html/ShortRead.htmlhttp://archive.gersteinlab.org/proj/rnaseq/rseqtoolshttp://www.taverna.org.ukhttp://galaxy.psu.edu

public server (09-June-2011)


No differential expression

Various (incl. human, mouse)

Many (unlimited with local install)

Human and mouse only

Various (incl. human, mouse)


One-step processes

One-step processes

Scripted pipelines

Scripted pipelines

Scripted pipelines

Scripted pipelines

Graphical, module-based workflow design



Analysis wizards assist users in quickly setting up workflows

Components adaptable, dependent processes can re-run





Changes to parameters require the entire analysis to be re-run

No way to adapt parameters of processes without repeating whole analyses

Components / parameters can be easily changed and (only) dependent processes will be re-run automatically


Full output data available interactively? *

Dynamic tables

Dynamic tables

Full output data available interactively? *

Many, highly customizable, dynamic plots

Genome-browser integrated / linked

Limited, no easy way to create coverage plots (wiggle plots)

Yes, genomic data can be directly visualized without further ado



http://www.geneprof.org/_pub/GeneProf-SoftwareComparisonTable.xls.zip


Page 2

Data Providence / Integration DSAP Myrna GATK QIIME TAVERNA Galaxy GeneProf Data Providence / Integration

Integration of public data * No No No No No No No No No Possible No No Integration of public data *

Meta-analysis support * No No No No Limited No Limited Limited Limited Limited No No Limited Yes Meta-analysis support *

Workflows & results linkable No No No No No No No No No No No Yes Yes Yes Workflows & results linkable

No No No No No No No No No No No

Gene-centric summaries * No No No No No No No No No No No No No Gene-centric summaries *

Secure Data Sharing No No No No No No No No No No No No Yes Yes Secure Data Sharing

* We believe that GeneProf presents a considerable and valuable advance in this category.

** QIIME provides a virtual box installation package, which makes it simple to set up the system on any operating system.

*** Taverna does not (yet) include components for HTS data analysis by default, but offers facilities to include web services available from arbitrary providers, so theoretically, if a provider for suitable services was to be found, Taverna could indeed support these

processes. According to announcements made by the Taverna team, a next-generation sequencing module is under development.

1.b Explanation of Criteria

N.B. Software has been compared in terms of their analysis capabilities for transcriptomic and regulatory next-generation sequencing data and their general usability.We have only included software that is free for academic use and that we thought was addressing these issues. Some software might have additional features, which have not been considered for the sake of this comparison.We have made every effort to be objective, but unfortunately comparisons of this type are inherently biased and we acknowledge that this table might be subject to differences in opinion.Some software is constantly being updated and extended, so the list of supported features might have changed since we composed this comparison (09 June 2011).

General Information

Interface Type

Operating System(s) – User-Side

Command-line tools and stand-alone programs might be dependent on the operating system used and if such a dependency is likely, we mark it here. Where parts of the software are running on a remote computer (“server”),

only operating system requirements on the user-side (“client”) are considered.

Software Installation

Core Functionality

Quality control

Does the software support quality control of raw short read sequence data in any way (such as removal of sequence artefacts and erroneous reads)? An assessment of the quality of the data at hand is essential to any sensible data analysis. “Yes” or “no” only.

We do not distinguish different methods for quality control here.

Alignment

Does the software support short read sequence alignment? Aligning short reads to a genome or transcriptome is a crucial step for virtually any sort of HTS data analysis. “Yes” if an alignment tool is directly integrated in the system, “limited” if an alignment tool has to be

installed / integrated separately or “no” if alignment cannot be performed from within the software.

Expression & statistics

Does the software support any sort gene expression and / or differential expression analysis? “Limited” means some sort of analysis is supported, but only for a very constrained type of application or that expression analysis could be integrated using other

external software.

Peak detection & related

Further downstream processing

Does the software offer any other features for further analysis of the results of next-gen sequencing data (e.g. comparison of results from different experiments, applications other than gene expression and DNA-binding, etc.)?

“None”, “some” or “many”, where the distinctions between “some” and “many” are, admittedly, rather arbitrary.

Organism support

Can the software only be used with data from certain organisms? We believe that at least human and the major model organisms such as mouse, rat, yeast, etc. should be supported.

Workflow Design

Workflow Type

miRNAkey W-ChIPeaks CisGenome SeqBuster HTSeq R/Bioconductor RSEQtoolsLimited availability (mostly demo data, < 10 published studies)

Yes, considerable collection (> 60 re-analysed published studies)

Transparent and reproducible analysis integrated with results *

Workflows can be saved reproducibly, but they are not integrated with the experimental data at used

Histories integrate data with analysis, logical flow can be difficult to recapitulate in large histories

Full workflow integrated with data, history of individual process outputs, intermediate data available

Transparent and reproducible analysis integrated with results *

Yes, large amounts of HTS data summarized on a per-gene basis

We categorize software into command-line based tools (scripts), graphical stand-alone software (GUI) and web-based applications (Web).

Installing and setting up software can be difficult and time-consuming. “None” if the software does not require any installation or “required” if the software and / or other components that the software depends on need to be installed before use.

Does the software support binding peak detection for ChIP-seq data and / or further downstream analysis of these peaks (e.g. association of peaks with genes, motif discovery, etc.)? “Yes” / “no” / “limited” as before.

The type of analyses supported by the system: “Fixed” means the system run a pre-defined series of task with some degree of parameterization, “semi-flexible” means some steps might be exchangeable and “flexible” means the entire analysis workflow can be defined



Page 3

in every detail from a range of provided components.

Design Methodology

This category is only applicable to systems that use “flexible” workflows. Workflow design can be either graphical (the preferred way, by our definition) or based on concatenating command-line scripts in a programmatic fashion (less accessible to most users).

Assisted Workflow Creation

Even if workflows can be designed graphically, the process can be very daunting and time-consuming for non-expert users. It is therefore desirable to have a means of assisting users in the creation of common processing pipelines.

Exploratory Data Analysis

Often, it is not possible beforehand to know exactly which programs (and parameters) will be best suited for a particular dataset at hand. It is therefore beneficial to have an easy means to adapt certain steps of the analysis without losing track of what one has done

and (ideally) without having to run all (time-consuming) processes again.

Presentation of Results

Full output data available interactively?

dynamic data, that can be further filtered, sorted, etc. In this context, we consider dynamic results superior to static ones.

Graphs & plots

Does the software automatically create plots that facilitate the understanding of the outputs? Numeric results and large tables, especially when dealing with genome-scale data, can be overwhelming and difficult to grasp. Sensible plots can greatly help to interpret

experimental measurements. Since no software could ever predict every type of plot suitable for the visualization of highly specialized data, it would also be beneficial if domain experts have a way to create additional, customizable plots dynamically from output data.


Genome browsers are popular tools for looking at genomic data (such as coverage profiles created from aligned short read data). Creating the data necessary for existing software can be difficult. Ideally, the analysis software should be able to output tracks compatible

with popular genome browsers or even to visualize genomic data directly within the software itself.

Data Providence and Integration

Integration of public data

if data can be integrated, but only a limited range is available.

Meta-analysis support

comparisons are possible, but require in-depth understanding of the format of the structure of the datasets and / or multiple components are required for one such comparison.

Workflows & results linkable

Can data analysis results and methodologies / workflows be linked to in reports, publications, etc.? Making analysis results and procedures readily available to other make research more transparent, reproducible and reusable and has been reported

to increase the impact of publications.

Transparent and reproducible analysis integrated with results

A fully transparent and understandable workflow is important, but in order to completely reproduce the data analysis presented, for example, in a paper, it is not sufficient to store workflows alone, but it is essential to also connect these with the input data used.

The importance of this becomes evident, when data from public resources is used, which frequently changes or becomes unavailable.

Gene-centric summaries

Secure Data Sharing

In cutting-edge science, often researchers from different parts of the world work together on a joint investigation. It would be highly beneficial to have a convenient way of sharing data and analysis results “as they happen” (so before publication) securely and quickly.

A data analysis tool that supports such functionality will be considered superior in this category for the sake of our comparison.

In which way can users access the results of analyses? A convenient way to access, browse and visualize analysis results is an important component of an analysis software. “Static” files refers to data that is fixed at the point the analysis finishes as opposed to

Integration of public data: Is there an easy way to integrate published data analyzed using the same software? Many researchers want to verify published data and / or look for additional information not discussed in publications. “Yes”, “No” or “Yes, limited”,

Can data from several experiments be compared easily? For example, can output datasets of two separate studies be juxtaposed in a straightforward manner? Is it possible to visualize genomic data from different studies together in one place? “Limited” if such

Having a convenient way to access the masses of data that might be stored in a system makes it possible for a wider range of researchers to utilize a tool – even if they do not have any own HTS data to analyze – thus drastically improving the usefullness of a tool.


Sheet2 - Comparison GeneProf - Galaxy

Page 4

2. Additional Comparison: GeneProf – Galaxy

And broad user-base, only comparatively few publications have appeared that exploit Galaxy to its full potential harnessing the software to complete an entire analysis from start to finish. From personal communication with data analysis experts,experimental biologists and from our own experience, we have learned that this might be partly due to the difficulty of setting up workflows consisting of many fine-grained components and from the fact that it remains difficult to keep track oflarge collections of datasets and where they fit into the current stage of the processing.

The ease of use of different software applications is not easily compared in a table such as presented above, therefore we have had a look at how easily some analyses can be set up with both systems.For this purpose, we attempted to repeat an analysis comparable to the one presented in our paper (gpXP_000168) in Galaxy and took note of which steps were necessary at every stage.

GeneProf GalaxySteps involved Steps involved1. Set up a new analysis environment- create a new GeneProf experiment - create a new history

- rename history

2. Obtain raw input data

(mouse gene reference data provided by GeneProf) - enter FTP download locations one by one into the "Upload File" tool

3. Annotate input data- open sample annotation tool - lookup which dataset (SRRxxx) corresponds to which cell type / replicate in the ENA or SRA annotations

4. Define analysis process- open RNA-seq wizard

- use these wizard (8 times) on the individual input datasets

5. Prepare genome browser tracks- open built-in genome browser - create new visualization in Trackster- select data tracks

- optional: change track name, order, ..6. Prepare analysis for inclusion in publication- export a PDF summary report - create a new Galaxy page

- organize conceptual workflow into steps and describe them textually

- locate noteworthy datasets in history and link them in the Galaxy page

Galaxy [13,14] is a well-established versatile analysis toolkit. Initially, designed primarily for the analysis of genomic data and sequences, the framework has more recently been extended with more HTS-specific functionality. Despite its popularity

- enter information about name, description, citations, external accession numbers, ..

- open SRA importer tool (accession number transferred automatically from experiment) and click "import"

- find file download locations in external European Nucleotide Database (FASTQ format, Galaxy does not currently support the NCBI SRA's own file format)

- obtain gene reference data (known genes) using "UCSC main table browser" tool: find the right group and table, select GFT format and "send output to Galaxy"

- augment sample annotation (some is imported from the SRA; steps described in methods section)

- rename each imported dataset individually so to make it possible to track which data we are dealing with

- define a new workflow to draw quality score and nucleotide distribution plots, filter reads, align reads to the genome by connecting analysis components and defining parameters, used components: "Input Data", "FASTQ Summary Statistics","Compute Quality Statistics", "Draw Nucleotide Distribution Chart", "Draw Quality Score Plot", "Filter FASTQ","Tophat"

- check sample grouping and adjust 4 parameter settings (as described in methods section)

- define a second workflow to merge replicates and calculates expression counts and run it on the output of the previous processes, used components: "Input Dataset", "Merge BAM", "Cufflinks"

- change type of cufflinks outputs to CSV, join datasets (2x "Join two datasets"), cut out expression count columns ("Cut columns")- plot histograms of expression counts (3x "Histogram")

- calculate fold changes (3x "Compute")

- add datasets from history (wait for indexing..)- (optional) choose display parameters (dataset labels, colours, scaling, ...) - change location (look up coordinates somewhere else)

- no "coverage"-plots supported, so we use track type "squish" (suggested by default)

(all data, outputs and analysis are automatically summarized in the experiment main page)


Sheet2 - Comparison GeneProf - Galaxy

Page 5

Summary: GeneProf GalaxyComplexity of Analysis WorkflowNumber of components in workflow: 24 Number of analysis components used: 65Number of parameters changed manually: 4 Number of parameters changed manually: 28Main Outputs:The analyses outlined above are as similar as achievable, but differ in the amount and detail of the “main” outputs produced:Overview Statistics & Plots: Overview Statistics & Plots:Read quality, nucleotide distribution before and after quality control: ✓ Read quality, nucleotide distribution before and after quality control: ✓Additional information about read number, lengths, quality scores: ✓ Additional information about read number, lengths, quality scores: XSuccess rate and ambiguity of alignment: ✓ Success rate and ambiguity of alignment: XAdditional information about genomic distribution of alignments, lengths, etc. ✓ Additional information about genomic distribution of alignments, lengths, etc. XDistribution of gene expression counts in libraries: ✓ Distribution of gene expression counts in libraries: ✓Heatmap, correlation & PCA: ✓ Heatmap, correlation & PCA: XSummary of most expressed genes and feature types: ✓ Summary of most expressed genes and feature types: XGenome Browser Tracks: Genome Browser Tracks:Track of aligned reads and reference genes: ✓ Track of aligned reads and reference genes: ✓Customized genomic coverage plots: ✓ Customized genomic coverage plots: XGene Expression Data: Gene Expression Data:Integrated list of gene expression counts in all libraries: ✓ Integrated list of gene expression counts in all libraries: ✓Supplementary annotation available (gene names, descriptions, …): ✓ Supplementary annotation available (gene names, descriptions, …): XFold changes between cell types: ✓ Fold changes between cell types: ✓Differential expression statistics: ✓ Differential expression statistics: XFiltered lists of differentially expressed genes: ✓ Filtered lists of differentially expressed genes: XOutput dataset can be explored dynamically (filters, sorting, etc.) ✓ Output dataset can be explored dynamically (filters, sorting, etc.) X

The Galaxy project created can be accessed at: http://main.g2.bx.psu.edu/u/galaxy_guttman_reanalysis/p/reanalysis-of-guttman-et-al-data

The GeneProf experiment is available at:

http://www.geneprof.org/show?id=gpXP_000168


References

Page 6

3. Supplementary References

[1] Huang PJ et al. Nucleic Acids Res. 38 (web server issue), W385-W391 (2010). PMID: 20478825

[2] Ronen R et al. Bioinformatics 26 (20), 2615-2616 (2010). PMID: 20801911

[3] Langmead B, Hansen KD, Leek JT. Genome Biol. 11 (8), R83 (2010). PMID: 20701754

[4] Lan X, Bonneville R, Apostolos J, Wu W, Jin VX. Bioinformatics (2010). PMID: 21138948

[5] Ji H et al. Nat. Biotechnol. 26 (11), 1293-1300 (2008). PMID: 18978777

[6] Pantano L, Estivill X, Martí E. Nucleic Acids Res. 38 (5), e35 (2009). PMID: 20008100

[7] McKenna A et al. Genome Res. 20(9), 1297-1298 (2010). PMID: 20644199

[8] Caporaso JG et al. Nat. Methods 7 (5), 335-336 (2010). PMID: 20383131

[9] Gentleman RC et al. Genome Biol. 5 (10), R80 (2004). PMID: 15461798

[10] Morgan M et al. Bioinformatics 25 (19), 2607-2608 (2009). PMID: 19654119

[11] Habegger L et al. Bioinformatics 27 (2), 281-283 (2010).PMID: 21134889

[12] Hull D et al. Nucleic Acids Res. 34 (web server issue), W729-W732 (2006). PMID: 16845108

[13] Goecks J, Nekrutenko A, Taylor J, Genome Biol. 11(8): R86 (2010). PMID: 21278189

[14] Blankenberg D et al, Curr Protoc Mol Biol. Chapter 19: Unit 19.10.1-21. (2010). PMID: 20069535


http://www.ncbi.nlm.nih.gov/pubmed/20478825














Supplementary Discussion

Overview of System Architecture

An overview of the system architecture of GeneProf is illustrated in Supplementary Figure 3. The entire application is organized into three major components:

GeneProf Web Server

We aimed to make GeneProf a comprehensive, yet user-friendly data analysis suite and data hub accessible to and usable by any academic researcher from any part of the world, regardless of technical know-how and computing equipment. A web-based interface seemed to be the most straightforward and effective means to deliver a software solution to a broad audience of scientists world-wide: Anybody familiar with the use of a modern web browser can get started right away (and from anywhere with a reliable internet connection) without needing to worry about installing software or moving data from one place to another.

The GeneProf web server hosts all of the application's web pages and dynamic components and constitutes the only part of the application exposed to direct user-interaction. The GeneProf web server furthermore manages essential aspects of user management and the confidentiality of user data, acts as a primary interface between web front-end components and the GeneProf databases (see next section), converts data between different formats on demand, creates plots, data representations and summaries for the interface and, crucially, acts as an interface between the experiment (processing job) queue and the user allowing to submit new jobs and track (or cancel) existing ones. Recently we have also added an alternative access layer, called the “GeneProf Web API”, which allows computer programmers and data analysis experts to programmatically retrieve data from GeneProf for use in external web sites or programs.

The server-side software is implemented in the popular Java programming language making use of standard, web-programming technologies such as JSP, Servlets, HTML, CSS, JavaScript and AJAX exploiting a variety of open frameworks and libraries where appropriate and can be deployed on a standard Java web container, such as Apache Tomcat, which is used to run the public instance of GeneProf at http://www.geneprof.org. All GeneProf servers are running on a CentOS Enterprise Linux operating system.At various points, the web application interacts with external software components, such as R (http://www.r-project.org) and LaTeX (we currently use the TeX Live distribution, http://www.tug.org/texlive). These software are not an essential requirement for the functioning of the GeneProf web application, but help to enrich the user experience (e.g. via rendering plots). Further technical details and up-to-date lists of used and recommended external software, libraries and tools can be found in the GeneProf manual. A list of external software components used in the manuscript version of GeneProf is given in section 'Third-Party Software Components'.

GeneProf Databases

Arguably, the most important part of a scientific software application should always be the data. GeneProf stores all its data in a combination of a relational database system and a file server. Other than user-submitted scientific data, such as short read sequences and genomic data, which make up the core of what GeneProf is all about, these data comprise user records and other internal information such as, for example, the experiment (job) execution queue.

We found that combining a well-established relational database management system (RDBMS) with a conventional file server offered the ideal solution for our application: Smaller units of data and those information that requires quick, random-access retrieval as well as dynamic filtering, sorting and the like can conveniently be stored in a relational database. In GeneProf, this means we store all internal data as well as gene-centric data and reference annotations (called 'Feature Data' and 'Reference Data', respectively, throughout the GeneProf interface) in this part of the database. Large chunks of data and data that does usually only require sequential access, on the other hand, ought to be stored on a file server. Here, we make use of a variety of compressed binary data formats to efficiently store and retrieve bulky data, such as short read sequences and genomic data (e.g. from alignments), effectively saving (disk) space and time (data access), which are both of major concern when dealing with the volume of data that we are presented with by modern functional genomics technologies.

From a technical point of view, we currently use MySQL (http://www.mysql.com) as an RDBMS on http://www.geneprof.org, although an equivalent system may be substituted in its place. The file server may


http://www.geneprof.org/


http://www.mysql.com/

http://www.tug.org/texlive

http://www.r-project.org/


either be a simple shared network drive accessible directly by any computer retrieving data from it (that is, the web server and processing nodes) or can use any standard FTP (file transfer protocol) server.

GeneProf Job Agencies & Workers

A powerful computer is of paramount importance to much of the data analysis performed in state-of-the-art bioinformatics workflows. It is not uncommon for individual processes to take several hours until completion and to require an amount of memory not currently available on most standard desktop workstations. GeneProf has therefore been designed to exploit a flexible network of compute nodes to perform all processing steps required. We call these compute nodes “job agencies”. Each job agency independently and constantly monitors the current experiment queue and waits for new jobs pending execution. When a new experiment is entered into the processing queue, one job agency will pick up this experiment and spawn a new “worker” process for the next step of the experiment's analysis workflow. This step is executed until the experiment is completed, the user decides to interrupt the process or the job agency shuts down in a controlled fashion, for example, to retrieve an update. Updates are dynamically retrieved from the web server to ensure consistent GeneProf versions on all nodes.Individual worker processes are simple Java programs (the workflow 'modules') following a well-defined specification (in programming terms, they all “inherit” from a common “superclass”) and may call external programs as required: In fact, some modules are merely wrappers for established tools implemented in a programming language other than Java. Importantly, computer programmers and algorithm developers can (comparatively easily) implement new modules without worrying about or any need to modify the mechanics of how they will be executed (also see section 'Extending GeneProf, Building on GeneProf & Availability of Source Code' of this document for more information).

GeneProf's 'job agencies' are small Java programs enclosed in a simple shell script that will retrieve updated versions of the software as required. Importantly, since we decided to implement the job and workflow management by retrieving jobs from a central database (“PULL” rather than “PUSH”) it is possible to easily extend the compute cluster with alternative types of nodes such as high-performance compute clusters as well as grid and cloud computing services.

Third-Party Software Components

GeneProf provides a wide range of data analysis functionality both within workflows and in other components of the web interface. Many of these tools are custom-built (such as modules for quality control, summary statistics, gene expression quantization and various tools for manipulating datasets and the conversion of inputs / outputs between individual components), but several benefit from publicly available third-party software packages and resources, which are not part of the GeneProf codebase itself, but are called from various points in the application (in no particular order):

Name Type / Use in GeneProf Reference or URL

SRA Toolkit Short Read Processing http://www.ncbi.nlm.nih.gov/books/NBK47540/

Picard Short Read / Alignment Processing http://picard.sourceforge.net/

SAMTools Short Read / Alignment Processing http://samtools.sourceforge.net/

BEDTools Genomic Data Manipulation 1

FASTX Toolkit Short Read Processing http://hannonlab.cshl.edu/fastx_toolkit

Bowtie Short Read Alignment 2

Tophat Short Read Alignment 3

ChIPSeqPeakFinder ChIP Peak Detection 4

CCAT ChIP Peak Detection 5

MACS ChIP Peak Detection 6

CisGenome ChIP Peak Detection 7

SISSRs ChIP Peak Detection 8

TFAS Algorithm Peak-Gene Association 9

MEME Motif Discovery 10

DESeq Differential Expression Analysis 11


http://hannonlab.cshl.edu/fastx_toolkit

http://samtools.sourceforge.net/

http://picard.sourceforge.net/

http://www.ncbi.nlm.nih.gov/books/NBK47540/

EdgeR Differential Expression Analysis 12

GenomeGraphs Visualization 13

Vennerable Visualization http://r-forge.r-project.org/projects/vennerable/

Ensembl Gene Annotation / Data Source 14

BioGRID REST API Protein-Protein Interactions / Data Source 15

Gene Ontology Functional Annotation / Data Source 16

goseq Functional Enrichment 17

R Statistics / Maths / Visualization http://www.r-project.org

Apart from these, GeneProf code makes use of many third-part code libraries. Please refer to the online GeneProf manual for an up-to-date list: http://www.geneprof.org/help_and_tutorials.jsp.

Additional Information about the Use of GeneProf Services and Software

Summary of Motivation and Overview of Terms & Conditions

The GeneProf software has been designed to facilitate life-science research by over-coming technical barriers preventing optimal use of data analysis methods and making experimental data and results more easily accessible and reusable to a wide range of researchers.

In addition to developing the software, we have set up a public instance GeneProf on a network of high-performance computers hosted at our department, which is accessible and free to use for academic research purposes at http://www.geneprof.org. We encourage researchers to browse the GeneProf archives and to upload their own data to the system. All data kept in the system remains the intellectual property of the uploader and will be kept strictly confidential unless or until the owner decides to make it public. After making the data public, any GeneProf user may view, export and reuse any part of this data and it ought to be removed from the system only in exceptional circumstances. In order to enable a fair use of the system by a wide range of users, we reserve the right to restrict the amount of disk space and compute time occupied by any individual user. Researchers handling particularly large amounts of data (high-volume users) may, however, obtain and setup of a copy of the software in their local department.

GeneProf has been developed as an academic piece of software and we provide our services freely for research purposes. Users interested in a potential commercial or in any way chargeable use of the software and/or services are encouraged to get in touch with use to discuss further options.

For a detailed and up-to-date version of the terms and conditions please refer to http://www.geneprof.org/terms_and_conditions.jsp.

GeneProf Documentation & Tutorials

GeneProf has an extensive online manual at http://www.geneprof.org/help_and_tutorials.jsp. The manual contains detailed explanations of many fundamental concepts, of all pages and of all modules that are part of the GeneProf system. The manual also contains several helpful tutorials and examples.

An up-to-date PDF version of the manual (suitable for printing) is also available from this source, however, we recommend checking the online version from time to time as the documentation is constantly being updated and improved.

Lastly, we have recorded a collection of short video clips introducing GeneProf's most important features. All screencasts are available at this URL: http://www.geneprof.org/screencasts.jsp.


http://www.geneprof.org/GeneProf/screencasts.jsp

http://www.geneprof.org/help_and_tutorials.jsp

http://www.geneprof.org/terms_and_conditions.jsp


http://www.geneprof.org/help_and_tutorials.jsp


http://r-forge.r-project.org/projects/vennerable/

Extending GeneProf, Building on GeneProf & Availability of Source Code

Bioinformaticians, computer programmers and data analysis experts are invited to contribute to the further development of the GeneProf software by developing their software and tools in a GeneProf-compliant manner or by integrating existing solutions into the GeneProf framework following a simple, well-defined specification. GeneProf binaries and code libraries for the Java programming language can be obtained from the website (initially by request only). Source code for existing GeneProf workflow modules can also be requested.

Another way in which advanced users may benefit from GeneProf, comes via the GeneProf Web API, which allows to retrieve data from GeneProf's vast databases programmatically and securely from external websites or computer programs.

Rather than providing further details here, we would like to refer interested reader to the 'Advanced Topics' section of the GeneProf manual (http://www.geneprof.org/help_advancedtopics.jsp), which will be constantly updated to reflect the most recent changes to the software. We encourage prospective developers of GeneProf modules to get in touch with us, so we can try to provide additional help and suggestions as appropriate.

References

1. Quinlan, A.R. & Hall, I.M. Bioinformatics 26, 841-842 (2010).2. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Genome Biol. 10, R25 (2009).3. Trapnell, C., Pachter, L. & Salzberg, S.L. Bioinformatics 25, 1105-1111 (2009).4. Chen, X. et al. Cell 133, 1106-1117 (2008).5. Xu, H. et al. Bioinformatics 26, 1199-1204 (2010).6. Zhang, Y. et al. Genome Biol. 9, R137 (2008).7. Ji, H. et al. Nat. Biotechnol. 26, 1293-1300 (2008).8. Jothi, R., Cuddapah, S., Barski, A., Cui, K. & Zhao, K. Nucleic Acids Res. 36, 5221-5231 (2008).9. Ouyang, Z., Zhou, Q. & Wong, W.H. Proc. Natl. Acad. Sci. U.S.A. 106, 21521-21526 (2009).10. Bailey, T.L., Williams, N., Misleh, C. & Li, W.W. Nucleic Acids Res. 34, W369-73 (2006).11. Anders, S. & Huber, W. Genome Biol. 11, R106 (2010).12. Robinson, M.D., McCarthy, D.J. & Smyth, G.K. Bioinformatics 26, 139-140 (2010).13. Durinck, S., Bullard, J., Spellman, P.T. & Dudoit, S. BMC Bioinformatics 10 (2009).14. Flicek, P. et al. Nucleic Acids Res. 39, D800-806 (2011).15. Winter, A.G., Wildenhain, J. & Tyers, M. Bioinformatics 27, 1043-1044 (2011).16. Ashburner, M. et al. Nat. Genet. 25, 25-29 (2000).17. Young, M.D., Wakefield, M.J., Smyth, G.K. & Oshlack, A. Genome Biol. 11, R14 (2010).


http://www.geneprof.org/help_advancedtopics.jsp

Supplementary Methods

The following examples demonstrate how GeneProf can be used to analyze experimental data and how it can help to answer actual biological questions.

Example 1: RNA-seq Gene Expression Analysis

Step 1: Experiment Creation and Data Upload

We first created a new GeneProf experiment (and called it 'Guttman2010 Data: Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs'). The experiment, with GeneProf accession number gpXP_000168, can be accessed at http://www.geneprof.org/show?id=gpXP_000168. It should be noted, that we did not repeat the analysis presented in the original research publication, but simply took the sequencing data and interpreted it as a RNA-seq experiment looking for differential gene expression between different cell types.

We obtained the raw short read data1 from the Sequence Read Archive2 (accession: SRP002325). GeneProf provides a tool for downloading and importing data from this resource, so all we needed to do was to enter the above SRA accession number into the SRA importer tool and initiate the import.

Imports from the SRA are handled on the compute cluster, so, at this stage, we had to wait until the process was finished before we could proceed with the analysis. Once the large download was complete we were notified by email that the data was ready.

Step 2: Sample Annotation and Workflow Creation

We then had a look at the sample annotation imported from the SRA and extended it a little by marking “Sample Groups” to categorize all short read datasets belonging to the same cell type together (i.e. we annotated SRR039999 - SRR040001 as “ESC” = embryonic stem cell, SRR040002 & SRR040003 as “MLF” = mouse lung fibroblast and SRR040004 & SRR040005 as “NPC” = neural progenitor cell). We also changed that “Cell Type” description imported from the SRA to singulars to conform with the annotations used in the rest of the system and removed the redundant “Description” column (which only contained information about the cell type).

Having augmented the sample annotation, we could now use the “All-in-one RNA-seq Analysis Wizard” to draw up an analysis workflow for these data. The wizard automatically suggested the correct reference organism and sample groupings (using the annotation provided earlier). Since the replicates in this experiment were technical (the same biological sample split across several lanes of the flow cell), we decided to merge individual datasets for each sample group prior to analysis. To do so, we only needed to tick the checkbox labeled 'Merge datasets for the same group' (1 click).

We were satisfied with the default settings for most of the other options, but changed the alignment software to be used and the insert size: We wanted to use TopHat3 for this aligning short reads to the mouse genome (the wizard recommends this options for long and mate-pair reads, such as is the nature of the sequence data at hand), so we clicked the checkbox next to TopHat in the configuration page (+1 click). We also changed the “average / expected insert size” to 140 (the insert size is stated as ranging from 110 to 170 1; +1 click and 3 key strokes) and the standard deviation to 30 (rather arbitrarily, since we did not know the exact distribution; +1 click and 2 key strokes). We then confirmed the other settings by clicking the big “Accept Settings & Create Workflow” button (+1 click).

The wizard now created a workflow (including quality control, sequence alignment, gene expression quantization and differential expression analysis) based on the parameters provided. The proposed workflow was summarized on the next page (Supplementary Fig. 7). Since we were satisfied with this workflow, we chose to save the workflow and execute it (+1 click).

Thanks to the easy to use workflow wizard and the availability of sensible, pre-defined default options, it took us only six clicks and five key strokes to create a complete data analysis pipeline for an RNA-seq experiment with three different experimental conditions!

At this point, the experiment was entered into the job scheduler queue and soon afterwards the experiment started being processed on the compute cluster.

Step 3: Examination of Results

When we were notified (by email) that the data processing had been completed, we went ahead to have a look at the outputs (we have included a PDF report, exported directly from the GeneProf experiment, in the supplementary online material of this publication, see Supplementary Data).



http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP002325

In particular, we looked at the different summary statistics calculated at various stages of the workflow: Sequence statistics before and after quality control (http://www.geneprof.org/show?id=gpDS_11_168_12_1 and http://www.geneprof.org/show?id=gpDS_11_168_16_1, respectively), read alignment (http://www.geneprof.org/show?id=gpDS_11_168_17_1) and gene expression statistics (http://www.geneprof.org/show?id=gpDS_11_168_19_1).

We found, that the sequence data in this experiment was of good quality as evidenced by overall high base-call quality scores (Supplementary Fig. 4) and a (rather) uniform nucleotide distribution across all sequencing cycles (Supplementary Fig. 5). Consequently, the quality control step (in which we were 'lenient' in the terminology of the RNA-seq analysis wizard, i.e. imposing a threshold on the average quality of each read being at least 8 and discarding all other reads), removed only a small portion of each library (see table below).

Library Read Length (in bases)

Mean Quality Score(per read)

Number of Reads(each paired-end read counts as 2)

SRA Accessions GeneProf Name 1st Mate 2nd Mate Before QC After QC Before QC After QC Kept

SRR039999SRR040000SRR040001

ESC 76 76 26.05 26.64 116,130,790 109,131,950 94.0%

SRR040002SRR040003

MLF 76 64 31.89 32.41 133,312,662 128,630,414 96.5%

SRR040004SRR040005

NPC 76 64-76 31.47 31.95 120,055,966 113,792,534 94.8%

The read alignment statistics told us that about 89% of all short read sequences could be aligned to the mouse reference genome (Supplementary Fig. 6; however, not all of these sequences had a valid alignment for both ends). As was to be expected, thanks to comparatively long read sequences and the availability of paired-end reads, the majority of reads (60.4 – 69.8%) could be aligned uniquely to one position in the genome.

In this case, we were satisfied with the results of the basic analysis steps and did not see any reason for further action. In other experiments, we might decide at this point to alter the parameters of the quality control step (for example, we often observe a strongly skewed nucleotide distribution in the initial sequencing cycles or drastically decreasing quality scores in later cycles, which suggests that it might be good idea to trim reads before alignment) or the alignment procedure (increasing/decreasing the number of mismatches allowed is a good way to balance between the number of unaligned and highly ambiguously aligned read sequences).

Lastly, we had a look at the gene expression read counts calculated from the coverage of aligned reads to known gene models. More specifically, we looked at the expression of four key stem cell regulators (Pou5f1, Nanog, Sox2 and Klf4), which were expected to be expressed at significantly high levels in embryonic stem cells (Sox2 and Klf4 are also expressed in other cell types). The table below shows the gene expression estimates expressed as reads per kilobase million (RPKM)4, confirming our expectations. The complete table with values for all genes can be accessed at http://www.geneprof.org/show?id=gpDS_11_168_20_1.

Gene Expression Estimates (as RPKM = Reads per Kilobase Million)

Name Ensembl ID GeneProf ID ESC NPC MLF

Pou5f1 ENSMUSG00000024406

gpFT_pub_mm_ens58_ncbim37_29219

771.1 0.00.0

Sox2 ENSMUSG00000074637


147.1 82.40.0

Nanog ENSMUSG00000012396


334.3 0.1 0.0

Klf4 ENSMUSG00000003032


55.8 9.7 3.6


http://www.geneprof.org/show?id=gpFT_pub_mm_ens58_ncbim37_29285


http://www.ensembl.org/id/ENSMUSG00000003032














http://www.geneprof.org/show?id=gpDS_11_168_20_1




Example 2: Comparing Transcriptional Activity and Transcription Factor Binding

We were interested in investigating the DNA-binding activity of the aforementioned stem cell master regulators, Pou5f1, Nanog and Sox2, in their own upstream regions and to contrast these with the transcriptional activity in different cell types.

For the sake of this example, we have picked out some datasets that might be of interest in such a case: RNA-seq expression analysis from ESC, NPC and MLF (from Guttman et al.1, see the first example above) and two independent studies5,6 that had profiled transcription factor binding activity by the three factors mentioned (data has been re-analyzed in GeneProf experiments gpXP_000012 and gpXP_000028).

We opened GeneProf's in-built Genome Browser for the mouse reference genome (which makes use of the R package GenomeGraphs7 to plot tracks along genomic coordinates) and selected the tracks for the selected datasets (which are all part of the public repository) in the “Choose Tracks & Other Display Options” dialog. We kept all the default options.

We then selected the regions we wanted to display by typing in a gene name (e.g. “sox2”) into the text box labeled “Gene:” and selecting one of the suggested genes from the list (if there were multiple matches, such as in the case of Sox2, the text we entered also matched the genes Qsox2 and Sox21). Subsequently clicking the “Update Display” button shows the requested region including the data from all selected tracks and genes located in the respective interval. We then extended the region some 3-10kb upstream and ~1kb downstream of the gene and exported the plot in high-quality PDF format.

Supplementary Figure 8 shows the exported plots for the Pou5f1, Nanog and Sox2 locus (a, b and c, respectively).

Example 3: Similarity of Global DNA-Protein Binding Profiles

Lastly, we performed an exploratory comparison between DNA-protein binding activity on a global scale using GeneProf's Visual Data Explorer (VDE). The VDE makes it possible to quickly discover patterns in and relations between large collections of datasets by exploiting the public data available in GeneProf and categorizing it by virtue of their annotation into groups such as different cell types, tissues or developmental stages. At the time of writing this manuscript, only histograms, correlation matrices and principal component analysis are supported, but more plot types will be added in future.

We opened the VDE and selected the mouse reference dataset (default), picked “correlation matrix” as the plot type, and “Has Binding Site? (Yes/No)” as data type (this kind of data reports a boolean flag for each gene indicating “true” if a ChIP binding peak for a given protein was found in the proximity of the gene's start site or “false” otherwise; “proximity of the TSS”, here, was defined as 20kb upstream and 1kb downstream of the TSS). We then clicked “Select Dataset(s)..” and picked out 22 datasets from 5 studies that profiled the binding of transcription factors and other regulatory proteins in embryonic stem cells5,6,8,9,10. We chose to group the data by “Gene”, which means that we wanted to group the profiles by the protein profiled, merging datasets belonging to the same factor. This resulted in 17 different groups corresponding to 17 proteins.

Subsequently clicking the “Update Plot” button, resulted in a plot like shown in Supplementary Figure 9.

Preparation of Manuscript Figures

The plots in Figure 1 of the paper have been exported directly from GeneProf as a set of R scripts, which were subsequently modified manually to comply with the style of Nature Methods. In particular, we modified the scripts in such a way, that they would use the same colors for the same sequence libraries in each figure in order to improve consistency and avoid a replication of figure legends. The figures exported from R (in PDF format) were imported in the Inkscape (http://inkscape.org) vector graphics software and further modified to create the complete figure with its flowchart-like elements.


http://inkscape.org/

References

1. Guttman, M. et al. Nat. Biotechnol. 28, 503-510 (2010).2. Leinonen, R., Sugawara, H. & Shumway, M. Nucleic Acids Res. 39, D19-21 (2011).3. Trapnell, C., Pachter, L. & Salzberg, S.L. Bioinformatics 25, 1105-1111 (2009).4. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621-628 (2008).5. Chen, X. et al. Cell 133, 1106-1117 (2008).6. Marson, A. et al. Cell 134, 521-533 (2008).7. Durinck, S., Bullard, J., Spellman, P.T. & Dudoit, S. BMC Bioinformatics 10 (2009).8. Heng, J.C. et al. Cell Stem Cell 6, 167-174 (2010).9. Li, G. et al. Genes Dev. 24, 368-380 (2010).10. Walker, E. et al. Cell Stem Cell 6, 153-166 (2010).


Supplementary Data

Automatically created report covering the data, analysis procedure and main outputs of the GeneProf experiment gpXP_000168 (Supplementary Methods).

We suggest to include a report like this in any publication using GeneProf for a major part of the data analysis presented.


Guttman2010 Data: Ab initio reconstruction oftranscriptomes of pluripotent and lineagecommitted cells reveals gene structures of

thousands of lincRNAs

Experiment Creator:Florian Halbritter

Last Updated:03-May-2011

Downloaded by Florian Halbritter ([email protected]) on May 3, 2011.

N.B. The creator / owner of the GeneProf experiment is not necessarily the author of the original data used in the experiment.

Report created by GenePof, (c) 2009-2011 Florian Halbritter and Simon Tomlinson, Institute for Stem Cell Research / MRC

Centre for Regenerative Medicine, University of Edinburgh, Edinburgh, UK.


Contents

1 Experiment 3

1.1 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Input Datasets Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Analysis Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Main Results Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Input Datasets 6

2.1 Ensembl 58 Mouse Genes, NCBIM37 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 SRR039999 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 SRR040005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 SRR040002 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 SRR040004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.6 SRR040001 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.7 SRR040000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.8 SRR040003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Genome Snapshots 8

3.1 Pluripotency Genes: Nanog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Pluripotency Genes: Pou5f1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 Pluripotency Genes: Sox2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Main Result Datasets 10

4.1 Sequence Data Statistics before Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.2 Sequence Data Statistics after Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3 Read Alignment Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.4 Gene Expression Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1


4.5 Gene Expression with Differential Expression Statistics . . . . . . . . . . . . . . . . . . . . . 20

Preamble

This document reports an overview about the GeneProf experiment ’Guttman2010 Data: Ab initio recon-struction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousandsof lincRNAs’ (gpXP 000168). It contains basic statistics and excerpts from the data and results that theauthors considered most relevant as well as an outline of the analysis workflow employed. Please visit theexperiment’s main page online for full details (http://www.geneprof.org/show?id=gpXP 000168).

2



Chapter 1

Experiment

1.1 General Information

Name Guttman2010 Data: Ab initio reconstruction of transcriptomes of pluripotent andlineage committed cells reveals gene structures of thousands of lincRNAs

Description This is an independent re-analysis of the data published by Guttman et al. anddoes not necessarily agree with the results presented in the paper.Original description from SRA:RNA-Seq provides an unbiased way to study a transcriptome, including both cod-ing and non-coding genes. To date, most RNA-Seq studies have critically de-pended on existing annotations, and thus focused on studying expression levelsand variation in known transcripts. Here, we present Scripture, a method to re-construct the transcriptome of a mammalian cell using only RNA-Seq reads andthe genome sequence. We apply this approach to mouse embryonic stem cells,neuronal precursor cells, and lung fibroblasts to accurately reconstruct the full-length gene structures for the vast majority of known genes. We identify novelbiological variation in protein-coding genes, including thousands of novel 5’-startsites, 3’-ends, and internal coding exons. We then determine the gene structuresof over a thousand lincRNA loci. Our results open the way to direct experimentalmanipulation of thousands of non-coding RNAs, and demonstrate the power of abinitio reconstruction to provide a comprehensive picture of mammalian transcrip-tomes. Overall Design: RNA-Seq experiments of poly-A selected total RNA fromembryonic stem cells, lung fibroblasts, and neural progenitor cells.

3


External ProjectAccession


Data Ownership This experiment is based on public data.Rigid Identifier gpXP 000168GeneProf URL http://www.geneprof.org/show?id=gpXP 000168Platform(s) Illumina Genome Analyzer IIReference(s) Ensembl 58 Mouse Genes, NCBIM37 AssemblyOwner Florian Halbritter ([email protected])Status PUBLICDate Created 18-Apr-2011Last Modified 03-May-2011

1.2 Input Datasets Overview

This is a list of all datasets used as an input for this experiment.

ID Name TypegpDS pub mm ens58 ncbim37 Ensembl 58 Mouse Genes, NCBIM37 Assembly REFERENCEgpDS 11 168 1 0 SRR039999 SEQUENCESgpDS 11 168 2 0 SRR040005 SEQUENCESgpDS 11 168 3 0 SRR040002 SEQUENCESgpDS 11 168 4 0 SRR040004 SEQUENCESgpDS 11 168 5 0 SRR040001 SEQUENCESgpDS 11 168 6 0 SRR040000 SEQUENCESgpDS 11 168 7 0 SRR040003 SEQUENCES

1.3 Analysis Workflow

This is a simplified, schematic representation of the analysis workflow used in this experiment.

4




��

��

��!��"�#��$��%&'(��! ��

��!��)��!��

��

��!�*��!��*!��

1.4 Main Results Overview

This is a list of all datasets containing the main results of the experiment (as chosen by the creator).

ID Name TypegpDS 11 168 12 1 Sequence Data Statistics before Quality Control SPECIALgpDS 11 168 16 1 Sequence Data Statistics after Quality Control SPECIALgpDS 11 168 17 1 Read Alignment Statistics SPECIALgpDS 11 168 19 1 Gene Expression Statistics SPECIALgpDS 11 168 20 1 Gene Expression with Differential Expression Statistics FEATURES

5


Chapter 2

Input Datasets

2.1 Ensembl 58 Mouse Genes, NCBIM37 Assembly

Name Ensembl 58 Mouse Genes, NCBIM37 AssemblyRigid Identifier gpDS pub mm ens58 ncbim37GeneProf URL http://www.geneprof.org/show?id=gpDS pub mm ens58 ncbim37Data Type REFERENCE

Data Sample

Top 10 rows sorted in ascending order by Internal ID. Showing only the first 7 columns.

Strand Chromosome Feature Type End Description Name Start

false 12 protein coding 3791750 CR974568.2 3791316true 19 protein coding 20497275 RIKEN cDNA 1500015L24 gene Gene [Source:MGI Symbol;Acc:MGI:1916244] 1500015L24Rik 20479778false 12 protein coding 56300832 Putative uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:Q9D1U8] AC174776.2 56300440true 18 protein coding 33634364 Putative uncharacterized protein Fragment [Source:UniProtKB/TrEMBL;Acc:Q3URY4] AC115117.1 33623817true 8 protein coding 13461369 family with sequence similarity 70, member B Gene [Source:MGI Symbol;Acc:MGI:2685533] Fam70b 13435426false 14 protein coding 57676782 gap junction protein, alpha 3 Gene [Source:MGI Symbol;Acc:MGI:95714] Gja3 57654497true 13 protein coding 95612852 tubulin cofactor A Gene [Source:MGI Symbol;Acc:MGI:107549] Tbca 95558898false 14 protein coding 50901926 olfactory receptor 732 Gene [Source:MGI Symbol;Acc:MGI:3030566] Olfr732 50901000true 9 protein coding 3021593 AC131780.5 3020108true 9 protein coding 21224721 queuine tRNA-ribosyltransferase 1 Gene [Source:MGI Symbol;Acc:MGI:1931441] Qtrt1 21216289

2.2 SRR039999

Name SRR039999Rigid Identifier gpDS 11 168 1 0GeneProf URL http://www.geneprof.org/show?id=gpDS 11 168 1 0Data Type SEQUENCES

There are 19,147,701 sequences in this dataset.

2.3 SRR040005


6


http://www.geneprof.org/show?id=gpDS_pub_mm_ens58_ncbim37




2.4 SRR040002



2.5 SRR040004



2.6 SRR040001



2.7 SRR040000



2.8 SRR040003



7







Chapter 3

Genome Snapshots

These are snapshots of genomic data overlayed on a linearized version of the genome that the owner of theexperiment considered particularly relevant / interesting.

3.1 Pluripotency Genes: Nanog

Genomic coverage profiles from RNA-seq (Guttman et al (2010) RNA-seq data) and the binding patterns ofthree stem cell master regulators (Pou5f1, Sox2 and Nanog) from two independent ChIP-seq studies (datafrom Chen et al. (2008) and Marson et al. (2008)). Shown here is the region surrounding the Nanog locus.

3.2 Pluripotency Genes: Pou5f1

Genomic coverage profiles from RNA-seq (Guttman et al (2010) RNA-seq data) and the binding patterns ofthree stem cell master regulators (Pou5f1, Sox2 and Nanog) from two independent ChIP-seq studies (datafrom Chen et al. (2008) and Marson et al. (2008)). Shown here is the region surrounding the Pou5f1 locus.

8


3.3 Pluripotency Genes: Sox2

Genomic coverage profiles from RNA-seq (Guttman et al (2010) RNA-seq data) and the binding patterns ofthree stem cell master regulators (Pou5f1, Sox2 and Nanog) from two independent ChIP-seq studies (datafrom Chen et al. (2008) and Marson et al. (2008)). Shown here is the region surrounding the Sox2 locus.

9


Chapter 4

Main Result Datasets

4.1 Sequence Data Statistics before Quality Control

Rigid Identifier gpDS 11 168 12 1GeneProf URL http://www.geneprof.org/show?id=gpDS 11 168 12 1

NP

C

MLF

ES

C

Num

ber

of S

eque

nces

0.0e

+00

4.0e

+07

8.0e

+07

1.2e

+08

distinct only totalN

PC

MLF

ES

C

Nuc

leot

ide

Cou

nts

0e+

002e

+09

4e+

096e

+09

8e+

09

N A T C G

Figure: Left: Overall library size of the sequence datasets examined. The distint size refers to the number of elements

in the set with differences in at least one nucleotide. Right: The overall nucleotide composition of the datasets.

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 103 107 111 115 119 123 127 131 135 139 143 147 151

NPC

0e+

003e

+07

6e+

07

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 102 106 110 114 118 122 126 130 134 138

MLF

0e+

003e

+07

6e+

07

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 103 107 111 115 119 123 127 131 135 139 143 147 151

ESC

0e+

003e

+07

10



Figure: The nucleotide distribution per dataset and base position. If dealing with short read sequencing datasets, each

base position may correspond to a cycle in the sequencing process.

0 50 100 150

010

2030

4050

Base Position / Sequencing Cycle

Qua

lity

Sco

re (

Phr

ed−

like)

NPCMLFESC

Figure: The average quality score per sequencing cycle. This might drop a little over the full length of the reads.

Significant dips in the average quality, however, might be a reason for concern.

0 5 10 15 20 25 30 35

02

46

8

Mean Quality Score

Num

ber

of R

eads

[log

10]

NPCMLFESC

0 1000 2000 3000 4000 5000

01

23

45

6

Cumulative Quality Score

Num

ber

of R

eads

[log

10]

NPCMLFESC

Figure: Left: The distribution of average quality scores. This plot be useful to assess how many sequences pass a

certain quality score threshold. Right: The distribution of cumulative quality scores. Cumulative quality scores have

been calculated by simply adding up all indiviudal quality scores per read.

0 1 2 3 4 5 6

02

46

8

Number of Occurrences of the Same Sequence [log10]

Fre

quen

cy [l

og10

]

NPCMLFESC

11


Figure: The abundance of identical reads. In short read sequencing libraries, most reads will only be sequences a few

times, but there will be some highly abundant sequences as well.

4.2 Sequence Data Statistics after Quality Control


NP

C

MLF

ES

C

Num

ber

of S

eque

nces

0.0e

+00

4.0e

+07

8.0e

+07

1.2e

+08

distinct only total

NP

C

MLF

ES

C

Nuc

leot

ide

Cou

nts

0e+

002e

+09

4e+

096e

+09

8e+

09

N A T C G

Figure: Left: Overall library size of the sequence datasets examined. The distint size refers to the number of elements

in the set with differences in at least one nucleotide. Right: The overall nucleotide composition of the datasets.

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 103 107 111 115 119 123 127 131 135 139 143 147 151

NPC

0e+

003e

+07

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 102 106 110 114 118 122 126 130 134 138

MLF

0e+

003e

+07

6e+

07

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 103 107 111 115 119 123 127 131 135 139 143 147 151

ESC

0e+

003e

+07

Figure: The nucleotide distribution per dataset and base position. If dealing with short read sequencing datasets, each

base position may correspond to a cycle in the sequencing process.

12



0 50 100 150

010

2030

4050

Base Position / Sequencing Cycle

Qua

lity

Sco

re (

Phr

ed−

like)

NPCMLFESC

Figure: The average quality score per sequencing cycle. This might drop a little over the full length of the reads.

Significant dips in the average quality, however, might be a reason for concern.

10 15 20 25 30 35

02

46

8

Mean Quality Score

Num

ber

of R

eads

[log

10]

NPCMLFESC

1000 2000 3000 4000 5000

01

23

45

6

Cumulative Quality Score

Num

ber

of R

eads

[log

10]

NPCMLFESC

Figure: Left: The distribution of average quality scores. This plot be useful to assess how many sequences pass a

certain quality score threshold. Right: The distribution of cumulative quality scores. Cumulative quality scores have

been calculated by simply adding up all indiviudal quality scores per read.

0 1 2 3 4 5 6

02

46

8

Number of Occurrences of the Same Sequence [log10]

Fre

quen

cy [l

og10

]

NPCMLFESC

Figure: The abundance of identical reads. In short read sequencing libraries, most reads will only be sequences a few

times, but there will be some highly abundant sequences as well.

13


4.3 Read Alignment Statistics

14



NP

C

MLF

ES

C

Num

ber

of D

istin

ct R

egio

ns

0.0e

+00

1.0e

+07

2.0e

+07

3.0e

+07

Figure: Overview about the size of all datasets. The numbers visualised here correspond to the total number of

distinct genomic intervals covered in the individual datasets.

1 2 3 4 5 6 7 8 9 10 12 14 16 18 MT Y

Chromosome

Per

cent

of D

istin

ct R

egio

ns

0.00

0.02

0.04

0.06

0.08

NPCMLFESC

Figure: The distribution of distinct genomic regions over chromosomes.

15



None 2 3 4 5 6 7 8 9 10 12 14 16 18 20

Number of Possible Matches in Reference

Per

cent

of A

ligne

d R

eads

0.0

0.2

0.4

0.6

0.8

1.0

NPCMLFESC

Figure: The alignment ambiguity. Some reads might not map uniquely to one position in the genome, this graphs

displays how ambiguous the alignment was overall. All reads that did not occur in any of the other categories count

toward ’unaligned’, i.e. these are reads that could either not be aligned at all or that are too ambiguous.

4.4 Gene Expression Statistics


Top 5 Most Highly Expressed Genes

Rank 1 2 3 4 5NPC [RPM] mt-Co1: 16, 480.2 mt-Cytb: 11, 607.6 mt-Atp6: 11, 164.3 mt-Nd4: 8, 646.6 mt-Co2: 8, 584.3MLF [RPM] Col3a1: 18, 558.7 Col1a2: 11, 909.8 Col1a1: 8, 634.8 mmu-mir-2133-2: 7, 668.6 Bgn: 7, 352.9ESC [RPM] Eef1a1: 8, 591.1 Hsp90ab1: 8, 121.6 mt-Co1: 6, 682.9 AC092404.2: 6, 434.9 AL840626.1: 5, 599.9

16



NP

C [R

PM

]

MLF

[RP

M]

ES

C [R

PM

]

050

0010

000

1500

020

000

2500

0

IG_C_geneIG_D_geneIG_J_geneIG_V_geneMt_rRNAMt_tRNAlincRNAmiRNAmisc_RNApolymorphic_pseudogeneprocessed_transcriptprotein_codingpseudogenerRNAsnRNAsnoRNA

Figure: The number of expressed features by category.

NP

C [R

PM

]

MLF

[RP

M]

ES

C [R

PM

]0e+

002e

+05

4e+

056e

+05

8e+

051e

+06

IG_C_geneIG_D_geneIG_J_geneIG_V_geneMt_rRNAMt_tRNAlincRNAmiRNAmisc_RNApolymorphic_pseudogeneprocessed_transcriptprotein_codingpseudogenerRNAsnRNAsnoRNA

Figure: The total amount of expression (i.e. sum of all expression values) observed per category and dataset.

17


ES

C [R

PM

]

MLF

[RP

M]

NP

C [R

PM

]

−1 −0.5 0 0.5 1Row Z−Score

010

020

030

040

050

0

Color Keyand Histogram

Cou

nt

Figure: Clustered heatmap of expression values.Only 1000 randomly selected features are shown.

18


MLF

[RP

M]

ES

C [R

PM

]

NP

C [R

PM

]NPC [RPM]

ESC [RPM]

MLF [RPM]

0.5 1Value

01

23

4

Color Keyand Histogram

Cou

nt

Figure: Pearson correlation matrix of expression patterns.

19


●

●

●

1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Number of Principal Components

Per

cent

of T

otal

Var

ianc

e

●

● ●simplecumulative

●

●

●

−100 −50 0 50 100 150

−10

0−

500

5010

0

PC1

PC

2

NPC [RPM]

MLF [RPM]

ESC [RPM]

●

●

●

−1e−11 −5e−12 0e+00 5e−12

−10

0−

500

5010

0

PC3

PC

2

NPC [RPM]

MLF [RPM]

ESC [RPM]

●

●

●

−100 −50 0 50 100 150

−1e

−11

−5e

−12

0e+

005e

−12

PC1

PC

3

NPC [RPM]

MLF [RPM]

ESC [RPM]

Figure: Principal component analysis. Only the first three principal components are being compared.

4.5 Gene Expression with Differential Expression Statistics

Name Gene Expression with Differential Expression StatisticsRigid Identifier gpDS 11 168 20 1GeneProf URL http://www.geneprof.org/show?id=gpDS 11 168 20 1Data Type FEATURES

Data Sample

Top 10 rows sorted in ascending order by log2FC MLF / ESC. Showing only the first 7 columns.

Name NPC [RPKM] MLF [RPKM] ESC [RPKM] log2FC NPC / ESC adj. P NPC / ESC log2FC MLF / ESC

20



Podxl 4.573 0.004 17.257 -1.937 1.000 -11.918Emb 1.026 0.028 74.274 -6.199 0.115 -11.151Chchd10 0.000 0.168 319.210 0.000 0.000 -10.694Klhl13 40.705 0.050 50.253 -0.325 1.000 -9.770Mdk 0.700 0.112 110.960 -7.329 0.040 -9.754Gas6 5.308 0.008 7.614 -0.542 1.000 -9.645Fbxo15 0.104 0.254 224.701 -11.093 0.001 -9.589Rpl39l 0.000 0.026 21.917 0.000 1.000 -9.537Fabp3 0.072 0.104 83.857 -10.224 0.013 -9.457Aim1l 0.000 0.003 2.478 0.000 1.000 -9.456

21


Nature Methods GeneProf: analysis of high-throughput ... library before quality control ... data...

Documents

Transcript of Nature Methods GeneProf: analysis of high-throughput ... library before quality control ... data...