Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.
-
Upload
hope-griffith -
Category
Documents
-
view
217 -
download
1
Transcript of Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.
Applied Bioinformatics
Week 9
Jens Allmer
Theory I
• Gene Expression• Microarray
Gene Expression
• Is there a transcript?
• How much transcript is made?
• Is there any difference to the DNA?
• Is there any difference to the annotation?
Measure Expression
• Northern/Western Blot
• qPCR
• Next generation sequencing
• Microarray
http://www.nature.com/leu/journal/v17/n7/images/2402974f1.jpg
Chip Construction
http://www.dkfz.de/gpcf/fileadmin/_migrated/RTE/RTEmagicC_AffyChipProduction.jpg.jpg
https://www.bcm.edu/cms_web/110/affy3.jpg
http://www.scq.ubc.ca/wp-content/GeneChip.gif
Chip Construction
http://angerer.swissbrain.org/expression_oveview.gif
http://www.scq.ubc.ca/wp-content/cDNAarray.gif
Bioinformatics Analysis
• Experimental design• Standardization• Data Analysis
– Image processing, normalization– ...– Clustering, Visualization
• Data Storage
Database ScopeMicroarray
experiment setsSample profiles
As of date
The Cancer Genome Atlas (TCGA) collection of expression data for different cancers 21229 ? 30-Aug-13
Stanford Microarray database private and published microarray and molecule abundance database 82542 ? 23-Oct-11
ArrayExpress at EBI Any curated MIAME or MINSEQE compliant transcriptomics data 24838 708914 28-Oct-11
Gene Expression Omnibus - NCBI any curated MIAME compliant molecular abundance study 25859 641770 28-Oct-11
Genevestigator Gene expression search engine based on manually curated, well annotated public and proprietary microarray and RNA-seq datasets
2615 119,400 Aug-14
NCI mAdb Hosts NCI data with integrated analysis and statistics tools ? 105,000 Mar-12
ArrayTrackArrayTrack hosts both public and private data, including MAQC benchmark data, with integrated analysis tools 1622 50,093 Feb-12
ImmGen database Open access across all immune system cells; expression data, differential expression, coregulated clusters, regulation
267 1059 Jan-12
UPSC-BASE data generated by microarray analysis within Umeå Plant Science Centre (UPSC).
~100 ? 15-Nov-07
UPenn RAD database MIAME compliant public and private studies, associated with ArrayExpress ~100 ~2500 Sept. 1, 2007
GeneNetwork system Open access standard arrays, exons arrays, and RNA-seq data for genetic analysis (eQTL studies) with analysis suite
~100 ~10000 July, 2010
caArray at NCI Cancer data, prepared for analysis on caBIG 41 1741 15-Nov-06
UNC Microarray database provides the service for microarray data storage, retrieval, analysis, and visualization
~31 2093 1-Apr-07
MUSC database The database is a repository for DNA microarray data generated by MUSC investigators as well as researchers in the global research community.
~45 555 1-Apr-07
UNC modENCODE Microarray database Nimblegen customer 2.1 million array ~6 180 17-Jul-09
List of MA Data Sources
A public repository for the archiving and distribution of gene expression data submitted by the scientific community.
MIAME compliant data. Minimum Information About a Microarray Experiment
http://www.mged.org/Workgroups/MIAME/miame.html
Convenient for deposition of gene expression data, as required by funding agencies and journals.
Curated, online resource for gene expression data browsing, query, analysis and retrieval.
Gene Expression Omnibus (GEO): Gene Expression and Molecular Abundance Data Repository
GEO Gene Expression Omnibus - TeachLine
GEO Architecture
Platform (GPL) = the technology used and the features detected.
Sample (GSM) = preparation and description of the sample.
Series (GSE) defines a set of samples and how they are related.
DataSets (GDS) sample data collections assembled by GEO staff.
GEO has four kinds of data records
Submitters may provide raw data
Original microarray scans Raw quantification data
GEO Gene Expression Omnibus - TeachLine
GPLPlatform
descriptions
GSMRaw/processedspot intensities
from a singleslide/chip
GSEGrouping of
slide/chip data“a single experiment”
GDSGrouping ofexperiments
Curated byNCBI
Submitted byExperimentalistsSubmitted by
Manufacturer*
GEO Architecture
GEO Gene Expression Omnibus - TeachLine
GEO Home Page
Simple interface to: show status
find documentation
query data browse data submit data
Basic Search: Repository Browser
Selecting the total public data or Repository Browser links on the GEO home page, takes you to the Repository Browser, listing: number of each type of submitted file, both public and unreleased the total number of each technology type under Platforms the total number of each Sample type
Basic Search: Browse Platforms
All GEO submissions need to be associated with a platform file. These describe the features on a given platform, required to understand the data.
A platform file must be submitted if one is not already present in GEO. Commercial array platform files are submitted to GEO by the manufacturer.
GEO Gene Expression Omnibus - TeachLine
Basic Search: Browse Platforms
Accession:GEO ID
Title:brief description
of platform
Contact:submitter
Samples:number of samples in GEO associated
with platform ID
Technology:platform
type
Release date:when file is
publicly accessible
The table can be sorted on any field except organism by clicking on the header.Specific platform files can be found using the ‘Find Platform’ option.
GEO Gene Expression Omnibus - TeachLine
Basic Search: Find Platforms
Select ‘Find Platform’ Select company Select distribution Select species Enter title keyword
GEO Gene Expression Omnibus - TeachLine
Basic Search: Find Platforms (continued)
Start the platform search Select the accession for the U133 plus 2.0 array Scroll down to find data table information
GEO Gene Expression Omnibus - TeachLine
Data Retrieval: Browse Series
Data is submitted to GEO as a Series, which represents the experiment design.
Selecting Browse>Series brings up a list sorted by release date. Selecting a Series ID brings up the Series file summary.
GEO Gene Expression Omnibus - TeachLine
Data Retrieval: Series Accession Page
GEO Gene Expression Omnibus - TeachLine
GEO Accession Results Display Options
Scope controls what information is displayed:SelfPlatform, Samples or SeriesFamily
Format controls how information is displayed:HTMLSOFT (Simple Omnibus Format in Text)MINiML (MIAME Notation in Markup Language)
Amount controls how much information is displayed:Brief QuickFull Data
All GEO accession results pages have the same header that allows different views and formats for the data to be displayed
GEO Gene Expression Omnibus - TeachLine
Data Retrieval: Series Accession Page
Biological sample summary
Design summary
Publication information
Platform (total)
Samples (total)
GEO Gene Expression Omnibus - TeachLine
Data Retrieval: Sample File Summary
Sample preparation
Hybridization and data
processing
Platform Series
GEO Gene Expression Omnibus - TeachLine
Data Retrieval: Sample File Data TableData table field
descriptions
Truncated data table from Quick view
Total data rows and file size
Supplementary raw data file
GEO Gene Expression Omnibus - TeachLine
Querying GEO with IDs from Papers
A common way to access GEO data is through accessions from papers. Online journals include hyperlinks to the GEO accession page. Or, at the GEO home page enter the accession into the Query>GEO
accession text boxGEO Gene Expression Omnibus - TeachLine
GEO Links in PubMed Search Results
One option for displaying PubMed search results is GEO DataSet links. When present, the results page is actually from Entrez GEO DataSets.
GEO Gene Expression Omnibus - TeachLine
Advanced Searches
GEO data can be queried as: Datasets: experiment-centric view using Entrez GEO DataSets Gene profiles: gene-centric view using Entrez GEO Profiles
Selecting either takes you to a similar Entrez introduction page
GEO Gene Expression Omnibus - TeachLine
Querying GEO DataSets
Start a GEO DataSets search with the Query>DataSets text box This brings up an Entrez GEO DataSets results form
Total results
Number of DataSets
Number of Platforms
Number of Series
GEO Gene Expression Omnibus - TeachLine
DataSet Search Result
DataSet ID
Description
Platform
Reference Series
Supplementary files
Number of Samples and truncated list
Cluster image
Select the DataSet ID or click on the cluster image to go to the DataSet record.
GEO Gene Expression Omnibus - TeachLine
GEO DataSet Record
Experiment design and DataSet information.
Sample and analysis information. Data retrieval.
Selecting analysis takes you to the data clustering interface.
Selecting the cluster image takes you to the clustering page
GEO Gene Expression Omnibus - TeachLine
GEO Gene Profiles
GEO DataSet ID
Platform ID, Platform Feature ID
Gene description
Target sequence accession
Expression profile
GEO Gene Profiles use gene IDs from Platform files to show the expression of a gene across DataSets.
Entering a gene ID into the Query>Gene profiles text box takes you to the Entrez results page.
GEO Gene Expression Omnibus - TeachLine
GEO BLAST
On the GEO BLAST page enter sequences in fasta format, GenBank accessions or select sequence files on local disks for blastn comparisons.
These are compared to GenBank sequences listed in Platform files associated with GEO DataSets
From the Blast result page select the ‘E’ option to the right of an alignment to show GEO Gene Profiles for that sequence in GEO DataSets
E button
GEO Gene Expression Omnibus - TeachLine
End Theory I
• Mindmapping• 10 min break
Practice I
• Gene Expression Omnibus– http://www.ncbi.nlm.nih.gov/geo/
NCBI GEO
• Take 15 minutes to browse the website
Repository
• Go to the repository browser– http://www.ncbi.nlm.nih.gov/geo/summary/
• Explore the available tabs
• What kind of different data is available?
Where is the actual data?
• Try to find the following accessions:– GSE48874– GSM1186226
End Practice I
• 15 min break
Theory II
• Next generation sequencing
Microarray vs NGS
1990 1995 2000 2005 2010 20150
200
400
600
800
1000
1200
1400
MicroarrayNGS
Publication Year
Num
ber o
f Pap
ers i
n Pu
bmed
Doug Brutlag 2011
The Human GenomeHow fast is the cost going down?
• 2006: $ 50 million• 2008: $500,000• 2009: $50,000• 2010: $20,000• 2011: $5,000• 2012:??? $1,000
Thanks to Serafim Batzoglou
Platforms• Roche/454 FLX: 2004• Illumina Solexa Genome Analyzer: 2006• Applied Biosystems SOLiDTM System: 2007• Helicos HeliscopeTM : recently available• Pacific Biosciencies SMRT: launching 2010
• And many more
Doug Brutlag 2011
Illumina Solexa Sequencing Technology
Doug Brutlag 2011
Pacific Biosciences Sequencing
Doug Brutlag 2011
Phospholinked Fluorophores
Doug Brutlag 2011
Processive Synthesis
Applications of next-generation sequencing
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)
OK, but where is the data?
• http://www.1000genomes.org/
• http://trace.ddbj.nig.ac.jp/dra/index_e.html
• http://www.ebi.ac.uk/ena
• http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies
End Theory II
• Mindmapping• 10 min break
Practice I
NCBI
• http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi
• Browse the webpage for 15 minutes
Available Data
• Search for human data• How much data is available?• Find accession ERX628533• How large is the dataset?• Why is it so large?
End Practice II
Homework
• Select 1 next generation sequencing platform and give a step by step description how it works
• Max 500 words and at most 5 figures.
http://dx.doi.org/10.6084/m9.figshare.100940
http://www.nature.com/nbt/journal/v26/n10/fig_tab/nbt1486_F1.html
http://www.nature.com/nbt/journal/v26/n10/fig_tab/nbt1486_F2.html
http://www.nature.com/nbt/journal/v26/n10/fig_tab/nbt1486_F3.html