Brief introduction to Bioinformatics

15
Cynthia Alexander Rascón [email protected] @4chromat June 11th, 2014 Brief Intro to Bioinformatics: Data Analysis on Biological data. Circular representation of a R. etli chromosome (CCG-UNAM)

Transcript of Brief introduction to Bioinformatics

Cynthia Alexander Rascón [email protected] @4chromat

June 11th, 2014

Brief Intro to Bioinformatics:

Data Analysis on Biological data.

Circular representation of a R. etli chromosome (CCG-UNAM)

Data analysis of Terabytes of next-generation high-throughput nucleotide sequences done by cyborgs in an automated laboratory at the basement of Stark’s Tower.

Data analysis of Terabytes of next-generation high-throughput nucleotide sequences done by cyborgs in an automated laboratory at the basement of Stark’s Tower.

What is Bioinformatics?

However, THERE IS a lot of “Next-generation” “high-throughput” nucleotide sequence analysis involved.

i.e. DNA or RNA, obtained through classic and cutting edge sequencing technologies. *and that is just one fraction of the fast-paced advancing field...

NO.

Cynthia Alexander Rascón June, 2014.

Throwback: Refresher on DNA

Images from http://www.aboutthemcat.org/ Vids: WEHImovies

About 37 trillion cells compose an adult human.

Each, with a nuclei containing an *almost* exact

same copy of DNA.

DNA to RNA

RNA to Protein

Then Proteins to Complex... to Tissues... to Organs ...to Full organism.

Cynthia Alexander Rascón June, 2014.

DNA: All living forms rely on it.

98.7% 60% 50%% of similarity at DNA level with Piqué:

DNA ~ Cooking book w/ all recipes needed for a

fully formed organism. RNA ~ The reader. Proteins ~ The ingredients.

Cynthia Alexander Rascón June, 2014.

Then, what is Bioinformatics?

Most popular languages:- R, Perl, Python, Java, C and C++

Some types of data analyzed:- DNA, RNA and Protein sequences, in small sets or much larger scales.

Also data representing their interactions--within a cell, an organism, a population or even across species.

Storage and organization:- Databases and ontologies

Retrival: - DNA sequencing

- RNA sequencing

- Protein sequencing

- Protein-DNA interaction maps

- Protein-RNA interaction maps

- Protein-Protein interaction maps

… and beyond

Frequent analyses:- Comparative analysis (1/0 + similarity).

- Expression studies (1/0 + time a/o condition).

- Regulation studies (how the presence of X changes expression).

- Structure of RNAs and Proteins.

- Network and systems biology

Cynthia Alexander Rascón June, 2014.

Going back to the human genome... Draft published in 2001.

‘Completed’ in 2003.

~3 billion base pairs sequenced.

Where to Download it? http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/

A free and up-to-date complete human/hg38 genome sequence in the 2bit file format (797M)... save 3.2 Gb for it.

Cynthia Alexander Rascón June, 2014.

So, DNA sequencing.

More info? http://www.genome.gov/

First methods were developed in the mid-1970s (Sanger sequencing).

The Human Genome Project took 13 years, and was an international collaboration between universities and research centers in the US, UK, France, Germany, Japan and China + a private biotech company, Celera Genomics.

However, nowadays…

Cynthia Alexander Rascón June, 2014.

Example of Bioinformatic SuccessAnalysis of 5M mutations from 7K cancers, obtaining 20 distinct mutational signatures.

* Mutations and any accessory characteristic, incorporated into the set of features by which a mutational signature is defined.

LB Alexandrov et al. Nature 000 , 1-7 (2013)

‘Rainfall’ plots represents an individual cancer sample and each dot a mutation at their position in the human genome.

Annotation and correlation of each of the signatures across samples and patient information gives great insights into cancer prevention and therapy.

Cynthia Alexander Rascón June, 2014.

Machine Learning (ML)

Adapted form http://cardsagainsthumanity.com/Wikipedia

Tone of a highly likely winner response:

Intellectual. Goofy. Witty. Nonsense.

Cynthia Alexander Rascón June, 2014.

How is ML used in Bioinformatics?

Jensen and Bateman Bioinformatics. 2011

#7!!

Random Forests, Support Vector Machines and Artificial Neural Networks have been successfully used for:- Protein structure prediction- Protein-protein relationships at structural and evolutive levels.- Pairwise sequences similarity - Microarray gene expression data- Genome wide association studies

RF: http://www.cs.cmu.edu/~qyj/papersA08/11-rfbook.pdfANN: Manning T, Sleator RD, Walsh P. Bioengineered. 2014

SVM: Yang ZR. Brief Bioinform. 2004

Cynthia Alexander Rascón June, 2014.

Also used as the other methods for prediction of protein-protein interactions, sequence similarity and network inference.

Bayesian approaches are desirable when data is subject to many sources of variation.

For example, a typical gene expression analysis…1. Normalisation process to correct gene expression levels across a sample. 2. Normalised data will then be processed to identify differentially expressed genes. Ignoring any uncertainty in (1)3. Differentially expressed genes then used for further analysis ignoring the uncertainty in (2)*However also possible to develop models for the analysis of unnormalised that correctly propagate uncertainty across the various levels of analysis.

Bioinformatics and Bayes

Wilkinson DJ. Brief Bioinform. 2007

Cynthia Alexander Rascón June, 2014.

Big Problems and Emerging AvenuesSome of the current biggest problems are STORAGE, TRANSPORTABILITY and REPRODUCIBILITY. Currently, we have way more than what we can chew.

The National Center for Biotechnology

Information.

IBM Watson being used to find better personalized cancer treatments at MSKC (NY). Also in NY, the recently opened NY Genome Center....And you have probably heard about companies like 23&me.

Cynthia Alexander Rascón June, 2014.

Startups in Biotech?

“The Human Genome project spurred a revolution in biotechnology innovation around the world and played a key role in making the U.S. the global leader in the new biotechnology sector.” --NIH website.

http://lifescivc.com/2014/04/startup-tech-incubator-announces-biotech-experiments/

Cynthia Alexander Rascón June, 2014.

P.S. Some future goals of the HGP:➔ The Cancer Genome Atlas (http://cancergenome.nih.gov/).

➔ More effective drugs with less side effects than those available today.

➔ Access to high-throughput screening of small molecule sets to explore tons of proteins.

➔ Cut the cost of sequencing an individual’s genome to $1,000 or less.

➔ Perzonalised genome anaysis as a powerful form of preventive medicine.

➔ And beyond medical conditions and data analysis: The possibility to connect DNA variation with non-medical traits (such as personality) will challenge society, making ethical, legal and social research REALLY important!! NIH FactSheets

Cynthia Alexander Rascón June, 2014.

Any questions? Feel free to contact me:[email protected] @4chromat

Thanks to Ed for inviting!