Intro to field_of_bioinformatics

32
09/05/13 K-INBRE Bioinformatics Core KSU Bioinformatics 1 Introduction to the field of bioinformatics Sept, 2013 Jennifer Shelton K-INBRE Bioinformatics Core KSU

Transcript of Intro to field_of_bioinformatics

Page 1: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Bioinformatics

1

Introduction to the field of bioinformatics

Sept, 2013Jennifer Shelton

K-INBRE Bioinformatics Core KSU

Page 2: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Outline

2

I. Basic conceptsi. Definition of bioinformaticsii. Databases (flat-file and

relational)iii. Assembly (Overlap-layout-

consensus)II. Steps you can take on your

own

Page 3: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Definition of bioinformatics

3

Acquire dataStore/archive data

Organize data

Analy

ze da

ta

Visu

alize

dat

a

Biological, Medical,

Behavioral, or Health

“Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.”

-NIH Biomedical Information Science and Technology Initiative Consortium 2000

Page 4: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Definition of bioinformatics

4

Acquire dataStore/archive data

Organize data

Analy

ze da

ta

Visu

alize

dat

a

Biological, Medical,

Behavioral, or Health

Acquire dataStore/archive data

Organize data

Analy

ze data

Visu

alize

data

Biological, Medical,

Behavioral, or Health

“Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.”

-NIH Biomedical Information Science and Technology Initiative Consortium 2000

Page 5: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Problem with volume

5

“We believe the field of bioinformatics for genetic analysis will be one of the biggest areas of disruptive innovation in life science tools over the next few years,”

-Isaac Ro, Goldman Sachs

Mark Smiciklas, Flickr.com/photos/intersectionconsulting

Ro, Goldman SachsPer year worldwide we can generate ~13,000,000,000,000,000 bp of data

Page 6: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

"This unprecedented amount of sequencing information poses bottlenecks that vary, depending on application, at the level of data extraction, analysis, and interpretation” "These challenges have become part and parcel of the biomedical research community where investigators have increasingly needed to incorporate bioinformatics and biostatistics into their armamentarium."

Problem with volume

6

Mark Smiciklas, Flickr.com/photos/intersectionconsulting

Opportunities and Challenges Associated with Clinical Diagnostic Genome Sequencing: A Report of the Association for Molecular Pathology. The Journal of Molecular Diagnostics - November 2012

Page 7: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

“It sounds like an analog solution in a digital age,”-Sifei He, head of cloud computing for BGI (referring to FedExing disks of data because internet connections are often too slow)

NY Times 2011 article: DNA Sequencing Caught in a Deluge of Data http://www.nytimes.com/2011/12/01/business/dna-sequencing-caught-in-deluge-of-data.html?pagewanted=all&_r=0

Problem with volume

7

Page 8: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Examples of bioinformatics tools

8

9/4/13 tumblr_m5sa3oXBAB1rrtrfso1_500.jpg (500×500)

25.media.tumblr.com/tumblr_m5sa3oXBAB1rrtrfso1_500.jpg 1/1

?? ?

??

? ?

??

Page 9: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Outline

9

I. Basic conceptsi. Definition of bioinformaticsii. Databases (flat-file and

relational)iii. Assembly (Overlap-layout-

consensus)II. Steps you can take on your

own

Page 10: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Flat-file databases

‘records’ about one unique object

‘fields’ same kind of data about different object

http://www.ncbi.nlm.nih.gov/genbank/

10

GenBank:

Page 11: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU 11

Flat-file databases

Any flat-file database, like GenBank can be thought of as a single spreadsheet called a ‘table’ of ‘fields’ and ‘records’

Page 12: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Relational databases Have multiple tables

with some shared fields and some different

**‘fields’ same kind of data about different objects

http://www.genome.jp/kegg/pathway.html

12

Page 13: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Relational databases Relational databases are like multiple tables that are linked with a

shared field. This acts like a “key” between them

13

9/25/12 KEGG PATHWAY: hsa05204

2/10www.genome.jp/dbget-‐‑bin/www_bget?pathway+hsa05204

Organism Homo sapiens (human) [GN:hsa]

Gene 1543 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1(EC:1.14.14.1) [KO:K07408] [EC:1.14.14.1]

1576 CYP3A4; cytochrome P450, family 3, subfamily A, polypeptide 4(EC:1.14.13.67 1.14.13.97 1.14.13.32) [KO:K07424][EC:1.14.14.1]

1577 CYP3A5; cytochrome P450, family 3, subfamily A, polypeptide 5(EC:1.14.14.1) [KO:K07424] [EC:1.14.14.1]

1551 CYP3A7; cytochrome P450, family 3, subfamily A, polypeptide 7(EC:1.14.14.1) [KO:K07424] [EC:1.14.14.1]

64816 CYP3A43; cytochrome P450, family 3, subfamily A, polypeptide43 (EC:1.14.14.1) [KO:K07424] [EC:1.14.14.1]

5743 PTGS2; prostaglandin-endoperoxide synthase 2 (prostaglandinG/H synthase and cyclooxygenase) (EC:1.14.99.1) [KO:K11987][EC:1.14.99.1]

10 NAT2; N-acetyltransferase 2 (arylamine N-acetyltransferase)(EC:2.3.1.5) [KO:K00622] [EC:2.3.1.5]

9 NAT1; N-acetyltransferase 1 (arylamine N-acetyltransferase)(EC:2.3.1.5) [KO:K00622] [EC:2.3.1.5]

1544 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2(EC:1.14.14.1) [KO:K07409] [EC:1.14.14.1]

6799 SULT1A2; sulfotransferase family, cytosolic, 1A, phenol-preferring, member 2 (EC:2.8.2.1) [KO:K01014] [EC:2.8.2.1]

6817 SULT1A1; sulfotransferase family, cytosolic, 1A, phenol-preferring, member 1 (EC:2.8.2.1) [KO:K01014] [EC:2.8.2.1]

6818 SULT1A3; sulfotransferase family, cytosolic, 1A, phenol-preferring, member 3 (EC:2.8.2.1) [KO:K01014] [EC:2.8.2.1]

445329 SULT1A4; sulfotransferase family, cytosolic, 1A, phenol-preferring, member 4 (EC:2.8.2.1) [KO:K01014] [EC:2.8.2.1]

1545 CYP1B1; cytochrome P450, family 1, subfamily B, polypeptide 1(EC:1.14.14.1) [KO:K07410] [EC:1.14.14.1]

1558 CYP2C8; cytochrome P450, family 2, subfamily C, polypeptide 8(EC:1.14.14.1) [KO:K07413] [EC:1.14.14.1]

1562 CYP2C18; cytochrome P450, family 2, subfamily C, polypeptide18 (EC:1.14.14.1) [KO:K07413] [EC:1.14.14.1]

1557 CYP2C19; cytochrome P450, family 2, subfamily C, polypeptide19 (EC:1.14.13.48 1.14.13.49 1.14.13.80) [KO:K07413][EC:1.14.14.1]

1559 CYP2C9; cytochrome P450, family 2, subfamily C, polypeptide 9(EC:1.14.13.48 1.14.13.49 1.14.13.80) [KO:K07413][EC:1.14.14.1]

2052 EPHX1; epoxide hydrolase 1, microsomal (xenobiotic)

Page 14: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Outline

14

I. Basic conceptsi. Definition of bioinformaticsii. Databases (flat-file and

relational)iii. Assembly (Overlap-layout-

consensus)II. Steps you can take on your

own

Page 15: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Assembly

15

Of the ~13,000,000,000,000,000bp of sequence data we can generate each year, most is not the full length of the molecule of DNA or RNA.

Instead, scientists get back multiple copies of their genome (or transcriptome) but all in short segments (between 50bp and several kbs)

Steps of Overlap-Layout-Consensus (OLC):

1) Lets’ think of a genome like the text of a book. We get back multiple copies of the book

Page 16: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

OLC Assembly

16

1) Instead of being nicely bound, we get randomly shredded text all mixed together from our multiple copies

ice was beginning to get very tired of

sitting by her tister on the bank, and of

having nothing to do

Alice was beginning to get vory tired of sitting by her sister on

the bank, and of having nothing to do: once

lice was beginning to get

very tire

d of sittin

g by her siste

r on the bank, an

d

of having nothing

Page 17: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

OLC Assembly

17

2) We look for lines that overlap for more than some minimum number of letters (in these programs all overlaps are found, then a single “path” is found through this “graph” of overlaps)

ice was beginning to get very tired of

sitting by her tister on the bank, and of

having nothing to do

Alice was

beginning to get vory tired of sitting by her sister on

the bank, and of having nothing to do: once

lice was beginning to get

very tire

d of sittin

g by her siste

r on the bank, an

d

of having nothing

Page 18: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

OLC Assembly

18

2) We look for lines that overlap for more than some minimum number of letters (in these programs overlaps are found, then a single “path” is found through this “graph” of overlaps)

ice was beginning to get very tired of

sitting by her tister on the bank, and of

having nothing to do

Alice was

beginning to get vory tired of sitting by her sister on

the bank, and of having nothing to do: once

lice was beginning to get

very tired of sitting by her sister on the bank, and

of having nothing

Page 19: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

OLC Assembly

19

3) We move column by column counting the letters in a column a make a note of the most common letter (take the consensus)

ice was beginning to get very tired of

sitting by her tister on the bank, and of

having nothing to do

Alice was

beginning to get vory tired of sitting by her sister on

the bank, and of having nothing to do: once

lice was beginning to get

very tired of sitting by her sister on the bank, and

of having nothing

ice was beginning to get very tired of

sitting by her tister on the bank, and of

having nothing to do

Alice was

beginning to get vory tired of sitting by her sister on

the bank, and of having nothing to do: once

lice was beginning to get

very tired of sitting by her sister on the bank, and

of having nothing

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do

Page 20: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

OLC Assembly

20

3) We move column by column counting the letters in a column a make a note of the most common letter (take the consensus)

ice was beginning to get very tired of

sitting by her tister on the bank, and of

having nothing to do

Alice was

beginning to get vory tired of sitting by her sister on

the bank, and of having nothing to do: once

lice was beginning to get

very tired of sitting by her sister on the bank, and

of having nothing

ice was beginning to get very tired of

sitting by her tister on the bank, and of

having nothing to do

Alice was

beginning to get vory tired of sitting by her sister on

the bank, and of having nothing to do: once

lice was beginning to get

very tired of sitting by her sister on the bank, and

of having nothing

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do

Page 21: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

OLC Assembly

21

3) We move column by column counting the letters in a column a make a note of the most common letter (take the consensus)

ice was beginning to get very tired of

sitting by her tister on the bank, and of

having nothing to do

Alice was

beginning to get vory tired of sitting by her sister on

the bank, and of having nothing to do: once

lice was beginning to get

very tired of sitting by her sister on the bank, and

of having nothing

ice was beginning to get very tired of

sitting by her tister on the bank, and of

having nothing to do

Alice was

beginning to get vory tired of sitting by her sister on

the bank, and of having nothing to do: once

lice was beginning to get

very tired of sitting by her sister on the bank, and

of having nothing

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do

Page 22: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

OLC Assembly

22

3) We move column by column counting the letters in a column a make a note of the most common letter (take the consensus)

ice was beginning to get very tired of

sitting by her tister on the bank, and of

having nothing to do

Alice was

beginning to get vory tired of sitting by her sister on

the bank, and of having nothing to do: once

lice was beginning to get

very tired of sitting by her sister on the bank, and

of having nothing

ice was beginning to get very tired of

sitting by her tister on the bank, and of

having nothing to do

Alice was

beginning to get vory tired of sitting by her sister on

the bank, and of having nothing to do: once

lice was beginning to get

very tired of sitting by her sister on the bank, and

of having nothing

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do

Page 23: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

OLC Assembly

23

3) We move column by column counting the letters in a column a make a note of the most common letter (take the consensus)

ice was beginning to get very tired of

sitting by her tister on the bank, and of

having nothing to do

Alice was

beginning to get vory tired of sitting by her sister on

the bank, and of having nothing to do: once

lice was beginning to get

very tired of sitting by her sister on the bank, and

of having nothing

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do

Page 24: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

0"

10"

20"

30"

40"

50"

60"

400! 500! 600! 700! 800!

Sand"bluestem"(removed)"

Sand"bluestem"(intact)"

0!

10!

20!

30!

40!

50!

60!

400! 500! 600! 700! 800!

Big$bluestem$(removed)$

Big$bluestem$(intact)$

Relat

ive re

flecta

nce o

f EW

C

Wavelength (nm)

Big bluestem Sand bluestem

Bischof B.

Bittersweet Balsam

Assemblies

homenursery.com gardeninginsomnia.com

24

60

145

230

315

400

23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61M

IRA

(454

)M

IRA

clus

ter 0

75

150

225

300

375

450

525

600Sand bluestem assembly length and number of contigs

Cum

ulat

ive

leng

th o

f seq

uenc

es (M

b)

Assembly k-mer value or name

Num

ber o

f seq

uenc

es (k

)

Cumulative length of sequences (Mb)Number of sequences x 10^5

0.4

1.6

2.7

3.9

5.0

23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61M

IRA

(454

)M

IRA

clus

ter

Sand bluestem N values

Con

tig le

ngth

(kb)

Assembly k-mer value or name

N75 (kb) N50 (kb)N25 (kb)

k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative length of

sequences (Mb)

Number of sequences x

105

k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative length of

sequences (Mb)

Number of sequences x

105

27374757mergeCDH clusterMIRA cluster

1.219 2.028 3.126 142.633358 1.28113 27 1.219 2.028 3.126 142.633358 1.281131.206 2.008 3.087 128.100083 1.1091 37 1.206 2.008 3.087 128.100083 1.10911.195 1.977 3.051 113.176134 0.93839 47 1.195 1.977 3.051 113.176134 0.938391.271 2.035 3.096 102.507455 0.82755 57 1.271 2.035 3.096 102.507455 0.827551.41 2.211 3.331 345.752982 2.31102 merge 1.41 2.211 3.331 345.752982 2.311021.44 2.27 3.422 84.202533 0.59174 CDH cluster 1440 2270 3422 84202533 59174

1.804 2.69 3.941 105.920843 0.50279 MIRA cluster 1804 2690 3941 105920843 50279

1.1

1.7

2.3

2.8

3.4

4.0

27 37 47 57

mer

ge

CDH

clu

ster

MIR

A cl

uste

r

Balsam N values

Con

tig le

ngth

(kb)

Assembly k-mer value or name

N75 (kb) N50 (kb)N25 (kb)

80

185

290

395

500

27 37 47 57

mer

ge

CDH

clu

ster

MIR

A cl

uste

r 0

0.75

1.5

2.25

3Balsam assembly length and number of contigs

Cum

ulat

ive

leng

th o

f seq

uenc

es (M

b)

Assembly k-mer value or name

Num

ber o

f seq

uenc

es x

10^

5

Cumulative length of sequences (Mb)Number of sequences x 10^5

k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative length of

sequences (Mb)

Number of sequences x

105

27374757mergeCDH clusterMIRA cluster

1.213 2.11 3.221 175.505163 1.619521.176 2.026 3.068 154.222168 1.369471.168 1.948 2.932 129.331497 1.075451.218 1.974 2.95 111.672465 0.903851.404 2.23 3.299 418.762352 2.778331.399 2.274 3.339 96.411479 0.70852 CDH cluster 1399 2274 3339 96411479 708521.825 2.676 3.856 123.666263 0.59598 MIRA cluster 1825 2676 3856 123666263 59598

100

200

300

400

500

27 37 47 57

mer

ge

CDH

clu

ster

MIR

A cl

uste

r 0

0.75

1.5

2.25

3Bittersweet assembly length and number of contigs

Cum

ulat

ive

leng

th o

f seq

uenc

es (M

b)

Assembly k-mer value or name

Num

ber o

f seq

uenc

es x

10^

5

Cumulative length of sequences (Mb)Number of sequences x 10^5

1.1

1.8

2.6

3.3

4.0

27 37 47 57

mer

ge

CDH

clu

ster

MIR

A cl

uste

r

Bittersweet N values

Con

tig le

ngth

(kb)

Assembly k-mer value or name

N75 (kb) N50 (kb)N25 (kb)

Red flour beetle

Day E.

Page 25: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Outline

25

I. Basic conceptsi. Definition of bioinformaticsii. Databases (flat-file and

relational)iii. Assembly (Overlap-layout-

consensus)II. Steps you can take on your

own

Page 26: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

What can you do to get prepared?

26

-Manoj Samanta http://www.homolog.us/blogs/2011/07/22/a-beginners-guide-to-bioinformatics-part-i/

•Layer 1 – Using web to analyze biological data•Layer 2 – Ability to install and run new programs•Layer 3 – Writing own scripts for analysis in PERL, python or R•Layer 4 – High level coding in C/C++/Java for implementing existing algorithms or modifying existing codes for new functionality•Layer 5 – Thinking mathematically, developing own algorithms and implementing in C/C++/Java

If you are interested in studying bioinformatics here is an outline of increasingly complex levels of skills you might work towards

Page 27: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

K-INBRE resources

27

Over the fall semester the Bioinformatics Core and Virginia Rider from Pittsburg State University will be hosting an undergraduate bioinformatics club.

Our first topic will be command-line blast. Students will get an account on Beocat (Kansas’ largest compute cluster).

http://bioinformaticsk-state-undergrad.blogspot.com

Page 28: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

K-INBRE resources

28

K-INBRE hosts a journal club, Wednesday at noon, via PolyCom to discuss current bioinformatics tools.

http://bioinformaticsk-state.blogspot.com/

Page 29: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

K-INBRE resources

29

Bradley Olson and K-INBRE – PerlJustin Blumenstiel et al. – Python

http://bioinformaticskstateperl.blogspot.com/

Page 30: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

K-INBRE resources

30

K-INBRE and i5K have begun a Github script sharing organization to archive and share scripts.

https://github.com/i5K-KINBRE-script-share

i5K-KINBRE-script-share

RNA-Seq annotation and

comparison

genome annotation and

comparison

genome and transcriptome

assembly

read cleaning and format conversion

KSU bioinfo

labOlson

labreadme

KSU bioinfo

labOlson

labreadme

readme

KSU bioinfo

labOlson

labreadme

GitHub organization

Category of ‘omics’ tool

Lab or research group

List and description of scripts

Page 31: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

K-INBRE resources

31

-Git has very well developed version control built-in http://git-scm.com/video/what-is-version-control-Easy to search-More advantages are reviewed in this quick introduction http://git-scm.com/video/quick-wins-Provides continuity within labs (as students and post docs rotate out) - Increases collaboration and sharing of workflows between our community- It is also a good way to distribute the code you describe in a publication.- Git is also widely used by beginners as well as developers of technology and software in the omics community. Including:https://github.com/broadinstitute (The Broad Institute)https://github.com/lh3 (Li H. developer of BWA etc)https://github.com/dzerbino (Daniel Zerbino developer of oases and velvet)https://github.com/PacificBiosciences

Page 32: Intro to field_of_bioinformatics

09/05/13 K-INBRE Bioinformatics Core KSU

Questions?

32

9/4/13 tumblr_mp3qolvEiS1rr34bqo1_500.jpg (497×628)

31.media.tumblr.com/7c979b49ccf3bb50a9c42db116e4d686/tumblr_mp3qolvEiS1rr34bqo1_500.jpg 1/1

Contact information:[email protected]

K-INBRE Bioinformatics Core:

http://www.kumc.edu/kinbre/bioinformatics.html

http://bioinformatics.k-state.edu/