Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department...

Post on 18-Jan-2016

216 views 0 download

Transcript of Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department...

Bioinformatics for biologists

Dr. Habil Zare, PhDPI of Oncinfo Lab

Assistant Professor, Department of Computer Science Texas State University

Presented at University of Texas, Health Science Center – San Antonio20 November 2015

Part 1

- BioLinux - Mapping RNAseq data to transcriptome (Salmon)

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

3

Bioinformatics: Computational and statistical analysis of biological data

Data

Biologists

ResultsGenotypes / Phenotypes

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

4

In this workshop: A compact demo of bioinformatics analysis starting from raw data to produce useful plots and meaningful interpretation of the data

RNAseq

Biologists

Pathway and Network Analysis

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

5

Goals of the workshop

- A practical introduction to some basic bioinformatics tools for biologists.

- Having hands-on experience with simple, toy-example data.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

6

Bio-Linux

Bio-Linux is a free workstation platform that facilitates running hundreds of bioinformatics tools without the corresponding installation hassles.

An easy way to install it on Mac OS X and Windows computers is described below:http://oncinfo.org/file/view/BioLinux_VM.pptx/564155065/BioLinux_VM.pptx

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

7

Browsing files and folders

tar.gz refers to a compressed file in Linux. Let’s practice decompressing such a file with an example. Follow the next steps in BioLinux.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

8

.Double-click on Bio-Linux

Documentation to open it.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

9

.

Double-click on Introductory Tutorial

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

10

.Click on File>New TAb

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

11

.Select the second tab and click

on Home.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

12

.

Drag and drop this file from intro_course tab to Home tab.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

13

.

Right-click on the file and then Extract Here…

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

14

.This folder will appear. Open it

and have a look inside.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

15

.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

16

Downloading and installing

Most useful bioinformatics tools are publicly available. You can download, install, and use them easily.

Let’s practice with an example. Follow the next steps in BioLinux.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

17

.

This is the “Dash”. Use it to launch and organize applications.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

18

.E.g., use “Firefox” to browse the web.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

19

.Type oncinfo.org in the address bar and press enter.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

20

From the right menu, click on the workshop link.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

21

.

Click on “zipped” to download the folder.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

22

.

Choose to save the file.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

23

1- Click on Files icon.

2- Click on Downloads.

The file that you just downloaded was saved in Downloades folder.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

24

This is the file you just downloaded.

The file that you just downloaded was saved in Downloades folder.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

25

Extract (decompress) the file that you just .

Right-click on the file and then Extract Here…

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

26

The file that you just downloaded in saved in Downloades folder.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

27

Salmon

Salmon, a successor of Sailfish, is a useful tool for mapping RNAseq data. It is faster and easier to run than alternatives such as TopHat.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

28

Installing Salmon software

We will run a script provided in the zipped file using a terminal.

Terminal is an interface that uses only text to communicate between the user and the computer.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

29

.

Click on the black rectangular to open a terminal.

How to open a terminal?

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

30

.Try a few simple Linux commands e.g.,echo, date, cal, …

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

31

.

Type “cd” in the terminal to “change directory”.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

32

.

Drag the folder to the terminal.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

33

.

Now press Enter.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

34

. Double-click on the folder to open it.

What is in the folder?

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

35

Equivalently, “ls” shows you the list

of files in this folder.

What is in the folder?

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

36

This script will install Salmon for you.

What is in the folder?

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

37

Type the name of the script

and then press Enter.

How to run the script?

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

38

How to run the script?

Type your password, which is “manager” by default.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

39

How to make sure Salmon is installed?

Type “salmon v” to test if it is installed or not.

The script should download and install salmon. The following test indicates that installation was OK.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

40

1- A FASTA file, which has the sequence information of the transcriptome of the species of interest.

2- One or more FASTQ files, which are provided by the sequencer instrument and contain the reads information from the samples.

Input for Salmon

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

41

Toy examples of FASTA and FASTQ files

Open the sample_data folder

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

42

Next generation sequencing A sequencer produces millions of short reads (50-200 bps).

Biological sample Sequencer Short reads

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

43

Toy examples of a FASTQ file

Double click on reads_1.fastq file.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

44

This is a read of length 50 with nucleotide and (Phred) quality information.

Toy examples of a FASTQ file

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

45

Double click on transcripts.fasta file.

Toy examples of a FASTA file

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

46

This is a transcript.

Toy examples of a FASTA file

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

47

It is an mRNA with RefSeq ID NM_001168316

Toy examples of a FASTA file

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

48

Type the RefSeq ID, e.g., NM_001168316

More information on the transcript Search in the NCBI database http://www.ncbi.nlm.nih.gov/nuccore/

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

49

Type the RefSeq ID, e.g., NM_001168316

Visualize the transcript on the genome Search in the UCSC genome browserhttps://genome.ucsc.edu/cgi-bin/hgGateway

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

50

This is the transcript

Visualize the transcript on the genome Search in the UCSC genome browserhttps://genome.ucsc.edu/cgi-bin/hgGateway

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

51

More information on this region is available.

Visualize the transcript on the genome Search in the UCSC Genome Browserhttps://genome.ucsc.edu/cgi-bin/hgGateway

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

52

Quantify the level of expressionThe level of expression of each transcript can be quantified by counting the number of reads that are aligned to it.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

53

Next generation sequencing A sequencer produces millions of short reads (50-200 bps).

Biological sample Sequencer Short reads

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

54

Only exons are present in mRNA

} } } }

exon 1 exon 2 exon 3 exon 4

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

55

Alignment

Gene 1 Gene 2

Determines what transcript (where on the genome) each read was originated from.

Short reads in a FASTQ file

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

56

Alignment

Gene 1 Gene 2

Short reads in a FASTQ file

Determines what transcript (where on the genome) each read was originated from.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

57

Alignment

Gene 1 Gene 2

Count the number of aligned (mapped) reads to each region.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

58

Alignment

Gene 1 Gene 2

High expression Low expression

Compare the level of expression between genes.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

59

Quantifying expression from RNAseq data

Salmon processes raw data and quantifies expression levels in 2 steps.http://salmon.readthedocs.org/en/latest/salmon.html#using-salmon

Step 1- Building an index for the transcriptome. Step 2- Aligning the reads to the transcriptome.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

60

Are you in the right directory?Before you start, make sure you are in the correct directory.The pwd command in Linux shows the current directory.

Typing “pwd” and then “Enter” will “print the working directory”, i.e., your current path.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

61

Always make sure that the files are stored where you expect them to be.

Are you in the right directory?Before you start, make sure you are in the correct directory.The pwd command in Linux shows the current directory.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

62

Step 1- Building an index for the transcriptome.

Run the following command in the terminal in BioLinux:

salmon index -t transcripts.fasta -i transcripts_index --type fmd

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

63

Type the command here.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

64

For now, ig

nore this

warning.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

65

The index is built.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

66

Salmon created a new folder.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

67

Step 2- Aligning the reads to the transcriptome.

Run the following command in the terminal in BioLinux:

salmon quant -i transcripts_index –l IU -1 reads_1.fastq -2 reads_2.fastq –o transcripts_quanton

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

68

Step 2- Aligning the reads to the transcriptome.

Run the following command in the terminal in BioLinux:

}

The command

}

The indexing built in step 1

}

The first input file

}The secondinput file

}

Output folder

salmon quant -i transcripts_index –l IU -1 reads_1.fastq -2 reads_2.fastq –o transcripts_quanton

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

69

Salmon created a new folder and stored the results there.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

70

quant.sf is the main output file that reports the number of reads and expression. Double click on it.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

71

The names of the transcripts (RefSeq IDs) and their length are in the first 2 columns.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

72

The number of mapped reads is reported on the last column.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

73

Transcript per million (TPM) is the estimated expression.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

74

Transcript per million (TPM) is the estimated expression.

TPM values correspond to counts normalized by the length of transcripts and also the depth of sequencing. There are other normalization methods such as RPKM and FPKM.

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

75

This transcript is highly expressed

Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

76

This transcript is highly expressed

These transcripts have low expression.

Instaling BioLinux using VM, Dr. Habil Zare 27 Oct 2015

77

References:

• Some of the slides are based on Introduction to Biolinux http://nebc.nerc.ac.uk/downloads/courses/Bio-Linux/bl8_latest.pdf

• Salmon is a useful tool for mapping and analyzing RNAseq data. https://combine-lab.github.io/salmon/

• I prepared these guidelines to facilitate the “Bioinformatics for biologists workshop”, 20 Nov 2015, UTHSC – San Antonio.http://oncinfo.org/Bioinformatics+for+biologist+workshop