Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology -...

45
Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell [email protected] BMB 6216 – Algorithms for Biology - Class 1

description

The one exception: Science is quantitative, and has always been. BMB 6216 – Algorithms for Biology

Transcript of Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology -...

Page 1: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Andy Kudlicki

Office: BSB 547

Phone: 772-2253, 771-1011 cell

[email protected]

BMB 6216 – Algorithms for Biology - Class 1

Page 2: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Welcome!

Imagine doing science without computers? It can (almost all) be done:

– Paper file folders

– Xeroxing

– Photographs on film

– Actually going to the library to browse journals

– Abstract collections

– Telephone, Snail-mail, Telegrams

– Typewriters

BMB 6216 – Algorithms for Biology

Page 3: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

The one exception:

Science is quantitative, and has always been.

BMB 6216 – Algorithms for Biology

Page 4: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

This course:

– Using computers for computing.

– Aspects useful in biology / bioinformatics

• Simple tasks ( 2 * 71.12 = ? )

• Simple repetitive tasks (few or many repetitions)

• Somewhat complicated tasks

• Typical problems of high complexity

– BLAST, genome assembly, motif discovery, ...

BMB 6216 – Algorithms for Biology

Page 5: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

This course:

– Using computers for computing.

– Aspects useful in biology / bioinformatics

• Simple tasks ( 2 * 71.12 = ? )

• Simple repetitive tasks (few or many repetitions)

• Somewhat complicated tasks

• Typical problems of high complexity

– BLAST, genome assembly, motif discovery, ...

BMB 6216 – Algorithms for Biology

spreadsheets

( Solved, software available )

Page 6: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.
Page 7: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Course Overview

Class 1     Introduction to the course and to the Perl programming language

Class 2     Computational complexity and numerical stability of algorithms

Class 3     Data Structures and Containers in PERL and other languages

1.     Tables, lists, queues, hashes and when to use them

2.     When PERL is not enough: A quick look at R and C++

Class 4     Matrix operations; Principal Component Analysis; ICA

Class 5     Network / graph algorithms

1.     Interaction Networks

2.     Regulation networks

3.     Graphs for enumerating hypotheses

BMB 6216 – Algorithms for Biology – Class 1

Page 8: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Course Overview

Class 6     Strings and Regular Expressions

1.     In silico enzyme digestion

2.     Gene translation

Class 7     Randomization and Monte Carlo simulations

1.     Randomization by permutation

2.     Modeling the null-hypothesis probability distribution

Class 8     Custom vector graphics: generating SVG from your data

1.     Create and re-create the killer graph for your paper

Class 9     Visualization of multidimensional data

Class 10     Web tools

1.     The components of a web page, elements of HTML.

2.     Extracting data from webpages and other documents.

3.     Connect to GenBank using BioPerl

BMB 6216 – Algorithms for Biology

Page 9: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Course Overview

Class 11     Cgi-bin: Creating dynamic web-based tools for data analysis.Class 12     Relational databases and SQL1.     Relational Model, normalization2.     Basic SQL3.     Examples: Experimental results,Class 13     Databases and WWWClass 14     Clustering1.     Hierarchical2.     K-means3.     friends-of-FriendsClass 15     Timecourses and spectral analysis; Convolution.

BMB 6216 – Algorithms for Biology

Page 10: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Format:

Mixed – lecture with hands-on assignments.

Computer environment:

Linux

Perl, also C/C++, R, shell, awk, sed, ..., when needed

Supplementary reading:Larry Wall et al: Programming PerlWing-Kin Sung: Algorithms in BioinformaticsJames Tisdall: Beginning Perl for BioinformaticsJames Tisdall: Mastering Perl for BioinformaticsStroustrup: The C++ Programmming Language

Special requests: Welcome !

BMB 6216 – Algorithms for Biology

Page 11: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Format:

Mixed – lecture with hands-on assignments.

Computer environment:

Linux

* Rich in standard tools, mostly open-source

* Industry standard

– * Very similar to MacOS, Android, iOS, BSD, ChromeOS, etc.

– Has many flavors created for specific purposes

BMB 6216 – Algorithms for Biology

Page 12: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Using your laptop in class:

To get a *nix environment:

* linux laptop (or unix console on Mac)– Live CD distribution

* cygwin* virtual machine

* remote session (preferred, guaranteed to work)

BMB 6216 – Algorithms for Biology

Page 13: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Remote session:

Use

– “Remote Desktop Connection” from win*

– Server: 129.109.88.185

From mac – install “Remote Desktop Connection Client for Mac”

From Linux “rdesktop 129.109.88.185”

Also works from off campus

• (mycitrix.utmb.edu -> remote desktop session)

Other options:

– ssh (puTTY on windows) , no graphics though, only on-campus

– NX NoMachine

Page 14: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Login to: 129.109.54.80

Username:

Password:

BMB 6216 – Algorithms for Biology

Page 15: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Unix / linux shell / command line:

– List files: ls ls -a ls -1 ls -l ls -lrt

– Directory: cd pwd

– Copy, move, delete, link: cp mv rm ln

– Machine status: ps w uptime top df du whoami /sbin/ifconfig date

– Text editors: joe nano emacs (c-x c-f) vi

– Pager: more less; also: cat, head, tail, tac

– Misc: echo tr sed man wc chmod

BMB 6216 – Algorithms for Biology

Page 16: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Simple data flow / spreadsheet-like

• Find in file : grep [grep -v; grep -f; egrep]

• Select top/bottom lines from file: head, tail

• Select columns: awk awk '{print $2, $3, $5+$6}'

• Merge lines: cat

• Merge columns: paste

• Sort

• Data flow: > >> < | tee tac

BMB 6216 – Algorithms for Biology

Page 17: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Exercise:

The file /data/students/classes/remastercycle.csv contains gene expression data arranged as time-series in columns. (affy-id, name, gene-id, data*36)

• How many named genes are there?

• What is the average expression at timepoint 1? In how many genes it is above average?

• What is the average expression at t1 of named genes, unnamed genes, non-genes? (genes have systematic names like YLR405W)

• List 200 named genes that have the highest (t7+t19+t31)-(t1+t13+t25)

BMB 6216 – Algorithms for Biology

Page 18: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.
Page 19: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Log in to your account (on 129.109.88.185)

– Make a fresh directory, e.g.

mkdir bmb6216

cd bmb6216

mkdir class_1; cd class_1

cp /data/students/classes/hello.pl .

* Cat it. * Less it. * Run it.

• Backup: cp hello.pl hello-0.pl

• Edit it: vi hello.pl

BMB 6216 – Algorithms for Biology

Page 20: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Editing with vi

– I / i (insert)

– A / a (append)

– X / x / dd (delete)

– R (eplace) / r (eplace 1 character)

– {n} W / w / B / b / hjkl -move around

– [ESC] – back from insert to command

– ZZ / :w / :q / :wq / :x / :q! - exit / save / quit

– xp – swap chars. ddp – swap lines

BMB 6216 – Algorithms for Biology

Page 21: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Exercise:

The file /home/students/classes/Class_1/remastercycle.csv contains gene expression data arranged as time-series in columns. (affy-id, name, gene-id, data*36)

• How many named genes are there?

• What is the average expression at timepoint 1? In how many genes it is above average?

• What is the average expression at t1 of named genes, unnamed genes, non-genes? (genes have systematic names like YLR405W), named genes also have a common name in column 2.

• List 200 named genes that have the highest (t7+t19+t31)-(t1+t13+t25)

BMB 6216 – Algorithms for Biology

Page 22: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

PERL

Why PERL?

Practical Extraction and Report Language

Pathologically Eclectic Rubbish Lister

• Versatile, portable

• Widely used in bioinformatics and web applications

• There's more than one way to do it

• Not the most elegant language, great for dirty hacks

• Easily integrated with anything

BMB 6216 – Algorithms for Biology

Page 23: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Warning: PERL6 ain't PERL

BMB 6216 – Algorithms for Biology

Page 24: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

PERL

HELLO WORLD:

print ''Hello \n'';

BMB 6216 – Algorithms for Biology

Page 25: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

PERL

HELLO WORLD:

> perl

print ''Hello \n'';

^D

BMB 6216 – Algorithms for Biology

Page 26: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

PERL

HELLO WORLD:

>perl -e 'print ''Hello \n'';'

BMB 6216 – Algorithms for Biology

Page 27: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

PERL

HELLO WORLD:

hello.pl

==================

#!/usr/bin/perl

print ''Hello \n'';

==================

BMB 6216 – Algorithms for Biology

> perl hello.plOr> ./hello.pl

(after chmod +x hello.pl)

Page 28: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

VARIABLES:

Scalar:

$dna = 'ATTTGCCCTGCCCATT';

$mouse_tail_inches = 2.13;

$RNA = ''GGGUUCAAUAUAUGGC'';

$seven = -6;

Default variable: $_

No need to declare variables. If not specified, $_ is assumed.

BMB 6216 – Algorithms for Biology

Page 29: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

VARIABLES:

No need to declare variables.

Risky though:

$my_variable = 51;

$something = $my_variable + 3;

$something_else = $myvariable + 4;

use strict;

BMB 6216 – Algorithms for Biology

Page 30: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

OPERATIONS:

String:

$dna = “ATAGAGGTA” . “CATATC”;

$at_repeat = “AT” x 50;

substr() sub-string

length()

Binding: print $dna if $dna =~ /ATA/;

chop (last char)

chomp (end of line)

Special characters: \t \n

BMB 6216 – Algorithms for Biology

Page 31: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

The different quotations

$x=6;

print ''x= $x \n'';

print 'x= $x \n';

BMB 6216 – Algorithms for Biology

Page 32: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

OPERATIONS:

Arithmetic:

$a + $b

$a - $b

$a * $b

$a % $b

$a ** $b

BMB 6216 – Algorithms for Biology

Page 33: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

OPERATIONS:

Incrementation (C-like)

$a ++

$a *= 4

$repeat = 'AT'; $repeat x=36;

BMB 6216 – Algorithms for Biology

Page 34: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

LISTS/TABLES:

@a = (4, 6, 3.21, 7, 'cat', ''dog'');

$a[0] = 6;

$#a address of last element

@a + 0 size of array

OPERATIONS:

* join / split

* push / pop / shift / unshift

BMB 6216 – Algorithms for Biology

Page 35: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

LISTS/TABLES:

@a = (4, 6, 3.21, 7, 'cat', ''dog'');

$a[0] = 6;

$#a address of last element

@a + 0 size of array

OPERATIONS:

* join / split

* push / pop / shift / unshift

BMB 6216 – Algorithms for Biology

Page 36: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

HASHES:

The most important data type in biology!

$expression{''RPS16''} = 4.65;

%expression = (

RPL12 => 1.23,

CDC28 => 5.31,

STAT1 => ''experiment gone south”

);

BMB 6216 – Algorithms for Biology

Page 37: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

FLOW CONTROL:

if ( $a > 4 ) { print sqrt ($a), “\n”; };

while ( $x > 0 ) { print --$x , “\n”};

$x>0 or $x = 6;

for $z (1..333) {print $z, ' ';};

for ($i=0; $i<=1000; ++$i)

{

next unless $a[$i] > 0

};

BMB 6216 – Algorithms for Biology

Page 38: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

TRUE or FALSE

false strings:

– ''0''

– ''''

Every other string is true!

''0.00'' is true

''0.00'' + 0 is false

– if ( 'Elvis is alive' ) { print 4+5, “\n”; };

– undef() is false

BMB 6216 – Algorithms for Biology

Page 39: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

SUBROUTINES

sub addit {

my ($x1, $x2) = @_;

return $x1 + $x2;

};

BMB 6216 – Algorithms for Biology

Page 40: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Input / Output:

while (<>)

{

chomp;

$sum += $_;

};

BMB 6216 – Algorithms for Biology

Page 41: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Input:

open BLABLA, “data.csv”;

$firstline = <BLABLA>;

@headers = split “\t”, $firstline;

while (<BLABLA>) {something};

close BLABLA;

BMB 6216 – Algorithms for Biology

Page 42: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Output:

– print $x, ''\n'';

– printf ''format'', $x;

– print + join '' '', @list;

open BLABLA, “>outdata.csv”;

print BLABLA $x, $y, ''\n''; #no comma!!!

close BLABLA;

BMB 6216 – Algorithms for Biology

Page 43: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Exercises:

1. repeat in PERL the awk/sort exercise from last hour

2. a-S_cer_TANAY_1000upstream.fasta contains the sequences out UTRs of genes. What is the correlation between the position of GATGAGA sequence and avg expression of the gene?

BMB 6216 – Algorithms for Biology

Page 44: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

Simple data flow / spreadsheet-like

• Find in file : grep [grep -v; grep -f; egrep]

• Select top/bottom lines from file: head, tail

• Select columns: awk awk '{print $2, $3, $5+$6}'

• Merge lines: cat

• Merge columns: paste

• Sort

• Data flow: > >> < | tee tac

BMB 6216 – Algorithms for Biology

Page 45: Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell BMB 6216 – Algorithms for Biology - Class 1.

C / C++ -> for total control

=========================== Hello.C ======

#include <iostream>

using namespace std;

int main ()

{

cout << "Hello :) " << 5+4 << endl;

};

BMB 6216 – Algorithms for Biology