1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information,...

Post on 19-Dec-2015

217 views 1 download

Tags:

Transcript of 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information,...

1

Parsing

• Analyze text: split it into meaningful units, tokens

• Extract relevant information, disregard irrelevant information

• ‘Meaningful’ and ‘relevant’ depend on application: what are we looking for?

2

Blast

• Program package for finding similarities between biological sequences

• blastn compares DNA sequences with DNA sequences

• Input: – Fasta file with query sequences– Formatted Fasta file with database sequences– Sensitivity parameter (and more)

• Output:– Result of comparing each query to each database sequence

3

Example run

Query file: arachis.fastaDatabase file: arabidopsis_nucleotides.fasta

Format the database: formatdb –i arabidopsis.fasta –p F –o T

Command:

/users/chili/usr/blast-2.2.13/bin/blastall -p blastn -e 0.000000002 -d arabidopsis.fasta -i arachis.fasta -o arachis_arab.bn

4

Example output – query with no match

BLASTN 2.2.6 [Apr-09-2003]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Query= CL5Contig1 (797 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

***** No hits found ******

..

5

Example output – query with matches

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10gb|AV519674.1|AV519674 AV519674 Arabidopsis 68 1e-09gb|AV557401.1|AV557401 AV557401 Arabidopsis 42 3e-05gb|BP670151.1|BP670151 BP670151 RAFL21 Arabidopsis 43 1e-04

..

6

Example output – match alignment

>gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis Length = 1009

Score = 69.9 bits (35), Expect = 3e-10 Identities = 47/51 (92%) Strand = Plus / Plus

Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117 ||||||||||||||||| ||||| ||||| |||||||| ||||||||||||Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826

..

General form of output:

Repetitions of (query, subject matches, alignments)

7

Extract information from blast output

• Extract the best hit for each query sequence

Query= CL69Contig1 (372 letters)..

Score ESequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

8

Algorithm

• Read blast output file line by line

• Introduce two states:1. Looking for next query

2. Looking for hit list

• Return dictionary of query best hit

9

First state: Looking for next query

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Look for a line starting with

Query=

(the = is important!)

10

Why we look for Query= and not just Query

>gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis Length = 1009

Score = 69.9 bits (35), Expect = 3e-10 Identities = 47/51 (92%) Strand = Plus / Plus

Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117 ||||||||||||||||| ||||| ||||| |||||||| ||||||||||||Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826

..

11

Second state: Looking for hit list

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Case A: hits were found

12

Case B: no hits were found

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Query= CL5Contig1 (797 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

***** No hits found ******

13

Second state: Looking for hit list

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Query= CL5Contig1 (797 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

***** No hits found ******

Look for a line starting with

Searching

Then read a few more lines to distinguish case A/B

Look for a line starting with

Searching

Then read a few more lines to distinguish case A/B

14blas

tpar

ser.

py (

part

1)

Find the query ID:Query= CL69Contig1

15blas

tpar

ser.

py (

part

2)

Find the best match ID:

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10

Find the best match ID:

Searching..................................................done

***** No hits found ******

16

Test

>>> from blastparser import parseBlastallOutput

>>> d = parseBlastallOutput(“arachis_arab.bn”)

>>> d[“gi|30419745”]

‘gb|BP625785.1|BP625785’

>>> d[“gi|30419753”]

‘none’

25

Evolutionary tree of life (animal kingdom)

• Huge hierarchy of groups and subgroups

• Each node in the tree has a name and a (possibly empty) list of descendant trees (sons)

Two pass-parsing

Source: The origin and evolution of model organisms, Nature Genetics, Nov. 2002, vol. 3.

26

Abstract data structure to represent a

general tree (not necessarily

binary)

tree

.py

27

How can we write a tree to a sequential file?

(File format should be readable by other systems, so we can’t use cPickle)

– A tree is a labeled node containing a (possibly empty) list of other (sub)trees

– Write tree node using start and end tags: <N=“Insects”> [sons] </N>

• Formally (context-free grammar):

T → <N=“L”>S</N> S → λ | TSL → string label

Insects

Beetles

Flies

B

AE

D

C

28

Recursive method: string representation of tree tr

ee.p

y

First obtain string representation of sons (empty string if no sons) by calling function recursively..

.. then create string with start tag, label, sons’ representation, and end tag

Insects

Beetles

Flies

B

AE

D

C

.. <N=“Beetles”><N=“C”></N><N=“D”></N><N=“E”></N></N> ..

29

Larger tree – How can we read a tree from a sequential file?

<N="Terrestrialvertebrates"><N="Synapsida"><N="Therapsida"><N="Mammalia"><N="Marsupialia"><N="Kangaroo"></N><N="Koala"></N></N><N="Eutheria"><N="Primates"><N="Human"></N><N="Gorilla"></N><N="Chimpanzee"></N></N><N="Carnivora"><N="Walrus"></N><N="Wolf"></N></N><N="Proboscidea"><N="Elephant"></N></N></N></N></N></N><N="Reptilia"><N="Diapsida"><N="Archosauromorpha"><N="Tyrannosaurus"></N><N="Penguin"></N><N="Owl"></N></N><N="Lepidosauromorpha"><N="Lizard"></N><N="Snake"></N></N></N><N="Testudines"><N="Turtle"></N></N></N></N>

We need a parser!

part

_of_

the_

tree

_of_

life.

txt

30

Two-pass parsing

Complex parsing is often split in two passes:

1. Lexical analysis• Identify and assemble tokens: logical units of text

2. Structural analysis• Determine the structural hierarchy of the tokens

In our case, the tokens are the two kinds of tag:

31

Lexical analysis

phyl

ogen

ypar

ser.

py (

part

1)

Match either a start tag or an end tag

Define a group containing the start tag’s label

Search text from index pointer

Create token of right type

Move index pointer

32

Structural analysis

phyl

ogen

ypar

ser.

py (

part

2)

current_node

new_node

.. <N="Kangaroo"></N><N="Koala"></N> ..

1

2

1

2

current_node

3

current_node

3

Kangaroo

.. <N="Kangaroo"></N><N="Koala"></N> ..

Real root will be first son of this node

33

Terrestrial vertebrates

Synapsida

Reptilia

Therapsida

Mammalia

MarsupiliaEutheria

Kangaroo

Koala

Primates

Human

GorillaChimpanzee

Carnivora

Walrus

Wolf

Proboscidea

Elephant

Diapsida

TestudinesTurtle

Lepidosauromorpha

Lizard

Snake

Archosauromorpha

Tyrannosaurus

Penguin

Owl

34phyl

ogen

ypar

ser.

pyTest

program

35

Navigating in the tree

Name: DiapsidaFather: ReptiliaSiblings: TestudinesSons: Archosauromorpha Lepidosauromorpha

(f)ather, (s)on, si(b)ling, (p)rint, (q)uit? bNumber of sibling (0-0)? 0

Name: TestudinesFather: ReptiliaSiblings: DiapsidaSons: Turtle

(f)ather, (s)on, si(b)ling, (p)rint, (q)uit? p<N="Testudines"><N="Turtle"></N></N>

Name: TestudinesFather: ReptiliaSiblings: DiapsidaSons: Turtle

(f)ather, (s)on, si(b)ling, (p)rint, (q)uit? f

Name: ReptiliaFather: Terrestrial vertebratesSiblings: SynapsidaSons: Diapsida Testudines

Reptilia

Diapsida

TestudinesTurtle

Lepidosauromorpha

Archosauromorpha

36

.. on to the exercises