1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information,...

28
1 Parsing Analyze text: split it into meaningful units, tokens • Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’ depend on application: what are we looking for?
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    1

Transcript of 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information,...

Page 1: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

1

Parsing

• Analyze text: split it into meaningful units, tokens

• Extract relevant information, disregard irrelevant information

• ‘Meaningful’ and ‘relevant’ depend on application: what are we looking for?

Page 2: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

2

Blast

• Program package for finding similarities between biological sequences

• blastn compares DNA sequences with DNA sequences

• Input: – Fasta file with query sequences– Formatted Fasta file with database sequences– Sensitivity parameter (and more)

• Output:– Result of comparing each query to each database sequence

Page 3: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

3

Example run

Query file: arachis.fastaDatabase file: arabidopsis_nucleotides.fasta

Format the database: formatdb –i arabidopsis.fasta –p F –o T

Command:

/users/chili/usr/blast-2.2.13/bin/blastall -p blastn -e 0.000000002 -d arabidopsis.fasta -i arachis.fasta -o arachis_arab.bn

Page 4: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

4

Example output – query with no match

BLASTN 2.2.6 [Apr-09-2003]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Query= CL5Contig1 (797 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

***** No hits found ******

..

Page 5: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

5

Example output – query with matches

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10gb|AV519674.1|AV519674 AV519674 Arabidopsis 68 1e-09gb|AV557401.1|AV557401 AV557401 Arabidopsis 42 3e-05gb|BP670151.1|BP670151 BP670151 RAFL21 Arabidopsis 43 1e-04

..

Page 6: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

6

Example output – match alignment

>gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis Length = 1009

Score = 69.9 bits (35), Expect = 3e-10 Identities = 47/51 (92%) Strand = Plus / Plus

Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117 ||||||||||||||||| ||||| ||||| |||||||| ||||||||||||Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826

..

General form of output:

Repetitions of (query, subject matches, alignments)

Page 7: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

7

Extract information from blast output

• Extract the best hit for each query sequence

Query= CL69Contig1 (372 letters)..

Score ESequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Page 8: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

8

Algorithm

• Read blast output file line by line

• Introduce two states:1. Looking for next query

2. Looking for hit list

• Return dictionary of query best hit

Page 9: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

9

First state: Looking for next query

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Look for a line starting with

Query=

(the = is important!)

Page 10: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

10

Why we look for Query= and not just Query

>gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis Length = 1009

Score = 69.9 bits (35), Expect = 3e-10 Identities = 47/51 (92%) Strand = Plus / Plus

Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117 ||||||||||||||||| ||||| ||||| |||||||| ||||||||||||Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826

..

Page 11: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

11

Second state: Looking for hit list

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Case A: hits were found

Page 12: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

12

Case B: no hits were found

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Query= CL5Contig1 (797 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

***** No hits found ******

Page 13: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

13

Second state: Looking for hit list

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Query= CL5Contig1 (797 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

***** No hits found ******

Look for a line starting with

Searching

Then read a few more lines to distinguish case A/B

Look for a line starting with

Searching

Then read a few more lines to distinguish case A/B

Page 14: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

14blas

tpar

ser.

py (

part

1)

Find the query ID:Query= CL69Contig1

Page 15: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

15blas

tpar

ser.

py (

part

2)

Find the best match ID:

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10

Find the best match ID:

Searching..................................................done

***** No hits found ******

Page 16: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

16

Test

>>> from blastparser import parseBlastallOutput

>>> d = parseBlastallOutput(“arachis_arab.bn”)

>>> d[“gi|30419745”]

‘gb|BP625785.1|BP625785’

>>> d[“gi|30419753”]

‘none’

Page 17: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

25

Evolutionary tree of life (animal kingdom)

• Huge hierarchy of groups and subgroups

• Each node in the tree has a name and a (possibly empty) list of descendant trees (sons)

Two pass-parsing

Source: The origin and evolution of model organisms, Nature Genetics, Nov. 2002, vol. 3.

Page 18: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

26

Abstract data structure to represent a

general tree (not necessarily

binary)

tree

.py

Page 19: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

27

How can we write a tree to a sequential file?

(File format should be readable by other systems, so we can’t use cPickle)

– A tree is a labeled node containing a (possibly empty) list of other (sub)trees

– Write tree node using start and end tags: <N=“Insects”> [sons] </N>

• Formally (context-free grammar):

T → <N=“L”>S</N> S → λ | TSL → string label

Insects

Beetles

Flies

B

AE

D

C

Page 20: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

28

Recursive method: string representation of tree tr

ee.p

y

First obtain string representation of sons (empty string if no sons) by calling function recursively..

.. then create string with start tag, label, sons’ representation, and end tag

Insects

Beetles

Flies

B

AE

D

C

.. <N=“Beetles”><N=“C”></N><N=“D”></N><N=“E”></N></N> ..

Page 21: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

29

Larger tree – How can we read a tree from a sequential file?

<N="Terrestrialvertebrates"><N="Synapsida"><N="Therapsida"><N="Mammalia"><N="Marsupialia"><N="Kangaroo"></N><N="Koala"></N></N><N="Eutheria"><N="Primates"><N="Human"></N><N="Gorilla"></N><N="Chimpanzee"></N></N><N="Carnivora"><N="Walrus"></N><N="Wolf"></N></N><N="Proboscidea"><N="Elephant"></N></N></N></N></N></N><N="Reptilia"><N="Diapsida"><N="Archosauromorpha"><N="Tyrannosaurus"></N><N="Penguin"></N><N="Owl"></N></N><N="Lepidosauromorpha"><N="Lizard"></N><N="Snake"></N></N></N><N="Testudines"><N="Turtle"></N></N></N></N>

We need a parser!

part

_of_

the_

tree

_of_

life.

txt

Page 22: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

30

Two-pass parsing

Complex parsing is often split in two passes:

1. Lexical analysis• Identify and assemble tokens: logical units of text

2. Structural analysis• Determine the structural hierarchy of the tokens

In our case, the tokens are the two kinds of tag:

Page 23: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

31

Lexical analysis

phyl

ogen

ypar

ser.

py (

part

1)

Match either a start tag or an end tag

Define a group containing the start tag’s label

Search text from index pointer

Create token of right type

Move index pointer

Page 24: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

32

Structural analysis

phyl

ogen

ypar

ser.

py (

part

2)

current_node

new_node

.. <N="Kangaroo"></N><N="Koala"></N> ..

1

2

1

2

current_node

3

current_node

3

Kangaroo

.. <N="Kangaroo"></N><N="Koala"></N> ..

Real root will be first son of this node

Page 25: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

33

Terrestrial vertebrates

Synapsida

Reptilia

Therapsida

Mammalia

MarsupiliaEutheria

Kangaroo

Koala

Primates

Human

GorillaChimpanzee

Carnivora

Walrus

Wolf

Proboscidea

Elephant

Diapsida

TestudinesTurtle

Lepidosauromorpha

Lizard

Snake

Archosauromorpha

Tyrannosaurus

Penguin

Owl

Page 26: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

34phyl

ogen

ypar

ser.

pyTest

program

Page 27: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

35

Navigating in the tree

Name: DiapsidaFather: ReptiliaSiblings: TestudinesSons: Archosauromorpha Lepidosauromorpha

(f)ather, (s)on, si(b)ling, (p)rint, (q)uit? bNumber of sibling (0-0)? 0

Name: TestudinesFather: ReptiliaSiblings: DiapsidaSons: Turtle

(f)ather, (s)on, si(b)ling, (p)rint, (q)uit? p<N="Testudines"><N="Turtle"></N></N>

Name: TestudinesFather: ReptiliaSiblings: DiapsidaSons: Turtle

(f)ather, (s)on, si(b)ling, (p)rint, (q)uit? f

Name: ReptiliaFather: Terrestrial vertebratesSiblings: SynapsidaSons: Diapsida Testudines

Reptilia

Diapsida

TestudinesTurtle

Lepidosauromorpha

Archosauromorpha

Page 28: 1 Parsing Analyze text: split it into meaningful units, tokens Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’

36

.. on to the exercises