FASTA and BLAST Chitta Baral. FASTA : Basic Steps Step 1: –Set a word size. (usually 6 for DNA and...

4
FASTA and BLAST Chitta Baral

Transcript of FASTA and BLAST Chitta Baral. FASTA : Basic Steps Step 1: –Set a word size. (usually 6 for DNA and...

Page 1: FASTA and BLAST Chitta Baral. FASTA : Basic Steps Step 1: –Set a word size. (usually 6 for DNA and 2 for proteins) –Make a plot. –Find the long diagonals.

FASTA and BLAST

Chitta Baral

Page 2: FASTA and BLAST Chitta Baral. FASTA : Basic Steps Step 1: –Set a word size. (usually 6 for DNA and 2 for proteins) –Make a plot. –Find the long diagonals.

FASTA : Basic Steps• Step 1:

– Set a word size. (usually 6 for DNA and 2 for proteins)– Make a plot. – Find the long diagonals (or high scoring regions)

• Step 2: – Score the 10 best diagonal runs using a scoring matrix. (allow mismatches, end

extensions, joining of two diagonals; but no gaps)– (init1: single best sub-alignment found in this stage.)

• Step 3:– Merge non-overlapping diagonal runs to allow gaps (ins/del).– Score of joined regions = sum of individual scores – penalty– Score of the highest scoring region at the end of this step is called initn.

• Step 4:– Use a variant of Smith-Waterman algorithm on a narrow band around initn and

construct an optimal alignment of this region.• Modifications:

– In Step 4, use a band around init1.

Page 3: FASTA and BLAST Chitta Baral. FASTA : Basic Steps Step 1: –Set a word size. (usually 6 for DNA and 2 for proteins) –Make a plot. –Find the long diagonals.

BLAST: basic steps• Step 1:

– Set a word size (3 for protein and 11 for DNA); Create a word list for the query sequence– Eg. qlnfsagw {ql, ln, nf, fs, sa, ag, gw}– Expand the list (using a threshold T, say 8)

• ql: ql, qm, hl, zl ln: ln, lb• nf: nf, af, ny,df,qf, ef, gf, hf, kf, sf, tf, bf, zf fs: fs, fa fn, fd, fg, fp, ft, fb, ys• sa: none ag: ag• gw: gw, aw, rw, nw, dw, qw, ew, hw, iw, kw, mw, pw, sw, tw, vw, bw, zw, xw

• Step 2– Scan through the string and whenever a word in the list is found try to extend it in both

directions (no gaps) to get to a score beyond a threshold S. While extending use a parameter L that defines how long an extension will be tried to raise the score over S.

• Modifications of Step 2:– Original BLAST: extension is continued as long as the score continues to increase– Another version: extension is stopped when the accumulated score stops increasing and

has just begun to fall a certain amount below the best score found.– Blast2 (gapped BLAST)

• Lower value of T is used• After extension try to combine (allowing gaps)• Find maximal scoring segment. Use Smith-Waterman algorithm around a band of this segment (as in

FASTA)

Page 4: FASTA and BLAST Chitta Baral. FASTA : Basic Steps Step 1: –Set a word size. (usually 6 for DNA and 2 for proteins) –Make a plot. –Find the long diagonals.

Home Work (due 3/31/03)

• Compare BLAST and FASTA.

(Hint: Read the external pointers in the class notes page.)