TA Lecture 01 Intro - ut L1 Intro 6up.pdf · okk& oleks olemeolenolidolnud...
Transcript of TA Lecture 01 Intro - ut L1 Intro 6up.pdf · okk& oleks olemeolenolidolnud...
9/4/10
1
Text Algorithms (4AP)
Jaak Vilo 2010 fall
1 MTAT.03.190 Text Algorithms Jaak Vilo
Topic
• Algorithms on strings – sequences, texts, documents, images, ...
– Stringid, sõned, teksId, …
Search (e.g. grep)
vilo@mac:~/L/Text/Data$ grep ATCGATC Yeast_01.fa!TGTTGAAGATCTAAAGGTATCGATCAAATATGTTGCTAGAGAGTGACTGAGTGTTACATT!TCTACAAAGCACAGAGATCGATCTGGGGCAAGAAGAGCCAGTAGAGGAAGAGACTGTCAT!TTCAAATATGTCGTTTCATTATCTGTATGACTGTCGTAACTTTGAATCGATCTAATGTGT!TAAAAACCTGCCCATCTTGATCGATCTGTTGACTCAAAATTTGGGATTATCTACAGACGA!AAGCAAGAGGAGGCGCATCGATCGTGGCAGATGAGTCAGCAAACACCACAGGAAAGTGAA!ATTCAACTTGAATTTGAGGCTTACCGTCAACATCGATCAACTTGAATGGGAAGTGCTTCA!
GCATCGTTCATACAAGTAATTATGCTATATTATCGATCCTCGGATTTCAGCTTCCGTTAT!vilo@mac:~/L/Text/Data$ !
Approximate search
vilo@mac:~/L/Text/Data$ agrep -B -x kohanimi kohanimed.txt !best match has 3 errors, there are 4 matches, output them?
(y/n)y!johani!kaimi!kompanii!kopanitsi!vilo@mac:~/L/Text/Data$ !
Indexing and compression
• Making searches much faster
• Compressing data
InformaIon retrieval
• Google …
9/4/10
2
Ingredients
• S = s1 s2 … sn (text) |S| = n (length)
• P = p1p2..pm (pa&ern) |P| = m
• Σ -‐ alphabet | Σ| = c
QuesIons: PaVern matching
• Does S contain P? – Does S = S' P S" fo some strings S' ja S"? – Usually m << n and n can be (very) large
– Exact match O(nm) O(n) O(n/m) – Approximate match
• Hamming distance • Edit distance • Generalised edit distance
• Why?
Example
• S = text that contains characters
• P = racter
• P= obtain, cotain, ota, hate, harra
MulIple occurrences in text
S
P
MulIple paVerns
S
{P}
DicIonary lookup
• Given D= { T1, T2, .., Tn }
• Does D contain P?
• Data Structures & Algorithms?
9/4/10
3
CPM • Combinatorial Pa&ern Matching addresses issues of searching and
matching strings and more complicated pa&erns such as trees, regular expressions, graphs, point sets, and arrays. The goal is to derive non-‐trivial combinatorial properIes for such structures and then to exploit these properIes in order to achieve improved performance for the corresponding computaIonal problem.
• A steady flow of high-‐quality research on this subject has changed a sparse set of isolated results into a full-‐fledged area of algorithmics with important applicaIons. This area is expected to grow even further due to the increasing demand for speed and efficiency that comes from molecular biology, but also from areas such as informa;on retrieval, pa&ern recogni;on, compiling, data compression, program analysis and security.
Algorithms
• Brute force
• Knuth-‐Morris-‐PraV
• Rabin-‐Karp
• Boyer-‐Moore • …
Brute Force
S
P i i+j-‐1
j
IdenIfy the first mismatch!
QuesIon:
Problems of this method? Ideas to improve the search?
Brute force
Algorithm Naive Input: Text S[1..n] and
paVern P[1..m] Output: All posiIons i, where P
occurs in S
for( i=1 ; i <= n-‐m+1 ; i++ ) for ( j=1 ; j <= m ; j++ ) if( S[i+j-‐1] != P[j] ) break ; if ( j > m ) print i ;
attempt 1: gcatcgcagagagtatacagtacg GCAg....
attempt 2: gcatcgcagagagtatacagtacg g.......
attempt 3: gcatcgcagagagtatacagtacg g.......
attempt 4: gcatcgcagagagtatacagtacg g.......
attempt 5: gcatcgcagagagtatacagtacg g.......
attempt 6: gcatcgcagagagtatacagtacg GCAGAGAG
attempt 7: gcatcGCAGAGAGtatacagtacg g.......
Anima;ons
• hVp://www-‐igm.univ-‐mlv.fr/~lecroq/string/
• EXACT STRING MATCHING ALGORITHMS Anima;on in Java
• ChrisIan Charras -‐ Thierry Lecroq Laboratoire d'InformaIque de Rouen Université de Rouen Faculté des Sciences et des Techniques 76821 Mont-‐Saint-‐Aignan Cedex FRANCE
• e-‐mails: {Chris.an.Charras, Thierry.Lecroq}@laposte.net
Time analysis
• Worst case
• Average case
• PracIcal measurements
• Preprocessing vs analysis
9/4/10
4
How does search depend on
• |S| • |P| or set of paVerns, ||P|| • |Σ| • Similarity measure and distance k
Space complexity
• Memory usage – Preprocessing – Matching – …
Approximate search
• Similarity measures – Edit distance – …
• Dynamic programming – Memorizing intermediate results
• Bit-‐parallel algorithms and pracIcal efficiency
9/4/10
5
QuesIons
• Exact vs approximate • (sub) string ACGTAG
• 1D vs 2D … • Regular expressions A([CG]A*T)+T • ProbabilisIc • MulIple paVerns
• Online vs offline (indexed)
Knowledge Explosion: PubMed
• Average number of new citaIons appearing in PubMed – In 1980: 746/day – In 2004: 1,640/day
No. of New Publica;ons
1980 1983 1985 1988 1990 1993 1995 1998 2000 2003
300000
350000
400000
450000
500000
550000
600000
Year
Accumulated New Publica;ons
1980 1983 1985 1988 1990 1993 1995 1998 2000 2003
2000000
4000000
6000000
8000000
10000000
12000000
Year
Indexing
• Suffix tree – O(n) Ime and space
– O(m) query
• Suffix array
• Compressed suffix trees
• Inverted index (pöördindex)
Compression
Model Model
Encoder Decoder Text Text Compressed
Text
Compression
• Run-‐length encoding • Shannon-‐Fano • Huffman codes
• ArithmeIc Coding
• “Memorizing” – Lempel-‐Ziv
• Burrows-‐Wheeler
• Algorithmic complexity, Kolmogorov …
9/4/10
6
InformaIon retrieval
• Google, Yahoo!, …
• How can this be made possible?
• How to find relevant documents?
• Similar documents
Text mining
• Analyzing texts • Finding trends • Finding regulariIes, paVerns, moIfs
• Mining web usage, links, etc…
• …
aasta aega ainult ajal alajaama all alles ampritasu anda annab as asi atonen balI edasi eesI eile elekter elektrienergia elektriga elektrihinna elektrijaama elektrita elektrituru enda endale energeeIka energia energiafirma energiaturu eriI esimees
esimehe eVevõte euroopa firma fortum gunnar hakkab hea hetkel hind hinnangul hinnatõus ikka ilma ilmselt investeeringute ise
isegi jaoks jooksul juhataja juhatuse juht juhul jäi järgi jääb kaasa kahe kaks kasutada keegi kell kelle keskmiselt kindlasI kinni kinnitas kirjutab kogu
kohaselt kohe kokku kolm kolme koos korda korral krooni kuidas kuigi kuu kuus kwh kõige kõik küll küsimus leedu ligi lihtsalt liidu liiga
liige lisaks lisas läbi läheb majandusminister maksab maksma meelis midagi miljardit miljon minister minu mõne märkis nad narva neid nõukogu nüüd
okk oleks oleme olen olid olnud oluliselt online osa osas osta palju pea peab peakaitsme peale peavad pm poole poolt posImees praegu
pressiesindaja protsent puhul põlevkivi raha reiljan reklaam riigi riigikogu riik riina rohkem rääkis saada saanud saavad sai sama samal
samas samuI seal seetõVu selleks selline seni seotud siin siiski soome suur suure suurem sõnul sõõrumaa tagasi tahab tallinn tarbija tarbijad tartu tasu teada teatas teeb tegelikult teha tehtud teine tuleks tulevikus turu tõsta tõVu tõusu tähendab täna umbes urmas uued uus uute vahel vaja valitsus valmis
vastavalt vastu vene venemaa viis võiks võimalik võimalus võivad võrguteenuse võrra võrreldes võVa väga vähem vähemalt vändre öelda ühe üks ütleb
250 sõna suhtelise osakaalu järgi (2004)
aasta aastal aastat all asi atonen balI eesI eesIs ekseko elektri elektrit
energia hoone juuni jäi kasutada kiriku krooni kubits kuidas kõik küll lisaks läbi maja meelis mikk narva oleks pakri peab peale pinge praegu riigikogu
sõnul teha trafopunkI varastaI vargad vene võimalik väga vändre üks
EesI Energiaga seostatud sõnad juunis 2004, muutus, tõus ja langus (tõus – oranzh, langus – sinine)
Valik kiiremini tõusvaid sõnu mais 2004
niinemäe atoneni vene minister riigisaladuse
kubits corpore toomas juhi mai kapo teabeameI talle iru
kätlin pirita pärnu saaremaa
näitas oki valitsuse maran info
res publica teab elioni
TGTTCTTTCTTCTTTCATACATCCTTTTCCTTTTTTTCC TTCTCCTTTCATTTCCTGACTTTTAATATAGGCTTACCA TCCTTCTTCTCTTCAATAACCTTCTTACATTGCTTCTTC TTCGATTGCTTCAAAGTAGTTCGTGAATCATCCTTCAAT GCCTCAGCACCTTCAGCACTTGCACTTCATTCTCTGGAA GTGCTGCACCTGCGCTGTCTTGCTAATGGATTTGGAGTT GGCGTGGCACTGATTTCTTCGACATGGGCGGCGTCTTCT TCGAATTCCATCAGTCCTCATAGTTCTGTTGGTTCTTTT CTCTGATGATCGTCATCTTTCACTGATCTGATGTTCCTG TGCCCTATCTATATCATCTCAAAGTTCACCTTTGCCACT TTCCAAGATCTCTCATTCATAATGGGCTTAAAGCCGTAC TTTTTTCACTCGATGAGCTATAAGAGTTTTCCACTTTTA GATCGTGGCTGGGCTTATATTACGGTGTGATGAGGGCGC TTGAAAAGATTTTTTCATCTCACAAGCGACGAGGGCCCG AGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGT CTTAGAAGATAAAGTAGTGAATTACAATAGATTCGATAC
9/4/10
7
Overview
• Matching (exact, approximate, mulIple P, regular expressions, … )
• Indexing and data structures • Text compression
• InformaIon Retrieval
• ProbabilisIc moIfs: HMM, SCFG, …
• PaVern discovery & data mining
Books
PracIcal assignment 1
• Propose some real-‐life use cases for string searching – Exact – Approximate – Regular expression – MulIple paVerns
• EsImate size of “interesIng texts” for various typical use cases • Study Unix tool grep family (grep, ggrep, egrep, fgrep, agrep … )
– What funcIonality is being offered?
• Run grep, and perform pracIcal measurements – Speed -‐-‐ how many characters searched per second? – Analyze dependence on m
– Dependence on alphabet size
• Create script(s) for evaluaIng grep at different text and paVern sizes – E.g. Perl, bash, python, … etc…
– Unix Ime, redirecIon of output ( > , >>, 2> , … ) …
Course
• ~12 Lectures 24h + 20h • ~10 PracIcals 20h + 40h • Project work 40h • Exam 4h + 12h • -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ • Total 48h + 112h = 160h (6EAP)
Grade
• Homework 50 + bonus points
• Project work 20
• Exam 30
• -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ • Total 100p
Homework
• EssenIal part of the course
• Obligatory to perform minimum 50% tasks • SoluIons must be in before 12:15 (web upload)
• PresentaIons orally during the pracIcals
9/4/10
8
Project
• A pracIcal algorithm development task plus analysis and comparisons of efficiency
• PresentaIon of results as a poster
• A poster session – everybody presenIng!
Exam
• Will be based on exactly or nearly on the quesIons of the homework assignments
• Knowledge of the basic principles of algorithms
• CreaIve use of the algorithms
Contact
• Lectures, pracIcals – acIve hours
• hVp://courses.cs.ut.ee/2010/text/ • Email ([email protected])
• Office hour (TBD); room 327 – Other Imes: knock on door or when door open
• Upon agreement