Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos...

24
Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia ** UPC, Barcelona, Catalonia IPSI Firence- 2007

Transcript of Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos...

Page 1: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

Concept Modeling in Bio-informatics

Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents**

*University of Ljubljana, Slovenia** UPC, Barcelona, Catalonia

IPSI Firence-2007

Page 2: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

2/24

WHAT IS CONCEPT

?

Page 3: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

3/24

Decision Making Algorithm

Page 4: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

4/24

Concept Modeling Layer

What is concept? How is it modeled? How is it built? How is it exploited? How is it updated?

Page 5: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

5/24

Classification of concept modeling (CM) and decision making systems (DMS)

This classification is made based on the following assumption:

Any decision making system, regardless if the process is performed entirely by humans, supported by machines or totally automated, is a layered process, with one layer

(explicit or implicit) which can be called

Concept Modeling Layer

Page 6: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

6/24

Purpose (DMS)

General Specialized (Bio-informatics)

Page 7: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

7/24

Bio-informatics

Genomic researchers mostly deal with similarity issues between genomic sequences. Genomic sequences are treated as long sequences of letters:

A (adenine) G (Guanine) C (Cytosine) T (Thymine)

which represents nitrogenous bases in protein structure.

Page 8: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

8/24

DNA sequence

(A)

(T)

(G) (C)

DNA chemistry compound DNA sequence

DNA sequence is presented as an array of letters which are mapping the nucleotides in DNA (consisted of one of four types of nitrogenousbases A/G/C/T, a five-carbon sugar, and molecule of phosphoric acid).

Page 9: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

9/24

DNA sequence analysis

GATTCATCGA CCATCAAAT GATT

Start sequence

End sequence

Start sequence

Useful dataNoisy data

Page 10: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

10/24

Bio-informatics in DMS Sequence concept (still impossible/there is no protein

conceptual model)

Sequence analysis (software BLAST, Smith Waterman, FASTA, etc)

Sequence retrieval (easy/ available for free on the WEB: ENSEMBL.ORG, NCBI, UCSC, etc.)

Sequencing (hard/laboratory work on the level of chemical reactions to conclude weather C/T/G/A is in question in DNA chain)

Page 11: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

11/24

Sequence analysis

In the example shown at next two figures, one can see a fraction of the results obtained from a BLAST comparison of protein SLC7A7 (human) against a SwissProt database of proteins.

We selected two illustrative examples that show from a perfect (word) mach to a similar mach.

Page 12: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

12/24

BLAST Sample session, perfect match

>gi|12643348|sp|Q9UHI5|LAT2_HUMAN <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=12643348&dopt=GenPept> Gene info <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=search&term=12643348%5BPUID%5D> Large neutral amino acids transporter small subunit 2 (L-type

amino acid transporter 2) (hLAT2) Length=535 Score = 665 bits (1717), Expect = 0.0, Method: Composition-based stats. Identities = 332/332 (100%), Positives = 332/332 (100%), Gaps = 0/332 (0%) Query 1 MGIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFLQGSFAYGGWNFLNYVTEELVDPYK 60 MGIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFLQGSFAYGGWNFLNYVTEELVDPYK Sbjct 204 MGIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFLQGSFAYGGWNFLNYVTEELVDPYK 263 Query 61 NLPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASNAVAVTFGEKLLGVMAWIMPISVA 120 NLPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASNAVAVTFGEKLLGVMAWIMPISVA Sbjct 264 NLPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASNAVAVTFGEKLLGVMAWIMPISVA 323 Query 121 LSTFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKRCTPIPALLFTCISTLLMLVTSD 180 LSTFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKRCTPIPALLFTCISTLLMLVTSD Sbjct 324 LSTFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKRCTPIPALLFTCISTLLMLVTSD 383 Query 181 MYTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIKINLLFPIIYLLFWAFLLVFSLW 240 MYTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIKINLLFPIIYLLFWAFLLVFSLW Sbjct 384 MYTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIKINLLFPIIYLLFWAFLLVFSLW 443 Query 241 SEPVVCGIGLAIMLTGVPVYFLGVYWQHKPKCFSDFIELLTLVSQKMCVVVYPEVERGSG 300 SEPVVCGIGLAIMLTGVPVYFLGVYWQHKPKCFSDFIELLTLVSQKMCVVVYPEVERGSG Sbjct 444 SEPVVCGIGLAIMLTGVPVYFLGVYWQHKPKCFSDFIELLTLVSQKMCVVVYPEVERGSG 503 Query 301 TEEANEDMEEQQQPMYQPTPTKDKDVAGQPQP 332 TEEANEDMEEQQQPMYQPTPTKDKDVAGQPQP Sbjct 504 TEEANEDMEEQQQPMYQPTPTKDKDVAGQPQP 535

Page 13: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

13/24

BLAST Sample session, similar match

>gi|12643378|sp|Q9UM01|YLA1_HUMAN <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=12643378&dopt=GenPept> Gene info <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=search&term=12643378%5BPUID%5D> Y+L amino acid transporter 1 (y(+)L-type amino acid transporter

1) (y+LAT-1) (Y+LAT1) (Monocyte amino acid permease 2) (MOP-2) Length=511 Score = 257 bits (656), Expect = 4e-68, Method: Composition-based stats. Identities = 138/315 (43%), Positives = 203/315 (64%), Gaps = 10/315 (3%) Query 2 GIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFLQGSFAYGGWNFLNYVTEELVDPYKN 61 GIV++ +G E N+FE +G +ALA F+Y GW+ LNYVTEE+ +P +N Sbjct 202 GIVRLGQGASTHFE--NSFEG-SSFAVGDIALALYSALFSYSGWDTLNYVTEEIKNPERN 258 Query 62 LPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASNAVAVTFGEKLLGVMAWIMPISVAL 121 LP +I IS+P+VT +Y+ NVAY T + +++LAS+AVAVTF +++ G+ WI+P+SVAL Sbjct 259 LPLSIGISMPIVTIIYILTNVAYYTVLDMRDILASDAVAVTFADQIFGIFNWIIPLSVAL 318 Query 122 STFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKRCTPIPALLFTCISTLLMLVTSDM 181 S FGG+N S+ +SRLFF G+REGHLP + MIHV+R TP+P+LLF I L+ L D+ Sbjct 319 SCFGGLNASIVAASRLFFVGSREGHLPDAICMIHVERFTPVPSLLFNGIMALIYLCVEDI 378 Query 182 YTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIKINLLFPIIYLLFWAFLLVFSLWS 241 + LINY F + F G+++ GQ+ LRWK+PD PRP+K+++ FPI++ L FL+ L+S Sbjct 379 FQLINYYSFSYWFFVGLSIVGQLYLRWKEPDRPRPLKLSVFFPIVFCLCTIFLVAVPLYS 438 Query 242 EPVVCGIGLAIMLTGVPVYFL--GVYWQHKPKCFSDFIELLTLVSQKMCVVVYPEVERGS 299 + + IG+AI L+G+P YFL V +P + T Q +C+ V E++ Sbjct 439 DTINSLIGIAIALSGLPFYFLIIRVPEHKRPLYLRRIVGSATRYLQVLCMSVAAEMDLED 498 Query 300 GTEEANEDMEEQQQP 314 G E M +Q+ P Sbjct 499 GGE-----MPKQRDP 508

Page 14: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

14/24

BLAST output:Score = 257 bits (656), Expect = 4e-68, Method: Composition-based stats.Identities = 138/315 (43%), Positives = 203/315 (64%), Gaps = 10/315 (3%)

BLAST expresses the level of similarity between query sequence and database sequence in terms of: Score, Expectations, Method, Identities, Positives, and Gaps. Here is where our DMA layer is finishing, and from this point inferring need to be done by researchers on the bases of software (ex. BLAST) output, and knowledge gathered elsewhere (book, computers, brains…).

Also, a forthcoming challenge in the field of comparative genomic analysis is to compare large amounts of genomic data (letters).For example, if one wants to compare one mammalian genomic sequence against all existing mammalian sequences, one would need a database with memory storage of 60 GB (Saragasso Sea project).

Page 15: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

15/24

Application for text analysis: Frequency (number of occurrences) Distance--------------------------

Exclude stop word lists (and, if, or etc) Stemming (traveling => travel; traveled => travel) Synonyms (sick = ill)

Visual Basic

Page 16: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

16/24

Home-made Brandy Production Grape-gathering is the first phase in the

production of brandy, through it might be made also from plums, figs, pears or cornel berries. The gathered grapes are crushed and then poured into wooden barrels. They are mixed several times a day, the more often the better. The obtained mass is called wine-marc. The process of alcoholic fermentation usually lasts fifteen or thirty days. When it is finished, or when, as usually people say the marc is still, distillation begins i.e. the making of brandy, which is done in special copper cauldrons. Hand made copper cauldrons can still be found in Tuscany households…

Page 17: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

17/24

word word frequency distance

brandy grape 10 0

brandy alcohol 4 1

brandy distillation 3 3

brandy strength 3 5

brandy making 2 5

Page 18: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

18/24

Concept criteria: Frequency > 5 Distance < 2

Concepts:

brandy grape

brandy alcohol Transcription:

brandy - made of - grapes

brandy - kind of - alcohol

Page 19: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

19/24

Concept Modeling layer

Implicit (concepts are not explicitly mentioned):

Protein conceptual model

Explicit (concepts are explicitly mentioned and/or defined):

Frequency > 5

Distance < 2

Page 20: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

20/24

Concept definition (CM)

Node in concept network (semantic web)

Node in concept web

Page 21: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

21/24

Concept definitons Structure that carries meaning. Needs other concepts and relations among them to be defined. Without

relations concept can not exist.

Relations between concepts can also be observed as concepts. All concepts can be related among each other, forming whether:

1. concept web (where relations are concepts also)

2. concept network (where relations are not concepts)

Page 22: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

22/24

Concept Network Concept Web

Page 23: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

23/24

Concept Modeling Learning Module

Page 24: Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia.

24/24

Thank you for your attention!Questions?