Eagle Bioinformatics Symposium: 8. Steve Gardner, The Importance of Data Representation: New Tools...

Post on 11-May-2015

777 views 2 download

Tags:

description

The volume and diversity of life science and healthcare data have created huge data integration challenges. Technologies such as federation and warehousing allowed us to manage volume but didn't give us the ability to respond flexibly to change or to routinely create novel insights from those data. In part this is due to imperfect recording and understanding of the context of the data and in part due to the representations we use. This talk will explore some of the historical and future approaches to large-scale semantic data integration and look at new graph and geometrical approaches to large scale knowledge modelling.

Transcript of Eagle Bioinformatics Symposium: 8. Steve Gardner, The Importance of Data Representation: New Tools...

o

o

o

o

select aminoid, seq1[0:6], xss[0:6] from amino a where seq1=‘R[2,4]+polar++hydroxyl+’

GO:0003673 : Gene_Ontology (28348)

GO:0008150 : biological_process (21805)

GO:0005575 : cellular_component (13866)

GO:0003674 : molecular_function (20801)

GO:0008369 : obsolete (289)

GO:0004432 : 1-phosphatidylinositol-4-phosphate kinase, class IA (0)

GO:0003824 : enzyme(7162)

GO:0016301 : kinase(1027)

GO:0004428 : inositol/phosphatidylinositol kinase(37)

GO:0016307 : phosphatidylinositol phosphate kinase(9)

GO:0000285 : 1-phosphatidylinositol-3-phosphate 5-kinase(1)

GO:0016740 : transferase(2130)

GO:0016772 : transferase, transferring phosphorus-containing groups(1239)

GO:0016773 : phosphotransferase, alcohol group as acceptor(969)

GO:0004428 : inositol/phosphatidylinositol kinase(37)

GO:0016307 : phosphatidylinositol phosphate kinase(9)

GO:0000285 : 1-phosphatidylinositol-3-phosphate 5-kinase(1)

Ontology

Structured Data Sources Unstructured Data Sources

o

oooooooo

oooooooo

ooooooooo

oooooooo

o

ooooooooooooooo

o

o

o

o

o

o

o

o

o

o

o

Context Vectors Term Vectors

1 2

3 n

‘Zinc’ ‘Finger’

1 2

3 n

Dot product comparisons of query vector vs term/context vectors gives semantic distance

‘Zinc finger’ OR addition

Query vector

‘Tachycardia’ search – (untrained – no starting vocab provided) 400K clinical trials (500MB of XML), unfiltered result set Approx. 1.2M ‘terms’ in corpus

Vector length = semantic distance (in corpus) Colour = term density in corpus

o

o

o

o

o

o ° ° °

o

o

o

o

o

o

o

o

o

o

𝒙′

𝒚′

𝒛′

𝒙𝒚𝒛

𝟏 𝟎 𝟎𝟎 𝒄𝒐𝒔∅ 𝒔𝒊𝒏∅𝟎 −𝒔𝒊𝒏∅ 𝒄𝒐𝒔∅

A B C D

0 0 0 0

0 0 0 1

0 0 1 0

0 0 1 1

0 1 0 1

0 1 1 0

0 1 1 1

1 0 0 1

1 0 1 0

1 0 1 1

1 1 0 1

1 1 1 0

1 1 1 1

A B C D

0 0 0 0

0 1 0 1 0 1 1

0 1 0 1 1 0

1 0 1 0 1 0 1 1

o

o

o

o

o

o

o

o

o

o