Fostering Serendipity through Big Linked Data

Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille

Ngonga Ngomo

Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia

Agenda

• Motivation• Datasets• Architecture• Evaluation• Requirements• Demo• Conclusion and Future Work

Motivation

Fostering Serendipity through Big Data Triplification, Continuous Integration,

and Visualization

Triplification: Linked TCGA• TCGA is publicly accessible atlas of cancer

related data from National Cancer Institute (NCI) – 9000 patients– 33 cancer types– 147,645 raw data files– 12.7 TB

• Only 46% of the total expected data with new data being submitted every day

• Goal is to enable cancer researchers to make and validate important discoveries

• Total Linked TCGA > 30 billion triples (Largest Dataset of LOD)

Triplification:PubMed• Collection of publications from the bio-

medical domain• Large amount of metadata (MESH Terms)• 23+ million publications• 10,000 new publications/month

Big Data Continuous Integration

TopFed

Parser

Federator Optimizer

Integrator

Results

ResultsSPARQL Query

Sub-queryPubMed

Entrez UtilitiesRDFizer

Auto Loader

TCGA Data Portal

SPARQL endpoint

b1 b2 p1 p2 g1 g2 g3p3 p4 g4 g5 g6p5 p6 g7 g8 g9

C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical}

F = {Expression-Exon}M = {beta_value, position}

(CNV, SNP, E-Gene, miRNA, E-Protein, Clinical)

Exon-Expression

Methylation

D = {seg_mean, rpmmm, scaled_est, p_exp_val}

C-2 = {{p {∈ E ∪ A ∪ G} ∨ {p = rdf:type o ∧ ∈ F}} ∧ {{S-Join(p, E ∪ F) P-Join(∨ p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ !P-Join(p, M ∪ B ∪ D ∪ C) }}}

C-3 = {{p {∈ M ∪ A} {p = rdf:type o ∨ ∧ ∈ B}} ∧ {{S-Join(p, M ∪ B) P-Join(∨ p, M ∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ !P-Join(p, E ∪ F ∪ D ∪ C) }}}

C-1 = {{p {∈ D ∪ A ∪ G} {p = rdf:type o ∨ ∧ ∈ C}} ∧ {{S-Join(p, D ∪ C) P-Join(p, ∨ D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ !P-Join(p, M ∪ B ∪ E ∪ F) }}}

C-1 Category ∨Colour = blue

IF tumour lookup is successful forward to corresponding leafElse broadcast to every one

For each query triple t(s, p, o) T ∈

A = {chromosome, result, bcr_patient_barcode} G = {start, stop}

B = {DNA-Methylation}

E = {RPKM}

Tumours

SPARQL endpoints

C-2 Category ∨Colour = pink

C-3 Category ∨Colour = green

1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33

Highly Scalable

Evaluation:Number of Sub-Query Submission

• TopFed number of sub-queries submission is 1/3 to FedX• Number of ASK requests

– FedX 480– TopFed 10

1 2 3 4 5 6 7 8 9 10 Avg0

FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission

Evaluation: Query Runtime

1 2 3 4 5 6 7 8 9 10 Average10

100000FedX TopFed

• TopFed outperform FedX significantly on 90% of the queries • On average, the query run time of TopFed is about 1/3 to that of FedX • TopFed‘s best run-time (query 2, query 3) is more than 75 times

smaller than that of FedX

Big Data Track Requirements• Data Volume

– 7.36 billion triples from Linked TCGA – 23 million publications from PubMed

• Data Variety– The Linked TCGA data was extracted from raw text files of different

structures– Processed the metadata associated with PubMed publications and

transform them into RDF– Unstructured data (publication abstracts) is processed to extract mentions

of gene names and cancers

• Data Velocity– TCGA data doubles /2 months– PubMed publications 10k/month

Big Data Visualization

Tumor-wise Visualization

PubMed Paper-wise Visualization

Genome-wise Patients Results Visualization

Everything is Public• Demo: http://srvgal78.deri.ie/tcga-pubmed/• TopFed: https://code.google.com/p/topfed/• TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ• Utilities: http://goo.gl/kNrFdI• Linked TCGA : http://tcga.deri.ie/

saleem@informatik.uni-leipzig.de AKSW, University of Leipzig, Germany

Fostering Serendipity through Big Linked Data

Education

Transcript of Fostering Serendipity through Big Linked Data

Serendipity in Music

Serendipity Art Unit

Cybernetic Serendipity

SERENDIPITY first issue

Serendipity in pharmacology.pdf

Serendipity in Linked Open Data

Markthis: Serendipity & Society30

The serendipity economy

Reboot9.0: Travel & Serendipity

Serendipity AR

Power Point Serendipity

Serendipity Summer 2015

serendipity tea room

The Serendipity Project

Amplifying Serendipity

Serendipity presentationnov2012 gina.a

Serendipity: Special Edition

Serendipity shot en

Serendipity Blackmagic V3...Serendipity Blackmagic V3 Serendipity Blackmagic 13 Serendipity Blackmagic Product Overview erendipity Blackmagic is used to proof post RIP data to either

Effect of Heuristics on Serendipity in Path-Based Storytelling with Linked Data