Linked Cancer Genome Atlas Database

20
Linked Cancer Genome Atlas Database Muhammad Saleem, Shanmukha Sampath Padmanabhuni, Axel-Cyrille Ngonga Ngomo, Jonas S. Almeida, Stefan Decker, Helena F. Deus. Linked Data Cup, I-Semantics 2013, September 04 - 06 2013, Graz, Austria

description

Linked Cancer Genome Atlas Database, Linked Data Cup Award Winner at I-Semnatics2013. http://tcga.deri.ie/

Transcript of Linked Cancer Genome Atlas Database

Page 1: Linked Cancer Genome Atlas Database

Linked Cancer Genome Atlas Database

Muhammad Saleem, Shanmukha Sampath Padmanabhuni, Axel-Cyrille Ngonga

Ngomo, Jonas S. Almeida, Stefan Decker, Helena F. Deus.

Linked Data Cup, I-Semantics 2013, September 04 - 06 2013, Graz, Austria

Page 2: Linked Cancer Genome Atlas Database

Agenda

• Cancer Genome Atlas (TCGA) introduction• Problem statement• Linked TCGA a scalable solution• Cancer treatment using Linked TCGA • Demo of the use cases• Conclusion

Page 3: Linked Cancer Genome Atlas Database

TCGA Introduction

• A publicly accessible atlas of cancer related data from National Cancer Institute (NCI) – 9000 patients– 33 cancer types– 147,645 raw data files– total of 12.7 terabytes of data

• Only a 46% of the total expected data with new data being submitted every day

• Goal is to enable cancer researchers to make and validate important discoveries

Page 4: Linked Cancer Genome Atlas Database

Problem Statement

• Data in the TCGA is organized as text archives with no remote querying interface– Download very large archives and waiting in queues – Parse the relevant text – Collect the critical co-variates necessary for analysis

• Various types of experimental results are not connected biologically

• TCGA data should be made publicly available for remote querying and virtual integration

Page 5: Linked Cancer Genome Atlas Database

Linked TCGA a Scalable Solution: RDFization

Page 6: Linked Cancer Genome Atlas Database

chromosome position beta_value16 28890100 0.439271303584937

3 57743543 0.2451476653814617 15725862 0.04401610611963472 177029073 0.741342927038953

11 93862594 0.029071382111447914 93813777 0.98555543668101918 11980953 0.010983200573291214 89290921 0.0104525957219692

composite element REF gene_symbol chromosome position beta_valuecg00000292 ATP2A1 16 28890100 0.439271303584937cg00002426 SLMAP 3 57743543 0.245147665381461cg00003994 MEOX2 7 15725862 0.0440161061196347

cg00005847 HOXD3 2 177029073 0.741342927038953

cg00006414 ZNF425 7 148822837 NAcg00007981 PANX1 11 93862594 0.0290713821114479cg00008493 COX8C 14 93813777 0.985555436681019cg00008713 IMPA2 18 11980953 0.0109832005732912cg00009407 TTC8 14 89290921 0.0104525957219692

Text to RDF Conversion

Data Refiner

Refined

Raw

Page 7: Linked Cancer Genome Atlas Database

chromosome position beta_value16 28890100 0.439271303584937

3 57743543 0.2451476653814617 15725862 0.04401610611963472 177029073 0.741342927038953

11 93862594 0.029071382111447914 93813777 0.98555543668101918 11980953 0.010983200573291214 89290921 0.0104525957219692

composite element REF gene_symbol chromosome position beta_valuecg00000292 ATP2A1 16 28890100 0.439271303584937cg00002426 SLMAP 3 57743543 0.245147665381461cg00003994 MEOX2 7 15725862 0.0440161061196347

cg00005847 HOXD3 2 177029073 0.741342927038953

cg00006414 ZNF425 7 148822837 NAcg00007981 PANX1 11 93862594 0.0290713821114479cg00008493 COX8C 14 93813777 0.985555436681019cg00008713 IMPA2 18 11980953 0.0109832005732912cg00009407 TTC8 14 89290921 0.0104525957219692

Text to RDF Conversion

Data Refiner

Refined

Raw

Page 8: Linked Cancer Genome Atlas Database

chromosome position beta_value16 28890100 0.439271303584937

3 57743543 0.2451476653814617 15725862 0.04401610611963472 177029073 0.741342927038953

11 93862594 0.029071382111447914 93813777 0.98555543668101918 11980953 0.010983200573291214 89290921 0.0104525957219692

composite element REF gene_symbol chromosome position beta_valuecg00000292 ATP2A1 16 28890100 0.439271303584937cg00002426 SLMAP 3 57743543 0.245147665381461cg00003994 MEOX2 7 15725862 0.0440161061196347

cg00005847 HOXD3 2 177029073 0.741342927038953

cg00006414 ZNF425 7 148822837 NAcg00007981 PANX1 11 93862594 0.0290713821114479cg00008493 COX8C 14 93813777 0.985555436681019cg00008713 IMPA2 18 11980953 0.0109832005732912cg00009407 TTC8 14 89290921 0.0104525957219692

Text to RDF Conversion

Data Refiner

Refined

Raw

Page 9: Linked Cancer Genome Atlas Database

chromosome position beta_value16 28890100 0.439271303584937

3 57743543 0.2451476653814617 15725862 0.04401610611963472 177029073 0.741342927038953

11 93862594 0.029071382111447914 93813777 0.98555543668101918 11980953 0.010983200573291214 89290921 0.0104525957219692

composite element REF gene_symbol chromosome position beta_valuecg00000292 ATP2A1 16 28890100 0.439271303584937cg00002426 SLMAP 3 57743543 0.245147665381461cg00003994 MEOX2 7 15725862 0.0440161061196347

cg00005847 HOXD3 2 177029073 0.741342927038953

cg00006414 ZNF425 7 148822837 NAcg00007981 PANX1 11 93862594 0.0290713821114479cg00008493 COX8C 14 93813777 0.985555436681019cg00008713 IMPA2 18 11980953 0.0109832005732912cg00009407 TTC8 14 89290921 0.0104525957219692

@prefix b:<http://tcga.deri.ie/>. @prefix d:<http://tcga.deri.ie/schema/bcr_patient_barcode>.@prefix r:<http://tcga.deri.ie/schema/result>. @prefix c:<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>.@prefix w:<http://tcga.deri.ie/schema/dna_methylation_result>. @prefix m:<http://tcga.deri.ie/schema/chromosome>.@prefix v:<http://tcga.deri.ie/schema/position>. @prefix u:<http://tcga.deri.ie/schema/beta_value>. b:TCGA-A2-A0CX d: "TCGA-A2-A0CX". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d1 . b:TCGA-A2-A0CX-d1 c: w: ; m: "16"; v: "28890100"; u: "0.439271303584937". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d2 . b:TCGA-A2-A0CX-d2 c: w: ; m: "3"; v: "57743543"; u: "0.245147665381461". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d3 . b:TCGA-A2-A0CX-d3 c: w: ; m: "7"; v: "15725862"; u: "0.0440161061196347". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d4 . b:TCGA-A2-A0CX-d4 c: w: ; m: "2"; v: "177029073"; u: "0.741342927038953". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d5 . b:TCGA-A2-A0CX-d5 c: w: ; m: "11"; v: "93862594"; u: "0.0290713821114479". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d6 . b:TCGA-A2-A0CX-d6 c: w: ; m: "14"; v: "93813777"; u: "0.985555436681019". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d7 . b:TCGA-A2-A0CX-d7 c: w: ; m: "18"; v: "11980953"; u: "0.0109832005732912". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d8 . b:TCGA-A2-A0CX-d8 c: w: ; m: "14"; v: "89290921"; u: "0.0104525957219692".

Text to RDF Conversion

Data Refiner

RDFizer

Refined

RDFizedRaw

Page 10: Linked Cancer Genome Atlas Database

Linked TCGA Data Workflow

Page 11: Linked Cancer Genome Atlas Database

Linked TCGA Tumors Statistics Tumor Type Original

Size(GB)Refined Size (GB)

RDFized Size (GB)

Triples (Million)

Cervical (CESC) 8.75 2.44 8.86 400.19

Rectal adenocarcinoma (READ) 8.07 2.25 9.04 413.31

Papillary Kidney (KIRP) 10.40 2.90 10.4 469.65Bladder cancer (BLCA) 12.16 3.39 12.3 556.38Acute Myeloid Leukemia (LAML) 14.85 4.14 15.1 684.05Lower Grade Glioma (LGG) 17.08 4.76 17.1 778.82

Prostate adenocarcinoma (PRAD) 18.05 5.03 18.1 821.01

Lung squamous carcinoma (LUSC) 20.63 5.75 20.5 927.08

Cutaneous melanoma (SKCM) 23.22 6.47 23.2 1050.94Head and neck squamous cell(HNSC) 27.6 7.69 27.5 1245.37

• A total of 7.36 Billion Triples for 10 small tumors• Total Linked TCGA > 30 billion triples (Largest Dataset of LOD)

Page 12: Linked Cancer Genome Atlas Database

Linking to Linked Open DataSource Target Class #Links

DNA27 HGNC Gene 23181

DNA27 Homologene Gene 27654

DNA27 HGNC Gene 15171

DNA450 Homologene Gene 489643

DNA450 OMIM Gene 212284

DNA27 HGNC Chromosome 108662

DNA27 OMIM Chromosome 16039535

Methylation HGNC Chromosome 97530

Methylation OMIM Chromosome 14407269

Gene Expression HGNC Chromosome 86052

Gene Expression OMIM Chromosome 12535829

• Links are generated using LIMES http://aksw.org/Projects/LIMES.html

Page 13: Linked Cancer Genome Atlas Database

Cancer Treatment using Linked TCGA

Page 14: Linked Cancer Genome Atlas Database

Linked TCGA Use Cases1. Targeted cancer treatment– Whether a specific drug can be used to treat a tumour

using the genomic data of patients with same tumor

2. Mechanism-based treatment– Whether a combination of drugs can be applied to

treat a specific tumor using similar patients data

3. Survival outcome– Using mathematical model to predict future signs such

as survival outcome for a new patient

Page 15: Linked Cancer Genome Atlas Database

Use case 1,2 SPARQL query SELECT ?patient ?meanWHERE{ ?uri tcga:tumour_type "BRCA". ?uri tcga:bcr_patient_barcode ?patient. ?patient rdf:type tcga:expression_gene_results. ?patient tcga:gene_symbol "HER2","ER". ?patient tcga:scaled_estimate ?mean}

Page 16: Linked Cancer Genome Atlas Database

Use Case 1,2 Querying LOD DrugBankSELECT ?drugnameWHERE { ?patient rdf:type tcga:expression_gene_results. ?patient tcga:gene_symbol ?targetname . ?patient tcga:scaled_estimate ?mean. FILTER (?mean > Threshold) ?drug drugbank:target ?target. ?drug drugbank:genericName ?drugname . ?target drugbank:synonym ?targetname . FILTER REGEX (?targetname, "HER2||estrogenreceptor||ERBB2", "i") }

Page 17: Linked Cancer Genome Atlas Database

Use Case 3 Query

SELECT ?patient ?meanWHERE{?uri tcga:tumour_type "BRCA".?uri tcga:bcr_patient_barcode ?patient.?patient rdf:type tcga:clinical.?patient tcga:tumour_stage ?tumour_stage. ?patient tcga:age_at_initial_patalogical_diagnosis ?age.?patient tcga:relevant_biomarker "BRCA1","CDKN2A", "CDH1".?patient tcga:beta_value ?mean}

Page 19: Linked Cancer Genome Atlas Database

Everything is Public• TopFed: https://code.google.com/p/topfed/• Linked TCGA : http://tcga.deri.ie/ [email protected] AKSW, University of Leipzig, Germany

Page 20: Linked Cancer Genome Atlas Database

ThanksMuhammad Saleem [email protected]