Linked Cancer Genome Atlas Database
-
Upload
muhammad-saleem -
Category
Technology
-
view
633 -
download
0
description
Transcript of Linked Cancer Genome Atlas Database
Linked Cancer Genome Atlas Database
Muhammad Saleem, Shanmukha Sampath Padmanabhuni, Axel-Cyrille Ngonga
Ngomo, Jonas S. Almeida, Stefan Decker, Helena F. Deus.
Linked Data Cup, I-Semantics 2013, September 04 - 06 2013, Graz, Austria
Agenda
• Cancer Genome Atlas (TCGA) introduction• Problem statement• Linked TCGA a scalable solution• Cancer treatment using Linked TCGA • Demo of the use cases• Conclusion
TCGA Introduction
• A publicly accessible atlas of cancer related data from National Cancer Institute (NCI) – 9000 patients– 33 cancer types– 147,645 raw data files– total of 12.7 terabytes of data
• Only a 46% of the total expected data with new data being submitted every day
• Goal is to enable cancer researchers to make and validate important discoveries
Problem Statement
• Data in the TCGA is organized as text archives with no remote querying interface– Download very large archives and waiting in queues – Parse the relevant text – Collect the critical co-variates necessary for analysis
• Various types of experimental results are not connected biologically
• TCGA data should be made publicly available for remote querying and virtual integration
Linked TCGA a Scalable Solution: RDFization
chromosome position beta_value16 28890100 0.439271303584937
3 57743543 0.2451476653814617 15725862 0.04401610611963472 177029073 0.741342927038953
11 93862594 0.029071382111447914 93813777 0.98555543668101918 11980953 0.010983200573291214 89290921 0.0104525957219692
composite element REF gene_symbol chromosome position beta_valuecg00000292 ATP2A1 16 28890100 0.439271303584937cg00002426 SLMAP 3 57743543 0.245147665381461cg00003994 MEOX2 7 15725862 0.0440161061196347
cg00005847 HOXD3 2 177029073 0.741342927038953
cg00006414 ZNF425 7 148822837 NAcg00007981 PANX1 11 93862594 0.0290713821114479cg00008493 COX8C 14 93813777 0.985555436681019cg00008713 IMPA2 18 11980953 0.0109832005732912cg00009407 TTC8 14 89290921 0.0104525957219692
Text to RDF Conversion
Data Refiner
Refined
Raw
chromosome position beta_value16 28890100 0.439271303584937
3 57743543 0.2451476653814617 15725862 0.04401610611963472 177029073 0.741342927038953
11 93862594 0.029071382111447914 93813777 0.98555543668101918 11980953 0.010983200573291214 89290921 0.0104525957219692
composite element REF gene_symbol chromosome position beta_valuecg00000292 ATP2A1 16 28890100 0.439271303584937cg00002426 SLMAP 3 57743543 0.245147665381461cg00003994 MEOX2 7 15725862 0.0440161061196347
cg00005847 HOXD3 2 177029073 0.741342927038953
cg00006414 ZNF425 7 148822837 NAcg00007981 PANX1 11 93862594 0.0290713821114479cg00008493 COX8C 14 93813777 0.985555436681019cg00008713 IMPA2 18 11980953 0.0109832005732912cg00009407 TTC8 14 89290921 0.0104525957219692
Text to RDF Conversion
Data Refiner
Refined
Raw
chromosome position beta_value16 28890100 0.439271303584937
3 57743543 0.2451476653814617 15725862 0.04401610611963472 177029073 0.741342927038953
11 93862594 0.029071382111447914 93813777 0.98555543668101918 11980953 0.010983200573291214 89290921 0.0104525957219692
composite element REF gene_symbol chromosome position beta_valuecg00000292 ATP2A1 16 28890100 0.439271303584937cg00002426 SLMAP 3 57743543 0.245147665381461cg00003994 MEOX2 7 15725862 0.0440161061196347
cg00005847 HOXD3 2 177029073 0.741342927038953
cg00006414 ZNF425 7 148822837 NAcg00007981 PANX1 11 93862594 0.0290713821114479cg00008493 COX8C 14 93813777 0.985555436681019cg00008713 IMPA2 18 11980953 0.0109832005732912cg00009407 TTC8 14 89290921 0.0104525957219692
Text to RDF Conversion
Data Refiner
Refined
Raw
chromosome position beta_value16 28890100 0.439271303584937
3 57743543 0.2451476653814617 15725862 0.04401610611963472 177029073 0.741342927038953
11 93862594 0.029071382111447914 93813777 0.98555543668101918 11980953 0.010983200573291214 89290921 0.0104525957219692
composite element REF gene_symbol chromosome position beta_valuecg00000292 ATP2A1 16 28890100 0.439271303584937cg00002426 SLMAP 3 57743543 0.245147665381461cg00003994 MEOX2 7 15725862 0.0440161061196347
cg00005847 HOXD3 2 177029073 0.741342927038953
cg00006414 ZNF425 7 148822837 NAcg00007981 PANX1 11 93862594 0.0290713821114479cg00008493 COX8C 14 93813777 0.985555436681019cg00008713 IMPA2 18 11980953 0.0109832005732912cg00009407 TTC8 14 89290921 0.0104525957219692
@prefix b:<http://tcga.deri.ie/>. @prefix d:<http://tcga.deri.ie/schema/bcr_patient_barcode>.@prefix r:<http://tcga.deri.ie/schema/result>. @prefix c:<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>.@prefix w:<http://tcga.deri.ie/schema/dna_methylation_result>. @prefix m:<http://tcga.deri.ie/schema/chromosome>.@prefix v:<http://tcga.deri.ie/schema/position>. @prefix u:<http://tcga.deri.ie/schema/beta_value>. b:TCGA-A2-A0CX d: "TCGA-A2-A0CX". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d1 . b:TCGA-A2-A0CX-d1 c: w: ; m: "16"; v: "28890100"; u: "0.439271303584937". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d2 . b:TCGA-A2-A0CX-d2 c: w: ; m: "3"; v: "57743543"; u: "0.245147665381461". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d3 . b:TCGA-A2-A0CX-d3 c: w: ; m: "7"; v: "15725862"; u: "0.0440161061196347". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d4 . b:TCGA-A2-A0CX-d4 c: w: ; m: "2"; v: "177029073"; u: "0.741342927038953". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d5 . b:TCGA-A2-A0CX-d5 c: w: ; m: "11"; v: "93862594"; u: "0.0290713821114479". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d6 . b:TCGA-A2-A0CX-d6 c: w: ; m: "14"; v: "93813777"; u: "0.985555436681019". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d7 . b:TCGA-A2-A0CX-d7 c: w: ; m: "18"; v: "11980953"; u: "0.0109832005732912". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d8 . b:TCGA-A2-A0CX-d8 c: w: ; m: "14"; v: "89290921"; u: "0.0104525957219692".
Text to RDF Conversion
Data Refiner
RDFizer
Refined
RDFizedRaw
Linked TCGA Data Workflow
Linked TCGA Tumors Statistics Tumor Type Original
Size(GB)Refined Size (GB)
RDFized Size (GB)
Triples (Million)
Cervical (CESC) 8.75 2.44 8.86 400.19
Rectal adenocarcinoma (READ) 8.07 2.25 9.04 413.31
Papillary Kidney (KIRP) 10.40 2.90 10.4 469.65Bladder cancer (BLCA) 12.16 3.39 12.3 556.38Acute Myeloid Leukemia (LAML) 14.85 4.14 15.1 684.05Lower Grade Glioma (LGG) 17.08 4.76 17.1 778.82
Prostate adenocarcinoma (PRAD) 18.05 5.03 18.1 821.01
Lung squamous carcinoma (LUSC) 20.63 5.75 20.5 927.08
Cutaneous melanoma (SKCM) 23.22 6.47 23.2 1050.94Head and neck squamous cell(HNSC) 27.6 7.69 27.5 1245.37
• A total of 7.36 Billion Triples for 10 small tumors• Total Linked TCGA > 30 billion triples (Largest Dataset of LOD)
Linking to Linked Open DataSource Target Class #Links
DNA27 HGNC Gene 23181
DNA27 Homologene Gene 27654
DNA27 HGNC Gene 15171
DNA450 Homologene Gene 489643
DNA450 OMIM Gene 212284
DNA27 HGNC Chromosome 108662
DNA27 OMIM Chromosome 16039535
Methylation HGNC Chromosome 97530
Methylation OMIM Chromosome 14407269
Gene Expression HGNC Chromosome 86052
Gene Expression OMIM Chromosome 12535829
• Links are generated using LIMES http://aksw.org/Projects/LIMES.html
Cancer Treatment using Linked TCGA
Linked TCGA Use Cases1. Targeted cancer treatment– Whether a specific drug can be used to treat a tumour
using the genomic data of patients with same tumor
2. Mechanism-based treatment– Whether a combination of drugs can be applied to
treat a specific tumor using similar patients data
3. Survival outcome– Using mathematical model to predict future signs such
as survival outcome for a new patient
Use case 1,2 SPARQL query SELECT ?patient ?meanWHERE{ ?uri tcga:tumour_type "BRCA". ?uri tcga:bcr_patient_barcode ?patient. ?patient rdf:type tcga:expression_gene_results. ?patient tcga:gene_symbol "HER2","ER". ?patient tcga:scaled_estimate ?mean}
Use Case 1,2 Querying LOD DrugBankSELECT ?drugnameWHERE { ?patient rdf:type tcga:expression_gene_results. ?patient tcga:gene_symbol ?targetname . ?patient tcga:scaled_estimate ?mean. FILTER (?mean > Threshold) ?drug drugbank:target ?target. ?drug drugbank:genericName ?drugname . ?target drugbank:synonym ?targetname . FILTER REGEX (?targetname, "HER2||estrogenreceptor||ERBB2", "i") }
Use Case 3 Query
SELECT ?patient ?meanWHERE{?uri tcga:tumour_type "BRCA".?uri tcga:bcr_patient_barcode ?patient.?patient rdf:type tcga:clinical.?patient tcga:tumour_stage ?tumour_stage. ?patient tcga:age_at_initial_patalogical_diagnosis ?age.?patient tcga:relevant_biomarker "BRCA1","CDKN2A", "CDH1".?patient tcga:beta_value ?mean}
Demo1 Demo2
Everything is Public• TopFed: https://code.google.com/p/topfed/• Linked TCGA : http://tcga.deri.ie/ [email protected] AKSW, University of Leipzig, Germany
ThanksMuhammad Saleem [email protected]