Phenotype Information for Existing GWAS Studies

16
Phenotype Information Retrieval for Existing GWAS Studies Neda Alipanah, Ph.D. University of California San Diego March 2013

description

2013 Summit on Clinical Research Informatics

Transcript of Phenotype Information for Existing GWAS Studies

Page 1: Phenotype Information for Existing GWAS Studies

Phenotype Information Retrieval for Existing GWAS

Studies

Neda Alipanah, Ph.D. University of California San Diego

March 2013

Page 2: Phenotype Information for Existing GWAS Studies

Motivation •  The database of Genotypes and Phenotypes (dbGaP) is

archiving the results of different Genome Wide Association Studies (GWAS).

•  Phenotype variables are not harmonized across studies.

•  Redundant phenotype identifiers for the same phenotype.

•  dbGaP lacks semantic relations among its variables.

•  Search on phenotypes is not accurate.

Goals •  Standardize dbGaP information to allow accurate,

reusable and

•  Quick retrieval of information

Page 3: Phenotype Information for Existing GWAS Studies

Problem Statement (Example of Redundant Variables) dbGaP Structure

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000007.v18.p7

dbGaP Study phs000007.v18.p7

id=” phv00003636.v1 ”, Description=” HEART:

HYPERTENSIVE HEART DISEASE ”, name=” FK414”,

version=“1”, Logical Max=”--”, Logical Minimum=”--”, unit=”--”, type=”string”

id=” phv00008678.v3 ”, Description=” CDI: HYPERTENSIVE

HEART DISEASE ”, name=” C334 ”,

version=“3”, Logical Max=”--”, Logical Minimum=”--”, unit=”--”,

type=”text”

N Alipanah, H Kim, L Ohno-Machado: Building an Ontology of Phenotypes for Existing GWAS Studies. Healthcare Informatics, Imaging and Systems Biology (HISB), 2012 IEEE Second International Conference on , vol., no., pp.111, 27-28 Sept. 2012

Page 4: Phenotype Information for Existing GWAS Studies

Problem Statement (Example of Semantic Relation) dbGaP Structure

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.p1

dbGaP Study phs000284.v1.p1

id=” phv00123020.v1”, Description=” CVD: self report of MD

dx of cvd ”, name=” cvd”,

version=“1”, value=“Yes, No, Not assessed”

Id=” phv00123021.v1 ”, Description=” CVD: self report of MD

dx of cvd (missing recoded as no) ”, name=”cvdx”,

version=“1”, value=“Yes, No, Not assessed”

Page 5: Phenotype Information for Existing GWAS Studies

Proposed Solution

� Build an information model ◦  Indexing the phenotype variables

semantically ◦ No Redundancy

Example:

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

Heart Disease

Cardiovascular Disease

Id=“phv00124261.v1 ”

id=” phv00008678.v3” phv00123021.v1 phv00123020.v1

…….

Page 6: Phenotype Information for Existing GWAS Studies

Methods

I. String-based Variables Distance Calculation

II. Semantic Hierarchy Extraction on Revised

Clusters

III. Classification and Ontology Creation

IV. sdGaP Information Retrieval

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

Page 7: Phenotype Information for Existing GWAS Studies

1. String-based Variables Distance Calculation

1- Property Extraction Name, Description, Type, Unit, …,and (Max-Min) values 2- UMLS Expansion

Expand Variable Description with MetaMap

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

Page 8: Phenotype Information for Existing GWAS Studies

1. String-based Variables Distance Calculation 3- Distance Computation Description: Vector Space Model Matching Name Similarity: Jaro-Winkler String Matcher Type: Exact String Match Unit: Exact String Match (Max-Min) values: Subset Matching 4- Build Distance Matrix Compute the Distance between every pair of Variables.

5- Cluster based on Distance Matrix Variables with the same distance to other variables are clustered together.

Page 9: Phenotype Information for Existing GWAS Studies

II. Semantic Hierarchy Extraction on Revised Clusters 1.  Build String-based distance matrix for variables in a single assigned

cluster.

2.  Sub-cluster variables and calculate semantically relevant (similar) variables.

3.  Assign labels to sub-clusters based on the relevant UMLS Concept Unique Identifier.

4.  Perform re-clustering to find smaller group of relevant variables.

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

Page 10: Phenotype Information for Existing GWAS Studies

III. Classification and (sdGaP) Ontology Creation 1.  Start with UMLS semantic network. 2.  Extract corresponding sub class (PAR/CHD) hierarchy using the

UMLS hierarchy table (MRREL table).

3.  Instantiate the phenotype variables to the UMLS CUIs. (Not for higher levels)

4.  Populate the related constraints in sdGaP

Page 11: Phenotype Information for Existing GWAS Studies

IV. sdGaP Information Retrieval 1.  Use sdGaP Ontology structure to expand the query �  Density Measure (DM)

Density(A)=3 Density(B)=0 Density(D)=0

Density(A)=2 Density(D)=1 Density(B)=1 Density(E)=1 Density(C)=0

Page 12: Phenotype Information for Existing GWAS Studies

IV. Result �  Dataset: Cleveland Family Study (CFS) with 5 data sets and 2,339

phenotype variables. (phs000284.v1.p1) �  Use Weka Tool for Xmean clustering. �  The X-mean clustering resulted in 35 clusters for relevant variables. �  Reorganized into 23 clusters by domain expert reviewers

Page 13: Phenotype Information for Existing GWAS Studies

IV. Result of Concept-based Retrieval (Improvement of Subclass Expansion)

�  Query =“Cardiovascular Disease”

Query Expansion={Heart } in “Disease Cluster” Recall Improvement 2/45=0.04 to 18/45=0.4

Cardiovascular Disease

Heart Disease

phv00123021.v1

Phv00122274.v1

Phv00122277.v1

phv00123020.v1

Phv00122280.v1

Phv00122281.v1

Phv00122283.v1

Phv00122284.v1

Phv00122285.v1

Phv00122286.v1

……

Page 14: Phenotype Information for Existing GWAS Studies

Conclusion

� Extracting Standard Reusable Information Model From UMLS

�  Improving Information Retrieval by

Organizing Phenotype Variables and Instantiate them to Data Model

Page 15: Phenotype Information for Existing GWAS Studies

Limitation � Clustering based on Distance Calculation

is Semi-automated Computation. �  Instantiating variables to lower levels of

hierarchy needs domain expert review. �  Only instances of lower level of

hierarchies are considered in ontology building.

�  For large data, distance calculation and clustering needs more advance algorithms.

Page 16: Phenotype Information for Existing GWAS Studies

Acknowledgement

�  Supported by Grants ◦ UH2HL108785 (NHLBI) ◦ R01HS019913 (AHRQ)

�  Supervision of Dr. Lucila Ohno-Machado.