Cross-species gene normalization by species inference
-
Upload
raunak-shrestha -
Category
Health & Medicine
-
view
152 -
download
1
Transcript of Cross-species gene normalization by species inference
![Page 1: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/1.jpg)
Raunak Shrestha
24th November 2011
![Page 2: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/2.jpg)
2
http://reginanuzzo.com/wp-content/iHOP_chimp.gif
![Page 3: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/3.jpg)
BioCreative
3
“BioCreative ….. consists of a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. … ”
http://biocreative.sourceforge.net/
![Page 4: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/4.jpg)
Biocreative III
• 3 Task:
• Gene Normalization (GN)
• Protein-protein interactions (PPI)
• Interactive demonstration task for gene indexing and retrieval (IAT)
• Goal
• Map gene & proteins mentioned in biomedical literature to its standard database identifiers
• Biocreative III
• No species information was provided
• Produce list of EntrezGene identifiers of all the species for gene mentions in full-text biomedical articles
4
![Page 5: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/5.jpg)
Relation Extraction
5
Name Entity Normalization
genes
proteins
disease
SpecificDatabase identifier
![Page 6: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/6.jpg)
GN task in Biocreative III
• 3 MAJOR CHALLENGES !!!
• Gene Mention Variations
• Orthographical Variation : TLR7 TLR-7
• Morphological Variation : GHF-1 transcriptional factor GHF-1 transcription factor
• Enumeration Variation : TLR7/8 TLR7, TLR8
• Variation with Abbreviation
• SLC11A1 NRAMP1
• Orthologous Gene Ambiguity
• Orthologous gene belongs to different species and should be mapped to different database identifiers
• Intra species Gene Ambiguity
• Different genes have the same name6
“CAS” -> multiple database identifiers• EntrezGene Id:1434 (“Cellular
apoptosis susceptibility protein”) • EntrezGene Id:9564 (“Breast cancer
antiestrogen resistance 1”)
![Page 7: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/7.jpg)
Method
7
• GenNorm• an integrative method to
handle the three issues of the GN task
• GenNorm Uses 3 modules• Gene Name
Recognition(GNR) module
• Species Assignation (SA) module
• Species-specific Gene Normalization (SGN) module
Architecture of GenNorm.
![Page 8: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/8.jpg)
Gene Name Recognition(GNR) module
• AIIA-GMT is a XML-RCP client of a web server that recognizes name entities in a biomedical literature
8
![Page 9: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/9.jpg)
Gene Name Recognition(GNR) module
• Identifiers Extraction• If gene mentions cannot be
matched with a particular database identifier
• Queries EntrezGenedatabase
• Attaches all the associated ids (swissprot_id, SGD_id, etc) 9
Distillation
![Page 10: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/10.jpg)
Species Assignation (SA) module• Aggregates three
different species name lexicons:• NCBI taxonomy
• list of cell lines from Wikipedia
• the corpus of Linnaeus
10
![Page 11: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/11.jpg)
Species Assignation (SA) module
11
Guaranteed Inference
Co-occurrence Inference
Species sub-type can disambiguate the species name
![Page 12: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/12.jpg)
Species Assignation (SA) module
125,933,419 Entrez Gene Ids belonging to > 6,000 species
![Page 13: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/13.jpg)
Species-specific Gene Normalization (SGN) module• measures the inference
scores of candidate Entrez Gene Ids in articles
• Entity Inference:• inference by exact match
• tagged entity = gene name entity
• Bags of words• Inference by partial match
• Tagged entity has at least one word matching the gene name identity from the bag of words
13“Hypoxia-inducible factor-1 alpha” “Hypoxia“, “inducible”, “factor”,
“1”, “alpha”
“CCRL1” = “chemokine receptor like 1”
![Page 14: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/14.jpg)
Test Data
• Two sets of test data• A: fully annotated by a group of trained and experienced curators
from different model organism database
• B: partially annotated by curators at NLM
• 507 full text articles from various BMC and PLoS journals
14
![Page 15: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/15.jpg)
Result
• Post-processing step most useful
• Identifier Extraction was the most efficient15
![Page 16: Cross-species gene normalization by species inference](https://reader033.fdocuments.in/reader033/viewer/2022052909/559864f41a28abce2f8b4661/html5/thumbnails/16.jpg)
Conclusion and Critique• GenNorm tries addresses key problems in cross-species gene
normalization
• Still orthologous gene ambiguity is a challenging task
• Even challenging step is to filter out the high-false-positive set as described in Table-5
• Paper describes briefly about some of the systems available in biomedical text-mining till date.
• GenNorm seems to be highly integrated system
• Grabs all the “cream” out of the best available resources in the filed of biomedical literature text mining
16