II-SDV 2015, 20 - 21 April, in Nice
-
Upload
dr-haxel-cem-gmbh -
Category
Internet
-
view
373 -
download
0
Transcript of II-SDV 2015, 20 - 21 April, in Nice
Richard Resnick CEO
II-SDV 2015, Nice, France
Integrated Keyword and Biological Sequence Searching in the Life Sciences
KEYWORD SEARCHING IN THE LIFE SCIENCES IS CHALLENGING
How do you spell “somatostatin”?
Ala-Gly-Cys-Lys-Asn-Phe-Phe-Trp-Lys-Thr-Phe-Thr-Ser-Cys
somato* AND (Mus musculus
OR mouse)
TGAACCTCACAGCATGGAGCCCCTCTCTTTGGCTTCCACACCTAGCTGGAATGCCTCAGCTGCT 100%/4.2%/100%
is not a a
Relevance of results to life sciences Com
plet
enes
s of p
aten
t aut
horit
y co
vera
ge
Size of bubble corresponds to the number of hits returned
GOAL: HIGHLY RELEVANT RESULTS FROM BROAD PATENT AUTHORITY COVERAGE
SEQUENCE SEARCHING PRESENTS CHALLENGES
CCCTCCATCATTTCACCATCCACACTCATAATAATCATATATATTCATCAATCATCTATATAAGTAGTGGCAGGAGCAATGAGAGGGAGGGTTTCTCCACTGATGCTGTTGCTAGGGATCCTTGTCCTGGCTTCAGTTTCTGCAACGCATGCCAAGTCATCACCTTACCAGAAGAAAACAGAGAACCCCTGCGCCCAGAGGTGCCTCCAGAGTTGTCAACAGGAACCGGATGACTTGAAGCAAAAGGCATGCGAGTCTCGCTGCACCAAGCTCGAGTATGATCCTCGTTGTGTCTATGATCCTCGAGGACACACTGGCACCACCAACCAACGTTCCCCTCCAGGGGAGCGGACACGTGGCCGCCAACCCGGAGACTACGATGATGACCGCCGTCAACCCCGAAGAGAGGAAGGAGGCCGATGGGGACCAGCTGGACCGAGGGAGCGTGAAAGAGAAGAAGACTGGAGACAACCAAGAGAAGATTGGAGGCGACCAAGTCATCAGCAGCCACGGAAAATAAGGCCCGAAGGAAGAGAAGGAGAACAAGAGTGGGGAACACCAGGTAGCCATGTGAGGGAAGAAACATCTCGGAACAACCCTTTCTACTTCCCGTCAAGGCGGTTTAGCACCCGCTACGGGAACCAAAACGGTAGGATCCGGGTCCTGCAGAGGTTTGACCAAAGGTCAAGGCAGTTTCAGAATCTCCAGAATCACCGTATTGTGCAGATCGAGGCCAAACCTAACACTCTTGTTCTTCCCAAGCACGCTGATGCTGATAACATCCTTGTTATCCAGCAAGGTATCAAATCTAATTCTATTCTAAACTACATATATTTTGTTGCTTGATACATATGATTCATTGGATTGCAGGGCAAGCCACCGTGACCGTAGCAAATGGCAATAACAGAAGAGCTTTAATCTTGACGAGGGCCATGCACTCAGAATCCCATCCGTTTCATTTCCTACATCTTGACGACATGACACCAGAACTCAGAGTAGCTAAATCTCATGCCGTTAACACACCCGGCCAGTTTGAGGTAGGTACCTCTTTCTTCTCACATATATATTCAATTCTCAATTATCATCTTACATGTTGTGGGTGTTGCTTCACAGGATTTCTTCCCGGCGAGCAGCCGAGACCAATCATCCTACTTGCAGGGATTCAGCAGGAATACTTTGGAGGCCGCCTTCAATGTAAGCAAATGTGTCATAATTATGGAATTAAAAGAACGATCATGTTATAAACTTATAATATATATATACATAGGCGGAATTCAATGAGATACGGAGGGTGCTGTTAGAAGAGAATGCAGGAGGTGAGCAAGAGGAGAGAGGGCAGAGGCGATGGAGTACTCGGAGTAGTGAGAACAATGAAGGAGTGATAGTCGAAGTGTCAAAGGAGCACGTTGAAGAACTTACTAAGCACGCTAAATCCGTCTCAAAGAAAGGCTCCGAAGAAGAGGGAGATATCACCAACCCAATCAACTTGAGAGAAGGCGAGCCCGATCTTTCTGACAACTTTGGGAGGTTATTTGAGGTGAAGCCAGACAAGAAGAACCCCCAGCTTCAGGACCTGGACATGATGCTCACCTGTGTAGAGATCAAAGAAGGAGCTTTGATGCTCCCACACTTCAACTCAAAGGCCATGGTCATCGTCGTCATCAACAAAGGAACTGGAAACCTTGAACTCGTAGCTGTAAGAAAAGAGCAACAACAGAGGGGACGGCGGGAACAAGAGTGGGAAGAAGAGGAGGAAGATGAAGAAGAGGAGGGAAGTAACAGAGAGGTGCGTAGGTACACAGCGAGGTTGAAGGAAGGCGATGTGTTCATCATGCCAGCAGCTCATCCAGTAGCCATCAACGCTTCCTCCGAACTCCATCTGCTTGGCTTCGGTATCAACGCTGAAAACAACCACAGAATCTTCCTTGCAGGTGATAAGGACAATGTGGTAGACCAGATAGAGAAGCAAGCGAAGGATTTAGCATTCCCTGGTTCGGGTGAACAAGTTGAGAAGCTCATCAAAAACCAGAGGGAGTCTCACTTTGTGAGTGCTCGTCCTCAATCTCAATCTCCGTCGTCTCCTGAAAAAGAGGACCAAGAGGAGGAAAACCAGGGAGGGAAGGGTCCACTCCTTTCAATTTTGAAGGCTTTTAACTGAGAATGGAGGAAACTTGTTATGTATCCATAATAAGATCACGCTTTTGTAATCTACTATCCAAAAACTTATCAATAAATAAAAACGTTTGTGCGTTGTTTCTCCAAGAAATACGGGTGGCGCTTATGGTTGTTTATTTATACGAAACTAATTAAATACATCATAACGGCAACGACCTCTTATTTTGTAATTTTCTT
BLAST?
90% ID?
Do I want total query coverage or total subject coverage?
Global alignment?
What word size?
How do my sequence hits relate to my text search results?
Fragment? Motif?
pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215]
KEYWORD SEARCHING IN THE LIFE SCIENCES PRESENTS CHALLENGES
How do my text search results relate to my sequence hits?
How do I figure out this system’s query syntax?
What if a keyword is misspelled in a patent
claim?
How can I exclude patents unrelated to my domain easily?
How do I build and maintain reliable
synonym lists?
Can I be sure that all of the documents I need to review exist in the underlying database?
BUILDING A REPORT FROM DIFFERENT PLATFORMS IS CHALLENGING
Lack of life science specificity in search platforms create multiple false-positive hits that require additional user review
Varying underlying algorithms can create an apples-to-oranges comparison
Different output formats make it difficult to analyze and compare results
Little cross-platform integration necessitates downloading multiple files for manual collation
Identify prior art surrounding gene modification in peanut for gene families implicated in food allergies.
“Ara h 1” is a seed storage protein from Arachis hypogaea. It is known because sensitization to it was found in 95% of peanut-allergic patients from North America. We’re seeking prior art that describes vaccines related to these allergies or sequences that hit to the Ara h 1 gene.
CASE STUDY
Run a sequence search against the prior art for the peanut “ara h 1” gene sequence:
Arachis hypogaea cultivar LUHUA 8 Ara h 1 allergen (ara h 1) gene (cds)
Identify relevant documents related to peanuts and claiming transgenic
modification of plants that decrease allergy risks, and limited to the documents
published after January 1st 2010
Text Search Sequence Search
CCCTCCATCATTTCACCATCCACACTCATAATAATCATATATATTCATCAATCATCTATATAAGTAGTGGCAGGAGCAATGAGAGGGAGGGTTTCTCCACTGATGCTGTTGCT…
SOLUTION: INTEGRATED LIFE SCIENCE SEARCH PLATFORMS
Union
Combine into a single, unique workfile
A COMPLETE REPORT FOR ANALYSIS
Claims contains vaccin* in green
Bioinformatics-related patents in red
Sequence search results in blue
A single, unified report for analyzing results.
STANDARD KEYWORDS AND BOOLEAN SYNTAX AREN’T ENOUGH
Life science applications are more than collections of discrete, specific keywords. They include field-specific ontological terms that can have synonyms, alternate spellings, and varying word order.
Building a single query that addresses all of these issues, plus allows the flexibility of Boolean, proximity, wildcard, field grouping, range searches, and term boosting, can be difficult.
USE EXISTING ONTOLOGY TERMS OR DEFINE YOUR OWN
As you type, suggested matching terms appear, based on the ontologies you choose
Simply typing “transgenic” with the NCBI ontology list allows “Transgenic Plants”
as one option
At any time, type in the ? symbol for a complete list of
field choices
Specify words in claims, date ranges, and many more
options to further refine your query
Define your own ontologies and synonyms that are
relevant for your specific search area
Includes synonyms and
alternate spellings for the genus and species of peanut
Hit “Search” or <return> to run the search
THE “TEXT SEARCH” WORKFILE
Sort by any column
Rank for priority
Color code to categorize
Quickly assign colors/ranks
using keyboard shortcuts
3 (for 3 stars)
O (for orange)
All the results seem relevant, but we want to annotate the documents talking about vaccines in the claims with a green color.
NAVIGATE A WORKFILE
Easily apply bulk annotations for future workfile manipulation Keyboard
shortcuts allow fast workfile evaluation
(next record)
(close preview)
(previous record)
FILTER A WORKFILE
Type in free text, use wildcards, or type in “?” to filter
by terms in a specific field
FILTER A WORKFILE
Apply the filter to pull out the subset of documents that match your query.
12 documents contain the word “vaccine”, or related
terms, in the claims.
12
MAKING DOCUMENTS WITH VACCINES IN THE CLAIMS GREEN
Here is what our subset (vaccine in claims) looks like. You can reset the filter to see other documents that are in the workfile.
Let’s annotate in red the documents that are probably not really relevant. Notice that “Bio-informatics” is a synonym list and includes multiple spellings.
MAKING BIOINFORMATICS DOCUMENTS RED
40 documents relate to bioinformatics methods.
Now it’s time to complete the analysis with sequence search results.
ara h 1 CDS sequence GenePast 90%ID over the length of the query or the subject (1000 results)
PREPARE YOUR SEQUENCE SEARCH RESULTS
We export these results to a LifeQuest workfile.
Apply a filter to keep the patents where the Patent sequence location of my hits are in the claims: that leads to 81 results in 25 patents.
FILTER YOUR SEQUENCE SEARCH RESULTS & EXPORT
Save it as a new “SEQ search” workfile, and open to analyze.
EXPORT YOUR SEARCH RESULTS TO A WORKFILE
In the “SEQ search” workfile, color code all as blue.
MARK ALL OF THE SEQUENCE SEARCH DOCUMENTS BLUE
Run a sequence search against the prior art for the peanut “ara h 1” gene sequence:
Arachis hypogaea cultivar LUHUA 8 Ara h 1 allergen (ara h 1) gene (cds)
Identify relevant documents related to peanuts and claiming transgenic
modification of plants that decrease allergy risks, and limited to the documents
published after January 1st 2010
Text Search Sequence Search
CCCTCCATCATTTCACCATCCACACTCATAATAATCATATATATTCATCAATCATCTATATAAGTAGTGGCAGGAGCAATGAGAGGGAGGGTTTCTCCACTGATGCTGTTGCT…
SOLUTION: INTEGRATED LIFE SCIENCE SEARCH PLATFORMS
Union
Combine into a single, unique workfile
CONSOLIDATE TEXT SEARCH AND SEQUENCE SEARCH RESULTS
Merge the two workfiles together (union) to get a complete set for final analysis.
Sort, filter, analyze, and export!
EVALUATE THE MERGED DATA SETS
vaccin* in claims
bioinformatics related
sequence hit in claims
GENERATE A COMPLETE REPORT FOR ANALYSIS
Includes results from both sequence & text
searches
Create color codes for your specific categories
Merge with other outputs or export to any
format
Sort or filter by any field
Rank hits (1, 2, 3 stars) to easily identify priority
Claims contain vaccin*
bioinformatics related
found using the “ara h 1” DNA sequence
A single, unified report for analyzing results.