II-SDV 2015, 20 - 21 April, in Nice

30
Richard Resnick CEO II-SDV 2015, Nice, France Integrated Keyword and Biological Sequence Searching in the Life Sciences

Transcript of II-SDV 2015, 20 - 21 April, in Nice

Richard Resnick CEO

II-SDV 2015, Nice, France

Integrated Keyword and Biological Sequence Searching in the Life Sciences

KEYWORD SEARCHING IN THE LIFE SCIENCES IS CHALLENGING

How do you spell “somatostatin”?

Ala-Gly-Cys-Lys-Asn-Phe-Phe-Trp-Lys-Thr-Phe-Thr-Ser-Cys

somato* AND (Mus musculus

OR mouse)

TGAACCTCACAGCATGGAGCCCCTCTCTTTGGCTTCCACACCTAGCTGGAATGCCTCAGCTGCT 100%/4.2%/100%

is not a a

Relevance of results to life sciences Com

plet

enes

s of p

aten

t aut

horit

y co

vera

ge

Size of bubble corresponds to the number of hits returned

GOAL: HIGHLY RELEVANT RESULTS FROM BROAD PATENT AUTHORITY COVERAGE

SEQUENCE SEARCHING PRESENTS CHALLENGES

CCCTCCATCATTTCACCATCCACACTCATAATAATCATATATATTCATCAATCATCTATATAAGTAGTGGCAGGAGCAATGAGAGGGAGGGTTTCTCCACTGATGCTGTTGCTAGGGATCCTTGTCCTGGCTTCAGTTTCTGCAACGCATGCCAAGTCATCACCTTACCAGAAGAAAACAGAGAACCCCTGCGCCCAGAGGTGCCTCCAGAGTTGTCAACAGGAACCGGATGACTTGAAGCAAAAGGCATGCGAGTCTCGCTGCACCAAGCTCGAGTATGATCCTCGTTGTGTCTATGATCCTCGAGGACACACTGGCACCACCAACCAACGTTCCCCTCCAGGGGAGCGGACACGTGGCCGCCAACCCGGAGACTACGATGATGACCGCCGTCAACCCCGAAGAGAGGAAGGAGGCCGATGGGGACCAGCTGGACCGAGGGAGCGTGAAAGAGAAGAAGACTGGAGACAACCAAGAGAAGATTGGAGGCGACCAAGTCATCAGCAGCCACGGAAAATAAGGCCCGAAGGAAGAGAAGGAGAACAAGAGTGGGGAACACCAGGTAGCCATGTGAGGGAAGAAACATCTCGGAACAACCCTTTCTACTTCCCGTCAAGGCGGTTTAGCACCCGCTACGGGAACCAAAACGGTAGGATCCGGGTCCTGCAGAGGTTTGACCAAAGGTCAAGGCAGTTTCAGAATCTCCAGAATCACCGTATTGTGCAGATCGAGGCCAAACCTAACACTCTTGTTCTTCCCAAGCACGCTGATGCTGATAACATCCTTGTTATCCAGCAAGGTATCAAATCTAATTCTATTCTAAACTACATATATTTTGTTGCTTGATACATATGATTCATTGGATTGCAGGGCAAGCCACCGTGACCGTAGCAAATGGCAATAACAGAAGAGCTTTAATCTTGACGAGGGCCATGCACTCAGAATCCCATCCGTTTCATTTCCTACATCTTGACGACATGACACCAGAACTCAGAGTAGCTAAATCTCATGCCGTTAACACACCCGGCCAGTTTGAGGTAGGTACCTCTTTCTTCTCACATATATATTCAATTCTCAATTATCATCTTACATGTTGTGGGTGTTGCTTCACAGGATTTCTTCCCGGCGAGCAGCCGAGACCAATCATCCTACTTGCAGGGATTCAGCAGGAATACTTTGGAGGCCGCCTTCAATGTAAGCAAATGTGTCATAATTATGGAATTAAAAGAACGATCATGTTATAAACTTATAATATATATATACATAGGCGGAATTCAATGAGATACGGAGGGTGCTGTTAGAAGAGAATGCAGGAGGTGAGCAAGAGGAGAGAGGGCAGAGGCGATGGAGTACTCGGAGTAGTGAGAACAATGAAGGAGTGATAGTCGAAGTGTCAAAGGAGCACGTTGAAGAACTTACTAAGCACGCTAAATCCGTCTCAAAGAAAGGCTCCGAAGAAGAGGGAGATATCACCAACCCAATCAACTTGAGAGAAGGCGAGCCCGATCTTTCTGACAACTTTGGGAGGTTATTTGAGGTGAAGCCAGACAAGAAGAACCCCCAGCTTCAGGACCTGGACATGATGCTCACCTGTGTAGAGATCAAAGAAGGAGCTTTGATGCTCCCACACTTCAACTCAAAGGCCATGGTCATCGTCGTCATCAACAAAGGAACTGGAAACCTTGAACTCGTAGCTGTAAGAAAAGAGCAACAACAGAGGGGACGGCGGGAACAAGAGTGGGAAGAAGAGGAGGAAGATGAAGAAGAGGAGGGAAGTAACAGAGAGGTGCGTAGGTACACAGCGAGGTTGAAGGAAGGCGATGTGTTCATCATGCCAGCAGCTCATCCAGTAGCCATCAACGCTTCCTCCGAACTCCATCTGCTTGGCTTCGGTATCAACGCTGAAAACAACCACAGAATCTTCCTTGCAGGTGATAAGGACAATGTGGTAGACCAGATAGAGAAGCAAGCGAAGGATTTAGCATTCCCTGGTTCGGGTGAACAAGTTGAGAAGCTCATCAAAAACCAGAGGGAGTCTCACTTTGTGAGTGCTCGTCCTCAATCTCAATCTCCGTCGTCTCCTGAAAAAGAGGACCAAGAGGAGGAAAACCAGGGAGGGAAGGGTCCACTCCTTTCAATTTTGAAGGCTTTTAACTGAGAATGGAGGAAACTTGTTATGTATCCATAATAAGATCACGCTTTTGTAATCTACTATCCAAAAACTTATCAATAAATAAAAACGTTTGTGCGTTGTTTCTCCAAGAAATACGGGTGGCGCTTATGGTTGTTTATTTATACGAAACTAATTAAATACATCATAACGGCAACGACCTCTTATTTTGTAATTTTCTT  

BLAST?

90% ID?

Do I want total query coverage or total subject coverage?

Global alignment?

What word size?

How do my sequence hits relate to my text search results?

Fragment? Motif?

pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215] pn:EP* AND somato*^5 AND [mus musculus] AND clm:[transgenic animal ~3] AND pd:[19950101 TO 20140215]

KEYWORD SEARCHING IN THE LIFE SCIENCES PRESENTS CHALLENGES

How do my text search results relate to my sequence hits?

How do I figure out this system’s query syntax?

What if a keyword is misspelled in a patent

claim?

How can I exclude patents unrelated to my domain easily?

How do I build and maintain reliable

synonym lists?

Can I be sure that all of the documents I need to review exist in the underlying database?

BUILDING A REPORT FROM DIFFERENT PLATFORMS IS CHALLENGING

Lack of life science specificity in search platforms create multiple false-positive hits that require additional user review

Varying underlying algorithms can create an apples-to-oranges comparison

Different output formats make it difficult to analyze and compare results

Little cross-platform integration necessitates downloading multiple files for manual collation

Identify prior art surrounding gene modification in peanut for gene families implicated in food allergies.

“Ara h 1” is a seed storage protein from Arachis hypogaea. It is known because sensitization to it was found in 95% of peanut-allergic patients from North America. We’re seeking prior art that describes vaccines related to these allergies or sequences that hit to the Ara h 1 gene.

CASE STUDY

Run a sequence search against the prior art for the peanut “ara h 1” gene sequence:

Arachis hypogaea cultivar LUHUA 8 Ara h 1 allergen (ara h 1) gene (cds)

Identify relevant documents related to peanuts and claiming transgenic

modification of plants that decrease allergy risks, and limited to the documents

published after January 1st 2010

Text Search Sequence Search

CCCTCCATCATTTCACCATCCACACTCATAATAATCATATATATTCATCAATCATCTATATAAGTAGTGGCAGGAGCAATGAGAGGGAGGGTTTCTCCACTGATGCTGTTGCT…

SOLUTION: INTEGRATED LIFE SCIENCE SEARCH PLATFORMS

Union

Combine into a single, unique workfile

A COMPLETE REPORT FOR ANALYSIS

Claims contains vaccin* in green

Bioinformatics-related patents in red

Sequence search results in blue

A single, unified report for analyzing results.

STANDARD KEYWORDS AND BOOLEAN SYNTAX AREN’T ENOUGH

Life science applications are more than collections of discrete, specific keywords. They include field-specific ontological terms that can have synonyms, alternate spellings, and varying word order.

Building a single query that addresses all of these issues, plus allows the flexibility of Boolean, proximity, wildcard, field grouping, range searches, and term boosting, can be difficult.

USE EXISTING ONTOLOGY TERMS OR DEFINE YOUR OWN

As you type, suggested matching terms appear, based on the ontologies you choose

Simply typing “transgenic” with the NCBI ontology list allows “Transgenic Plants”

as one option

At any time, type in the ? symbol for a complete list of

field choices

Specify words in claims, date ranges, and many more

options to further refine your query

Define your own ontologies and synonyms that are

relevant for your specific search area

Includes synonyms and

alternate spellings for the genus and species of peanut

Hit “Search” or <return> to run the search

INSTANT RESULTS

A result preview is shown, and we save it as a workfile called “TEXT SEARCH”

THE “TEXT SEARCH” WORKFILE

Sort by any column

Rank for priority

Color code to categorize

Quickly assign colors/ranks

using keyboard shortcuts

3 (for 3 stars)

O (for orange)

All the results seem relevant, but we want to annotate the documents talking about vaccines in the claims with a green color.

NAVIGATE A WORKFILE

Easily apply bulk annotations for future workfile manipulation Keyboard

shortcuts allow fast workfile evaluation

(next record)

(close preview)

(previous record)

FILTER A WORKFILE

Type in free text, use wildcards, or type in “?” to filter

by terms in a specific field

FILTER A WORKFILE

Apply the filter to pull out the subset of documents that match your query.

12 documents contain the word “vaccine”, or related

terms, in the claims.

12

Let’s annotate these in green.

MAKING DOCUMENTS WITH VACCINES IN THE CLAIMS GREEN

MAKING DOCUMENTS WITH VACCINES IN THE CLAIMS GREEN

Here is what our subset (vaccine in claims) looks like. You can reset the filter to see other documents that are in the workfile.

Let’s annotate in red the documents that are probably not really relevant. Notice that “Bio-informatics” is a synonym list and includes multiple spellings.

MAKING BIOINFORMATICS DOCUMENTS RED

40 documents relate to bioinformatics methods.

vaccin* in claims

bioinformatics related

HERE IS OUR TEXT SEARCH WORKFILE

Now it’s time to complete the analysis with sequence search results.

ara h 1 CDS sequence GenePast 90%ID over the length of the query or the subject (1000 results)

PREPARE YOUR SEQUENCE SEARCH RESULTS

We export these results to a LifeQuest workfile.

Apply a filter to keep the patents where the Patent sequence location of my hits are in the claims: that leads to 81 results in 25 patents.

FILTER YOUR SEQUENCE SEARCH RESULTS & EXPORT

Save it as a new “SEQ search” workfile, and open to analyze.

EXPORT YOUR SEARCH RESULTS TO A WORKFILE

In the “SEQ search” workfile, color code all as blue.

MARK ALL OF THE SEQUENCE SEARCH DOCUMENTS BLUE

Run a sequence search against the prior art for the peanut “ara h 1” gene sequence:

Arachis hypogaea cultivar LUHUA 8 Ara h 1 allergen (ara h 1) gene (cds)

Identify relevant documents related to peanuts and claiming transgenic

modification of plants that decrease allergy risks, and limited to the documents

published after January 1st 2010

Text Search Sequence Search

CCCTCCATCATTTCACCATCCACACTCATAATAATCATATATATTCATCAATCATCTATATAAGTAGTGGCAGGAGCAATGAGAGGGAGGGTTTCTCCACTGATGCTGTTGCT…

SOLUTION: INTEGRATED LIFE SCIENCE SEARCH PLATFORMS

Union

Combine into a single, unique workfile

CONSOLIDATE TEXT SEARCH AND SEQUENCE SEARCH RESULTS

Merge the two workfiles together (union) to get a complete set for final analysis.

Sort, filter, analyze, and export!

EVALUATE THE MERGED DATA SETS

vaccin* in claims

bioinformatics related

sequence hit in claims

GENERATE A COMPLETE REPORT

GENERATE A COMPLETE REPORT FOR ANALYSIS

Includes results from both sequence & text

searches

Create color codes for your specific categories

Merge with other outputs or export to any

format

Sort or filter by any field

Rank hits (1, 2, 3 stars) to easily identify priority

Claims contain vaccin*

bioinformatics related

found using the “ara h 1” DNA sequence

A single, unified report for analyzing results.

PLEASE COME BY OUR BOOTH FOR MORE INFORMATION.