Searching for Polymer Information on STN - Chemical Abstracts
Multifile Patent Sequence Searching on STN® · Multifile Patent Sequence Searching on STN ......
Transcript of Multifile Patent Sequence Searching on STN® · Multifile Patent Sequence Searching on STN ......
Agenda
• Sequence searchable databases on STN®
• Step-by-step through a multifile BLAST search• Multifile post-processing using STN Express• Overview of the search results• Recent database enhancements• Summary and resources
2
See also: Sequence Basics e-Seminar:http://www.stn-international.com/Sequence_Basics_Seminar.html
STN sequence searchable databases
• DGENE– Thomson Reuters GENESEQTM
– Value-added patent sequence data from around the globe• USGENE
– The USPTO Genetic Sequence Database– All available sequence data from the USPTO
• PCTGEN– WIPO/PCT Patent Application Biosequences– All available e-published sequence data from WIPO
• CAS REGISTRYSM
– Chemical Abstracts Service (CAS) REGISTRY– Worldwide value-added patent and non-patent sequences
3
CAS REGISTRY/CAplus offers two sequence search modes
• NCBI BLAST similarity– Using a separate Graphic User Interface
• Sequence Code Match (motif) searching– Using the Search (=> S) command
4
DGENE, USGENE and PCTGEN offer three sequence search modes
• NCBI BLAST similarity=> RUN BLAST
• FASTA-based similarity=> RUN GETSIM
• Sequence Code Match (motif) search=> RUN GETSEQ
5
Learn more in the DGENE Workshop Manual:http://www.stn-international.com/dgene_wm.html
Multifile patent sequence searching
6
Search Question:Find all patents that disclose Homo sapiens D-amino-acid oxidase (NCBI NP_001908), or similar sequences (≥ 80%):MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPNNPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNCTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQTVTLGGIFQLGNWSELNNIQDHNTIWEGCCRLEPTLKNARIIGERTGFRPVRPQIRLEREQLRTGPSNTEVIHNYGHGGYGLTIHWGCALEAAKLFGRILEEKKLSRMPPSHL
(Search conducted on 7th July 2010)
Multifile search strategy
1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode
2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results
3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS
REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN
results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results
using the BLAST Report Tool7
SAVE, UPLOAD and VERIFY the query
• Prepare and save the query as a plain text file in a suitable text editor, e.g. Windows Notepad
8
From the Discover! button menu.
SAVE, UPLOAD and VERIFY the query (cont.)
9
(a) Click Upload Sequence(b) Choose the query file(c) Select the STN database
(b)(a)
(c)
The sequence becomes a Query L-number in the database of choice for use with RUN BLAST.
SAVE, UPLOAD and VERIFY the query (cont.)
10
=> FILE USGENE
=> UPL R BLAST
Uploading C:\. . . .\NP_001908 Homo sapiens DAO.txt
UPLOAD SUCCESSFULLY COMPLETEDL1 GENERATED
=> D L1 LQUE
L1 ANSWER 1 USGENE COPYRIGHT 2010 SEQUENCEBASE CORP on STNLQUE MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSD
PNNPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNCTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQTVTLGGIFQLGNWSELNNIQDHNTIWEGCCRLEPTLKNARIIGERTGFRPVRPQIRLEREQLRTGPSNTEVIHNYGHGGYGLTIHWGCALEAAKLFGRILEEKKLSRMPPSHLThe sequence query is now ready for searching directly in
DGENE, USGENE, or PCTGEN using the L-number (L1).
Commands in red are automatically run by the STN Express Sequence Query Upload wizard.
Verify the sequence was uploaded successfully with D LQUE.
RUN the DGENE, USGENE and PCTGEN BLAST searches in BATCH mode
11
=> FILE DGENEFILE 'DGENE' ENTERED AT 17:05:31 ON 07 JUL 2010COPYRIGHT (C) 2010 THOMSON REUTERS
=> RUN BLAST L1 /SQP -F F BATCH
PLEASE ENTER BATCH IDENTIFIER (MAX. 8 CHARS):DAOP
TO BE NOTIFIED WHEN THIS BATCH SEARCH IS COMPLETE, PLEASE ENTER YOUR EMAIL ADDRESS (MAX. 50 CHARS) OR "NONE"INPUT: OR (END):[email protected]
BLAST Version 2.2
The BLAST software is used herein with permission of the National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM). . . .
BATCH PROCESSING STARTED FOR DAOP
Add BATCH to the end of a RUN BLAST command to search in offline batch search mode.
Enter a valid email address to be notified when the BATCHsearch is completed.
Tip: BATCH mode BLAST searches may be run concurrently in each database.
RUN the DGENE, USGENE and PCTGEN BLAST searches in BATCH mode (cont.)
12
=> FILE USGENE
=> RUN BLAST L1 /SQP -F F BATCH. . . .
PLEASE ENTER BATCH IDENTIFIER (MAX. 8 CHARS):DAOP. . . .
=> FILE PCTGEN
=> RUN BLAST L1 /SQP -F F BATCH. . . .
PLEASE ENTER BATCH IDENTIFIER (MAX. 8 CHARS):DAOP. . . .
=> LOG H
SESSION WILL BE HELD FOR 120 MINUTESSTN INTERNATIONAL SESSION SUSPENDED AT 17:07:14 ON 07 JUL 2010
RUN the USGENE and PCTGEN searches concurrently using BATCH.
Reminder: Turn the Low Complexity Filter off with the syntax: /SQP –F F
Tip: use LOGOFF HOLD (LOG H) to be able to return to the same STN session within two hours (optional).
=> FILE DGENEFILE 'DGENE' ENTERED AT 17:11:25 ON 07 JUL 2010COPYRIGHT (C) 2010 THOMSON REUTERS
=> RUN GETBATCH DAOPPlease enter your batch identifier
or enter # for batch id listor enter * for batch id at top of listor enter - before batch id to deleteor enter . for (end)
Database DGENE AAPosted date: Jun 25, 2010 11:33 PM
. . . .
ENTER EITHER THE NUMBER OF ANSWERS YOU WISH TO KEEPOR ENTER MINIMUM PERCENT OF SELF SCORE FOLLOWED BY %(BEST ANSWER PERCENTAGE OF SELF SCORE IS 100%) ENTER (ALL) OR ? :80%L2 RUN STATEMENT CREATEDL2 19 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYAD. . . MPPSHL/SQP.-F F
Answer set arranged by accession number; to sort by descendingsimilarity score, enter at an arrow prompt (=>) "sor score d".
Retrieve the BATCH search results
13
In this example, 80% of the Query Self Score is used to select out just the most relevant results (L2).
Use RUN GETBATCH to retrieve completed BATCH search results.
=> FILE USGENE
=> RUN GETBATCH DAOP. . . .
ENTER EITHER THE NUMBER OF ANSWERS YOU WISH TO KEEPOR ENTER MINIMUM PERCENT OF SELF SCORE FOLLOWED BY %(BEST ANSWER PERCENTAGE OF SELF SCORE IS 100%) ENTER (ALL) OR ? :80%L3 RUN STATEMENT CREATEDL3 14 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYAD. . . MPPSHL/SQP.-F F
=> FILE PCTGEN
=> RUN GETBATCH DAOP. . . .
ENTER EITHER THE NUMBER OF ANSWERS YOU WISH TO KEEPOR ENTER MINIMUM PERCENT OF SELF SCORE FOLLOWED BY %(BEST ANSWER PERCENTAGE OF SELF SCORE IS 100%) ENTER (ALL) OR ? :80%L4 RUN STATEMENT CREATEDL4 3 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYAD. . . MPPSHL/SQP.-F F
Retrieve the BATCH search results (cont.)
14
Use RUN GETBATCH to retrieve completed BATCH search results.
Multifile search strategy
1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode
2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results
3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS
REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN
results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results
using the BLAST Report Tool15
Merge the results into a single L-number
16
=> SET DUPORDER FILESET COMMAND COMPLETED
=> DUP IDE L2 L3 L4
FILE 'DGENE' ENTERED AT 17:16:56 ON 07 JUL 2010COPYRIGHT (C) 2010 THOMSON REUTERS
FILE 'USGENE' ENTERED AT 17:16:56 ON 07 JUL 2010COPYRIGHT (C) 2010 SEQUENCEBASE CORP
FILE 'PCTGEN' ENTERED AT 17:16:56 ON 07 JUL 2010COPYRIGHT (C) 2010 WIPOPROCESSING COMPLETED FOR L2 PROCESSING COMPLETED FOR L3 PROCESSING COMPLETED FOR L4 L5 36 DUP IDE L2 L3 L4 (INCLUDES 0 SETS OF DUPLICATES)
ANSWERS '1-19' FROM FILE DGENE ANSWERS '20-33' FROM FILE USGENE ANSWERS '34-36' FROM FILE PCTGEN
=> SOR IDENT DPROCESSING COMPLETED FOR L5 L6 36 SOR L5 IDENT D
SET DUPORER FILE ensures that multifile records merged using DUP IDE are organized by database (file).
DUPLICATE IDENTIFY (DUP IDE) is used here to create a single multifile L-number (L5).
The multifile L-number (L5) can be sorted by BLAST SCORE, or Percent Identity (IDENT).
Review multifile answers with a free-of-charge format including alignment
17
=> D L6 TRIAL SCORE ALIGN 1-36; FILE STNGUIDE
L6 ANSWER 1 OF 36 DGENE COPYRIGHT 2010 THOMSON REUTERS on STN AN AAO23074 Protein DGENETI Determining a genotype of an individual for preparing a composition
for treating schizophrenia by determining the identity of anucleotide at a biallelic marker of the D-amino acid oxidase gene ofthe polynucleotide in a sample -
DESC Human D-amino acid oxidase wild-type protein.KW Biallelic marker; D-amino acid oxidase; DAO; neuroleptic; CNS
disorder; movement; Parkinson's disease; Huntington's; motorneurone; Alzheimer's; mood; unipolar depression; bipolar; . . . .
SQL 347SCORE 731 100% of query self score 731BLASTALIGN
Query = 347 lettersLength = 347Score = 731 bits (1886), Expect = 0.0Identities = 347/347 (100%), Positives = 347/347 (100%)Query: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQP . . .
MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPSbjct: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQP . . .
Query Self Scoreand percentage.
Review answers with a free-of-charge format including alignment (cont.)
18
L6 ANSWER 4 OF 36 USGENE COPYRIGHT 2010 SEQUENCEBASE CORP on STN TI Collections of matched biological reagents and methods for
identifying matched reagents (PublishedApplication)MTY ProteinSQL 347SCORE 731 100% of query self score 731BLASTALIGN
Query = 347 lettersLength = 347Score = 731 bits (1886), Expect = 0.0Identities = 347/347 (100%), Positives = 347/347 (100%)Query: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPN
MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPNSbjct: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPNQuery: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPR
NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRSbjct: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRQuery: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVN
ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNSbjct: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNQuery: 181 CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQ . . .
CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQSbjct: 181 CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQ . . .
BLAST Percent Identity (IDENT).
Review answers with a free-of-charge format including alignment (cont.)
19
L6 ANSWER 28 OF 36 PCTGEN COPYRIGHT 2010 WIPO on STN TI ORGAN-SPECIFIC PROTEINS AND METHODS OFTHEIR USE MTY PRTSQL 347SCORE 728 99% of query self score 731BLASTALIGN
Query = 347 lettersLength = 347Score = 728 bits (1879), Expect = 0.0Identities = 346/347 (99%), Positives = 346/347 (99%)Query: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPN
MRVVVIGAGVIGLSTALCIHERYHSVLQPL IKVYADRFTPLTTTDVAAGLWQPYLSDPNSbjct: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLHIKVYADRFTPLTTTDVAAGLWQPYLSDPNQuery: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPR
NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRSbjct: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRQuery: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVN
ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNSbjct: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNQuery: 181 CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQ . . .
CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQSbjct: 181 CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQ . . .
Ensure Capture Session is on to record a transcript for use in post-processing
20
Note: Check the Capture Retrospectively box to capture the session so far, as well as the session from this point forwards.
Use the STN Express 8.4 Patent Family Manager wizard display the results
21
Access the patent family manager wizard from the Discover! Menu.
Choose a bibliographic display format with alignment for the first (best) hit, and a free-of-charge format with alignment for the rest of the sequences in each patent family group.
=> FSORT L6. . . .
L7 36 FSO L6
11 Multi-record Families Answers 1-33Family 1 Answers 1-5Family 2 Answers 6-8Family 3 Answers 9-10Family 4 Answers 11-12Family 5 Answers 13-14Family 6 Answers 15-16Family 7 Answers 17-18Family 8 Answers 19-25Family 9 Answers 26-27Family 10 Answers 28-31Family 11 Answers 32-33
3 Individual Records Answers 34-360 Non-patent Records
The patent family manager begins by organising the results using FSORT...
22
In this example, 14 patent family groups (i.e. 11 + 3) are retrieved.
Commands in RED are those issued automatically by the STN Express Patent Family Manager.
FSORT organizes the patent sequence records by Publication, Application, Related, and Priority numbers.
=> DIS L7 PFAM=7 1 BIB,SQL,SCORE,IDENT,ALIGN
L7 ANSWER 17 OF 36 DGENE COPYRIGHT 2010 THOMSON REUTERS on STN FAMILY7AN AEL25470 protein DGENETI Identifying compound that reduce/inhibit internal ribosome . . . .IN Fear MPA (TELE-N) TELETHON INST CHILD HEALTH RES.PI WO 2006102720 A1 20061005 197AI WO 2006-AU435 20060331PRAI AU 2005-901574 20050331PSL Disclosure; SEQ ID NO 18LA EnglishOS 2006-747347 [76]CR N-PSDB: AEL25469
PC-NCBI: gi30446PC-SWISSPROT: P14920
DESC Reporter protein SEQ ID NO:18.SQL 347SCORE 726 99% of query self score 731IDENT 99%BLASTALIGN
Query = 347 lettersLength = 347Score = 726 bits (1873), Expect = 0.0Identities = 345/347 (99%), Positives = 345/347 (99%). . . .
...and then continues by displaying the family groups in the specified formats
23
Commands in RED are those issued automatically by the STN Express Patent Family Manager.
=> DIS L7 PFAM=7 2-TOT TRIAL,SCORE,IDENT,ALIGN
L7 ANSWER 18 OF 36 USGENE COPYRIGHT 2010 SEQUENCEBASE CORP on STNFAMILY7TI Isolation of Inhibitors of IRES-Mediated Translation
(PublishedApplication)DESC Homo Sapiens Protein; sequence 18 of 148MTY ProteinSQL 347SCORE 726 99% of query self score 731IDENT 99%BLASTALIGN
Query = 347 lettersLength = 347Score = 726 bits (1873), Expect = 0.0Identities = 345/347 (99%), Positives = 345/347 (99%)Query: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPN
MRVVVIGAGVIGLSTALCIHERYHSVLQPL IKVYADRFTPLTTTDVAAGLWQPYLSDPNSbjct: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLHIKVYADRFTPLTTTDVAAGLWQPYLSDPNQuery: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPR
NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRSbjct: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRQuery: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVN
ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNSbjct: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVN
. . . .
...and then continues by displaying the family groups in the specified formats (cont.)
24
This USGENE hit is in the same family as the DGENE record on the previous slide (FAMILY 7).
=> DIS L7 34-36 BIB,SQL,SCORE,IDENT,ALIGN
L7 ANSWER 34 OF 36 USGENE COPYRIGHT 2010 SEQUENCEBASE CORP on STN AN 20060275794.63099 Protein USGENETI Collections of matched biological reagents and methods for
identifying matched reagents (PublishedApplication)IN Carrino John (San Diego, CA); Liang Feng (San Diego, CA)PA Invitrogen Corporation (Carlsbad CA)PI US 20060275794 A1 20061207AI US 2006-371354 20060307PRAI WO 2005-US13914 20050422
US 2005-673045P 20050419US 2005-665199P 20050325US 2005-665200P 20050325US 2005-659492P 20050307US 2005-659493P 20050307
PSL SEQ ID NO 63099DESC Homo sapiens protein; sequence 63099DT Patent SQL 347SCORE 731 100% of query self score 731IDENT 100%BLASTALIGN
Query = 347 lettersLength = 347Score = 731 bits (1886), Expect = 0.0Identities = 347/347 (100%), Positives = 347/347 (100%). . . .
...and then continues by displaying the family groups in the specified formats (cont.)
25
This USGENE record is the first of the 3 “individual records” in the FSORT answer set (L7).
Multifile search strategy
1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode
2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results
3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS
REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN
results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results
using the BLAST Report Tool26
Typical steps of CAS REGISTRY BLAST
1. Launch BLAST2. Search the sequence3. Examine and evaluate alignment/relevance of
sequence answers4. Display STN data on sequences – REGISTRY5. Display STN data on sequences – CAplusSM
– Limit CAplus results, if necessary– Display CAplus data (references and HITRN)
6. Post-process BLAST alignment data
27
Launch CAS REGISTRY BLAST
28
• The Result Set Manager is the starting point
• To begin a new sequence search
• To review results of previous sequence searches
Input the search query
29
• Sequences can be input by Copy/paste • Read from a file• Recall a previously searched sequence
within the same session• Sequence line numbers do not
interfere with the search.
Select the BLAST program
30
The following programs are most typically run:• BLASTn for nucleotides• BLASTp for proteins/peptides
Verify BLAST settings
31
Default values have been set to optimize sequence searches for researchers. Recommended settings for patent searches:• Low Complexity Filtering –
unchecked• Max No. of Answers - 1000
Evaluate the alignment report
33
The negative sign represents that the alignment details are shown.Detail information such as the sequence length, score, percent identity are available.
Select sequences of interest
34
Sequences can be selected:• In groups, using the color bar in the
Alignment Scores• Individually, by selecting the check box• To transfer the sequence data to STN,
click the Get STN Data button.
Get STN Data and Save alignments (.xss)
35
The alignment data is saved in STN Express Saved Sequences (.xss) format.
Alignment data needs to be transferred for post-processing.
Transfer sequences to STN
36
Display sequences if desired.
• Logon to STN and a REGISTRY search of the sequences is automatic.
• Results display can be accomplished using either Discover! wizards or command line input.
• Note: Type END or click Cancel to get out of the “Display Wizard”. You can turn off the “Display Wizard” in Preferences.
Multifile search strategy
1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode
2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results
3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS
REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN
results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results
using the BLAST Report Tool37
Display additional CAplus answers including the HITRN for alignment post-processing
38
=> FILE HCAPLUSFILE 'HCAPLUS' ENTERED AT 17:25:10 ON 07 JUL 2010COPYRIGHT (C) 2010 AMERICAN CHEMICAL SOCIETY (ACS)
=> S L12 AND PATENT/DTL13 12 L12 AND PATENT/DT
=> TRANSFER L6 PN 1-L14 TRANSFER L6 1- PN : 20 TERMSL15 29 L14ALL TERMS IN L14 RETRIEVED.
=> S L13 NOT L15L16 2 L13 NOT L15
=> D BIB HITRN 1-2
The 44 REGISTRY records (L12) correspond to 12 HCAplus patent records (L13).
Transfer Publication Numbers (PN) from DGENE/USGENE/PCTGEN (L6) to find corresponding HCAplus records (L15).
In this example, 2 additional, highly relevant references have been found by including the REGISTRY/HCAplus search (L16).
Example: Unique REGISTRY/CAplus result
39
L16 ANSWER 1 OF 2 HCAPLUS COPYRIGHT 2010 ACS on STN AN 2002:391912 HCAPLUSDN 137:1836TI Measurement of DNA methylation for analysis of the toxicology . . . .IN Olek, Alexander; Piepenbrock, Christian; Berlin, KurtPA Epigenomics Ag, GermanySO PCT Int. Appl., 113 pp.
CODEN: PIXXD2LA GermanFAN.CNT 1
PATENT NO. KIND DATE APPLICATION NO. DATE--------------- ---- -------- -------------------- --------
PI WO 2002040710 A2 20020523 WO 2001-EP12951 20011108. . . .
PRAI DE 2000-10056802 A 20001114 WO 2001-EP12951 W 20011108
IT 391975-30-7, Protein (human 347-amino acid)RL: BSU (Biological study, unclassified); PRP (Properties); BIOL(Biological study)
(amino acid sequence; measurement of DNA methylation for anal. of the toxicol. of substances)
Note: HITRN must be included, so that the CAS REGISTRY BLAST alignments can be merged into the BLAST Report.
Multifile search strategy
1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode
2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results
3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS
REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN
results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results
using the BLAST Report Tool40
Access the Table Tool and select the multifile search Transcript file
41
The most recent STN session Transcript is usually listed here.
Choose a template and select content
42
Option: choose a pre-defined custom template from a previous project.
L7 is the DGENE, USGENE and PCTGEN FSORTed answer set.
Select fields, column order, headings, fonts and spacing for the table
43
The pre-defined custom template included a list of fields. These can be further customized and the template re-saved.
Explore the results further in Microsoft Excel
45
Some tips for Microsoft Excel:• Resize columns and rows as desired –
especially the BLAST alignment column to approx 77
• View, Freeze panes – holds the top row fixed when scrolling down
• Add Filters – provides a great way to navigate results – for example by BLAST percent identity (above)
Multifile search strategy
1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode
2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results
3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS
REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN
results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results
using the BLAST Report Tool46
Post-process REGISTRY BLAST alignments
Download the post-processing template (.PRF) files used in this seminar:http://www.stn-international.com/stn_biosequence_searching_mfs.html
47
Select BLAST alignment report
48
• The first step is to select the XSS file to include in the BLAST report.
• Important: If your BLAST query is fairly long, or a nucleic acid, or the answers may exceed 1000 characters, make sure you change the value in the Do not include alignments longer than box.
Post-processing then continues via standard STN Express Custom Report Tool steps.
Select the session Transcript and template
49
The most recent STN session Transcript is usually listed here.
Option: choose a pre-defined custom template from a previous project.
Select fields, fonts and spacing for the report
51
The pre-defined custom template included a list of fields. These can be further customized and the template re-saved.
Overview of search results for Homo sapiens D-amino-acid oxidase – unique in (red)
SEQs≥ 80%
PNs Patent Families*
DGENE 19 10 8 (1)
USGENE 14 10 7 (2)
PCTGEN 3 3 3 (1)
REGISTRY 18 12 9 (2)
NCBI 6 4 4 (0)
Total Unique - - 14(* Patent families = INPADOC Patent Families. Specifically, family records in INPAFAMDB.)
53
Recent database enhancements
• Simultaneous left and right truncation added to the basic index of DGENE and PCTGEN
• Recent backfile enhancements– Thomson Reuters GENESEQ (DGENE)– USPTO Genetic Sequence Database (USGENE)– World patent application biosequences (PCTGEN)
54
Simultaneous left and right truncation (SLART) added to the Basic Index of DGENE & PCTGEN
• Left and right truncation provides improved text search capabilities to refine sequence searches
=> S L2 AND INFLAMMAT?L3 7494 L2 AND INFLAMMAT?
=> S L2 AND ?INFLAMMAT?L4 7525 L2 AND ?INFLAMMAT?
=> S L4 NOT L3L5 31 L4 NOT L3
=> D TI DESC KW
L5 ANSWER 1 OF 31 DGENE COPYRIGHT 2011 THOMSON REUTERS on STN TI New lipolytic enzyme, useful for treating digestive disorders,
pancreatic insufficiency, pancreatitis or cystic fibrosis.DESC Aspergillus niger lipolytic enzyme, SEQ ID 2.KW LPY; Lipase; antiinflammatory; cystic fibrosis; gastrointestinal
function disorder; gastrointestinal-gen.; lipolytic enzyme; . . . .
L2 = BLAST search results.
SLART may help retrieve additional answers (L5).
Note: SLART was already available in USGENE.
55
• A backfile of “mega publication” sequence data continues to be added to DGENE:Entry year Pub. years Number of pubs. Number of seqs.
2007 2002 – 2006 20 844,962
2008 2001 – 2008 52 4,575,648
2009 2004 – 2009 109 7,360,824
2010 2006 – 2009 34 2,354,859
2011 (-July) 2003 – 2008 13 3,386,264
Total 228 18,522,557
DGENE backfile enhancements
Status: 22 July 2011.
56
Example: DGENE backfile “mega publication”
57
L1 ANSWER 4 OF 197024 DGENE COPYRIGHT 2011 THOMSON REUTERS on STN AN AUK86054 DNA DGENETI Identifying a target protein of yeast or a gene encoding the target
protein by identifying target protein and gene encoding the protein, and analyzing functions of the gene to identify characters given to the yeast by the gene.
IN Nakao Y; Kodama Y; Shimonaga T; Kanamori TPA (SUNR) SUNTORY LTD.PI WO 2007099451 A1 20070907 292AI WO 2007-IB551 20070226PRAI JP 2006-117198 20060228PSL Disclosure; SEQ ID NO 98512DED 03 FEB 2011 (first entry)DT PatentLA EnglishOS 2007-739784 [69]DESC Saccharomyces pastorianus oligonucleotide, SEQ ID 98512.KW Protein detection; protein purification; brewing; ss.ORGN Saccharomyces pastorianus.AB The present invention relates to a method for identifying a target
protein of brewery yeast or a gene encoding the target protein. Themethod comprises cultivating yeast under a predetermined . . . .
NA 6 A; 7 C; 4 G; 8 T; 0 U; 0 OtherSQL 25SEQ
1 taacccggtc cacgattttg aatct
197,024 backfile sequence records were recently added into DGENE, from WO2007099451 A1.
USGENE backfile enhancements
• The following data/fields are now available for all records published from April 2006 onwards– U.S. related application information (RLI)– Priority application information (PRAI)– Calculated patent expiration date (XPD)– Patent term adjustment details (NTE, PTA)– Patent Sequence Location (PSL)– Sequence description (DESC)
58
See: The USGENE Workshop Manual (page 63):http://www.stn-international.com/usgene_wm.html
Example: USGENE backfile record with RLI, PRAI, XPD, NTE, PSL and DESC fields
L1 ANSWER 1 OF 1 USGENE COPYRIGHT 2012 SEQUENCEBASE CORP on STN AN 6838433.6 Protein USGENETI IL-6 antagonist peptides (Patent)IN Serlupi-Crescenzi Ottaviano (Rome, IT); Bressan Alessandro (Rome,
IT); Della Pietra Linda (Rome, IT); Pezzotti Anna Rita (Rome, IT)PA Applied Research Systems ARS Holding NV (Curacao NL)PI US 6838433 B2 20050104
US 20030186876 A1 20031002AI US 2003-357479 20030204RLI US 2000-715923 20001120
WO 1999-EP3421 19980518PRAI EP 1998-108997 19980518XPD 20180518 (calculated)NTE Subject to any Disclaimer, the term of this patent is extended or
adjusted under 35 USC 154(b) by 54 days.PSL Claim 1; SEQ ID NO 6DESC Artificial protein; Synthetic; sequence 6 of 19DT PatentAB The present invention relates to IL-6 antagonist peptides, isolatable
from a peptide library through the two-hybrids system by . . . .ECLM US6838433 B2: 1. An IL-6 antagonist peptide, isolatable from a
peptide library by binding to the intracellular domain of gp130 in atwo-hybrid system for detecting protein-protein interaction, saidpeptide comprising SEQ ID NO:6, as well as salts, functionalderivatives, and conservatively substituted analogs thereof havingIL-6 antagonist activity.
. . . .
AN 6838433.6 is displayed here in BRIEF format, which includes all of the additional backfile content.
PSL indexing identifies if and where a SEQ ID NO is referred to in the claims.
59
USGENE organism name standardization
• Original typographical errors for “Homo sapiens” in the organism name field (ORGN) have now been corrected throughout the USGENE file– Including converting “Human” to “Homo sapiens”
• From May 2011 onwards, similar standardization is applied for a list of top organisms in USGENE– E.g.: Zea mays, Glycine max, Oryza sativa, Mus
musculus, Arabidopsis thaliana, Streptococcus pneumoniae, Gossypium hirsutum, Triticum aestivum
60
Example: organism name standardization
=> D AN ORGN SEQ
L1 ANSWER 1 OF 1 USGENE COPYRIGHT 2012 SEQUENCEBASE CORP on STN AN 20090232771.1 USGENEORGN Homo sapiensSEQ
1 mdvvdsllvn gsnitppcel glenetlfcl dqprpskewq pavqillysl51 ifllsvlgnt lvitvlirnk rmrtvtnifl lslavsdlml clfcmpfnli101 pnllkdfifg savcktttyf mgtsvsvstf nlvaislery gaickplqsr151 vwqtkshalk viaatwclsf timtpypiys nlvpftknnn qtanmcrfll201 pndvmqqswh tflllilfli pgivmmvayg lislelyqgi kfeasqkksa251 kerkpsttss gkyedsdgcy lqktrpprkl elrqlstgss sranrirsns301 saanlmakkr virmlivivv lfflcwmpif sanawraydt asaerrlsgt351 pisfilllsy tsscvnpiiy cfmnkrfrlg fmatfpccpn pgppgargev401 geeeeggttg aslsrfsysh msasvppq
AN 20090232771.1 is SEQ ID NO 1 from US20090232771.
61
Direct links to view original NCBI source data have been added to USGENE records
=> D SQIDE
L1 ANSWER 1 OF 1 USGENE COPYRIGHT 2012 SEQUENCEBASE CORP on STN TI Recombinant viral nucleic acids (Patent)DESC DNA; sequence 12 of 13SQL 109SEQ
1 gttttaaata cgctcgagga tgatcagatt cttagtcctc tctttgctaa51 ttctcaccct cttcctaaca actcctgctg tggagggcga tgttagcttc101 cgtttatca
FEATURE TABLE:Key |Location| ==========+========+===============================================USGENE |1..109 |http://www.sequencebase.com/usgene.php?d=7192740.12NCBI |1..109 |http://www.sequencebase.com/ncbi.php?d=EA095311source |1..109 |/organism='unknown' source |1..109 |/mol_type='genomic DNA'
AN 7192740.12 is displayed here in SQIDEformat, which includes sequence-specific fields.
62
Click this link to access the original sequence data as published via NCBI (next slide).
Direct links to view original NCBI source data have been added to USGENE records (cont.)
The source sequence data for AN 7192740.12 in USGENE (previous slide).
Note: This sequence was not included in the original published sequence listing.
63
PCTGEN backfile enhancements
• WIPO recently made a backfile of sequence data available, for the time period 1999 – 2007– The majority are in image form (TIF, PDF, etc)
• Work is in progress by FIZ Karlsruhe Editorial to add this new backfile data into PCTGEN– Using Optical Character Recognition (OCR)– Including Quality Control and intellectual work
64
L1 ANSWER 1 OF 1 PCTGEN COPYRIGHT 2012 WIPO on STN AN 2006030220.1 PRT PCTGENTI Compositions monovalent for CD4OL binding and methods of
use [File created by using OCR software]PA Grant et al., S.PI WO 2006030220 20060323RLI US 2004-610819P 20040917; US 2005-102512 20050408ED 20120112DT PatentORGN Homo sapiensSQL 116SEQ
1 evqllesggg lvqpggslrl scaasgftfs syamswvrqa pgkglewvsa51 isgsggstyy adsvkgrfti srdnskntly lqmnslraed tavyycaksy101 gafdywgqgt lvtvss
Example: New backfile data for PCTGEN
Records created from image format sequence listings are clearly marked.
Work is in progress on the new backfile data.
65
Summary
• RUN BLAST is available for searching DGENE, USGENE and PCTGEN directly on STN
• CAS REGISTRY BLAST provides BLAST searching options for the REGISTRY database
• DGENE, USGENE and PCTGEN multifile search results can be post-processed into tables, and exported to Microsoft Excel, using STN Express
• CAS REGISTRY BLAST alignment data can be merged with CAplus records, and exported in to RTF format, to form single unified report
• All four STN sequence databases are required for a comprehensive patent sequence search
66
Resources for sequence searching on STN
• DGENE Workshop Manualhttp://www.stn-international.com/dgene_wm.html
• USGENE Workshop Manualhttp://www.stn-international.com/usgene_wm.html
• CAS REGISTRY sequence searching resourceshttp://www.cas.org/support/stngen/stndoc/sequences.html
• Multifile BLAST searching (step-by-step guide)http://www.stn-international.com/usgene_wm_mfs.html
67
Recorded STN e-Seminars are available to watch at your own pace….
• FIZ Karlsruhe recorded e-Seminars:http://www.stn-international.com/recorded_events.html– Sequence Basics (all databases)– Multifile patent sequence searching (all databases)
• CAS recorded e-Seminars: http://www.cas.org/support/stngen/stntraining/recorded.html– Sequence motif searching (all databases)– Processing sequence data (REGISTRY)– Unmasking the World of Antibodies (REGISTRY)
68
FIZ [email protected] and Training:www.stn-international.de
CASE-mail: [email protected] and Training:www.cas.org
For more information …