NCBI API - Integration into analysis code

36
NCBI API – Integration into analysis code QBRC Tech Talk Jiwoong Kim

description

QBRC Tech Talk on April 1st, 2014

Transcript of NCBI API - Integration into analysis code

Page 1: NCBI API - Integration into analysis code

NCBI API – Integration into analysis code

QBRC Tech Talk

Jiwoong Kim

Page 2: NCBI API - Integration into analysis code

Outlines

• Introduction

• Usage Guidelines of the E-utilities

• Sample Applications of the E-utilities

Page 3: NCBI API - Integration into analysis code

NCBI & Entrez• The National Center for

Biotechnology Information advances science and health by providing access to biomedical and genomic information.

• Entrez is NCBI’s primary text search and retrieval system that integrates the PubMeddatabase of biomedical literature with 39 other literature and molecular databases including DNA and protein sequence, structure, gene, genome, genetic variation and gene expression.

Page 4: NCBI API - Integration into analysis code

E-utilities

• Entrez Programming Utilities– The Entrez Programming Utilities (E-utilities) are a set of

eight server-side programs that provide a stable interface into the Entrez query and database system at the NCBI.

– The E-utilities use a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data.

E-utilitiesURL XML, FASTA, Text …Input Output

Page 5: NCBI API - Integration into analysis code

Usage Guidelines and Requirements

• Use the E-utility URL– baseURL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ …

– Python urllib/urlopen, Perl LWP::Simple, Linux wget, …

• Frequency, Timing and Registration of E-utility URL Requests– Make no more than 3 requests per second → sleep(0.5)

– Run large jobs on weekends or between 5 PM and 9 AM EST

– Include &tool and &email in all requests

• Minimizing the Number of Requests– &retmax=500

• Handling Special Characters Within URLs– Space → +, " → %22, # → %23

Page 6: NCBI API - Integration into analysis code

ESearch

Page 7: NCBI API - Integration into analysis code

ESearch (text searches)

• Responds to a text query with the list of matching UIDs in a given database (for later use in ESummary, EFetch or ELink), along with the term translations of the query.

• Syntax: esearch.fcgi?db=<database>&term=<query>– Input: Entrez database (&db); Any Entrez text query (&term)

– Output: List of UIDs matching the Entrez query

• Example: Get the PubMed IDs (PMIDs) for articles about osteosarcoma – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&

term=%22osteosarcoma%22[majr:noexp]

Page 8: NCBI API - Integration into analysis code

ESummary

ESearch

UIDs

EFetch

UID

Page 9: NCBI API - Integration into analysis code

ESummary(document summary downloads)

• Responds to a list of UIDs from a given database with the corresponding document summaries.

• Syntax: esummary.fcgi?db=<database>&id=<uid_list>– Input: List of UIDs (&id); Entrez database (&db)

– Output: XML DocSums

• Example: Download DocSums for these PubMed IDs: 24450072, 24333720, 24333432– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubme

d&id=24450072,24333720,24333432

Page 10: NCBI API - Integration into analysis code

EFetch

ELink

Page 11: NCBI API - Integration into analysis code

EFetch (data record downloads)

• Responds to a list of UIDs in a given database with the corresponding data records in a specified format.

• Syntax: efetch.fcgi?db=<database>&id=<uid_list>&rettype=<retrieval_type>&retmode=<retrieval_mode>– Input: List of UIDs (&id); Entrez database (&db); Retrieval type

(&rettype); Retrieval mode (&retmode)

– Output: Formatted data records as specified

• Example: Download the abstract of PubMed ID 24333432– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&i

d=24333432&rettype=abstract&retmode=text

Page 12: NCBI API - Integration into analysis code

ELink (Entrez links)

• Responds to a list of UIDs in a given database with either a list of related UIDs (and relevancy scores) in the same database or a list of linked UIDs in another Entrez database

• Checks for the existence of a specified link from a list of one or more UIDs

• Creates a hyperlink to the primary LinkOut provider for a specific UID and database, or lists LinkOut URLs and attributes for multiple UIDs.

Page 13: NCBI API - Integration into analysis code

ELink (Entrez links)

• Syntax: elink.fcgi?dbfrom=<source_db>&db=<destination_db>&id=<uid_list>– Input: List of UIDs (&id); Source Entrez database (&dbfrom);

Destination Entrez database (&db)

– Output: XML containing linked UIDs from source and destination databases

• Example: Find one set/separate sets of Gene IDs linked to PubMed IDs 24333432 and 24314238– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubme

d&db=gene&id=24333432,24314238

– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&db=gene&id=24333432&id=24314238

Page 14: NCBI API - Integration into analysis code

EGQuery

Page 15: NCBI API - Integration into analysis code

EGQuery (global query)

• Responds to a text query with the number of records matching the query in each Entrez database.

• Syntax: egquery.fcgi?term=<query>– Input: Entrez text query (&term)

– Output: XML containing the number of hits in each database.

• Example: Determine the number of records for mouse in Entrez.– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=mouse[

orgn]&retmode=xml

Page 16: NCBI API - Integration into analysis code

ESpell

Page 17: NCBI API - Integration into analysis code

ESpell (spelling suggestions)

• Retrieves spelling suggestions for a text query in a given database.

• Syntax: espell.fcgi?term=<query>&db=<database>– Input: Entrez text query (&term); Entrez database (&db)

– Output: XML containing the original query and spelling suggestions.

• Example: Find spelling suggestions for the PubMed query "osteosacoma".– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?term=osteosac

oma&db=pmc

Page 18: NCBI API - Integration into analysis code

EInfo (database statistics)

• Provides the number of records indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrezdatabases.

• Syntax: einfo.fcgi?db=<database>– Input: Entrez database (&db)

– Output: XML containing database statistics

• Example: Find database statistics for Entrez Protein.– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=protein

Page 19: NCBI API - Integration into analysis code

EPost (UID uploads)

• Accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset.

• Syntax: epost.fcgi?db=<database>&id=<uid_list>– Input: List of UIDs (&id); Entrez database (&db)

– Output: Web environment (&WebEnv) and query key (&query_key) parameters specifying the location on the Entrez history server of the list of uploaded UIDs

• Example: Upload five Gene IDs (7173, 22018, 54314, 403521, 525013) for later processing.– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=gene&id=71

73,22018,54314,403521,525013

Page 20: NCBI API - Integration into analysis code

Application 1

• Find related human genes to articles searched for non-

extended MeSH term "Osteosarcoma" (PubMed → Gene)

1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubme

d&term=%22osteosarcoma%22[majr:noexp]&usehistory=y

2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubm

ed&db=gene&query_key=1&WebEnv=NCID_1_220057266_130.14.

18.34_9001_1396281951_1196950266&term=%22homo+sapiens%

22[organism]&cmd=neighbor_history

3. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene

&query_key=3&WebEnv=NCID_1_220057266_130.14.18.34_9001_

1396281951_1196950266

Page 21: NCBI API - Integration into analysis code

Application 1

• Find related human genes to articles searched for non-

extended MeSH term "Osteosarcoma" (PubMed → Gene)

– ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz• It can be used instead of "ELink".

– ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz• It can be used instead of "ESummary".

Page 22: NCBI API - Integration into analysis code

Application 2

• Find nucleotide sequences of "Burkholderia cepacia complex"

and download in GenBank format

1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccor

e&term=%22burkholderia+cepacia+complex%22[organism]&usehist

ory=y

2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore

&query_key=1&WebEnv=NCID_1_264773253_130.14.22.215_9001

_1396244608_457974498&rettype=gb&retmode=text

Page 23: NCBI API - Integration into analysis code

Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"

platform GEO Datasets

cancer "copy number"

esearch.fcgi?db=pubmed

Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]

esearch.fcgi?db=gds

esummary.fcgi?db=pubmed

WebEnv, query_key

esummary.fcgi?db=gds

WebEnv, query_key

GPL9704GPL8226GPL6804GPL6801

elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds

Parsing

Result table

Common

PubMed title

Page 24: NCBI API - Integration into analysis code

"cancer copy number" articles"Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets

Page 25: NCBI API - Integration into analysis code

Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"

platform GEO Datasets

cancer "copy number"

esearch.fcgi?db=pubmed

Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]

esearch.fcgi?db=gds

esummary.fcgi?db=pubmed

WebEnv, query_key

esummary.fcgi?db=gds

WebEnv, query_key

GPL9704GPL8226GPL6804GPL6801

elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds

Parsing

Result table

Common

PubMed title

Page 26: NCBI API - Integration into analysis code

Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"

platform GEO Datasets

Page 27: NCBI API - Integration into analysis code

Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"

platform GEO Datasets

cancer "copy number"

esearch.fcgi?db=pubmed

Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]

esearch.fcgi?db=gds

esummary.fcgi?db=pubmed

WebEnv, query_key

esummary.fcgi?db=gds

WebEnv, query_key

GPL9704GPL8226GPL6804GPL6801

elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds

Parsing

Result table

Common

PubMed title

Page 28: NCBI API - Integration into analysis code

Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"

platform GEO Datasets

Page 29: NCBI API - Integration into analysis code

Make custom scripts with XML-parser

Page 30: NCBI API - Integration into analysis code

EBot

• EBot is an interactive web tool that first allows users to construct an arbitrary E-utility analysis pipeline and then generates a Perl script to execute the pipeline. The Perl script can be downloaded and executed on any computer with a Perl installation. For more details, see the EBot page linked above.– http://www.ncbi.nlm.nih.gov/Class/PowerTools/e

utils/ebot/ebot.cgi

Page 31: NCBI API - Integration into analysis code

Entrez Direct

• E-utilities on the UNIX Command Line

• Download from ftp://ftp.ncbi.nih.gov/entrez/entrezdirect/

• Entrez Direct Functions– esearch performs a new Entrez search using terms in indexed fields.

– elink looks up neighbors (within a database) or links (between databases).

– efilter filters or restricts the results of a previous query.

– efetch downloads records or reports in a designated format.

– xtract converts XML into a table of data values.

– einfo obtains information on indexed fields in an Entrez database.

– epost uploads unique identifiers (UIDs) or sequence accession numbers.

– nquire sends a URL request to a web page or CGI service.

• Entering Query Commands– esearch -db pubmed -query "opsin gene conversion" | elink -related

Page 32: NCBI API - Integration into analysis code

Links• References

– Entrez Programming Utilities Help• http://www.ncbi.nlm.nih.gov/books/NBK25501/

– Entrez Help• http://www.ncbi.nlm.nih.gov/books/NBK3836/

• Useful Links– Entrez Unique Identifiers (UIDs) for selected databases

• http://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.chapter2_table1/?report=objectonly

– Valid values of &retmode and &rettype for EFetch (null = empty string)• http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1/?r

eport=objectonly

– The full list of Entrez links• http://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html

Page 33: NCBI API - Integration into analysis code

NCBI databases

• Literature: PubMed, PubMed Central, NLM Catalog, MeSH, Books, Site Search

• Health: PubMed Health, MedGen, GTR, dbGaP, ClinVar, OMIM, OMIA

• Organisms: Taxonomy

• Nucleotide Sequences: Nucleotide, GSS, EST, SRA, PopSet, Probe

• Genomes: Genome, Assembly, Epigenomics, UniSTS, SNP, dbVar, BioProject, BioSample, Clone

• Genes: Gene, HomoloGene, UniGene, GEO Profiles, GEO DataSets

• Proteins: Protein, Conserved Domains, Protein Clusters, Structure

• Chemicals: PubChem Compound, PubChem Substance, PubChem BioAssay

• Pathways: BioSystems

Page 34: NCBI API - Integration into analysis code

E-utilities

• Eight server-side programs– ESearch : Searching a Database

– EPost : Uploading UIDs to Entrez

– ESummary : Downloading Document Summaries

– EFetch : Downloading Full Records

– ELink : Finding Related Data Through Entrez Links

– EInfo : Getting Database Statistics and Search Fields

– EGQuery : Performing a Global Entrez Search

– ESpell : Retrieving Spelling Suggestions

Page 35: NCBI API - Integration into analysis code

Sample Applications of the E-utilities

• Basic pipelines– ESearch - ESummary/EFetch

– EPost - ESummary/EFetch

– ELink - ESummary/Efetch

– ESearch - ELink - ESummary/EFetch

– EPost - ELink - ESummary/EFetch

– EPost - ESearch

– ELink - ESearch

Page 36: NCBI API - Integration into analysis code

Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"

platform GEO Datasets1. tr '\n' '\t' < cancer_copy_number.pubmed_result.txt | sed 's/\t\t/\n/g' | sed 's/^\t[0-9]*: //' | sed 's/\t/ /g' >

cancer_copy_number.pubmed_result.oneLine.txt

2. sed 's/^.* PubMed *PMID: *//' cancer_copy_number.pubmed_result.oneLine.txt | sed 's/; .*//' | sed 's/\.$//' >

cancer_copy_number.pubmed_ids.txt

3. for id in $(cat cancer_copy_number.pubmed_ids.txt); do perl ~/scripts/elink.pl pubmed gds $id pubmed_gds | sed

"s/^/$id\t/"; done > cancer_copy_number.pubmed_gds_ids.txt

4. awk -F'\t' '($1 == "Platform")' Affymetrix_Genome-Wide_Human_SNP_Array.gds_result.txt | cut -f2 | sed

's/^Accession: //' > Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt

5. for platform in $(cat Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt); do perl

~/scripts/esearch.pl gds $platform; done | sort -nu > Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt

6. paste cancer_copy_number.pubmed_ids.txt cancer_copy_number.pubmed_result.oneLine.txt | perl

~/scripts/table.addColumns.pl cancer_copy_number.pubmed_gds_ids.txt 0 - 0 1 | perl ~/scripts/table.search.pl

Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt 0 - 1 | perl ~/scripts/table.mergeLines.pl -d ', ' - 0,2 >

cancer_copy_number.Affymetrix_Genome-Wide_Human_SNP_Array.pubmed_gds.txt