Managing Bioinformatics Data With Oracle · Bioinformatics Manages nucleic/protein and expression...
Transcript of Managing Bioinformatics Data With Oracle · Bioinformatics Manages nucleic/protein and expression...
Managing Bioinformatics Data Managing Bioinformatics Data With OracleWith Oracle
Jansen LimBioinformatics
Bristol-Myers SquibbPharmaceutical Research Institute
TopicsTopicsWho we areWho we areVariety of biological data Variety of biological data Roadblocks encounteredRoadblocks encounteredOvercoming barriersOvercoming barriersProviding information to usersProviding information to users
BioinformaticsBioinformaticsManages nucleic/protein and expression Manages nucleic/protein and expression datadataProvides expert bioinformatics assistance Provides expert bioinformatics assistance that impacts drug discovery programsthat impacts drug discovery programsDevelops tools forDevelops tools for
analyzing sequences and expression profiling dataanalyzing sequences and expression profiling datasearching biological data among different DBssearching biological data among different DBsgenerating informationgenerating information--rich reports (genome browser)rich reports (genome browser)annotating proprietary sequencesannotating proprietary sequences
Data SourcesData SourcesGenbank
Nucleotides and ESTsContigsRefSeq Records
SwissProt/TrEmblUniProtNRDBDerwent PatentsAffymetrixLicensed content
Expression profilingLocusLinkIncyte “Foundation”ENSEMBLWIPO
Lots of DatabasesLots of DatabasesSequence repository (DS SeqStore from Accelrys)Expression profilingIncyte FoundationGene annotationLocusLinkSNPs…and many more
Hardware ConfigurationHardware ConfigurationProduction and development servers on SUN Microsystems SunFire (Solaris 8)Processors: 16 500 MHz UltraSparc IIII/O: private gigabit fiber channel between disk subsystem and CPU, CXFS
Oracle DatabaseOracle DatabaseOracle 9i (9.2.0.1.0)
DS SeqStore (550 GB)Incyte Foundation (130 GB)Gene annotation (1.5 GB)
Oracle 8i (8.1.7.4.0)LocusLink (0.5 GB)Expression Profiling (200 GB)
Major ChallengesMajor ChallengesLarge database sizes and frequent updates -> loading and extracting data must be efficient and scalableDifferent sources -> different formatsBackups/Exports/Imports
Oracle Swiss Army Knife Oracle Swiss Army Knife Loading and Extraction
SQL*LoaderPL/SQLPro*C/OCI
Transformation/Web Apps/ScriptsPerl/BioPerl
Sharing dataDatabase linksMaterialized views
SQL*Loader SQL*Loader LocusLink (to be replaced by Entrez Gene)
Data mirrored nightly from NCBISchema generated from LL_tmpl Parse and transform files (LL_tmpl, LL_out, loc2UG, loc2ref, loc2acc, mim2loc)Generate control filesUse SQL*Loader using direct path to load into actual and/or temp tablesCompletes in < 30 min
SQL*Loader SQL*Loader Incyte Foundation
HUGE XML formatted data organized by chromosome numberFull length transcripts, gene function, biological role, SNPs, ESTsGenomic DNASchema generated from DTDParse XML data -> SQL*Loader friendly formatLoad into staging areaPost SQL*Loader processing
What Goes In ...What Goes In ...Derwent (Patent)
Incyte (EST & transcripts)
BMS (via GeneTracker & d2clusters)
Alliance (Lexicon &Pharmagene)
Frescobi (non-redundantnucleotide/protein)
SGD
WIPO (Patent)
Genbank (Nucleotide, EST, RefSeq, Contigs)
SwissProt (Highlyannotated proteinDB)
SPTrembl (Translated codingsequences)
IPI (Human, mouse, & rat)
SequenceDatabase
… Must Come Out… Must Come OutHarmonizer Wisconsin Package
FetchBlastable files• public nucleotides• public and patent proteins• human refseq• Incyte transcripts• mouse genomic contigs...
SequenceDatabase
GeneTracker
FrescobiSeqWeb
Ad-hoc query
Sequence Repository Sequence Repository DS SeqStore
One-stop shop for storing, retrieving, and maintaining public and proprietary sequence dataNightly update of Genbank nucleotides and ESTsWeekly update of SwissProt/TrEmblBi-weekly update of GeneseqLoading using multiple CPUs and array inserts resulting in fast load timesPerformance dependent on amount of annotation
Creating Blastable FilesCreating Blastable Files
Need custom FASTA dumper to embed Need custom FASTA dumper to embed information in the header lineinformation in the header line
Patent records Patent records --> patent no, publication date, > patent no, publication date, document detail (e.g. claim, disclosure, etc.)document detail (e.g. claim, disclosure, etc.)RefSeq RefSeq --> need version> need versionProteins Proteins --> need > need /coded_by/coded_by featurefeaturetaxidtaxid
ExamplesExamples>REFSEQN:NM_198453.1 /taxid=9606 /descr=Homo sapiens >REFSEQN:NM_198453.1 /taxid=9606 /descr=Homo sapiens
FLJ46675 protein (FLJ46675), mRNA.FLJ46675 protein (FLJ46675), mRNA.
>GCGGB:AB003693.1 /ref_text=Patent: US 5589355>GCGGB:AB003693.1 /ref_text=Patent: US 5589355--A 31A 31--DECDEC--1996; /taxid=1697 /descr=Corynebacterium ammoniagenes 1996; /taxid=1697 /descr=Corynebacterium ammoniagenes DNA for rib operon, complete DNA for rib operon, complete cdscds..
>REFSEQP:NP_940855.1 /coded_by=NM_198453.1:34..2739 >REFSEQP:NP_940855.1 /coded_by=NM_198453.1:34..2739 /taxid=9606 /descr=FLJ46675 protein [Homo sapiens]./taxid=9606 /descr=FLJ46675 protein [Homo sapiens].
Why Pro*C?Why Pro*C?SpeedSpeedDirect interface to OracleDirect interface to OracleUse of host arrays for output variablesUse of host arrays for output variables
seqdata data structureseqdata data structure
struct seqdata {struct seqdata {
int seq_oid[FETCH_ARRAY_SIZE];int seq_oid[FETCH_ARRAY_SIZE];
char local_name[FETCH_ARRAY_SIZE][LOCAL_NAME];char local_name[FETCH_ARRAY_SIZE][LOCAL_NAME];
char seq_descr[FETCH_ARRAY_SIZE][DESCR_LEN];char seq_descr[FETCH_ARRAY_SIZE][DESCR_LEN];
ub4 seq_len[FETCH_ARRAY_SIZE];ub4 seq_len[FETCH_ARRAY_SIZE];
OCIClobLocator *clob[FETCH_ARRAY_SIZE];OCIClobLocator *clob[FETCH_ARRAY_SIZE];
};};
Execution of SQL QueryExecution of SQL Query……EXEC SQL WHENEVER SQLERROR DO dynsql_error( "Oracle Error:EXEC SQL WHENEVER SQLERROR DO dynsql_error( "Oracle Error:\\n" );n" );EXEC SQL PREPARE s1 FROM :cmd;EXEC SQL PREPARE s1 FROM :cmd;EXEC SQL DECLARE cr1 CURSOR FOR s1;EXEC SQL DECLARE cr1 CURSOR FOR s1;
/* there has to be an hierarchical structure to the options *//* there has to be an hierarchical structure to the options */if ((strcmp(paramptrif ((strcmp(paramptr-->qtype, "list") == 0) ||>qtype, "list") == 0) ||
(strcmp(paramptr(strcmp(paramptr-->qtype, "gslist") == 0)) {>qtype, "gslist") == 0)) {EXEC SQL OPEN cr1 USING :paramptrEXEC SQL OPEN cr1 USING :paramptr-->list;>list;
}}else if ((strcmp(paramptrelse if ((strcmp(paramptr-->qtype, "localname") == 0) || (strcmp(paramptr>qtype, "localname") == 0) || (strcmp(paramptr-->qtype, >qtype,
"gslocalname") == 0)) {"gslocalname") == 0)) {EXEC SQL OPEN cr1 USING :paramptrEXEC SQL OPEN cr1 USING :paramptr-->container, :paramptr>container, :paramptr-->localname, >localname,
:paramptr:paramptr-->seqtype;>seqtype;}}else if ((strcmp(paramptrelse if ((strcmp(paramptr-->qtype, "container") == 0) || (strcmp(paramptr>qtype, "container") == 0) || (strcmp(paramptr-->qtype, >qtype,
"gscontainer") == 0)) { "gscontainer") == 0)) { EXEC SQL OPEN cr1 USING :paramptrEXEC SQL OPEN cr1 USING :paramptr-->container, :paramptr>container, :paramptr-->seqtype;>seqtype;
}}else if ((strcmp(paramptrelse if ((strcmp(paramptr-->qtype, "taxid") == 0) || (strcmp(paramptr>qtype, "taxid") == 0) || (strcmp(paramptr-->qtype, "gstaxid") >qtype, "gstaxid")
== 0)) {== 0)) {EXEC SQL OPEN cr1 USING :paramptrEXEC SQL OPEN cr1 USING :paramptr-->container, :paramptr>container, :paramptr-->taxid, :paramptr>taxid, :paramptr-->seqtype;>seqtype;
}}……
FetchingFetching/* sqlca.sqlerrd[2] contains the number of rows fetched so far */* sqlca.sqlerrd[2] contains the number of rows fetched so far *//EXEC SQL WHENEVER NOT FOUND DO break;EXEC SQL WHENEVER NOT FOUND DO break;
for (;;) {for (;;) {
EXEC SQL FETCH cr1 INTO :seqptr;EXEC SQL FETCH cr1 INTO :seqptr;
process_recs( outfile, logfile, headr, seqptr, sqlca.sqlerrprocess_recs( outfile, logfile, headr, seqptr, sqlca.sqlerrd[2] d[2] ––num_ret );num_ret );
num_ret = sqlca.sqlerrd[2];num_ret = sqlca.sqlerrd[2];
}}
/* if ARRAY_SIZE to process is > 0, process the partial records /* if ARRAY_SIZE to process is > 0, process the partial records retrieved */retrieved */if ( ( sqlca.sqlerrd[2] if ( ( sqlca.sqlerrd[2] -- num_ret ) > 0 ) {num_ret ) > 0 ) {
process_recs( outfile, logfile, headr, seqptr, sqlca.sqlerprocess_recs( outfile, logfile, headr, seqptr, sqlca.sqlerrd[2] rd[2] ––num_ret );num_ret );
}}
Processing CLOBsProcessing CLOBsfor(i = 0; i < n; i++) {for(i = 0; i < n; i++) {……EXEC SQL LOB DESCRIBE :seqptrEXEC SQL LOB DESCRIBE :seqptr-->clob[i] GET LENGTH INTO :seqlen;>clob[i] GET LENGTH INTO :seqlen;EXEC SQL WHENEVER NOT FOUND DO break;EXEC SQL WHENEVER NOT FOUND DO break;
while (TRUE) {while (TRUE) {
EXEC SQL LOB READ :amt FROM :seqptrEXEC SQL LOB READ :amt FROM :seqptr-->clob[i] INTO :buffer ;>clob[i] INTO :buffer ;
bufferbuffer-->arr[buffer>arr[buffer-->len] = '>len] = '\\0‘ ;0‘ ;xptr = strcopy(xptr, (char *) bufferxptr = strcopy(xptr, (char *) buffer-->arr) ;>arr) ;
}}
bufferbuffer-->arr[buffer>arr[buffer-->len] = '>len] = '\\0';0';tlen += amt;tlen += amt;xptr = strcopy(xptr, (char *) bufferxptr = strcopy(xptr, (char *) buffer-->arr);>arr);
/* printf("the string = %.*s/* printf("the string = %.*s\\n", seqstringn", seqstring-->len, seqstring>len, seqstring-->arr); */>arr); */output_1_seq( outfile, logfile, seqhdr, seqptr, seqstring, i output_1_seq( outfile, logfile, seqhdr, seqptr, seqstring, i ););
}}
PL/SQLPL/SQLCentralize stored functions and procedures Centralize stored functions and procedures using packagesusing packages
Simplify queries by removing explicit joinsSimplify queries by removing explicit joinsSupport postSupport post--processing that inserts into custom processing that inserts into custom tablestablesCan be invoked from SQL*Plus,Perl, Pro*C, unix shell Can be invoked from SQL*Plus,Perl, Pro*C, unix shell scriptsscripts
Perl/BioPerlPerl/BioPerlGeneral scripting needs General scripting needs Transforming one format into anotherTransforming one format into another
Geneseq data in EmblGeneseq data in Embl--like format like format --> gbff and gpff for > gbff and gpff for SeqStore loadingSeqStore loading
Web applications developmentWeb applications developmentFriendlyFetch FriendlyFetch -- informationinformation--rich sequence search rich sequence search engine to SeqStoreengine to SeqStoreHarmonizer Harmonizer –– database showing sequence database showing sequence relationships from various sourcesrelationships from various sourcesGeneTracker GeneTracker –– internal annotation database tracking internal annotation database tracking BMS targetsBMS targets
Header Information
Visual displayof features
Cross references toother databases
DB Links/Materialized ViewsDB Links/Materialized Views
Distributed databases necessitate the Distributed databases necessitate the creation of database links for remote creation of database links for remote connectionsconnectionsIn concert with materialized views, we get In concert with materialized views, we get unified access to dataunified access to data
Harmonizer ContentHarmonizer Content
AccessesAccesses public and public and
inin--house resourceshouse resources
SearchesSearches across across
source boundariessource boundaries
Integrates Integrates other inother in--
House ToolsHouse Tools
ExportExport Fasta files and Fasta files and
Excel Workbook
Identical sequences and their annotations
Identical sequences and their annotations
Similar sequences and their alignments
Similar sequences and their alignments
Links, cross references, and names for each
member of the family
Links, cross references, and names for each
member of the family
Annotations for each member of the family
Annotations for each member of the family
Active Link into Source data, Blast, Genome
Maps,...
Active Link into Source data, Blast, Genome
Maps,...
Export all sequences,Export to Excel, Link
into Harmonizer
Export all sequences,Export to Excel, Link
into Harmonizer
Excel Workbook
Many Thanks ToMany Thanks To
Robert Bruccoleri Robert Bruccoleri Daniel DavisonDaniel DavisonKarlKarl--Heinz OttHeinz OttCharles TilfordCharles Tilford