Managing Bioinformatics Data With Oracle · Bioinformatics Manages nucleic/protein and expression...

Managing Bioinformatics Data Managing Bioinformatics Data With OracleWith Oracle

Jansen LimBioinformatics

Bristol-Myers SquibbPharmaceutical Research Institute

TopicsTopicsWho we areWho we areVariety of biological data Variety of biological data Roadblocks encounteredRoadblocks encounteredOvercoming barriersOvercoming barriersProviding information to usersProviding information to users

BioinformaticsBioinformaticsManages nucleic/protein and expression Manages nucleic/protein and expression datadataProvides expert bioinformatics assistance Provides expert bioinformatics assistance that impacts drug discovery programsthat impacts drug discovery programsDevelops tools forDevelops tools for

analyzing sequences and expression profiling dataanalyzing sequences and expression profiling datasearching biological data among different DBssearching biological data among different DBsgenerating informationgenerating information--rich reports (genome browser)rich reports (genome browser)annotating proprietary sequencesannotating proprietary sequences

Data SourcesData SourcesGenbank

Nucleotides and ESTsContigsRefSeq Records

SwissProt/TrEmblUniProtNRDBDerwent PatentsAffymetrixLicensed content

Expression profilingLocusLinkIncyte “Foundation”ENSEMBLWIPO

Lots of DatabasesLots of DatabasesSequence repository (DS SeqStore from Accelrys)Expression profilingIncyte FoundationGene annotationLocusLinkSNPs…and many more

Hardware ConfigurationHardware ConfigurationProduction and development servers on SUN Microsystems SunFire (Solaris 8)Processors: 16 500 MHz UltraSparc IIII/O: private gigabit fiber channel between disk subsystem and CPU, CXFS

Oracle DatabaseOracle DatabaseOracle 9i (9.2.0.1.0)

DS SeqStore (550 GB)Incyte Foundation (130 GB)Gene annotation (1.5 GB)

Oracle 8i (8.1.7.4.0)LocusLink (0.5 GB)Expression Profiling (200 GB)

Major ChallengesMajor ChallengesLarge database sizes and frequent updates -> loading and extracting data must be efficient and scalableDifferent sources -> different formatsBackups/Exports/Imports

Oracle Swiss Army Knife Oracle Swiss Army Knife Loading and Extraction

SQL*LoaderPL/SQLPro*C/OCI

Transformation/Web Apps/ScriptsPerl/BioPerl

Sharing dataDatabase linksMaterialized views

SQL*Loader SQL*Loader LocusLink (to be replaced by Entrez Gene)

Data mirrored nightly from NCBISchema generated from LL_tmpl Parse and transform files (LL_tmpl, LL_out, loc2UG, loc2ref, loc2acc, mim2loc)Generate control filesUse SQL*Loader using direct path to load into actual and/or temp tablesCompletes in < 30 min

SQL*Loader SQL*Loader Incyte Foundation

HUGE XML formatted data organized by chromosome numberFull length transcripts, gene function, biological role, SNPs, ESTsGenomic DNASchema generated from DTDParse XML data -> SQL*Loader friendly formatLoad into staging areaPost SQL*Loader processing

What Goes In ...What Goes In ...Derwent (Patent)

Incyte (EST & transcripts)

BMS (via GeneTracker & d2clusters)

Alliance (Lexicon &Pharmagene)

Frescobi (non-redundantnucleotide/protein)

SGD

WIPO (Patent)

Genbank (Nucleotide, EST, RefSeq, Contigs)

SwissProt (Highlyannotated proteinDB)

SPTrembl (Translated codingsequences)

IPI (Human, mouse, & rat)

SequenceDatabase

… Must Come Out… Must Come OutHarmonizer Wisconsin Package

FetchBlastable files• public nucleotides• public and patent proteins• human refseq• Incyte transcripts• mouse genomic contigs...

SequenceDatabase

GeneTracker

FrescobiSeqWeb

Ad-hoc query

Sequence Repository Sequence Repository DS SeqStore

One-stop shop for storing, retrieving, and maintaining public and proprietary sequence dataNightly update of Genbank nucleotides and ESTsWeekly update of SwissProt/TrEmblBi-weekly update of GeneseqLoading using multiple CPUs and array inserts resulting in fast load timesPerformance dependent on amount of annotation

Creating Blastable FilesCreating Blastable Files

Need custom FASTA dumper to embed Need custom FASTA dumper to embed information in the header lineinformation in the header line

Patent records Patent records --> patent no, publication date, > patent no, publication date, document detail (e.g. claim, disclosure, etc.)document detail (e.g. claim, disclosure, etc.)RefSeq RefSeq --> need version> need versionProteins Proteins --> need > need /coded_by/coded_by featurefeaturetaxidtaxid

ExamplesExamples>REFSEQN:NM_198453.1 /taxid=9606 /descr=Homo sapiens >REFSEQN:NM_198453.1 /taxid=9606 /descr=Homo sapiens

FLJ46675 protein (FLJ46675), mRNA.FLJ46675 protein (FLJ46675), mRNA.

>GCGGB:AB003693.1 /ref_text=Patent: US 5589355>GCGGB:AB003693.1 /ref_text=Patent: US 5589355--A 31A 31--DECDEC--1996; /taxid=1697 /descr=Corynebacterium ammoniagenes 1996; /taxid=1697 /descr=Corynebacterium ammoniagenes DNA for rib operon, complete DNA for rib operon, complete cdscds..

>REFSEQP:NP_940855.1 /coded_by=NM_198453.1:34..2739 >REFSEQP:NP_940855.1 /coded_by=NM_198453.1:34..2739 /taxid=9606 /descr=FLJ46675 protein [Homo sapiens]./taxid=9606 /descr=FLJ46675 protein [Homo sapiens].

Why Pro*C?Why Pro*C?SpeedSpeedDirect interface to OracleDirect interface to OracleUse of host arrays for output variablesUse of host arrays for output variables

seqdata data structureseqdata data structure

struct seqdata {struct seqdata {

int seq_oid[FETCH_ARRAY_SIZE];int seq_oid[FETCH_ARRAY_SIZE];

char local_name[FETCH_ARRAY_SIZE][LOCAL_NAME];char local_name[FETCH_ARRAY_SIZE][LOCAL_NAME];

char seq_descr[FETCH_ARRAY_SIZE][DESCR_LEN];char seq_descr[FETCH_ARRAY_SIZE][DESCR_LEN];

ub4 seq_len[FETCH_ARRAY_SIZE];ub4 seq_len[FETCH_ARRAY_SIZE];

OCIClobLocator *clob[FETCH_ARRAY_SIZE];OCIClobLocator *clob[FETCH_ARRAY_SIZE];

};};

Execution of SQL QueryExecution of SQL Query……EXEC SQL WHENEVER SQLERROR DO dynsql_error( "Oracle Error:EXEC SQL WHENEVER SQLERROR DO dynsql_error( "Oracle Error:\\n" );n" );EXEC SQL PREPARE s1 FROM :cmd;EXEC SQL PREPARE s1 FROM :cmd;EXEC SQL DECLARE cr1 CURSOR FOR s1;EXEC SQL DECLARE cr1 CURSOR FOR s1;

/* there has to be an hierarchical structure to the options *//* there has to be an hierarchical structure to the options */if ((strcmp(paramptrif ((strcmp(paramptr-->qtype, "list") == 0) ||>qtype, "list") == 0) ||

(strcmp(paramptr(strcmp(paramptr-->qtype, "gslist") == 0)) {>qtype, "gslist") == 0)) {EXEC SQL OPEN cr1 USING :paramptrEXEC SQL OPEN cr1 USING :paramptr-->list;>list;

}}else if ((strcmp(paramptrelse if ((strcmp(paramptr-->qtype, "localname") == 0) || (strcmp(paramptr>qtype, "localname") == 0) || (strcmp(paramptr-->qtype, >qtype,

"gslocalname") == 0)) {"gslocalname") == 0)) {EXEC SQL OPEN cr1 USING :paramptrEXEC SQL OPEN cr1 USING :paramptr-->container, :paramptr>container, :paramptr-->localname, >localname,

:paramptr:paramptr-->seqtype;>seqtype;}}else if ((strcmp(paramptrelse if ((strcmp(paramptr-->qtype, "container") == 0) || (strcmp(paramptr>qtype, "container") == 0) || (strcmp(paramptr-->qtype, >qtype,

"gscontainer") == 0)) { "gscontainer") == 0)) { EXEC SQL OPEN cr1 USING :paramptrEXEC SQL OPEN cr1 USING :paramptr-->container, :paramptr>container, :paramptr-->seqtype;>seqtype;

}}else if ((strcmp(paramptrelse if ((strcmp(paramptr-->qtype, "taxid") == 0) || (strcmp(paramptr>qtype, "taxid") == 0) || (strcmp(paramptr-->qtype, "gstaxid") >qtype, "gstaxid")

== 0)) {== 0)) {EXEC SQL OPEN cr1 USING :paramptrEXEC SQL OPEN cr1 USING :paramptr-->container, :paramptr>container, :paramptr-->taxid, :paramptr>taxid, :paramptr-->seqtype;>seqtype;

}}……

FetchingFetching/* sqlca.sqlerrd[2] contains the number of rows fetched so far */* sqlca.sqlerrd[2] contains the number of rows fetched so far *//EXEC SQL WHENEVER NOT FOUND DO break;EXEC SQL WHENEVER NOT FOUND DO break;

for (;;) {for (;;) {

EXEC SQL FETCH cr1 INTO :seqptr;EXEC SQL FETCH cr1 INTO :seqptr;

process_recs( outfile, logfile, headr, seqptr, sqlca.sqlerrprocess_recs( outfile, logfile, headr, seqptr, sqlca.sqlerrd[2] d[2] ––num_ret );num_ret );

num_ret = sqlca.sqlerrd[2];num_ret = sqlca.sqlerrd[2];

}}

/* if ARRAY_SIZE to process is > 0, process the partial records /* if ARRAY_SIZE to process is > 0, process the partial records retrieved */retrieved */if ( ( sqlca.sqlerrd[2] if ( ( sqlca.sqlerrd[2] -- num_ret ) > 0 ) {num_ret ) > 0 ) {

process_recs( outfile, logfile, headr, seqptr, sqlca.sqlerprocess_recs( outfile, logfile, headr, seqptr, sqlca.sqlerrd[2] rd[2] ––num_ret );num_ret );

}}

Processing CLOBsProcessing CLOBsfor(i = 0; i < n; i++) {for(i = 0; i < n; i++) {……EXEC SQL LOB DESCRIBE :seqptrEXEC SQL LOB DESCRIBE :seqptr-->clob[i] GET LENGTH INTO :seqlen;>clob[i] GET LENGTH INTO :seqlen;EXEC SQL WHENEVER NOT FOUND DO break;EXEC SQL WHENEVER NOT FOUND DO break;

while (TRUE) {while (TRUE) {

EXEC SQL LOB READ :amt FROM :seqptrEXEC SQL LOB READ :amt FROM :seqptr-->clob[i] INTO :buffer ;>clob[i] INTO :buffer ;

bufferbuffer-->arr[buffer>arr[buffer-->len] = '>len] = '\\0‘ ;0‘ ;xptr = strcopy(xptr, (char *) bufferxptr = strcopy(xptr, (char *) buffer-->arr) ;>arr) ;

}}

bufferbuffer-->arr[buffer>arr[buffer-->len] = '>len] = '\\0';0';tlen += amt;tlen += amt;xptr = strcopy(xptr, (char *) bufferxptr = strcopy(xptr, (char *) buffer-->arr);>arr);

/* printf("the string = %.*s/* printf("the string = %.*s\\n", seqstringn", seqstring-->len, seqstring>len, seqstring-->arr); */>arr); */output_1_seq( outfile, logfile, seqhdr, seqptr, seqstring, i output_1_seq( outfile, logfile, seqhdr, seqptr, seqstring, i ););

}}

PL/SQLPL/SQLCentralize stored functions and procedures Centralize stored functions and procedures using packagesusing packages

Simplify queries by removing explicit joinsSimplify queries by removing explicit joinsSupport postSupport post--processing that inserts into custom processing that inserts into custom tablestablesCan be invoked from SQL*Plus,Perl, Pro*C, unix shell Can be invoked from SQL*Plus,Perl, Pro*C, unix shell scriptsscripts

Perl/BioPerlPerl/BioPerlGeneral scripting needs General scripting needs Transforming one format into anotherTransforming one format into another

Geneseq data in EmblGeneseq data in Embl--like format like format --> gbff and gpff for > gbff and gpff for SeqStore loadingSeqStore loading

Web applications developmentWeb applications developmentFriendlyFetch FriendlyFetch -- informationinformation--rich sequence search rich sequence search engine to SeqStoreengine to SeqStoreHarmonizer Harmonizer –– database showing sequence database showing sequence relationships from various sourcesrelationships from various sourcesGeneTracker GeneTracker –– internal annotation database tracking internal annotation database tracking BMS targetsBMS targets

Header Information

Visual displayof features

Cross references toother databases

DB Links/Materialized ViewsDB Links/Materialized Views

Distributed databases necessitate the Distributed databases necessitate the creation of database links for remote creation of database links for remote connectionsconnectionsIn concert with materialized views, we get In concert with materialized views, we get unified access to dataunified access to data

Harmonizer ContentHarmonizer Content

AccessesAccesses public and public and

inin--house resourceshouse resources

SearchesSearches across across

source boundariessource boundaries

Integrates Integrates other inother in--

House ToolsHouse Tools

ExportExport Fasta files and Fasta files and

Excel Workbook

Identical sequences and their annotations

Identical sequences and their annotations

Similar sequences and their alignments

Similar sequences and their alignments

Links, cross references, and names for each

member of the family

Links, cross references, and names for each

member of the family

Annotations for each member of the family

Annotations for each member of the family

Active Link into Source data, Blast, Genome

Maps,...

Active Link into Source data, Blast, Genome

Maps,...

Export all sequences,Export to Excel, Link

into Harmonizer

Export all sequences,Export to Excel, Link

into Harmonizer

Excel Workbook

Many Thanks ToMany Thanks To

Robert Bruccoleri Robert Bruccoleri Daniel DavisonDaniel DavisonKarlKarl--Heinz OttHeinz OttCharles TilfordCharles Tilford

Managing Bioinformatics Data With Oracle · Bioinformatics Manages nucleic/protein and expression...

Documents

Transcript of Managing Bioinformatics Data With Oracle · Bioinformatics Manages nucleic/protein and expression...