Curator Meeting: 10/27/03 Integration of New Data into RGD: Quality Control and Data Submission...

55
Curator Meeting: 10/27/0 Integration of New Data into RGD: Quality Control and Data Submission Tools Rat Genome Database http://rgd.mcw.edu Bioinformatics Research Center Medical College of Wisconsin, Milwaukee, USA

Transcript of Curator Meeting: 10/27/03 Integration of New Data into RGD: Quality Control and Data Submission...

Curator Meeting: 10/27/03

Integration of New Data into RGD: Quality Control and Data

Submission Tools

Rat Genome Databasehttp://rgd.mcw.edu

Bioinformatics Research CenterMedical College of Wisconsin, Milwaukee, USA

Curator Meeting: 10/27/03

RGD Pipeline Background

• RGD is a relatively new MOD• Needed to integrate large amounts of historic data

• Curation staff is limited• Developed near beginning of RGD Project• Efficient methods to evaluate and integrate data

• Informatic methods were chosen to address the problems • Catch up with historic data• Achieve good productivity with limited staff

•Modular design• New types of data• New QC checks and methods

Curator Meeting: 10/27/03

RatMapMGD

RGD 2.0RGD 2.0

RHdb

EBI (UK)Markers, Primers

Goteborg (Sweden)Genes, markers, QTLs

WI/MITMarkers, Genetic Map

MIT

Jackson LabsMarkers, Strains, Genes

NCBI LocusLink, RefSeq, UniGene, etc.

NCBI

Otsuka

Otsuka (Japan)SSLPs

U. IowaESTs, RH Map

UI

NIAMS ARBMaps and SSLPs

Data Sources

MCW

All Objects

MCO

SSLPs

Baylor (HGSC)Sequences

Baylor

RGD

Literature

Curator Meeting: 10/27/03

Data Pipeline

Regular Journal ScreeningAnd Curation

RGD Database

Ongoing Data Curation

Internal Data

Databases

Literature

Websites

Informaticdata mining

DataSources

Bulk Data Pipeline in the Curation Process

External Data

Curator Meeting: 10/27/03

RGD Objects

RGD stores information about 11 fundamental data types (Objects)

1. Genes2. Strains3. QTLs4. Traits5. Sequences 6. ESTs 7. Maps8. SSLPs9. References10. Homologs11. Phenotypes

Curator Meeting: 10/27/03

Relationships betweenRGD Objects

Genes -> Genes, ESTs, SSLPs, and QTLsESTs -> GenesSSLPs -> Genes and StrainsQTLs -> Genes, Traits, StrainsTraits -> QTLsMaps -> Maps Data

Maps Data -> any RGD object

References -> any RGD objectHomologs -> any RGD objectStrains -> any RGD objectSequences -> any RGD objectPhenotypes -> any RGD object

Curator Meeting: 10/27/03

RGD object Templates

Curator Meeting: 10/27/03

InternalDataSources

QC functionalityon data entryforms

Curator Meeting: 10/27/03

Curation AnnotationsNotes Editor

Curator Meeting: 10/27/03

Data Entry Summary Page

Curator Meeting: 10/27/03

Edit Record inSubmission Database

Curator Meeting: 10/27/03

RGD Data Flow

owner_1

Production

Cur_1

Curation

Owner_2

rgd.mcw.edu

dssCuration

data

BulkdataAll objects

OnlineGenesQTLsStrains

Internal Systems

Public System

BulkData

(Production-load)

QC

QCQC

Curator Meeting: 10/27/03

BD PipelineDatabase

• keep all raw data• format the data• track all checking flags• track all loading status

Input raw data

Check data

RGDDatabase

Blasting results

Preload data

Load data

Web-based interfaceto view all processingstatus

QC checks in the Data Flow

Curator Meeting: 10/27/03

QC Process Overview

Incoming Dataset

Internal checking (blast/symbol)

Blast against RGD database

Check for identity conflicts •Check symbol •Check sequence via GB ID•Check sequence via BLAST•Check alias

Preload: check for any attribute conflicts

Load: values without conflicts

RGD database

Conflict data filesfor curation review

Conflict data filesfor curation review Curators to review flags

Level One:Integrity Checking

Level Two:Identity Checking

Level Three:Attribute Checking

Curator Meeting: 10/27/03

Examples of Checks

• New symbol matches an RGD symbol • New symbol matches an alias in RGD• New record has a GBID• New GBID matches the RGD record• New GBID matches GBID of alias gene• New GBID matches any other RGD record• New Sequence matches any RGD• Every attribute value compared to RGD values

Curator Meeting: 10/27/03

Review Pipeline’s QC Checks

Curator Meeting: 10/27/03

Review Conflicts

Curator Meeting: 10/27/03

Excel Summary ReportConflict Data

Report lists the bin ID for data thatrequires further curation (BLAST/BLAT analysis)

Curator Meeting: 10/27/03

Conflict Data Discovered by the Bulk Data Pipeline

Nomenclature conflicts • Symbols were incorrect

Sequence conflicts• Sequence reads were unacceptable due to poor quality (Many N’s)• Primers were switched • Sequence in dataset were associated with different objects in RGD

Alias conflicts• Dataset aliases were RGD objects• Dataset symbols were in RGD aliases

Attribute conflicts• Chromosomes were different in RGD• Cytological positions were different in RGD• Expected sizes of PCR products were different in RGD

Redundant data conflicts• Datasets had duplicate entries

Curator Meeting: 10/27/03

Curation ofConflicting Data

Checking processesfind conflicting data

Manual curationto resolve conflicts

Nomenclature, Sequence,Alias symbols, Attributes,Redundant records

Curated data

Resolvable Irresolvable

Removed data

Load into RGD(Over-write current data)

Store data in file(Notify source)

Curator Meeting: 10/27/03

After Load

Curator Meeting: 10/27/03

Acknowledgements Principal Investigators

Howard Jacob Peter Tonellato Simon Twigger

RGD Bioinformatics

Dean Pasko, Jiali ChenLan Zhao, Henry Fan,Wenhua Wu, Jian Lu

Hanping Long

RGD CurationMary ShimoyamaSusan Bromberg

Rajni Nigam, Chin-fu ChenGopal Gopinathrao, Charles Wang

Victoria PetriDorothy Reilly, Cindy Foote

Angela Zuniga-Meyer, Nataliya Nenasheva

Curator Meeting: 10/27/03

Curator Meeting: 10/27/03

Curator Meeting: 10/27/03

Curator Meeting: 10/27/03

Model OrganismBulk Data Processing Work Flow

Curator Meeting: 10/27/03

Case Numb

er

Case Description Expected Result Note

1 New record has an RGD_ID, and both symbol and RGD_ID of new record match an active RGD record

Set symbol flag to “IN_RGD_1“

The process will continue through the GenBank ID check

2 New record has an RGD_ID, and new symbol matches an active RGD symbol, but new RGD_ID does not match the RGD_ID of matching symbol

Set symbol flag to “DIF_RGD_ID“

The process will continue through the GenBank ID check

3 New record has an RGD_ID, but new symbol does not match an active RGD symbol

Set symbol flag to “DIF_SYMBOL“

The process will continue through the GenBank ID check

4 New record does not have an RGD_ID, but new symbol matches an active RGD symbol

Set symbol flag to “IN_RGD_2”

The process will continue through the GenBank ID check

5 New record does not have an RGD_ID and new symbol does not match an active RGD symbol

Set symbol flag to “NEW”

The process will continue through the GenBank ID check

Curator Meeting: 10/27/03

Case Numb

er

Case Description Expected Result Note

6 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene

Change symbol flag to “IN_RGD_UPDATED”

Change current symbol flag after changing symbol (GBID check) and continue through GenBank ID check

7 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; new GBID matches a record in RGD and there is no alias matching that symbol; new symbol matches a retired or withdrawn symbol

Change symbol flag to “DIF_NON_ACTIVE_1”

Change current symbol flag and continue through GenBank ID check

8 New record does not have an RGD_ID and new symbol does not matches an active RGD symbol; new record does not have a GBID; new symbol matches a retired or withdrawn symbol

Change symbol flag to “DIF_NON_ACTIVE_2”

Change current symbol flag and continue through GenBank ID check

Curator Meeting: 10/27/03

9 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does match another GBID/seq in RGD

Set flag to “DIF_9:RGD_ID“ 

This case will complete thePRELOAD step in the pipeline, but NOT be loaded until after curation review. Data without GBID match will be compared by BLAST after PRELOAD step in the pipeline, but NOT loaded until after curation review. RGD_ID is the value that is associated with the GBID/seq in RGD

10 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does not match any GBID/seq in RGD

Set flag to “DIF_10”

This case will be compared by BLAST after PRELOAD step in pipeline, but NOT loaded until after curation review.

11 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD but new GBID doesn’t match the GBID of the gene associated with the matching alias

Set flag to “DIF_11”

This case will complete thePRELOAD step in the pipeline, but NOT be loaded until after curation review

Curator Meeting: 10/27/03

case6

case3

case4

case5

case2

The coming alias not match any alias in RGD

The coming alias matches an a alias but not match any alias type in RGD

The coming alias matches a alias and a alias type in RGD, and the coming symbol not match any symbol that is associated with the matched alias

The coming alias matches a alias and a alias type in RGD, and the coming symbol matches any symbol that is associated with the matched alias

The coming symbol is identical to the coming alias

The coming alias types is not identical to the alias types in name ...

check_data

Pipeline and RGD DB

case1

New Check Aliases Use Case Diagram

C. Fan

Curator Meeting: 10/27/03

symbols match, no GBID in new data but in RGD

both symbol and GBID matches the same RGD ID

symbols match, there is GBID to the new symbol, but the new GBID does't match any rgd id in RGD

symbols match, there is GBID to the new symbol, but the new GBID matches a different RGD ID in RGD

symbols don't match, no GBID in new file

neither symbol nor GBID match any RGD ID

symbols don't not match, GBID match a rgd id, but the coming symbol not match any aliases value and aliases type

symbols match, no GBID in new file and in rgd

symbols match, no GBID matches this symbol in RGD, but the GBID matches other RGD ID in RGD

symbols match, no GBID matches this symbol in RGD, but the new GBID doesn't match any in rgd

symbols don't not match, but GBID match a rgd id, and the coming symbol matches a aliases value and aliases type, but the coming GBID doesn't match any GBIDs that are associated the matched aliases

symbols don't not match, GBID match a rgd id, but the coming symbol matches only one aliases value and aliases type in aliases table

symbols don't not match, GBID match a rgd id, but the coming symbol matches more than one aliases value and aliases type in aliases table

case3case2

case1

c ase13

case12

case11

case10case9

check_data, RGD DB, Pipeline DB

case8

case7

case6

case5

case4

New Check Gene Symbol Use Case Diagram

Curator Meeting: 10/27/03

Case Numb

er

Case Description Expected Result Note

1 New record has an RGD_ID, and both symbol and RGD_ID of new record match an active RGD record

Set symbol flag to “IN_RGD_1“

The process will continue through the GenBank ID check

2 New record has an RGD_ID, and new symbol matches an active RGD symbol, but new RGD_ID does not match the RGD_ID of matching symbol

Set symbol flag to “DIF_RGD_ID“

The process will continue through the GenBank ID check

3 New record has an RGD_ID, but new symbol does not match an active RGD symbol

Set symbol flag to “DIF_SYMBOL“

The process will continue through the GenBank ID check

4 New record does not have an RGD_ID, but new symbol matches an active RGD symbol

Set symbol flag to “IN_RGD_2”

The process will continue through the GenBank ID check

5 New record does not have an RGD_ID and new symbol does not match an active RGD symbol

Set symbol flag to “NEW”

The process will continue through the GenBank ID check

Curator Meeting: 10/27/03

Case Numb

er

Case Description Expected Result Note

6 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene

Change symbol flag to “IN_RGD_UPDATED”

Change current symbol flag after changing symbol (GBID check) and continue through GenBank ID check

7 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; new GBID matches a record in RGD and there is no alias matching that symbol; new symbol matches a retired or withdrawn symbol

Change symbol flag to “DIF_NON_ACTIVE_1”

Change current symbol flag and continue through GenBank ID check

8 New record does not have an RGD_ID and new symbol does not matches an active RGD symbol; new record does not have a GBID; new symbol matches a retired or withdrawn symbol

Change symbol flag to “DIF_NON_ACTIVE_2”

Change current symbol flag and continue through GenBank ID check

Curator Meeting: 10/27/03

9 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does match another GBID/seq in RGD

Set flag to “DIF_9:RGD_ID“ 

This case will complete thePRELOAD step in the pipeline, but NOT be loaded until after curation review. Data without GBID match will be compared by BLAST after PRELOAD step in the pipeline, but NOT loaded until after curation review. RGD_ID is the value that is associated with the GBID/seq in RGD

10 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does not match any GBID/seq in RGD

Set flag to “DIF_10”

This case will be compared by BLAST after PRELOAD step in pipeline, but NOT loaded until after curation review.

11 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD but new GBID doesn’t match the GBID of the gene associated with the matching alias

Set flag to “DIF_11”

This case will complete thePRELOAD step in the pipeline, but NOT be loaded until after curation review

Curator Meeting: 10/27/03

12 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene

Set flag to “IN_RGD_2”

The new symbol is changed to the RGD gene symbol of the gene associated with the matching alias and the data is loaded

13 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, but alias is associated with more than one gene

Set flag to “DIF_13” This case will complete thePRELOAD step in the pipeline, but NOT be loaded until after curation review

Curator Meeting: 10/27/03

Bin #

Symb.

match

GBID

match

specific RGD

record

GBID in new

file

GBID in RGD

GBID

match

any RGD

Seq match

any RGD

(BLAST)

Symb. match alias

GBID match GBID of alias

gene

Alias of more

than one gene

FlagSymbol/

GBID/Alias

1 yes -- no yes -- -- -- -- -- DIF_ 1 New: A/-RGD: A/1

2 yes yes yes yes -- -- -- -- -- IN_RGD_1 New: A/1RGD: A/1

3 yes no yes yes no no -- -- -- DIF_3 New: A/1RGD: A/2 or –

RGD: B/14 yes no yes yes yes

or noyes -- -- -- DIF_4:RG

D_IDNew: A/1RGD: A/2RGD: B/1

5 no -- no -- -- -- -- -- -- DIF_5 New: A/-RGD: B/2 or -

6 no -- yes -- no no -- -- -- NEW New: A/1RGD: B/2 or -

7 no -- yes yes yes or no

yes no -- -- DIF_7:RGD_ID

New: A/1/CRGD: B/1

8 yes -- no no     -- -- -- DIF_8 New: A/-RGD: A/-

9 yes -- yes no yes or no

yes -- -- -- DIF_9:RGD_ID

New: A/1RGD: A/-

10

yes -- yes no no no -- -- -- DIF_10 New: A/1RGD: A/-

11

no -- yes -- yes -- yes no -- DIF_11 New: A/1RGD: B/2/A

12

no -- yes -- yes -- yes yes no DIF_12 New: A/1RGD: B/1/A

13

no -- yes -- yes -- yes yes yes DIF_13 New: A/1RGD: B/1/ARGD: C/2/A

Curator Meeting: 10/27/03

Start pipeline process

read file

check file header

check symbol unique

valid?

yes

curate the header

no

unique?

create raw tables

yes

curate the symbol

no

check file exist

exist? stop

read raw table

load file

delete file clear fileyes

go to check process

no

start check process

gene object?

check gene symbol

yes

check name

no

check aliases

check sequence

check other attributes

dif flag

write DIF file

yes

go to load process

no

curate data

start load process

load data to RGD

end pipeline

has sequence?

yes

no

yes

no

Input dataInput data Check dataCheck data Pro-load dataPro-load data Load dataLoad data

Complete Bulkdata Pipeline Process Diagram

C. Fan

Curator Meeting: 10/27/03

sslpsgenes

sequences

references

mapsqtls

traits

homologs

strains

phenotypesdiseasesESTs

Database ObjectRelationships

Curator Meeting: 10/27/03

RGD Schema Diagram

• 54 Tables• 10 Views

Curator Meeting: 10/27/03

RGD Schema Word Document

Curator Meeting: 10/27/03

Platforms• Database server: Oracle 8.1.6• Sun Solaris 2.8 Unix operating system• Sun Enterprise 450’s

Programming Language• Perl 5

Object-oriented Methodology• Database - object based schema• Perl modules – object based and globally used across systems

DB.pm module PRELOAD.pm module LOAD.pm module

Schema Documentation• Rational Rose 2000 Enterprise

RGD DatabaseTechnologies

Curator Meeting: 10/27/03

Bulk Data Database Schema

Curator Meeting: 10/27/03

Review

Quality ControlReports

Curator Meeting: 10/27/03

Review

Curator Meeting: 10/27/03

Review

Curator Meeting: 10/27/03

Review

Curator Meeting: 10/27/03

Validation

Curator Meeting: 10/27/03

RGD Data Flow

owner_1

Production

Cur_1

Curation

Owner_2

dev_1

Development

dorado

fuxi

rgd.mcw.edu

dssCuration

data

OnlineStrains,ReferencesNomenclatureGene editingOntologies(rgdtogo.txt)Notes

BulkdataAll objects Bulk

Data(Test-data)

OnlineGenesQTLsStrains Internal Systems

Public System

alpsObject TemplatesText - tab delimited

Modify flags

BulkData

(Production-load)

1st

2nd

Modify flags

TemplatesHomologsStrainsGenesQTLsSSLPsESTsMap Data

Curator Meeting: 10/27/03

Blast Result Scenarios

Curator Meeting: 10/27/03

Check for LocusLink, Swiss-Prot, RatMap IDs

Bin Number

• New symbol matches RGD symbol • LL/SP/RM_ID in new file • LL/SP/RM_ID in specific RGD record • LL/SP/RM_ID matches specific RGD record

• LL/SP/RM_ID matches any RGD

Curator Meeting: 10/27/03

• Sequence must be over 95% aligned

• Forward and Reverse primer must be over 95% aligned

• Ratio of the aligned bp / length of query sequence => 95%

• Ratio of the length1 (short seq) / length2 (longer seq) => 90%

Checks for Sequence

Curator Meeting: 10/27/03

Review

Curator Meeting: 10/27/03

Review Pipeline’s

QC Checks

Curator Meeting: 10/27/03

Before Load

Curator Meeting: 10/27/03

Check Alias

• New alias type matches RGD alias types

• New alias is same as new symbol

• New alias matches any alias in RGD

and same alias type