Curator Meeting: 10/27/03 Integration of New Data into RGD: Quality Control and Data Submission...
-
Upload
claire-reed -
Category
Documents
-
view
218 -
download
0
Transcript of Curator Meeting: 10/27/03 Integration of New Data into RGD: Quality Control and Data Submission...
Curator Meeting: 10/27/03
Integration of New Data into RGD: Quality Control and Data
Submission Tools
Rat Genome Databasehttp://rgd.mcw.edu
Bioinformatics Research CenterMedical College of Wisconsin, Milwaukee, USA
Curator Meeting: 10/27/03
RGD Pipeline Background
• RGD is a relatively new MOD• Needed to integrate large amounts of historic data
• Curation staff is limited• Developed near beginning of RGD Project• Efficient methods to evaluate and integrate data
• Informatic methods were chosen to address the problems • Catch up with historic data• Achieve good productivity with limited staff
•Modular design• New types of data• New QC checks and methods
Curator Meeting: 10/27/03
RatMapMGD
RGD 2.0RGD 2.0
RHdb
EBI (UK)Markers, Primers
Goteborg (Sweden)Genes, markers, QTLs
WI/MITMarkers, Genetic Map
MIT
Jackson LabsMarkers, Strains, Genes
NCBI LocusLink, RefSeq, UniGene, etc.
NCBI
Otsuka
Otsuka (Japan)SSLPs
U. IowaESTs, RH Map
UI
NIAMS ARBMaps and SSLPs
Data Sources
MCW
All Objects
MCO
SSLPs
Baylor (HGSC)Sequences
Baylor
RGD
Literature
Curator Meeting: 10/27/03
Data Pipeline
Regular Journal ScreeningAnd Curation
RGD Database
Ongoing Data Curation
Internal Data
Databases
Literature
Websites
Informaticdata mining
DataSources
Bulk Data Pipeline in the Curation Process
External Data
Curator Meeting: 10/27/03
RGD Objects
RGD stores information about 11 fundamental data types (Objects)
1. Genes2. Strains3. QTLs4. Traits5. Sequences 6. ESTs 7. Maps8. SSLPs9. References10. Homologs11. Phenotypes
Curator Meeting: 10/27/03
Relationships betweenRGD Objects
Genes -> Genes, ESTs, SSLPs, and QTLsESTs -> GenesSSLPs -> Genes and StrainsQTLs -> Genes, Traits, StrainsTraits -> QTLsMaps -> Maps Data
Maps Data -> any RGD object
References -> any RGD objectHomologs -> any RGD objectStrains -> any RGD objectSequences -> any RGD objectPhenotypes -> any RGD object
Curator Meeting: 10/27/03
RGD Data Flow
owner_1
Production
Cur_1
Curation
Owner_2
rgd.mcw.edu
dssCuration
data
BulkdataAll objects
OnlineGenesQTLsStrains
Internal Systems
Public System
BulkData
(Production-load)
QC
QCQC
Curator Meeting: 10/27/03
BD PipelineDatabase
• keep all raw data• format the data• track all checking flags• track all loading status
Input raw data
Check data
RGDDatabase
Blasting results
Preload data
Load data
Web-based interfaceto view all processingstatus
QC checks in the Data Flow
Curator Meeting: 10/27/03
QC Process Overview
Incoming Dataset
Internal checking (blast/symbol)
Blast against RGD database
Check for identity conflicts •Check symbol •Check sequence via GB ID•Check sequence via BLAST•Check alias
Preload: check for any attribute conflicts
Load: values without conflicts
RGD database
Conflict data filesfor curation review
Conflict data filesfor curation review Curators to review flags
Level One:Integrity Checking
Level Two:Identity Checking
Level Three:Attribute Checking
Curator Meeting: 10/27/03
Examples of Checks
• New symbol matches an RGD symbol • New symbol matches an alias in RGD• New record has a GBID• New GBID matches the RGD record• New GBID matches GBID of alias gene• New GBID matches any other RGD record• New Sequence matches any RGD• Every attribute value compared to RGD values
Curator Meeting: 10/27/03
Excel Summary ReportConflict Data
Report lists the bin ID for data thatrequires further curation (BLAST/BLAT analysis)
Curator Meeting: 10/27/03
Conflict Data Discovered by the Bulk Data Pipeline
Nomenclature conflicts • Symbols were incorrect
Sequence conflicts• Sequence reads were unacceptable due to poor quality (Many N’s)• Primers were switched • Sequence in dataset were associated with different objects in RGD
Alias conflicts• Dataset aliases were RGD objects• Dataset symbols were in RGD aliases
Attribute conflicts• Chromosomes were different in RGD• Cytological positions were different in RGD• Expected sizes of PCR products were different in RGD
Redundant data conflicts• Datasets had duplicate entries
Curator Meeting: 10/27/03
Curation ofConflicting Data
Checking processesfind conflicting data
Manual curationto resolve conflicts
Nomenclature, Sequence,Alias symbols, Attributes,Redundant records
Curated data
Resolvable Irresolvable
Removed data
Load into RGD(Over-write current data)
Store data in file(Notify source)
Curator Meeting: 10/27/03
Acknowledgements Principal Investigators
Howard Jacob Peter Tonellato Simon Twigger
RGD Bioinformatics
Dean Pasko, Jiali ChenLan Zhao, Henry Fan,Wenhua Wu, Jian Lu
Hanping Long
RGD CurationMary ShimoyamaSusan Bromberg
Rajni Nigam, Chin-fu ChenGopal Gopinathrao, Charles Wang
Victoria PetriDorothy Reilly, Cindy Foote
Angela Zuniga-Meyer, Nataliya Nenasheva
Curator Meeting: 10/27/03
Case Numb
er
Case Description Expected Result Note
1 New record has an RGD_ID, and both symbol and RGD_ID of new record match an active RGD record
Set symbol flag to “IN_RGD_1“
The process will continue through the GenBank ID check
2 New record has an RGD_ID, and new symbol matches an active RGD symbol, but new RGD_ID does not match the RGD_ID of matching symbol
Set symbol flag to “DIF_RGD_ID“
The process will continue through the GenBank ID check
3 New record has an RGD_ID, but new symbol does not match an active RGD symbol
Set symbol flag to “DIF_SYMBOL“
The process will continue through the GenBank ID check
4 New record does not have an RGD_ID, but new symbol matches an active RGD symbol
Set symbol flag to “IN_RGD_2”
The process will continue through the GenBank ID check
5 New record does not have an RGD_ID and new symbol does not match an active RGD symbol
Set symbol flag to “NEW”
The process will continue through the GenBank ID check
Curator Meeting: 10/27/03
Case Numb
er
Case Description Expected Result Note
6 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene
Change symbol flag to “IN_RGD_UPDATED”
Change current symbol flag after changing symbol (GBID check) and continue through GenBank ID check
7 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; new GBID matches a record in RGD and there is no alias matching that symbol; new symbol matches a retired or withdrawn symbol
Change symbol flag to “DIF_NON_ACTIVE_1”
Change current symbol flag and continue through GenBank ID check
8 New record does not have an RGD_ID and new symbol does not matches an active RGD symbol; new record does not have a GBID; new symbol matches a retired or withdrawn symbol
Change symbol flag to “DIF_NON_ACTIVE_2”
Change current symbol flag and continue through GenBank ID check
Curator Meeting: 10/27/03
9 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does match another GBID/seq in RGD
Set flag to “DIF_9:RGD_ID“
This case will complete thePRELOAD step in the pipeline, but NOT be loaded until after curation review. Data without GBID match will be compared by BLAST after PRELOAD step in the pipeline, but NOT loaded until after curation review. RGD_ID is the value that is associated with the GBID/seq in RGD
10 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does not match any GBID/seq in RGD
Set flag to “DIF_10”
This case will be compared by BLAST after PRELOAD step in pipeline, but NOT loaded until after curation review.
11 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD but new GBID doesn’t match the GBID of the gene associated with the matching alias
Set flag to “DIF_11”
This case will complete thePRELOAD step in the pipeline, but NOT be loaded until after curation review
Curator Meeting: 10/27/03
case6
case3
case4
case5
case2
The coming alias not match any alias in RGD
The coming alias matches an a alias but not match any alias type in RGD
The coming alias matches a alias and a alias type in RGD, and the coming symbol not match any symbol that is associated with the matched alias
The coming alias matches a alias and a alias type in RGD, and the coming symbol matches any symbol that is associated with the matched alias
The coming symbol is identical to the coming alias
The coming alias types is not identical to the alias types in name ...
check_data
Pipeline and RGD DB
case1
New Check Aliases Use Case Diagram
C. Fan
Curator Meeting: 10/27/03
symbols match, no GBID in new data but in RGD
both symbol and GBID matches the same RGD ID
symbols match, there is GBID to the new symbol, but the new GBID does't match any rgd id in RGD
symbols match, there is GBID to the new symbol, but the new GBID matches a different RGD ID in RGD
symbols don't match, no GBID in new file
neither symbol nor GBID match any RGD ID
symbols don't not match, GBID match a rgd id, but the coming symbol not match any aliases value and aliases type
symbols match, no GBID in new file and in rgd
symbols match, no GBID matches this symbol in RGD, but the GBID matches other RGD ID in RGD
symbols match, no GBID matches this symbol in RGD, but the new GBID doesn't match any in rgd
symbols don't not match, but GBID match a rgd id, and the coming symbol matches a aliases value and aliases type, but the coming GBID doesn't match any GBIDs that are associated the matched aliases
symbols don't not match, GBID match a rgd id, but the coming symbol matches only one aliases value and aliases type in aliases table
symbols don't not match, GBID match a rgd id, but the coming symbol matches more than one aliases value and aliases type in aliases table
case3case2
case1
c ase13
case12
case11
case10case9
check_data, RGD DB, Pipeline DB
case8
case7
case6
case5
case4
New Check Gene Symbol Use Case Diagram
Curator Meeting: 10/27/03
Case Numb
er
Case Description Expected Result Note
1 New record has an RGD_ID, and both symbol and RGD_ID of new record match an active RGD record
Set symbol flag to “IN_RGD_1“
The process will continue through the GenBank ID check
2 New record has an RGD_ID, and new symbol matches an active RGD symbol, but new RGD_ID does not match the RGD_ID of matching symbol
Set symbol flag to “DIF_RGD_ID“
The process will continue through the GenBank ID check
3 New record has an RGD_ID, but new symbol does not match an active RGD symbol
Set symbol flag to “DIF_SYMBOL“
The process will continue through the GenBank ID check
4 New record does not have an RGD_ID, but new symbol matches an active RGD symbol
Set symbol flag to “IN_RGD_2”
The process will continue through the GenBank ID check
5 New record does not have an RGD_ID and new symbol does not match an active RGD symbol
Set symbol flag to “NEW”
The process will continue through the GenBank ID check
Curator Meeting: 10/27/03
Case Numb
er
Case Description Expected Result Note
6 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene
Change symbol flag to “IN_RGD_UPDATED”
Change current symbol flag after changing symbol (GBID check) and continue through GenBank ID check
7 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; new GBID matches a record in RGD and there is no alias matching that symbol; new symbol matches a retired or withdrawn symbol
Change symbol flag to “DIF_NON_ACTIVE_1”
Change current symbol flag and continue through GenBank ID check
8 New record does not have an RGD_ID and new symbol does not matches an active RGD symbol; new record does not have a GBID; new symbol matches a retired or withdrawn symbol
Change symbol flag to “DIF_NON_ACTIVE_2”
Change current symbol flag and continue through GenBank ID check
Curator Meeting: 10/27/03
9 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does match another GBID/seq in RGD
Set flag to “DIF_9:RGD_ID“
This case will complete thePRELOAD step in the pipeline, but NOT be loaded until after curation review. Data without GBID match will be compared by BLAST after PRELOAD step in the pipeline, but NOT loaded until after curation review. RGD_ID is the value that is associated with the GBID/seq in RGD
10 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does not match any GBID/seq in RGD
Set flag to “DIF_10”
This case will be compared by BLAST after PRELOAD step in pipeline, but NOT loaded until after curation review.
11 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD but new GBID doesn’t match the GBID of the gene associated with the matching alias
Set flag to “DIF_11”
This case will complete thePRELOAD step in the pipeline, but NOT be loaded until after curation review
Curator Meeting: 10/27/03
12 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene
Set flag to “IN_RGD_2”
The new symbol is changed to the RGD gene symbol of the gene associated with the matching alias and the data is loaded
13 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, but alias is associated with more than one gene
Set flag to “DIF_13” This case will complete thePRELOAD step in the pipeline, but NOT be loaded until after curation review
Curator Meeting: 10/27/03
Bin #
Symb.
match
GBID
match
specific RGD
record
GBID in new
file
GBID in RGD
GBID
match
any RGD
Seq match
any RGD
(BLAST)
Symb. match alias
GBID match GBID of alias
gene
Alias of more
than one gene
FlagSymbol/
GBID/Alias
1 yes -- no yes -- -- -- -- -- DIF_ 1 New: A/-RGD: A/1
2 yes yes yes yes -- -- -- -- -- IN_RGD_1 New: A/1RGD: A/1
3 yes no yes yes no no -- -- -- DIF_3 New: A/1RGD: A/2 or –
RGD: B/14 yes no yes yes yes
or noyes -- -- -- DIF_4:RG
D_IDNew: A/1RGD: A/2RGD: B/1
5 no -- no -- -- -- -- -- -- DIF_5 New: A/-RGD: B/2 or -
6 no -- yes -- no no -- -- -- NEW New: A/1RGD: B/2 or -
7 no -- yes yes yes or no
yes no -- -- DIF_7:RGD_ID
New: A/1/CRGD: B/1
8 yes -- no no -- -- -- DIF_8 New: A/-RGD: A/-
9 yes -- yes no yes or no
yes -- -- -- DIF_9:RGD_ID
New: A/1RGD: A/-
10
yes -- yes no no no -- -- -- DIF_10 New: A/1RGD: A/-
11
no -- yes -- yes -- yes no -- DIF_11 New: A/1RGD: B/2/A
12
no -- yes -- yes -- yes yes no DIF_12 New: A/1RGD: B/1/A
13
no -- yes -- yes -- yes yes yes DIF_13 New: A/1RGD: B/1/ARGD: C/2/A
Curator Meeting: 10/27/03
Start pipeline process
read file
check file header
check symbol unique
valid?
yes
curate the header
no
unique?
create raw tables
yes
curate the symbol
no
check file exist
exist? stop
read raw table
load file
delete file clear fileyes
go to check process
no
start check process
gene object?
check gene symbol
yes
check name
no
check aliases
check sequence
check other attributes
dif flag
write DIF file
yes
go to load process
no
curate data
start load process
load data to RGD
end pipeline
has sequence?
yes
no
yes
no
Input dataInput data Check dataCheck data Pro-load dataPro-load data Load dataLoad data
Complete Bulkdata Pipeline Process Diagram
C. Fan
Curator Meeting: 10/27/03
sslpsgenes
sequences
references
mapsqtls
traits
homologs
strains
phenotypesdiseasesESTs
Database ObjectRelationships
Curator Meeting: 10/27/03
Platforms• Database server: Oracle 8.1.6• Sun Solaris 2.8 Unix operating system• Sun Enterprise 450’s
Programming Language• Perl 5
Object-oriented Methodology• Database - object based schema• Perl modules – object based and globally used across systems
DB.pm module PRELOAD.pm module LOAD.pm module
Schema Documentation• Rational Rose 2000 Enterprise
RGD DatabaseTechnologies
Curator Meeting: 10/27/03
RGD Data Flow
owner_1
Production
Cur_1
Curation
Owner_2
dev_1
Development
dorado
fuxi
rgd.mcw.edu
dssCuration
data
OnlineStrains,ReferencesNomenclatureGene editingOntologies(rgdtogo.txt)Notes
BulkdataAll objects Bulk
Data(Test-data)
OnlineGenesQTLsStrains Internal Systems
Public System
alpsObject TemplatesText - tab delimited
Modify flags
BulkData
(Production-load)
1st
2nd
Modify flags
TemplatesHomologsStrainsGenesQTLsSSLPsESTsMap Data
Curator Meeting: 10/27/03
Check for LocusLink, Swiss-Prot, RatMap IDs
Bin Number
• New symbol matches RGD symbol • LL/SP/RM_ID in new file • LL/SP/RM_ID in specific RGD record • LL/SP/RM_ID matches specific RGD record
• LL/SP/RM_ID matches any RGD
Curator Meeting: 10/27/03
• Sequence must be over 95% aligned
• Forward and Reverse primer must be over 95% aligned
• Ratio of the aligned bp / length of query sequence => 95%
• Ratio of the length1 (short seq) / length2 (longer seq) => 90%
Checks for Sequence