ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

14
ESTminer CHADO adaptor The University of Georgia Alan Gingle, [email protected] Yecheng Huang, [email protected] http://cggc.agtec.uga.edu/ Nov 1, 2004

description

ESTminer CHADO schema overview Major part of CHADO that is relevant to the ESTMiner project

Transcript of ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

Page 1: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

ESTminer CHADO adaptor

The University of GeorgiaAlan Gingle, [email protected]

Yecheng Huang, [email protected]://cggc.agtec.uga.edu/

Nov 1, 2004

Page 2: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

Introduction

• Purpose of this presentation is to draft an EST chado schema that is open for community comments• Examples are used to demonstrate our approach to applying CHADO to EST data.

Contents: • ESTMiner_CHADO schema overview• Control Vocabulary -- Ontology and definition • Feature, and its properties, relationship and location• Appendix (example used in slides, minor tables)

Page 3: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

ESTminer CHADO schema overview

• Major part of CHADO that is relevant to the ESTMiner project

dbxrefprop

PK dbxrefprop_id

FK1,U1,I1 dbxref_idFK2,U1,I2 type_idU1 valueU1 rank

featureprop

PK featureprop_id

FK1,U1,I1 feature_idFK2,U1,I2 type_idU1 valueU1 rank

cvterm

PK cvterm_id

FK1,U1,I1 cv_idU1 name definitionFK2 dbxref_id

feature

PK feature_id

FK1,I1 dbxref_idI2,U1 organism_idI5,I6 nameI4,U1 uniquename residues seqlen md5checksumFK2,I3,U1 type_id is_analysis timeaccessioned timelastmodified

analysis

PK analysis_id

name descriptionU1 programU1 programversion algorithmU1 sourcename sourceversion sourceuri timeexecuted

cvterm_dbxref

PK cvterm_dbxref_id

FK2,U1,I1 cvterm_idFK1,U1,I2 dbxref_id

featurerange

PK featurerange_id

I1 featuremap_idFK5,I2 feature_idFK4,I3 leftstartf_idFK3,I4 leftendf_idFK2,I5 rightstartf_idFK1,I6 rightendf_id rangestr

cvterm_relationship

PK cvterm_relationship_id

FK3,I1,U1 type_idFK2,I2,U1 subject_idFK1,U1,I3 object_id

analysisfeature

PK analysisfeature_id

FK2,U1,I1 feature_idFK1,U1,I2 analysis_id rawscore normscore significance identity

db

PK db_id

U1,I1 name contact_id description urlprefix url

feature_cvterm

PK feature_cvterm_id

FK1,I1,U1 feature_idFK2,I2,U1 cvterm_idI3,U1 pub_id

dbxref

PK dbxref_id

FK1,I1,U1 db_idU1,I2 accessionU1,I3 version description

analysisprop

PK analysisprop_id

FK1,U1,I1 analysis_idFK2,U1,I2 type_idU1 value

feature_relationship

PK feature_relationship_id

FK2,I1,U1 subject_idFK1,I2,U1 object_idFK3,U1,I3 type_id rank

cv

PK cv_id

U1 name definition

EST_CHADO V0.1

featureloc

PK featureloc_idPK,FK1,U1,I2 feature_id

FK2,I4,I3 srcfeature_idI4,I1 fmin is_fmin_partialI4,I1 fmax is_fmax_partial strand phase residue_infoU1 locgroupU1 rank

Page 4: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

EST Control vocabulary I - Ontology

cvterm_relationship

PK cvterm_relationship_id

FK3,I1,U1 type_idFK2,I2,U1 subject_idFK1,U1,I3 object_id

1: Read3’’

8: Scr1o

7: GB_ACC_#

4: Cluster3: Contig

2: Sequence

6: Library

9: Scr1e

12: QUAL16o

10: Scr2o

11: Src2e

13: QUAL16e

15: QUAL20e

14: QUAL20o

16: GB_Access

17: Identity_threshold

18: Length_threshold

19: Library_name

20: stage

24: strain

23: organ

21: cultivar

22: cell_type

25: Organism

26: imo 27: ipo

5: ESTName

27: numofcontig

26: numofSeq

Page 5: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

EST Control vocabulary II -Definition

cv

PK cv_id

U1 name definition

cvterm_id 1 2 3 4 5 6 7 8

name Read3 sequence Contig Cluster Name Lib GB_Access Scr1o

definition 3’ read EST Sequence

EST Contig EST Cluster EST

Name Library GenBank Access Number

Screen offset 1

cvterm_id 9 10 11 12 13 14 15 16

name Scr1e Scr2o Scr2e QUAL16o QUAL16e QUAL20o QUAL20e

definition Screen end 1

Screen offset 2

Screen end 2

Quality16 offset Quality16 end Quality20

offset Quality20 end

cvterm_id 17 18 19 20 21 22 23 24

name Identity_threshold Length_threshold Library_name stage cultivar Cell_type Organ strain

definition

cvterm_id 25 26 27 28 29 …

name Organism imo ipo numofseq numofcontig …

definition Organism and species Is member of Is part of Number of seq Number of contig …

• insert into cv (cv_id,name,definition) values (1, ‘CGGC_UGA‘,’University of Georgia, Comparative Grass Genomic Center’ );• insert into cvterm(cvterm_id, cv_id, name, definition, dbxef_id) valuses (1, 1, ‘Read5’, ‘5\’ read’, 1 );

cvterm

PK cvterm_id

FK1,U1,I1 cv_idU1 name definitionFK2 dbxref_id

Page 6: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

EST Featurefeature

PK feature_id

FK1,I1 dbxref_idI2,U1 organism_idI6,I5 nameI4,U1 uniquename residues seqlen md5checksumFK2,I3,U1 type_id is_analysis timeaccessioned timelastmodified

insert into feature (feature_id, uniquename, residues, seqlen, type_id, …) values (1, ‘IP1_1_F11.g1_A002‘, ‘TGAG…CATTT’, 788,1,… );

feature_id 1 2 3

uniquename IP1_1_F11.g1_A002 IP1 Q20_1

residues TGAG…CATTT TTT...TGGA

seqlen 788 579

type_id 1 6 1

feature_id 4 5 6

uniquename Q16_ 1 CTGSB_100848 CLSB_1540

residues TTT…TTCCGAT Consensus residues

seqlen 618 … …

type_id 1 3 4

**** Check the example at the appendix ****

Page 7: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

EST Feature and Properties

feature tableFeaure_id Uniquename Type_id

1 IP1_1_F11.g1_A002 1(sequence)

2 IP1 6(Library)

5 CTGSB_100848 3(contig)

6 CLSB_1540 4(cluster)

feature_property table IP1_1_F11.g1_A002

Feaureprop_id Feature_id Type_id value

1 1 2(sequence)

2 1 5(ESTname) IP1_1_F11.g1_A002

3 1 12(QUAL16o) 11

4 1 13(QUAL16e) 628

5 1 14(QUAL20o) 11

6 1 15(QUAL20e) 589

7 1 16(GB_Access) BG946868

feature_property table IP1

Feaureprop_id Feature_id Type_id value

8 2 19 Library_name IP1

9 2 20

10 2 21 cultivar BTx623

11 2 22 cell_type N/A

12 2 23 organ Developing preanthesis pannicles

13 2 24 strain N/A

14 2 25 Organism Sorghum Bicolor L.

feature_property table CLSB_1540

Feaureprop_id Feature_id Type_id value

16 6 17 Iden_threshold 95

17 6 18 Len_threshold 20

18 6 28 numofcontig 1

feature_property table CTGSB_100848

Feaureprop_id Feature_id Type_id value

15 5 28 numofseq 2

Page 8: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

EST Feature Relationshipfeature_relationship

PK feature_relationship_id

FK2,I1,U1 subject_idFK1,I2,U1 object_idFK3,U1,I3 type_id rank

Feature relationship tableFeature_relationship_id 3 4 5

subject_id 1 1 1

object_id 5 (contig) 5(contig) 2 (library)

type_id 26 (is member of) 26 26

rank

feature tableFeaure_id Uniquename Type_id

1 IP1_1_F11.g1_A002 1 (sequence)

2 IP1 2 (library)

5 CTGSB_100848 3 (contig)

6 CLSB_1540 4 (cluster)

feature_id 1 (sequence)

feature_id 5 (contig)

member of

feature_id 6 (cluster)

member of

Page 9: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

EST Feature Locationfeatureloc

PK featureloc_idPK,FK1,U1,I2 feature_id

FK2,I4,I3 srcfeature_idI4,I1 fmin is_fmin_partialI4,I1 fmax is_fmax_partial strand phase residue_infoU1 locgroupU1 rank

feature tableFeaure_id Uniquename Type_id

1 IP1_1_F11.g1_A002 1(sequence)

3 Q20_1 1(sequence)

4 Q16_1 1(sequence)

featureloc tableFeatureloc_id Feaure_id Srcfeature_id fmin fmax

1 3 1 11 628

2 4 1 11 589

… …

feature_id 11 77811

feature_id 3

feature_id 4

628589

Page 10: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

Appendix – Example of EST Library

IP1STAGE: N/A FULL_NAME: Immature pannicle 1 CULTIVAR: BTx623 CELL_TYPE: N/A STRAIN: N/A ORGANISM: Sorghum bicolor L. BOTANICAL_NAME: S. bicolor ORGAN: Developing preanthesis pannicles CELL_LINE: N/A COMMENT_FOR_EST: Sequences have been trimmed to exclude PolyA, vector and regions below Phred quality 16. The threshold for high quality sequence is 20. Three-prime sequences, which are obtained with PolyTMix or T7 sequencing primer, are presented as the reverse complement. PUBLISH: Y HOST: N/A SEX: N/A RE_2: EcoRI TISSUE: N/A RE_1: XhoI LIB_NAME: IP1 VECTOR: pBluescript II SK(-) from Lambda Zap II V_TYPE: Plasmid DESCR: The library was made from poly-A RNA in the cloning vector lambda ZAP II. Clones to be sequenced were prepared by mass excision.

Page 11: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

Appendix – Example of EST Sequence

Seqence Name: IP1_1_F11.g1_A002   GenBank Access Number: BG946868    

1 11 21 31 41 51 61 71 81 91

1 TGAGTTTTTT TTTTTTTTTT TTGTTCTTAA TTATTCAATT CATTCATGAT ACTACTGTCT GCTATTTCCA CAGTAAATGT TCATATTACA TAGGAGCCAC

101 TGGCTCCTCC GGATTCCTTA AAAAAAATGT CCATATTACA ATTGGATTTA TGATACTACA CAGGTTCGCG AAATCGAGCA GGTTAGAAAA GCTTCCACTT

201 GCTGACCTCA CTAAAAGTGA AACACAGTTC CGGGAAGTTC ATACAGTTTT CCCATATAGA TCAATTGATC CTATCTGAAA CCTTGGATTA GAATGAGATT

301 CTCTTACGCG TAGAAACCTA AACCGGAAAG CATTTGCTTT ATATCTCTTA TCCACTGTAA ATGTTTTTCT AAGGAAACGG CTCTCAAACA TTTCAGAATT

401 CCGAGCATCA AGTAGATTCC AGGTGGAACC TGCATCTGTG CTCCCTTCAA GAACCCAGTC CATTGGATCC CTCTCTGGAG CATCATTAGC TGACATCAAA

501 TCATATGACT CCAACTCACA ACTTTTGCCA AGCTTGCATTG TATAAATCAG CCAACATCCT TTGGCTCCAT CAGGCTCTTC CCATTTGGAA GAATGGATGC

601 CGTCAAAAGC TGCTGTTGCA ATTCCGATTG GGAGCTGTTC CCTGCTTGCA AGGACTGAAC CTGAGCATAC TCTGTTCCCC TCTGGGAAAT GGTTGCCCTC

701 TGTGAAAGAG GTATTANNTC TATAATACTC ATATCTCATT ACTGCATCCA GTGCTACTGG TAACGCTNAG GATGAGTGGA TTGCATTT

•Length of Sequence: 788 •Screened Vector •Phred Qulity 20+ START:11   END:589 •Phred Qulity 16+ START:11   END:628 •Phred Quliaty Below 16

Page 12: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

Appendix – Example of EST Cluster and Contig

95-20-CLSB_1540

Identity Threshold: 95Length Threshold: 20Cluster Name: CLSB_1540   Number of Contigs: 1

    CTGSB_100848 Contig Name: CTGSB_100848Number of Sequences: 2 

    • IP1_1_F11.g1_A002

  • P1_48_H11.g1_A002

Page 13: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

Appendix – Example of EST Database

db

PK db_id

U1,I1 name contact_id description urlprefix url

dbxref

PK dbxref_id

FK1,I1,U1 db_idU1,I2 accessionU1,I3 version description

• insert into db (db_id, name …) values (1, ‘CGGC_UGA’, …);

• insert into dbxref (dbxref_id, db_id,…) values (1, 1…);

• insert into dbxrefprop (dbxrefprop_id, dbxref_id, …) values (1,1…)

dbxrefprop

PK dbxrefprop_id

FK1,U1,I1 dbxref_idFK2,U1,I2 type_idU1 valueU1 rank

Page 14: ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,

Appendix – Example of Analysis

analysis

PK analysis_id

name descriptionU1 programU1 programversion algorithmU1 sourcename sourceversion sourceuri timeexecuted

analysisfeature

PK analysisfeature_id

FK2,U1,I1 feature_idFK1,U1,I2 analysis_id rawscore normscore significance identity

analysis_id 1 …

name CGGC_01 …

description …

program blast …

algorithm cagt_miner …

analysisfeature_id 1 2 …

analysis_id 1 1 …

feature_id 5 6 …