Development of a Chicken Unigene Database
-
Upload
isabelle-rowen -
Category
Documents
-
view
35 -
download
0
description
Transcript of Development of a Chicken Unigene Database
Development of a Chicken Unigene Database
Project No. 9
Mentors: Dr. Wellington Martins - Dr. Joan Burnside
Animal Science Dept.University of Delaware
Jianshan Tang Ruoming Jin
Department of CIS
University of Delaware
Lilian Lacoste
DBI - French National School of Aeronautics
and Space
Results
2815 contigs 6390 singlets
17,090 ESTsPhrap
9,205 cluster
Phrap Clustering Result:
Second clustering method : using BLAST output
Contig 1
BLASToutput1
Contig 2
BLASToutput2
FilteringParsing
Comparing
Similarity function
Similarity matrix
Whats gbc?
Graph Based Clustering Clustering, a process of partitioning a set of data (or
objects) in a set of meaningful sub-classes, called clusters. Graph, the relation of the data could be expressed as
graph If there is a relation of two nodes, one edge connects them
Working in bioinformatics Protein sequence clustering EST clustering A lot of other applications!
Objective of "gbc" Support different input format Efficiently support very large sparse graph clustering Flexible to use by user
How to use gbc
Output Cluster number, and all the nodes belongs
to the cluster Clique clustering
a clique is a completely connected subgraph each maximal clique in the graph becomes a cluster clusters many overlap generally produces small but very tight clusters
Single-link clustering A maximal connected subgraph becomes a cluster produces larger but weaker clusters
A little about Implementation Works
Two clustering algorithm Single-link Clique
Graph Classes Efficiently support dense/sparse
graph Provide the same interface without
modifying clustering code
Analysis program
Reset BLAST output
Change matrix thresholdReset semantics
Run analysisNew contig set
Number ofcontigs
Comparisonalgorithm
Clusteringalgorithm
Resultsoutput
Analysis tools
Processlog output
Analysis tools : contig information
Display the BLAST output :- sequences references- sequences annotations- percentage of matching basepairs
Display the list of contigs sortedaccording to their best matching percentage in the BLAST output
Analysis tool : EST selector
Display :- frequency vs length (in ESTs)of contigs- list of ESTs in a contig
Allows to select the best representative EST accordingto length and tissue type
First results
On a set of 400 contigs representing 1000 ESTs
Contig number :79Contig size :743Best matching fraction :0.43587786259541983gb|AF178529.1|AF178529 Gallus gallus Rad54b (RAD54B) mRNA, compl... 571 e-160gb|BC001965.1|BC001965 Homo sapiens, RAD54, S. cerevisiae, homol... 143 2e-31ref|XM_005161.3| Homo sapiens RAD54, S. cerevisiae, homolog of, ... 143 2e-31gb|AF112481.1|AF112481 Homo sapiens RAD54B protein (RAD54B) mRNA... 143 2e-31ref|NM_012415.1| Homo sapiens RAD54, S. cerevisiae, homolog of, ... 143 2e-31emb|AL133578.1|HSM801429 Homo sapiens mRNA; cDNA DKFZp434J1672 (... 143 2e-31dbj|AP003534.1|AP003534 Homo sapiens genomic DNA, chromosome 8q2... 76 3e-11gb|AC009623.6|AC009623 Homo sapiens chromosome 8, clone RP11-219... 40 1.7
Contig number :133Contig size :740Best matching fraction :0.9413109756097561gb|AF178529.1|AF178529 Gallus gallus Rad54b (RAD54B) mRNA, compl... 1235 0.0gb|BC001965.1|BC001965 Homo sapiens, RAD54, S. cerevisiae, homol... 184 5e-44ref|XM_005161.3| Homo sapiens RAD54, S. cerevisiae, homolog of, ... 184 5e-44gb|AF112481.1|AF112481 Homo sapiens RAD54B protein (RAD54B) mRNA... 184 5e-44ref|NM_012415.1| Homo sapiens RAD54, S. cerevisiae, homolog of, ... 184 5e-44emb|AL133578.1|HSM801429 Homo sapiens mRNA; cDNA DKFZp434J1672 (... 184 5e-44dbj|AP003534.1|AP003534 Homo sapiens genomic DNA, chromosome 8q2... 76 3e-11gb|AC084633.1|CBRG45G04 Caenorhabditis briggsae cosmid G45G04, c... 44 0.11dbj|AB018110.1|AB018110 Arabidopsis thaliana genomic DNA, chromo... 44 0.11
References
Gene Index analysis of the human genome estimates approximately 120,000 genes. Liang-Feng; Holt-Ingeborg, Pertea-Geo, Karamycheva-Svetlana, Salzberg-Steven-L, Quackenbush-John Nature-Genetics. June, 2000; 25 (2): 239-240.
The TIGR Gene Indices: Reconstruction and representation of expressed gene sequences Quackenbush-John, Liang-Feng, Holt-Ingeborg, Pertea-Geo, Upton-Jonathan Nucleic-Acids-ResearchJan. 1, 2000; 28 (1): 141-145
IMAGEne I: Clustering and ranking of I.M.A.G.E. cDNA clones corresponding to known genes. Cariaso-M, Folta-P , Wagner-M, Kuczmarski-T, Lennon-G Bioinformatics-Oxford. Dec., 1999; 15 (12): 965-973.
R. Larson, M. Hearst : Content analysis - Lecture from University of California , Berkeley School of information management and systems 1998. http://www.sims.berkeley.edu/courses/is202/f98/Lecture16/sld001.htmGib
T. Ono, H. Hishigaki, A. Tanigami, T. Takagi - Automated extraction of information on protein-protein interaction from biological literature. Bioinformatics vol 17 no 2 - Oxford University Press 2001.
I. Iliopoulos, A.J. Enright, C.A. Ouzounis - TEXTQUEST: document clustering of medline abstracts for concept discovery in molecular biology. EMBL Cmabridge Outstation, Cambridge CB10 ISD, UK.