Discussion Points for 2 nd Pseudogene Call Mark Gerstein 2005,09.22 11:00 EST.
-
Upload
rebecca-francis -
Category
Documents
-
view
215 -
download
1
Transcript of Discussion Points for 2 nd Pseudogene Call Mark Gerstein 2005,09.22 11:00 EST.
Discussion Points for 2nd Pseudogene Call
Mark Gerstein
2005,09.22 11:00 EST
86
8787
Havana-Gencode: 167 pseudogenes
Yale: 184 pseudogenes
UCSC retrogenes: 15 expressed (7-8 pseudogenes) + 143 not expressed (all pseudogenes)
16 18
22
17
4535 21
42
18
Provided by France.
Intersection of Pseudogenes from Three Groups: Original
86 havana peudogenes overlap with any Yale pseudogene and 87 Yale pseudogenes overlap with any havana pseudogene (idem for retrogenes). This is a global result: maybe in some loci three havana pseudogenes overlap with only one yale pseudogene, but in other loci, several yale pseudogenes overlap with one havana pseudogene.
82 (34)
Havana-Gencode: 167 pseudogenes
Yale: 164 pseudogenes
UCSC retrogenes: 146 not expressed
17 (7)
33 (1)
15 (1)
14 (2) 16 (0)
52 (2)
• The numbers in parentheses are pseudogenes from GIS.• All from http://pseudogene.org/ENCODE/cross-ref• Pseudo-exons were merged to form pseudogenes and used for this comparison
(now a pseudogene has only a single start and end)
• Strand information is ignored• There are a total of 229 pseudogenes in the union
Intersection of Pseudogenes from 4 Groups: Updated
82 (34)
Havana-Gencode: 167 pseudogenes
Yale: 164 pseudogenes
UCSC retrogenes: 146 not expressed
17 (7)
33 (1)
15 (1)
14 (2) 16 (0)
52 (2)
Intersection of Pseudogenes from 4 Groups: Non-processed Consensus
GENCODE Processed
GENCODE Non-Processed
Yale Processed 7 / 8 5 / 5
Yale Non-Processed
4 / 4 39 / 37
Roughly agreement now is:
82 + 52 – 7 = 127from 229 total
What to do with 102?
How to Pick Pseudogenes for RT-PCR?
• Start with the intersection 127• Duplicated v processed: how many of each? (2:1?)• Rank Pseudogenes:
– By likelihood to be transcribed according to ENCODE evidence• ditag, then CAGE, then tiling array
– By their uniqueness in genome• Good primers• Non cross-hybridizing probes
• How to get a consistent rank?• Who will do RT-PCR ?• What coordinates to use ?• (Ignore 1 processed pseudogene already being sequenced by GIS group.)
How to generate a consensus for remaining 102 pseudogenes?
• Stick with the intersection 127• Develop a consistent criteria for identifying pseudogenes and
uniformly apply to ENCODE– E.g. protein matches with disablements found from a pipeline– Ignores tricky cases flagged by manual annotation
• Do a simple union of UCSC, Havana & Yale, giving 229– GIS is a subset of other 3– Describe pseudogenes as being identified by multiple approaches and
then explicitly flag each group’s unique ones in final annotation– Easy but perhaps biases stats
• Do a qualified union– Allow each group to “question” particular pseudogenes in another’s set– Send questions around and then have a call to sort out differences– Need a way to arbitrate– e.g. we could demand an obvious disablement– We might learn something!
• How do we represent this in the browser & in stats?
Once we have consensus, how to agree on pseudogene boundaries?• Keep unchanged each group’s boundaries
– If pseudogenes overlap, take largest region (union) or smallest
• Develop a uniform criteria for assigning pseudogene boundaries and apply it to each of the pseudogenes in the consensus set– Could just take each pseudogene in the
consensus and have one group realign it against parent