Post on 05-Jan-2016
Exploring Williams-Beuren Syndrome using myGrid
R.D. Stevens,a H.J. Tipney,b C.J. Wroe,a T.M. Oinn,c
M. Senger,c P.W. Lord,a C.A. Goble,a A. Brass,a
M. Tassabehji b
a Department of Computer Science
University of Manchester
b University of Manchester,
Academic Unit of Medical Genetics
St Mary’s Hospital
c European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Williams-Beuren Syndrome (WBS)
• Congenital disorder caused by sporadic gene deletion
• 1/20,000 live births• Effects multiple systems – muscular,
nervous, circulatory• Characteristic facial features• Unique cognitive profile• Mental retardation (IQ 40-100,
mean~60, ‘normal’ mean ~ 100 )• Outgoing personality, friendly nature,
‘charming’• Haploinsuffieciency of the region
results in the phenotype
Williams-Beuren Syndrome Microdeletion
Chr 7 ~155 Mb
~1.5 Mb7q11.23
GT
F2I
RF
C2
CY
LN
2
GT
F2I
RD
1
NC
F1
WB
SC
R1/
E1f
4H
LIM
K1
EL
N
CL
DN
4
CL
DN
3
ST
X1A
WB
SC
R18
WB
SC
R21
TB
L2
BC
L7B
BA
Z1B
FZ
D9
WB
SC
R5/
LA
B
WB
SC
R22
FK
BP
6
PO
M12
1
NO
LR
1
GT
F2I
RD
2
C-c
en
C-m
id
A-c
en
B-m
id
B-c
en
A-m
id
B-t
el
A-t
el
C-t
el
WB
SC
R14
ST
AG
3P
MS
2L
Block A
FK
BP
6T
PO
M12
1N
OL
R1
Block C
GT
F2I
PN
CF
1PG
TF
2IR
D2P
Block B
**
WBS
SVAS
Patient deletions
CTA-315H11
CTB-51J22
‘Gap’
Physical Map
Eicher E, Clark R & She, X An Assessment of the Sequence Gaps: Unfinished Business in a Finished Human Genome. Nature Genetics Reviews (2004) 5:345-354Hillier L et al. The DNA Sequence of Human Chromosome 7. Nature (2003) 424:157-164
1. Identify new, overlapping sequence of interest2. Characterise the new sequence at nucleotide and amino acid
level
Cutting and pasting between numerous web-based services i.e. BLAST, InterProScan etc
Filling a genomic gap in Silico
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Filling a genomic gap in silico
• Frequently repeated – info rapidly added to public databases• Time consuming and mundane• Don’t always get results• Huge amount of interrelated data is produced – handled in notebooks and
files saved to local hard drive• Much knowledge remains undocumented: Bioinformatician does the analysis
Advantages: Specialist human intervention at every step, quick and easy access to distributed services
Disadvantages: Labour intensive, time consuming, highly repetitive and error prone process, tacit procedure so difficult to share both protocol and results
Why Workflows and Services?
Workflow = general technique for describing and enacting a processWorkflow = describes what you want to do, not how you want to do itWeb Service = how you want to do itWeb Service = automated programmatic internet access to applications
• Automation– Capturing processes in an explicit manner– Tedium! Computers don’t get bored/distracted/hungry/impatient!– Saves repeated time and effort
• Modification, maintenance, substitution and personalisation • Easy to share, explain, relocate, reuse and build• Available to wider audience: don’t need to be a coder, just need to know
how to do Bioinformatics • Releases Scientists/Bioinformaticians to do other work• Record
– Provenance: what the data is like, where it came from, its quality– Management of data (LSID - Life Science Identifiers)
myGrid
• E-Science pilot research project funded by EPSRC www.mygrig.org.uk
• Manchester, Newcastle, Sheffield, Southampton, Nottingham, EBI and RFCGR, also industrial partners.
• ‘targeted to develop open source software to support personalised in silico experiments in biology on a grid.’
www.mygrid.org.uk
Which means….
Distributed computing – machines, tools, databanks, people
Personalisation
Provenance and Data management
Enactment and notification
A virtual lab ‘workbench’, a toolkit which serves life science communities.
Workflow Components
Scufl Simple Conceptual Unified Flow LanguageTaverna Writing, running workflows & examining resultsSOAPLAB Makes applications available
Freefluo Workflow engine to run workflows
Freefluo
SOAPLABWeb Service
Any Application
Web Service e.g. DDBJ BLAST
GenBank Accession No
GenBank Entry
Seqret
Nucleotide seq (Fasta)
GenScanCoding sequence
ORFs
prettyseq
restrict
cpgreport
RepeatMasker
ncbiBlastWrapper
sixpack
transeq
6 ORFs
Restriction enzyme map
CpG Island locations and %
Repetitive elements
Translation/sequence file. Good for records and publications
Blastn Vs nr, est databases.
Amino Acid translation
epestfind
pepcoil
pepstats
pscan
Identifies PEST seq
Identifies FingerPRINTS
MW, length, charge, pI, etc
Predicts Coiled-coil regions
SignalPTargetPPSORTII
InterPro
Hydrophobic regions
Predicts cellular location
Identifies functional and structural domains/motifs
Pepwindow?Octanol?
BlastWrapper
URL inc GB identifier
tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr
RepeatMasker
Query nucleotide sequence
BLASTwrapper
Sort for appropriate Sequences only
Pink: Outputs/inputs of a servicePurple: Tailor-made servicesGreen: Emboss soaplab services Yellow: Manchester soaplab services
RepeatMasker
TF binding Prediction
Promotor Prediction
Regulation Element Prediction
Identify regulatory elements in genomic sequence
Williams Workflow Plan
A B C
The Williams Workflows
A: Identification of overlapping sequenceB: Characterisation of nucleotide sequenceC: Characterisation of protein sequence
The Workflow Experience
• Correct and Biologically meaningful results• Automation
– Saved time, increased productivity– Process split into three, you still require humans!
• Sharing– Other people have used and want to develop the workflows
• Change of work practises– Post hoc analysis. Don’t analyse data piece by piece receive
all data all at once– Data stored and collected in a more standardised manner– Results amplification– Results management and visualisation
Have workflows delivered on their promise? YES!
The Workflow Experience
• Activation energy versus Reusability trade-off– Lack of ‘available’ services, levels of redundancy can be limited – But once available can be reused for the greater good of the
community
• Licensing of Bioinformatics Applications– Means can’t be used outside of licensing body– No license = access third-party websites
• Instability of external services– Research level– Reliant on other peoples servers– Taverna can retry or substitute before graceful failure
• Shims
shim (sh m) n. A thin, often tapered piece of material used to fill gaps, make something level, or adjust something to fit properly. shimmed, shim·ming, shimsTo fill in, level, or adjust by using shims or a shim.
Shims
• Explicitly capturing the process• Unrecorded ‘steps’ which aren’t realised until attempting to build something• Enable services to fit together
Shims
Sequencei.e. last known 3000bp
Mask BLASTIdentify new sequences and determine their degree of identity
Sequence database entryFasta format sequenceGenbank format sequence
Alignment of full query sequence V full ‘new’ sequence
Old BLAST result
Simplify and Compare
Lister
Retrieve
BLAST2
‘I want to identify new sequences which overlap with my query sequence and determine if they are useful’
The Biological Results
CTA-315H11 CTB-51J22
EL
N
WB
SC
R14
RP11-622P13 RP11-148M21 RP11-731K22
314,004bp extension
All nine known genes identified
CL
DN
4
CL
DN
3
ST
X1A
WB
SC
R18
WB
SC
R21
WB
SC
R22
WB
SC
R24
WB
SC
R27
WB
SC
R28
Four workflow cycles totalling ~ 10 hoursThe gap was correctly closed and all known features identified
Conclusions
• It works – a new tool has been developed which is being utilised by biologists
• More regularly undertaken, less mundane, less error prone
• Once notification is installed won’t even need to initiate it• More systematic collection and analysis of results• Increased productivity• Services: only as good as the individual services, lots of
them, we don’t own them, many are unique and at a single site, research level software, reliant on other peoples services, licenses
• Activation energy
Future Directions
• Scheduling and Notification
• Portals
• Results visualisation
• Re-use: other genomic disorders, Graves Disease
Acknowledgments
• Dr May Tassabehji
• Prof Andy Brass
• Medical Genetics team at St Marys Hospital, Manchester
• Wellcome Trust
www.mygrid.org.uk
myGrid Peoplewww.mygrid.org.uk
CoreMatthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro
Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.
UsersSimon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical
Medical Sciences, University of Newcastle, UKHannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UKPostgraduatesMartin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith
Flanagan, Antoon Goderis, Tracy Craddock, Alastair HampshireIndustrial Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)Robin McEntire (GSK)CollaboratorsKeith Decker