Bioinformatics foam 2013 program and abstracts

F

O

A

M

ocus

n

nalytical

ethods

Bio

info

rmatics

2013

irony, n. pron: /aɪərənɪ/, as in “This page was intentionally blank until we put this footer in”

Welcome Dear Colleagues,

Welcome to Bioinformatics Focus On Analytical Methods (FOAM) 2013, run as part of CSIRO’s Computational and Simulation Sciences and eResearch Annual Conference and Workshops, and sponsored by the CSIRO Bioinformatics Core and The Australian Bioinformatics Network (ABN).

The first half of FOAM 2013 is aimed at CSIRO bioinformaticians, computational biologists and quantitative bioscientists, recognising that this is a once-a-year opportunity for staff across Australia to get together to discuss CSIRO-specific issues.

The second half of the meeting is aimed at bioinformaticians, computational biologists and quantitative bioscientists in general. Feedback to the ABN indicated a preference to hold bioinformatics-oriented meetings in conjunction with other events, rather than initiating a standalone conference (at least for the time being). CSIRO’s CSS conference gives us a great opportunity to hold a very affordable (i.e., free to members) ABN event at a great location in a city with a high concentration of Australian life-science research.

We have a diverse and engaging agenda of presentations, reflecting the breadth of research that falls under the heading “bioinformatics”. We also want to encourage you to use this opportunity to meet new colleagues and catch up with old friends and will again be holding a special Bioinformatics FOAM dinner. We will be privileged to hear from Graham Cameron, Director of the Bioinformatics Resource Australia – EMBL (BRAEMBL), at that event.

We hope you enjoy Bioinformatics FOAM 2013 and welcome your feedback and ideas about how to make future events even better.

With best wishes from the Bioinformatics FOAM 2013 Organising Committee:

• Annette McGrath (CSIRO Bioinformatics Core Leader)

• David Lovell (Australian Bioinformatics Network Director)

• Lars Jermiin (OCE Science Leader in Genomics)

Bioinformatics FOAM 2013: Program and Abstracts

Page 1 of 12

Start Speaker Running title

Day 1: Wednesday 20 March 13:30 Annette McGrath Welcome to Day 1 13:35 Ross Crowhurst A draft genome sequence of European pear (Pyrus communis L. ‘Bartlett’)” 13:55 Jason Ross “Stop_gap, measure”. Tools for handling deep bisulphite sequencing data. 14:15 Tim Peters Identifying differentially methylated regions in human genome 14:35 Denis Bauer Cancer from every angle 15:00 Afternoon tea 15:30 Annette McGrath An update on Bioinformatics Core activities 15:50 Alec Zwart Reproducible Research and R 16:10 Neil Saunders Version control in bioinformatics: our experience using Git 16:25 Steve McMahon and Philippe

Moncuquet The CSIRO Galaxy Pilot

16:45 Sean Li Annotation of the Helicoverpa genome Day 2: Thursday 21 March 9:00 Shared keynote session with CSS conference 10:00 Morning tea 10:30 Group Discussion Bioinformatics at CSIRO 12:00 Lunch 13:00 Lars Jermiin A revised phylogenetic protocol 13:15 Paul Greenfield Error correction in primary sequence reads 13:30 Rob Lanfear Identifying optimal partition schemes and models for molecular phylogenetic data 13:45 David Yeates Phylogenetics in the context of collections-based research 14:00 Stuart Denman The role of phylogenetics in the context of ecogenomics 14:15 Peter Grewe Examining population/gene phylogenies: can old school allozyme techniques help guide Next Gen research? 14:30 Afternoon tea 14:45 David Lovell Welcome to ABN members 14:50 Graham Cameron Introducing Bioinformatics Resource Australia EMBL (BRAEMBL) 15:30 Roy Storey Ensembl for non-model organisms 15:50 Bruno Gaeta Characterising the human immunoglobulin heavy chain locus by ultra-deep sequencing of rearranged immunoglobulin genes 16:15 CSS conference close 18:30 Bioinformatics Dinner Dinner speaker: Graham Cameron on "The Genesis of EBI"


Page 2 of 12

Day 3: Friday 22 March 9:00 Joint session with Visualisation in Science workshop 9:00 Ajay Limaye Tools for Effective Volume Exploration 9:30 Felice Frankel Communicating Science Visually (Live streaming of keynote from VIZBI conference Boston) 10:30 Morning Tea 11:00 Cecilia Deng Integration of WGS, RNA-Seq, and comparative genomics reveals the candidate effector repertoire of closely related

Venturia pathogens of the Maloideae 11:20 Mani Grover Association of detailed Drug data with predicted candidate genes in Gentrepid 11:40 Melissa Davis Rewiring the dynamic interactome: alternative splicing alters protein interactions across human tissues 12:00 Paul Berkman GAME: modelling a genes-eye view of evolution 12:15 Vidana Epa Accurate Structural Modelling of the Interaction of a Designed Ankyrin Repeat Protein with the Human Epidermal Growth

Factor Receptor 2 12:30 Lunch 13:30 Andrew Lonie Progress on the Genomics Virtual Laboratory 13:50 Ross Lazarus Transmuting dark script matter into reproducible tools 14:10 Tamsyn Crowley Milking the pigeon 14:30 Nathan Hall Bioinformatician – more than just a number cruncher (or bridging the gap between computer scientists and biologists) 15:00 Ross Lazarus Afternoon Tea 15:30 Lauren Bragg Shining a light on dark sequencing: Characterising errors in Ion Torrent PGM data 15:50 Ken Doig PipeCleaner: Sanitation for your NGS pipeline 16:10 Tony Papenfuss Making sense of tumours in man, mouse and devils 16:30 David Lovell Australian Bioinformatics Network: update to members 17:00 Workshop Close


Page 3 of 12

Authors (Speaker) Title Abstract

Denis Bauer Cancer from every angle In order to understand disease states or cancer progression, we need to gain a better insight into the interplay of different regulatory mechanisms in the cell. However, modern high-throughput data generation allows us to only capture a discreet snapshot of cellular regulation, e.g. RNA, DNA, methylation. Our goal is hence to build predictive models from these layers of discrete 'omics data that capture the continuous regulatory interplay to inform medical genomics research. To achieve this, we generated matched genetics, transcriptomics, epigenomics as well as microbiomics data from lean and obese colorectal cancer patients. We employ statistical and machine learning methods that integrate information from the different 'omics data sources at single base resolution to identify regions with functional relevance for cancer development and prognosis.

Paul Berkman GAME: modelling a gene's-eye view of evolution

It is now nearly half a century since the establishment of game theory as a mechanism for studying evolution. While the primary application of this work has been at the population and species level, the gene's-eye view of evolution was postulated only shortly after evolutionary game theory itself. However, an experimental or empirical approach to the gene's-eye view has not been well developed, primarily due to the challenges associated with measuring how genes act as agents over the course of evolution, with the first mathematical theory describing this perspective only published in 2011. Major advances in our understanding of the core tenets of genetics and biochemistry over the last few decades are providing the data needed to calibrate the gene's-eye approach, and high-throughput sequencing technologies promise to provide even more such data. In this talk I will present GAME (Gene-Agent Modelling of Evolution), a software package designed for agent-based modelling of evolution from the gene perspective. This model provides a simulation of changes to the value and fitness of individual genes in a population of organisms over time. I will present preliminary results regarding the impacts of mutation and allelic diversity over time, testing the hypothesis that greater allelic diversity at a locus results in greater fitness for that locus.

Lauren Bragg Shining a light on dark sequencing: Characterising errors in Ion Torrent PGM data

The highly anticipated Ion Torrent Personal Genome Machine (PGM) debuted on the sequencing market in 2011. Novel platform design, most notably the measurement of pH changes to detect polymerisation events, yielded the first sequencing platform under $100K. The long-read lengths (now 400bp) and marketed high base-accuracy suggested that the PGM would supersede the Roche 454 platform for most applications. To identify potential applications for this new technology, I analysed a number of re-sequencing datasets generated using the PGM, investigating the errors and biases introduced by the PGM library preparation and sequencing process. In this presentation I will be discussing these results and how errors/biases introduced by the PGM platform may compromise specific studies.

Graham Cameron Bioinformatics Resource Australia/EMBL (BRAEMBL)

The EMBL Australia Bioinformatics Resource developed out of the EBI Mirror Project, whose goal was to create a “mirror” of EBI databases and services at UQ to serve Australia. This was motivated in large part by a desire to remove perceived disadvantages in the exploitation of bioinformatics data and tools due to Australia’s geographical remoteness and its network connectivity to the rest of the world.


Page 4 of 12

It turns out that mirroring the EBI in its entirety is impossible and probably not even desirable. Many of the services of the EBI depend on a complex and extensive IT context, and even for “mirrorable” services there is a real difficulty in keeping up with the data releases and updates from the original source. This has caused us to re-examine our goals and to cast the mission in less specific terms. It is to:

• enable optimal exploitation of the tools and data of bioinformatics by Australian scientists • contribute to the global biomolecular information infrastructure in a way which showcases Australian

science. This mission is entirely compatible with the underlying motivation for the mirror project, but is agnostic about the solution. Alongside the Mirror project, a related project at UQ, the Specialised Facility in Bioinformatics (SFB), provides compute capability to Australian bioinformatics. Our respecified mission is as fitting for the SFB as for the Mirror, and the two projects have been unified under this mission as a single project the “Bioinformatics Resource Australia/EMBL” (BRAEMBL). We are now working out what is required in practical terms in pursuit of the BRAEMBL mission. The first stage of this was a data gathering exercise – a survey of bioinformatics activities and needs in Australia. I will:

• present the key findings of this survey as indicators of the activities, mood and desires of our scientific constituency

• give my opinion about the global trends in IT in the life sciences • present some emerging ideas about how we might marry modern IT and Australian bioinformatics • give some thoughts about components beyond BRAEMBL necessary to a healthy Australian bioinformatics

ecosystem. Ross Crowhurst, Chagné D, Pindo M, Thrimawithana A, Deng C, Ireland H, Fiers M, Dzierzon H, Cestaro A, Lu A, Storey R, Knaebel M, Saeed M, Montanari S, Kim YK, Nicolini D, Larger S, Stefani E, Allan AC, Bowen J, Johnston J, Malnoy

A draft genome sequence of European pear (Pyrus communis L. ‘Bartlett’)”

We have sequenced the genome of European pear, Pyrus communis cultivar ‘Bartlett’/‘Williams’ Bon Chrétien’ using second generation sequencing technology (Roche 454). A draft assembly was produced from single end reads, 2 kb, and 8 kb insert paired end reads using Newbler (version 2.7). The assembly contained 142,083 scaffolds greater than 499 bases (maximum scaffold length of 1.29Mb) covering a total of 577.3 Mb and representing 96.1% of the expected 600 Mb Pyrus genome. Gene prediction using Augustus (version 2.6.1) predicted 50,703 models, of which 5339 proteins are unique to European pear. Preliminary analysis indicated that 2279 SNP markers anchored 171 Mb of the assembled genome. Further analysis is in progress to improve anchoring. This preliminary ‘Bartlett’ genome sequence is a unique tool for identifying the genetic control of key horticultural traits and for developing better pear cultivars, enabling wide application of marker-assisted and genomic selection.


Page 5 of 12

M, Troggio M, Perchepied L, Sawyer G, Wiedow C, Won KH, Viola R, Hellens R, Brewer L, Bus VGM, Schaffer R, Gardiner SE, Velasco R Tamsyn Crowley Milking the Pigeon

The pigeon is one of only a few birds that produce a nutrient substance ‘crop milk’ to feed their young. This nutrient substance is produced in the crop by both male and female birds and has been shown to have functional similarities with mammalian milk. As with mammalian milk, crop milk is essential for squab growth, providing both nutritional and immune benefits. We have spent the last few years studying this interesting biological phenomenon employing many different tools, including bioinformatics. Until recently there was little genomic information available, hence we have utilised bioinformatics and experimental biology in order to gain an insight into the production and benefits of pigeon crop milk.

David LA Wood, Mark A Ragan, Nicole Cloonan, Sean M Grimmond and Melissa J Davis

Rewiring the dynamic interactome: alternative splicing alters protein interactions across human tissues

Transcriptomics continues to provide ever-more evidence that in morphologically complex eukaryotes, each protein-coding genetic locus can give rise to multiple transcripts that differ in length, exon content and/or other sequence features. In humans, the majority of loci give rise to multiple transcripts in this way. Motifs that mediate protein-protein interactions can be present or absent in these transcripts. Analysis of protein interaction networks has been a valuable development in systems biology. Interactions are typically recorded for representative proteins or even genes, although exploratory transcriptomics has revealed great spatiotemporal diversity in the output of genes at both the transcript and protein-isoform levels. The increasing availability of high-resolution protein structures has made it possible to identify the domain-domain interactions that underpin many protein interactions. Thus we are able to identify protein isoforms that gain or lose the ability to interact with other proteins by identifying the interaction domains present or absent in the set of isoforms produced from a given gene. Here we explore the impact of transcript and isoform diversity on protein interactions in 16 phenotypically normal human tissues. We use the sequenced transcriptomes of these tissues to interrogate the protein-coding transcriptional output of genes, identifying tissue-specific variation in the inclusion of protein interaction domains. We map these data to a set of high-quality protein interactions, and characterise the variation in network connectivity likely to result from tissue specific alternative splicing. We find strong evidence for altered interaction potential in many genes, suggesting that transcriptional variation can significantly rewire the human interactome. We further identify interactions that are wide spread and supported at the transcript level across most human tissues, as well as interactions that are restricted to single, or a small number of tissues. Our work highlights the


Page 6 of 12

rewiring of interaction networks resulting from alternate transcriptional events and underpinning the unique molecular interaction systems of each tissue.

Stuart Denman The Role of Phylogenetics in the Context of Economics

Culture-based methods focused on the isolating and describing of specific populations from within an environment are time consuming and heavily biased by the selected isolation media and methods employed. Culture independent methods were devised to overcome these short falls. By far, the majority of these studies use molecular markers (DNA based) to identify and describe the microbes present and their changing abundance within these ecosystems. These range from methods that produce a high level/low resolution “fingerprint” of the community through to high resolution phylogenetic targeted methods and metagenomic phylogenetic assignment.

Cecilia Deng, Daniel Jones, Bruno Le Cam, Kim Plummer, Carl Mesarich, Matthew Templeton and Joanna Bowen

Integration of WGS, RNA-Seq, and comparative genomics reveals the candidate effector repertoire of closely related Venturia pathogens of the Maloideae

Host specificity is exhibited by different species and races of Venturia, a fungus that infects members of the Maloideae. V. inaequalis causes the economically important disease apple scab; however, certain isolates classified as V. inaequalis infect loquat but not Malus. V. pirina infects the related woody host European pear. The genetics of the interaction between apple and V. inaequalis follow the gene-for-gene model. Effectors (small pathogen proteins required for infection) are secreted into the plant/pathogen interface to suppress defence/enhance infection. A subset of effectors can be recognised by plant resistance gene (R) products to induce resistance. Seventeen gene-for-gene pairings between effector and R genes have been identified to date. The effector repertoire of Venturia isolates determines their cultivar specificity and probably host specificity. The draft genome of three V. inaequalis isolates (two from apple, one from loquat) and an isolate of V. pirina have been assembled using second generation sequencing data. Additionally, RNA sequencing has been obtained from samples taken at two time points after inoculation both in planta and in vitro. Comparative analysis of the predicted proteomes of the four Venturia isolates, coupled with detection of differential gene expression levels, has enabled the identification of candidate effectors determining host range. Eighty-four effectors are unique to V. pirina, six are unique to the V. inaequalis isolate specific to loquat, and 145 specific to the apple-infecting isolates. These effector candidates are currently being characterised with respect to functionality.

Ken Doig and Jason Ellul

PipeCleaner: Sanitation for your NGS pipeline

Increasingly affordable sequencing platforms has led to their wide spread adoption beyond research groups. The infiltration of desktop sequencers into the clinic has meant many institutions have had to beg, borrow or steal analysis pipelines that can crunch locally generated data piles. These pipelines are needed to refine the voluminous data generated by next generation sequencing (NGS) platforms. Ideally, they transform raw sequencing reads into meaningful biological data suitable for clinical reporting or research analysis. Unfortunately, there is little consensus across labs on how this should be done and indeed, there is such a vast range of software components with varying attributes that there is unlikely to be any standardisation in the near future. Here we present PipeCleaner as implemented at the Peter MacCallum Cancer Centre and describes its operation and utility in developing robust clinical pipelines. We will also present a number of sequencing scenarios where it’s application has enhanced our internal amplicon somatic mutation

Vidana C. Epa, Olan Dolezal, Larissa

Accurate Structural Modelling of the Interaction of a

The human epidermal growth factor receptor 2 (HER2) is over-expressed in a significant proportion of breast cancers and is a target for therapeutic intervention with monoclonal antibodies and small molecule inhibitors. The


Page 7 of 12

Doughty, Xiaowen Xiao, and Timothy E. Adams

Designed Ankyrin Repeat Protein with the Human Epidermal Growth Factor Receptor 2

novel binding proteins called Designed Ankyrin Repeat Proteins (DARPins) can be selected to be high affinity binders to targets. The DARPin H10-2-G3 has been evolved to bind with picomolar affinity to HER2. In this work, we modelled the structure of the complex between the DARPin H10-2-G3 and HER2 using computational macromolecular docking. After analyzing the structural interface between the two proteins, we validated the structural model by showing that HER2 mutations at the putative interface significantly reduce binding to the DARPin but have no effect on binding to Herceptin, a HER2-specific monoclonal antibody. Very recently the X-ray crystal structure of this complex was solved and showed that the backbone RMSD between the computational model and the X-ray structure was better than 1 Angstrom. This work illustrates the utility of computational structural biology methodologies in elucidating the details of protein-protein interactions.

Bruno Gaeta Characterising the human immunoglobulin heavy chain locus by ultra---deep sequencing of rearranged immunoglobulin genes

The study of inherited variation in the immunoglobulin heavy chain (IGH) locus has lagged behind that of other loci. This locus undergoes recombination during B--- lymphocyte differentiation, as well as somatic hypermutation after antigen challenge, and the resulting variation is difficult to distinguish from inherited polymorphisms. In addition, most large---scale human genomics projects (including the Human Genome Project and the 1000 Genomes Project) have ignored the IGH locus as they are based on sequencing DNA from lymphoblastoid cells in which the IGH locus has been recombined. As an alternative, our group has pioneered the use of ultra---deep sequencing of rearranged immunoglobulin genes to understand inherited variation in the germline locus. By sampling and comparing tens of thousands of rearranged sequences from an individual it is possible to identify the patterns of variation that are consistent with inherited polymorphisms instead of resulting from somatic mutation. It is also possible to genotype, and in some cases haplotype, the IGH loci for this individual. This approach has required the development of a whole new range of bioinformatics algorithms tailored to immunoglobulin genes, and has resulted in the discovery of several new polymorphisms as well as providing the basis for in---depth population analysis of the IGH locus. In this presentation I will outline the difficulties in applying standard genomic techniques to immunoglobulin genes and describe the bioinformatics methods we developed to study this unusual locus.

Paul Greenfield Error correction in primary sequence reads

Sequence data is now cheap, and become cheaper all the time. The only problem with all this data is that it isn’t perfect and contains errors, some random, some more systematic. Commonly used tools, such as aligners and assemblers, know about these errors and deal with them in various ways, such as looking for consensus or doing error-tolerant string matching. Another way of dealing with sequencing errors is to correct them and there have been a number of published error-correction algorithms. This presentation looks at a number of these algorithms and discusses their effectiveness and performance. Do they actually work and correct the errors present in typical sequencing data? Are they sufficiently practical to be at all useful? How do you measure the effectiveness of a correction algorithm? Are these programs even worth running or are existing aligners and assemblers already handling errors well enough? The results presented here come from a comparison of published algorithms done for the paper describing Blue (a fast correction algorithm based on consensus and context).

Peter Grewe Examining population/gene phylogenies: can old school

Determining phylogenetic relationships between/among populations can lead to understanding population genetic relatedness and differentiation in a way that is useful for management. In management of marine fish populations,


Page 8 of 12

allozyme techniques guide Next Gen research?

demonstrating stock delineation has been difficult due to low levels of differentiation between areas, even when separated by large distances. This low level of differentiation has been attributed to many factors including very large population sizes, lack of sufficient barriers to migration/immigration, and even homoplasy in the data where mutation rates may be equal or greater than the rates of genetic drift that promote differentiation. However, genetic analyses in the past have been limited to small snapshots of the genome that have limited resolution and capability to examine populations in sufficient detail to sufficiently address these issues. Next gen sequencing techniques are now opening the way forward to examine genetic data in ways never before thought possible. Our lab is now examining a protein polymorphism revealed by cellulose acetate electrophoresis and shown to have spatially different allozyme frequencies that appear to be temporally stable. We are unravelling the nucleotide variation responsible for the protein phenotypes in an effort to examine the phylogeny of these polymorphisms with increased resolution afforded at the nucleotide level. Examination of the phylogenies of allele variants should also give us an understanding of relationships among these populations. By mapping these relationships in a geographic context we hope to reveal important regions that can be used to define fish stocks useful for management purposes. We also hope to uncover subtle variation that would indicate finer relationships and further substantiate that these two populations are indeed reproductively isolated

Mani. P. Grover, K. A. Mohanasundaram, Sara Ballouz, R. A. George, C. D. H. Sherman1 M. A. Wouters

Association of detailed Drug data with predicted candidate genes in Gentrepid

Candidate gene prediction systems identify genes likely to be of functional relevance to a phenotype from associated genetic loci. Gentrepid, a human candidate gene discovery platform, utilizes two algorithms- Common Module Profiling and Common Pathway Scanning - to prioritize candidate genes for human inherited disorders. Recently, several protocols were developed to apply Gentrepid to the analysis of data from Genome Wide Association Studies (GWAS) using the Wellcome Trust Case Control Consortium (WTCCC) data set on seven complex diseases as an example (Ballouz et al, 2011).We are integrating drug databases now to enable researchers to immediately associate potential therapeutics with candidate genes. In this work presented here, we associated drugs with seven WTCCC phenotypes. For instance, Gentrepid predicted Peroxisome proliferator activated receptor delta (PPARD) as a candidate gene for Type II diabetes. Using the reference drug databases, we identified a dozen drugs that target PPARD. Drug Bank (Wishart et al, 2006) suggested 10 drugs used to treat lipid and glucose metabolic diseases, the Therapeutic Target Database (TTD) (Chen et al, 2002) indicated two drugs currently used to treat obesity and hyperlipidemia, and Pharm-GKB database (Hernandez et al, 2008) suggested two drugs used to treat prostatic neoplasms. For Carbohydrate (chondroitin 6) sulfotranferase 3 (CHST3), another Gentrepid candidate gene for Type II diabetes, Pharm-GKB suggested the same two drugs to treat prostatic neoplasms as identified for the PPARD gene. Thus, these drugs can be immediately utilized in further laboratory studies and in phase III clinical trials.

Nathan Hall Bioinformatician – more than just a number cruncher (or bridging the gap between computer scientists and

What is a bioinformatician? What does a bioinformatician do? Biologist? Computer scientist? Statistician? All of the above? Every bioinformatician is different, but the one thing for sure is that a bioinformatician is much more than just a number cruncher and a critical role is to bridge the gap between biologists and computer scientists


Page 9 of 12

biologists)

Bioinformatics, especially in the area of next-generation sequencing, is growing at tremendous speed and will need to continue to do so in the future. This leads to the questions: “What are the best ways to go about teaching biologists to be bioinformaticians?”, and “Who should call themselves a bioinformatician, should this be an inclusive or exclusive club?”. I will relate my experiences in working in bioinformatics at the interface between biology and computer science and discuss the benefits and downfalls of becoming a generalist bioinformatician, and the fun of getting to think about a huge range of interesting problems.

Lars Jermiin A revised phylogenetic protocol

Molecular phylogenetics has acquired an increasingly central role in studies of genomes and genomics data. In this context, a sequential set of procedures — the phylogenetic protocol — is commonly applied to extract information from these types of data. The phylogenetic protocol, however, is flawed as it contains several illogical feedback loops. In addition, the assumptions of many of the phylogenetic methods used are often not considered in sufficient detail. In this seminar, I present a revised phylogenetic protocol with a sound set of feedback loops. I also present some of the phylogenetic tools that we have developed or are in the process of developing. Finally, I demonstrate the value of the revised phylogenetic protocol using insect and yeast genome data.

Rob Lanfear Finding Good Models of Molecular Evolution in Phylogenetics

As phylogenetic datasets increase in size, it becomes more and more important to use an appropriate model of molecular evolution. Incorrect models can lead to incorrect inferences, and this problem is exacerbated with larger datasets. I will present some new methods and associated software, PartitionFinder, which simplify and automate model selection in phylogenetics. These methods can be applied to datasets of any size - from a single locus to genome-scale datasets of many thousands of loci, and can be efficiently parallelised. I'll show how these methods can lead to huge improvements in the models of molecular evolution that are used, and discuss how this can improve the inferences we make from DNA sequence data.

Sean Li Annotation of the Helicoverpa genome

The high-throughput next-generation sequencing techniques and nowadays computational power have greatly facilitated the genome sequencing, assembly and annotation process in terms of data resources, cost and time. Yet each of three tasks has its own challenges to overcome. Especially in genome annotation, though a number of automatic pipelines have been proposed and shown some promising results, the approach of constructing a reasonably accurate gene set remains unclear. In this talk, we will present an overview of works we have done so far for annotating the Helicoverpa Genome, including annotation tools that have been applied, such as Maker, CEGMA, PASA and Blast2GO, methods to produce a consensus gene set from multiple annotation runs, experience that we have learnt from multiple approaches, questions raised from the annotation quality assessment, as well as the future plan towards the completion of Helicoverpa genome annotation.

Andrew Lonie Progress on the Genomics Virtual Laboratory

The Genomics Virtual Laboratory (GVL) project, funded by NeCTAR, is building scalable infrastructure, workflow platforms and community resources for Australian genomics researchers. At this stage, the GVL comprises: a prototype workflow management system based on the Galaxy framework, a bioinformatics toolkit (for command-line users), and a visualisation service based on the UCSC Genome Browser, all implemented on the


Page 10 of 12

NeCTAR Research Cloud; and a developing set of tutorials and exemplar workflows targetted at common high throughput genomics tasks. In this talk I will demonstrate GVL capabilities and discuss progress and the GVL roadmap.

David Lovell Australian Bioinformatics Network: update to members (and those yet to join!)

The Australian Bioinformatics Network aims to connect people to • people • resources • opportunities

to increase the benefits Australian bioinformatics can deliver. The ABN now has over 250 members and this presentation is to provide an update and gather some feedback about how it can serve these members and those yet to join.

Annette McGrath Bioinformatics Core update I will present an update on the activities of the CSIRO Bioinformatics Core since our last meeting. In particular I will be updating you on projects that are already underway and highlighting upcoming projects for the CSIRO bioinformatics community for your input.

Steve McMahon & Philippe Moncuquet

The CSIRO Galaxy pilot project A Galaxy service pilot has been set up in CSIRO for the benefit of biologists and bioinformaticians within the organisation. The service pilot is implemented as a collaboration between CSIRO’s Information Management and Technology staff (IM&T) and the CSIRO bioinformatics core. This makes best use of the IT infrastructure and service delivery expertise of the IT and the bioinformatics domain expertise of the bioinformatics staff. This presentation outlines Galaxy, the way it has been implemented in CSIRO as a service pilot and some of the outcomes and related experiences as well as how to use it and how it can benefit both bioinformatician and biologist. This presentation encourages the bioinformatics community to show demand for a full production Galaxy service.

Tony Papenfuss Making sense of tumour sequence data in man, mouse and devil

Analysis of next generation sequencing data from tumour genomes requires pipelines built around specialised tools for SNV calling, copy number analysis and genomic rearrangement prediction. These tools must deal with many challenges. Some are intrinsic to the biology, such as contaminating normal cells, aneuploidy and intra-tumour heterogeneity, and some are extrinsic, for example sample quality, experimental design or its mis-design. Our work has been focused on two areas: methods for predicting somatic structural variation and going from pipeline results to biological insight. With examples from human, mouse and Tassie devil tumours, I'll discuss how identifying genomic rearrangements works; how, motivated by different datasets, our approach has developed; and how we made sense of insanely complex genomic rearrangements.

Tim Peters Identifying differentially methylated regions in human genome

The Illumina® HM450K array interrogates the human methylome by measuring methylation signals at approximately half a million CpG sites of biological interest. However, identifying the most differentially methylated (DM) probes alone, even with annotation, is of fairly limited use. What is more useful is identifying regions of DM; clusters of probes whose DM signals correspond with loci of particular biological functionality. A principled agglomeration of DM probes, informed by consecutivity, annotation, and relative genomic position, along with a robust measure of differential methylation itself, is needed to properly extract these regions. Methods such as bump hunting (Jaffe et al. 2012) attempt to do this, but suffer from unnecessary parameterisation and operational


Page 11 of 12

issues. We present a less parameterised method that fits probes of interest to a weighted probability density function with kernel estimation, which is able to rank the most differentially methylated regions based on the density of the DM signal at any given point in the genome. This method is also able to detect regions of high variability of methylation in unlabelled data, and has scope for integration into existing visualisation tools and statistical analysis software packages.

Jason Ross “Stop_gap, measure”. Tools for handling deep bisulphite sequencing data.

The current iteration of Ion Torrent instruments offer high throughput, long read lengths and relatively low costs, making them attractive platforms for deep sequencing. However, Ion Torrent sequencing (like 454 sequencing) has an error mode where the number of nucleotides in longer homopolymers are often incorrectly estimated. Bisulphite treated DNA often has long runs of thymines and the resulting Ion Torrent read errors introduce misalignments - making the estimation of cytosine methylation particularly difficult. “Stop_gap” is a software tool that reads BAM files, implements approaches to correct for such misalignments and writes corrected BAM files. “Measure” is software that can walk through a BAM file from any deep sequencing platform and calculate methylation rates at CpG or CpN sites. Output can be either a csv or Excel file. Both software tools can be executed from the command line as part of a pipeline, or alternatively are callable as Python classes.

Neil Saunders Version control in bioinformatics: our experience using Git

Version control is an important aspect of reproducible bioinformatics research. However, it is still not employed as widely as we would like. In this presentation I aim to: (1) Provide a basic introduction to Git, a popular open-source distributed version control system (2) Illustrate how we use Git to manage projects in the CMIS Bioinformatics & Biostatistics group

Roy Storey Ensembl for non-model organisms

EnsEMBL started as a data and visualisation framework for the release of the Human genome (doi:10.1038/35057062). Since then, EnsEMBL has been accumulating and hosting genomes. Initially this was confined to vertebrate genomes, such as Human, Mouse and Zebrafish, but now the EnsEMBL Genomes (http://www.ensemblgenomes.org) project includes over 6000 genomes spanning all the biotic phyla. The EnsEMBL framework serves as a resource with which to warehouse and access "omics" data in a genomic context, in an extensible and reproducible manner. We present lessons learnt from running EnsEMBL as a local instance, incorporating mirrors of public genomes and genomes that we have sequenced, assembled and annotated. This provides insight into the challenges faced and how we have extended the application programming interface (API), website visuals and functionality to provide integration into other local services.

David Yeates Phylogenetics in the context of collections-based research

CSIRO manages and develops four major biological collections in the Australian National Biological Collections Facility: The National Herbarium and the National Collections of Insects, Wildlife and Fish. Together these collections manage millions of specimens and are a significant resource for studying Australia’s biodiversity. Research scientists in the collections use phylogenetic research results to illuminate the tempo and mode of evolution in Australia’s biodiversity in an effort to understand the processes that have shaped biological evolution here. Increasingly these results have important implications for conservation, and natural resource management, in particular helping us predict the impacts of threatening processes such as climate change. I will focus on the use of


Page 12 of 12

phylogenetic results using multilocus molecular datasets to understand biogeographic and coevolutionary processes in a number of different biological systems. The emerging promise of phylogenomic-scale datasets offers an expanded arsenal of tools to understand evolutionary patterns and processes, and brings with it a new set of challenges in analysis and interpretation.

Alec Zwart Reproducible Research and R Literate programming (Knuth 1984) systems such as CWeb or Sweave (in R) provide tools to enable the concept of reproducible research (Fomel & Claerbout 2009, Donoho 2010) – the idea that a publication describing the results of research can (& should) also include the code and data needed to reproduce the results and figures presented in the publication. In this talk I briefly introduce the motivations for literate programming and reproducible research, the concept of a compendium (Gentleman & Temple Lang 2007) as a format for distribution of reproducible research, and I demonstrate a particularly easy-to-use literate programming system recently developed for the R statistical software – knitr+markdown – perfect for simple, reproducible reports of analyses.

Life’s complex…

…use bioinformatics

The CSIRO Bioinformatics Core and the Australian Bioinformatics Network are proud to support Bioinformatics FOAM 2013.

The Core aims to complement and augment the efforts of bioinformaticians and bioinformatics teams across CSIRO.

The Australian Bioinformatics Network aims to connect people, resources and opportunities to increase the benefits Australian bioinformatics can deliver.

We wish all delegates a successful meeting.

Bioinformatics foam 2013 program and abstracts

Documents

Transcript of Bioinformatics foam 2013 program and abstracts