CIFAR bigDATA Workshop Report

72
REPORT ON CIFAR bigDATA Workshop University of British Columbia, Vancouver, BC, Canada May 28-30, 2010 LEAD ORGANIZER Steven Hallam [email protected] ASSISTANT ORGANIZERS Jody Wright [email protected] David Walsh [email protected] 1 20 40 60 80 100 120 140 160 180 200 220 240 2 0 20 40 60 80 100 120 140 160 180 200 220 240 3 0 20 40 60 80 100 120 140 160 180 4 0 20 40 60 80 100 120 140 160 180 5 0 20 40 60 80 100 120 140 160 180 0

description

CIFAR bigDATA Workshop Report (May 2010)

Transcript of CIFAR bigDATA Workshop Report

Page 1: CIFAR bigDATA Workshop Report

R e p o R t o n

CIFAR bigDATA WorkshopUniversity of British Columbia, Vancouver, BC, Canada

May 28-30, 2010

L e A D O R g A n I z e R

Steven Hallam [email protected]

A s s I s TA n T O R g A n I z e R s

Jody Wright [email protected]

David Walsh [email protected]

1

0 20

40

60

80

100

120

140

160

180

200

220

240

2

020406080100120140160180200220240

3

020

4060

80100120140160180

4

0

20406080100120140160180

5

020406080100

120140160180

6

020406080100

120

140

1607

020406080100120140

8 02040608010

0120140

9

02040

6080

100120

140

100

2040

6080

100120

11

020406080

100120

12

020406080100120

13

020406080

100

14

020406080

100

15

020406080

10016

020

4060

8017

020

4060

180

204060

190

204060

200

20 40 60

Page 2: CIFAR bigDATA Workshop Report

Report on CIFAR bigDATA Workshop

Edited By: Jody Wright

Designed By: Tora Design Ltd

Cover Image By: Martin Kryzwinski

bigData Online: http://www.cmde.science.ubc.ca/hallam/bigdata.php

Page 3: CIFAR bigDATA Workshop Report

Contents

List of Attendees 2

Themes 4

Summary of Presentations 6

Summary of Software Demonstration Sessions 14

Discussion & Recommendations 16

Abstracts & Presenter Information 18

Selected Publications from Workshop Presenters 44

bigDATA Survey Results 58

R e p o R t o n

CIFAR bigDATA WorkshopUniversity of British Columbia, Vancouver, BC, Canada

May 28-30, 2010

Page 4: CIFAR bigDATA Workshop Report

List of Attendees

page

2

List of Attendees

Adelshin, RenAt University of British ColUmBia, trainee, Cifar-imB Program

AndRews, BRendA J University of toronto, fellow and Program direCtor, Cifar gn Program

BAldwin, susAn University of British ColUmBia

Beiko, RoBeRt G dalhoUsie University

BoucheR, YAn mit, sCholar, Cifar-imB Program

BRAdY, ARthuR University of maryland

BRooks, denise University of British ColUmBia

BuRki, FABien University of British ColUmBia, trainee, Cifar-imB Program

cAse, ReBeccA harvard University, assoCiate, Cifar-imB Program

chAn, AmY University of British ColUmBia, trainee, Cifar-imB Program

chénARd, cARoline University of British ColUmBia, trainee, Cifar-imB Program

FoRtunA, miGuel A. PrinCeton University

GARdY, JenniFeR University of British ColUmBia, BC Centre for disease Control

GiAnoulis, tARA harvard mediCal sChool

GRiGG, michAel e national institUtes of health, niaid, sCholar, Cifar-imB Program

GsponeR, JoeRG University of British ColUmBia

GustAvsen, JuliA A University of British ColUmBia, trainee, Cifar-imB Program

hAllAm, steven J University of British ColUmBia, sCholar, Cifar-imB Program

hAwleY, AlYse k University of British ColUmBia, trainee, Cifar-imB Program

hoRAk, Ales University of British ColUmBia, trainee, Cifar-imB Program

howe, Alexis t University of British ColUmBia, trainee, Cifar-imB Program

howes, chARles G University of British ColUmBia, trainee, Cifar-imB Program

huGenholtz, philip doe Joint genome institUte

imAniAn, BehzAd University of British ColUmBia, trainee, Cifar-imB Program

JAnouskovec, JAn University of British ColUmBia, trainee, Cifar-imB Program

kAnG, RuiJuAn University of British ColUmBia, trainee, Cifar-imB Program

keelinG, pAtRick University of British ColUmBia, fellow and Program direCtor, Cifar-imB

Program

khoRvAsh, mAssih University of British ColUmBia

konwAR, kishoRi University of ConneCtiCUt

Page 5: CIFAR bigDATA Workshop Report

List of Attendees

page

3

kRzYwinski, mARtin genome sCienCes Center, BCCa

lABonte, JessicA m University of British ColUmBia, trainee, Cifar-imB Program

lee, RennY dalhoUsie University, trainee, Cifar-imB Program

mclellAn, JessicA University of British ColUmBia

miRJAFARi, pARissA University of British ColUmBia

mitchell, kendRA University of British ColUmBia, trainee, Cifar-imB Program

monieR, AdAm monterey Bay aqUariUm researCh institUte

noRBeck, AnGelA d PaCifiC northwest national laBoratory

pAGé, Antoine p University of British ColUmBia, trainee, Cifar-imB Program

peReiRA nicholAs harvard mediCal sChool

peRlmAn, steve University of viCtoria, sCholar, Cifar-imB Program

pomBeRt, JeAn-FRAncois University of British ColUmBia, trainee, Cifar-imB Program

RoBeRtson, JAmes University of British ColUmBia

silveRmAn, mel Cifar, viCe President, researCh

sonG, YounG c University of British ColUmBia, Cifar-imB Program

svinti, victoRiA University of British ColUmBia

tAi, veRA University of British ColUmBia, JUnior fellow, Cifar-imB Program

twieG, BRendAn University of British ColUmBia

vAssilenko, ekAteRinA University of British ColUmBia

visockY o’GRAdY, JenniFeR ensPaCe inC. / Cleveland state University

visockY o’GRAdY, ken ensPaCe inC. / Kent state University

vlok, mARli University of British ColUmBia, trainee, Cifar-imB Program

von meRinG, chRistiAn Universitz of zUriCh

wAlsh, dAvid A ConCordia University, trainee, Cifar-imB Program

winGet, dAnielle University of British ColUmBia, trainee, Cifar-imB Program

wRiGht, JodY J University of British ColUmBia, trainee, Cifar-imB Program

Yu, hAnG University of British ColUmBia

zAikovA, elenA University of British ColUmBia, trainee, Cifar-imB Program

Page 6: CIFAR bigDATA Workshop Report

themes & Key Ideas

page

4

Themes

Although the vast majority of microbes in nature resist laboratory cultivation, they represent an almost limitless reservoir of genomic diversity and biological innovation. next generation sequencing technologies are rapidly expanding our capacity to access this genomic information directly from environmental samples. however, to effectively organize and interpret this increasing volume of information, new analytic methods and operational knowledge must be developed and dispersed with the end user community in mind. this workshop aimed to bring together experts and trainees to explore problems and solutions in analyzing environmental sequence data at different levels of biological organization. the workshop fulfilled two goals:

(1) providing tutorials and hands-on training for the use of existing and emerging tools for analyzing traditional and next generation sequencing data. it also included talks that demonstrated how such data sets are generated, so participants came away from the workshop with a basic understanding of how to manipulate and take full advantage of large environmental data sets.

(2) exploring the future of bigdAtA analysis with particular emphasis on organization, integration and visualization of complex and multidimensional datasets. there is presently a powerful impetus to create an open source “knowledgebase” that allows genomic diversity, metagenomic, experimental and environmental information to be analyzed and visualized in real time as a community resource. the basic framework of this system i.e. tools and applications, stand alone versus network, accessibility and modularity, etc., is still nascent. we discussed our collective vision of how an ideal knowledgebase should operate to better inform developers and funding agencies.

Key Ideas1. Technological breakthroughs are creating major bottlenecks in data intensive computation and analysis.

As phil hugenholtz pointed out in his talk, increases in the capacity and throughput of next generation dnA sequencing technologies are both a blessing and a curse – they enable researchers to generate vast amounts of information, but a computational bottleneck in both data storage and analysis is imminent. in terms of data analysis, it is often challenging to find effective computational methods for sifting through these massive datasets to identify patterns that are actually meaningful and interesting at an appropriate level of resolution. Angela norbeck illustrated these challenges within the field of proteomics and demonstrated some computational tools she is using to address them. trying to understand these complex patterns and the implications they have for biological systems is equally challenging from a prediction and modeling perspective, as was discussed by many speakers including david walsh with respect to his work on identifying valid predictors for ecosystem health.

2. The application of first principles of design in data visualization is an essential component in effective communication of scientific discovery both within the scientific community and to the public.

A recurring theme throughout many of the talks was the enormous challenge we face in effectively visualizing our data and results in ways that are intuitive, aesthetic and ultimately meaningful to both scientific and public audiences. many sincere attempts to display significant patterns in large datasets fall short of being legible, clear and/or elegant, probably because the overwhelming majority of scientists do not have any formal training (or perhaps even interest) in principles and conventions of design. participants agreed that this deficiency is significant because if we as a scientific community cannot create professional visualizations, our impact will not reach beyond our immediate areas of expertise and will most certainly not reach the general public.

Page 7: CIFAR bigDATA Workshop Report

themes & Key Ideas

page

5

martin krzywinski presented a variety of simple tips and techniques that scientists can apply in order to make legible, clear and elegant figures, without needing to become experts in the field of design. it was noted that effective design is so critical in communication of information that universities and organizations should have designers available to collaborate with scientists on their information design needs, and that effective visualization of information should be a topic high on the agenda of the scientific community as a whole.

3. Network theory provides an emerging conceptual framework for understanding biological complexity at different levels of organization from genomes to biomes.

Brenda Andrews, in her keynote address to workshop participants, described the genetic landscape of a cell using interaction networks to chart a path between genotype and phenotype. the utilization of hierarchical modular networks to describe biological interactions at the molecular, cellular and community levels was a recurring theme in many of the talks that

followed. miguel Fortuna used network algorithms borrowed from particle physics to chart the ecological relationships between bat colonies and food sources while tara Gianoulis used interaction maps to illustrate ecological partitioning of transport proteins in the surface ocean. participants were excited by the power of network visualizations and confounded by the challenge of using them effectively in their own research. thus the inclusion of a tutorial by Jennifer Gardy focused on cytoscape, on open source bioinformatics software platform for visualizing molecular interaction networks, was a very welcome component of the program.

Page 8: CIFAR bigDATA Workshop Report

Summary of presntations

page

6

Summary of PresentationsBrenda AndrewsGenetic Landscape of the Cell

it is valuable to study how different combinations of genetic variants display themselves phenotypically in humans, particularly when those combinations lead to disease. studying the vast number of genetic interactions that are possible is a daunting challenge that Brenda Andrews and her research group at the university of toronto have been addressing. to define the general principles of genetic networks, she has focused on the systematic identification of genetic interactions in a biomedically relevant model system – the budding yeast, Saccharomyces

cerevisiae – using a technique known as synthetic gene array (sGA). sGA provides a high throughput approach for automating studies of yeast genetics by allowing researchers to examine every single lethal combination of genetic variants possible. in this way, Brenda’s group was able to construct a “cell map” illustrating clusters of genetic interactions located within the cell and predicting gene function and biological pathways each gene is involved in. this map can help point out interesting patterns of interaction that are as yet unstudied, and can also enable researchers to make very precise predictions about gene function based on where genes fall within the network. when the cell map of S. cerevisiae was compared to to that of the fission yeast S. pombe, Brenda’s group found significant conservation of genetic interactions, hinting to a common core eukaryotic interaction network. Brenda illustrated that conserved interactions between the two yeasts are more likely to have human homologues, indicating that we may be able to model certain aspects of human genetic interactions in yeasts.

David WalshOcean Health: A Case Study for Microbial Systems Ecologydavid discussed how his work at uBc involves trying to understand the health of the oceans by looking at the microbial ecology that’s at the base of the food web of these systems. microbes dominate the earth’s ecology in abundance, diversity and in terms of their role in ecosystem function. At present, 80-99% of all microbes are uncultivated which compounds the difficulty of being able to study their physiology and role in global biogeochemical cycles. in the oceans, microbes convert co2 from the atmosphere into organic carbon that becomes sequestered in the ocean’s interior; they are also involved in the generation and removal of biologically available nitrogen, as well as in the production and consumption of greenhouse gases. currently, ocean warming is leading to an expansion of naturally occurring regions of low oxygen (hypoxia) known as oxygen minimum zones. ecosystem models demonstrate that increasing hypoxia leads to a diversion of energy from higher trophic levels (i.e. fish, marine mammals) into microbial pathways, however the dynamics of these pathways are as yet poorly understood as are the implications of omz expansion. david discussed how the application of genomic, transcriptomic and proteomic tools allowed him to uncover the metabolic functions and ecosystem role of a particular group of ubiquitous omz microbes called sup05 in the seasonally anoxic basic of saanich inlet, British columbia. he emphasized the importance of integrating systems biology with ecological theory and practice, and physical, chemical and biological system descriptions in order to understand and predict ecosystem function and response to global environmental change.

Page 9: CIFAR bigDATA Workshop Report

Summary of presntations

page

7

Phil HugenholtzDivide and conquer strategies for metagenomicsphil discussed how increases in the capacity and throughput of next generation dnA sequencing technologies are both a blessing and a curse – they enable researchers to get vast amounts of information, but a computational bottleneck in data storage and analysis is imminent. he proposed using single-cell genomics as a “divide and conquer” solution in which a single bacterium is isolated and it’s genome amplified, thereby decreasing problems with genome assembly associated with amplification of dnA from whole microbial communities. At present, the best method for searching populations of microbes to obtain single cells of interest involves a fluorescence in-situ hybridization (Fish) probe staining of the population of interest, following by flow cytometric sorting of the fluorescent cells to pick out single cells from this population for further analysis. single cells are then amplified using multiple displacement Amplification (mdA), which can convert mere femtograms to micrograms of dnA within several hours. the problems associated with this method as a whole are that it is not scalable and is also very slow as only one population can be targeted at a time. one possible solution may be to use anonymous cell sorting rather than using Fish based probes, as this would make the process scalable. phil discussed an example of the successful application of single cell genomics to enhance a metagenomic study of the rhizosphere from several plants whereby the molecular conversation between the host plant and microbes could be more easily discerned by sorting and amplifying the genomes of enriched microbes.

Arthur BradyPhyMM and PhyMMBL: Phylogenetic identification of metagenomesA persistent problem associated with metagenomics studies is the difficulty of assigning taxonomic origin to individual dnA sequence reads. existing methods typically result in the exclusion of large amounts of sequenced data, and Arthur was interested in developing a tool that would allow the taxonomic identification of fragments of metagenomic dnA and allow the inclusion of as much sequence data as possible in downstream studies. he created phymm, a software solution based on core component gene modeling algorithm GlimmeR (a microbial gene finding system) which scores reads using models based on existing sequenced microbial genomes. dnA sequence reads from metagenomics projects are usually sampled in insufficient numbers to reconstruct whole populations, so assembly results in “chimeras” (assemblies of dnA that do not actually exist in nature). using phymm to taxonomically bin metagenomic data before genome assembly is very useful as it decreases the amount of chimeras produced and thus improves overall assembly quality. taxonomic binning using phymm can also lead to the inclusion of a greater proportion of sequenced reads which is cost effective and increases the probability of linking metabolic genes of interest to their hosts phylogenetic identity.

Page 10: CIFAR bigDATA Workshop Report

Summary of presntations

page

8

Miguel FortunaNestedness and Modularity in Ecological Networks

miguel uses network theory to understand ecology by applying network thinking to understand biological communities, focusing mainly on plant/animal networks, but also on other systems. when we think of “biodiversity”, we often think of large numbers of species, but of course species do not exist in isolation – they are constantly interacting with each in various protagonistic, antagonistic and commensal ways. one way to simplify these interactions is to use networks – a series of nodes connected by links. we could think of the interactions in networks as being random, but in nature we find a nested pattern in which the species interact in an asymmetric way with some generalists interacting with all other species in a group and some specialists interacting with only a few other species in a group (i.e. some bats nest in many species of trees while other bats are specifically found in only one species of tree). miguel has found that most mutualistic communities are statistically significantly nested. this nested pattern has dynamical implications as it is more robust to extinction of species and environmental degradation and these nested communities typically contain higher biodiversity. miguel also found that both nested and modular patterns can coexist within the same network (nestedness within compartments), yet as the number of connections in a network increases, it is more likely to find one or the other type of interaction (nested oR modular) but not as likely to find both. these conclusions will aid miguel in the prediction and study of network structure in certain types of ecological relationships – for example, will all parasitic networks have the same network structure? mutualistic networks? And will these be different from each other and thus indicative of the type of interaction?

Tamara MunzerSolving Visualization Problems with Large Datasets – Visualization and Biology: Fertile Ground for Collaborationwhy do visualization? pictures help us think, which substitutes the mental process of perception for cognition, freeing up limited memory resources for higher-level problem solving. For example, reading colors off of a picture is often easier than comparing numbers in a table because we are using our sense of perception instead of purely cognition. visualization is especially important for problems that cannot be automated and where a simple summary is not adequate to fully grasp patterns from a set of data (i.e. statistics do not adequately characterize the complexity of dataset distribution). visualization can allow for novel discovery of patterns, as well as confirmation of hypothesized patterns as well as contradiction of hypotheses. it can often help speed up the workflow of data analysis by helping researchers identify key patterns of interest in a timely fashion. Good driving problems for visualization research have a need for humans in the research loop, involve large and multidimensional datasets (bigdAtA!), and address reasonably clear questions – these criteria are all satisfied in the field of biology making this a key field to apply visualization research. tamara described and illustrated several applications of the visualization software tool ceReBRAl, a tool she and her collaborators created for immunologists to be able to visualize multiple specific experimental conditions on biological pathways simultaneously while still viewing them in a larger context. cerebral allows researchers to notice things across cellular compartments that would be very difficult to identify without using extensive visualization. tamara emphasized the value of using visualization tools to create overviews of very large datasets such that the small and interesting “needles in the haystack” don’t get aggregated out and become easier to identify.

Page 11: CIFAR bigDATA Workshop Report

Summary of presntations

page

9

Christian von MeringMicrobial 16S rRNA data : A Meta analysis to connect genomes to environmental context

microbes in the wild form complex assemblages, but how microbes are distributed across habitats and the nature of the interactions within these assemblages is largely unknown. christian is particularly interested in studying whether or not microbes typically occur in symbiotic associations, how they are dispersed, whether they have clearly defined habitats, and what shapes the contents of microbial genomes. to approach these questions, christian identified phylogenetic marker gene sequences (16s small subunit ribosomal RnA gene) collected from a variety of habitats worldwide and found that two species will co-occur more often than not in the same habitat. he then built a network of microbial co-existence relationships and partitioned this network to define the habitats where the lineages are occurring. Fully sequenced microbial genomes were mapped into the network environment to search for genomic correlates of ecological associations. it was seen that some genomes fall in existing clusters while others do not, indicating some environments seem to be easier to obtain sequenced genomes from than others. many of the co-existing lineages were phylogenetically closely related, but a significant number of distant associations were observed as well. Genomes from coexisting microbes tended to be more similar than expected by chance, both with respect to functional pathway content and genome size. christian hypothesized that groupings of lineages are often ancient, and that they may have significantly impacted on genome evolution.

Tara GianoulisNetwork Dynamics Across Environments: Metabolism and Membrane Proteins

Recent metagenomics studies have begun to sample the genomic diversity among dissimilar habitats and relate this variation to specific features of the environment. As such, tara has been trying to understand how microbial community diversity and function is impacted by very specific environmental features (for example, across gradients of salinity or chemistry). understanding the relationships between specific microbial groups and environmental features could allow us to use the presence or abundance of these groups as biosensors for ecosystem health in a predictive sense. in order to try and distinguish human impact features on the oceans (i.e. shipping, agricultural runoff, pollution) from natural features, tara used metagenomic sequence data and oceanographic metadata collected from the Global ocean survey (Gos) to map the distribution of known metabolic pathways and cell membrane protein families across the surface of the oceans. membrane proteins are an intuitive, but thus far overlooked, choice in this type of analysis as they directly interact with the environment, receiving signals from the outside and transporting nutrients into the cell. using this approach, tara studied global variation in the distribution of membrane proteins in terms of natural features, such as phosphate and nitrate concentrations, and also in terms of human impacts, such as pollution and climate change. her results show that there is widespread variation in membrane protein content across marine sites, which is correlated with changes in both oceanographic variables and human factors.

Page 12: CIFAR bigDATA Workshop Report

Summary of presntations

page

10

Robert BeikoGeographic and Temporal Analysis of Genomes and MetagenomesGenGis (http://kiwi.cs.dal.ca/GenGis) is a new software tool developed by Rob’s group at dalhousie university that merges digital map, habitat, and georeferenced genomic data in a 3d Geographic information system (Gis) environment. GenGis is free and open source, extensible, allows for interactive manipulation of data, and can bring together maps, sample metadata, genomic information, and generate appealing visualizations of geographic datasets. in his talk, Rob gave a demonstration of several applications of GenGis, for example loading 3d topographical maps of Africa and mapping the distribution of hiv subtypes across the continent, which easily allowed the visual interpretation of hiv strain travel across major highways and country boundaries within Africa.

Angela NorbeckIMPROV for Data Integration and Visualizationthe study of complex biological systems requires the integration of descriptive information and experimental data from multiple experiments in order to reveal a biological story that is accurate and representative of the system as a whole. Although the field of genomics is revealing a great deal about the complexity of biological systems, it is of utmost importance to also study protein expression (proteomics) in these systems because two organisms can have the same genome but have very different external representation (or phenotype) depending on which genes are being expressed as proteins (for example, a caterpillar and a butterfly). the same organism can also have varying phenotypes depending on the environmental conditions it is confronting. Angela presented a software tool to aid in proteomic analyses (impRov, or the integrated metapRoteomics viewer), which is designed to collate and cluster multiple data types and display protein expression data with interactive views and interpretable visualization. impRov allows investigators to globally view the whole proteome (all expressed proteins) of a microbial community, to zoom into regions and extract information about specific proteins, to change between different measurements made for peptides and proteins, to filter and search for entities of interest, as well as to export selections of information for analysis in other tools and for visualization. Features in development include on-the-fly clustering of proteins (for example, by function or by location), a heatmap comparison viewer to compare expression levels of various proteins, a time course viewer, statistical analysis platforms such as principle components Analysis (pcA), as well as full integration of functional pathway analysis tools (in collaboration with the hallam lab at uBc).

Page 13: CIFAR bigDATA Workshop Report

Summary of presntations

page

11

Jenn & Ken Visocky-O’GradyInformation Design Dissected

design is based on conventions. Jenn and ken are specifically interested in understanding whY and how these conventions work in effective communication of information, and how these conventions affect the ways in which we communicate and interpret information. through the presentation of several case studies of effective information design, Jenn and ken highlighted the following principles and conventions of information design that can be applied when communicating scientific results:

· lAtch organization of content: a model for organizing information that identifies only 5 ways to organize content: location, Alphabet, time, category, hierarchy

· principle of least effort: regardless of experience and expertise, users will naturally gravitate to familiar and easy tools even if the resulting yield is poor (i.e. use of Google for information search instead of visiting the library)

· miller’s magic number: content should be presented in chunks of the number 7 +/- 2 (i.e. 3, 5, 7, 9 “chunks” represented on any given page)

· it takes a while to arrive at an effective design. there is no magic bullet to design, it takes time and many prototypes until a unique solution is reached. the first idea is often the worst/most common idea, but it might be helpful to tweak the most common idea to make it more powerful as it also likely the most familiar to an audience.

Jenn and ken emphasized that if we as a scientific community cannot create professional visualizations, our impact will not reach beyond our immediate areas of expertise and will most certainly not reach the general public. they noted that design is so critical in communication of information that it should be on the agenda and organizations should have designers available to collaborate with scientists on

their information design needs. incorporating design into an organizations culture at every level causes differentiation and enables the organization to rise above (for example, Apple computers).

Page 14: CIFAR bigDATA Workshop Report

Summary of presntations

page

12

Martin KrzywinskibigFIGURES for bigDATA: Creating Informative and Appealing Genome Data Graphicsmartin gave a very enlightening talk highlighting practical tips to help make scientific figures legible, clear, and attractive. he emphasized the utmost importance in putting in a bit of extra thought and effort to at least make figures legible and clear so that readers are able to understand the information that is being conveyed, while learning to make figures attractive may take more time and practice and may involve using the skills of a trained designer. martin recommended beginning the figure generation process by considering these questions and statements:

· what is my messAGe?· is a graphical representation really necessary to explain this data?

· what is the best wAY to represent the data?· does the legend obviate the figure?· Are there extraneous elements?· does the reader want more? (Always want to leave the reader wanting more, which they can find in the text, rather than overwhelming them with too much information)

· the reader does not know what they need to know – you must tell them

· the reader does not know what is important – you must show them

· the reader’s cognitive and visual acuity are limited

Quality of communication will increase with all improvements in legibility, clarity, and attractiveness. in addressing each of these three factors individually, we should seek to:

1) simplify and de-clutter (improve resolution, color choice, and orientation)

2) Refine and remove redundancy

3) Restructure

other key tips to consider when creating figures include:

· if there is no emergent pattern in your figure, don’t show it

· only show an entire dataset when there is an emergent pattern – otherwise, show only the relevant portions of the dataset and highlight where the patterns are

· use tools like colorbrewer.com and kuler.adobe.com to help create attractive color combinations (don’t want to use colors with the same luminence, or colors that will seems more important than others when they are not)

· Background can hide data – don’t use a background color like grey, use only black or white

· Avoid occluding one dataset with another· Remove redundant information· For scales, use axis breaks if needed, make sure all axes are on the same scale unless you’re just trying to show the shape of a curve

· For legends: » only apply one legend for a multi panel figure » make legends into tables if necessary » Recapitulate the order of legend colors and labels

in figures » never use the same color for two different things » exaggerate your message » shouldn’t need a complicated legend to navigate a

complicated figure· no 3d- ever!· don’t let outliers highjack your figure· data to ink ratio: most of your ink should be used in showing your data

» don’t need strong outlines » don’t want color for things that are missing/zeroes » if zeroes are not important/relevant, don’t show

them » don’t always need axis labels if they are unnecessary· in selecting glyphs, use either shapes or colors but not Both

· use a hierarchy in the information· to make figures more elegant: » they should sit lightly on the page » make lines at the same angles » consider the entry point of the eye into the figure

– starting with a legend gives the reader a chance

Page 15: CIFAR bigDATA Workshop Report

Summary of presntations

page

13

to understand something simple first, then you can look at the specifics of the data for the rest of the figure

» the reader’s eye goes right to the middle of the figure so orient your audience there

martin concluded by showing some examples of figures generated by circos, a software tool he created for visualizing complex genome information. circos is designed for data visualization and not for data analysis, and is freely available for download at http://mkweb.bcgsc.ca/circos/

Page 16: CIFAR bigDATA Workshop Report

Software Demonstration

page

14

Summary of Software Demonstration SessionsSTRINGChristian von Mering(see summary of christian von mering’s workshop presentation, p. 11)

CytoscapeJennifer Gardycytoscape is a program for analyzing and visualizing network data. it was originally designed for visualizing protein-protein interaction networks, but it can be used to visualize virtually any type of network interaction (even social networks!). cytoscape is freely available, and benefits from a large and very active user/developer community, which is continuously generating new plugins and tools to use with cytoscape. the most powerful functions of cytoscape include the ability to use all sorts of visual encodings to represent attributes of your data, as well as the multitude of layout options available. You can paint your network with any sort of quantitative data you are interested in. it is also very useful for generating new hypotheses, analyzing and deciphering what your network might be telling you. there are also many useful tutorials available online to help get started and build your cytoscape toolbox.

Page 17: CIFAR bigDATA Workshop Report

Software Demonstration

page

15

GenGIS Rob Beiko(see summary of Rob Beiko’s workshop presentation, p. 12)

PyrotaggerPhil Hugenholtzpyrotagger is a software tool used for phylogenetically clustering 454 pyrotag sequence data. it performs clusters of similar reads, checks for chimeras (artefacts of sequencing that do not represent real dnA sequences), and finally classifies reads using Greengenes taxonomy.

Pathway ToolsDavid Walsh & Simon Engpathway tools is a software suite that facilitates the analysis of metabolic pathways in organisms and communities of organisms. this suite comprises several components: metacyc, a curated reference database encompassing metabolic pathways from all domains of life; pathologic, which constructs databases of predicted metabolic pathways in organisms and communities using metacyc as a base; and a viewer for navigating these databases and performing comparative analyses across them.

Page 18: CIFAR bigDATA Workshop Report

Discussions and Recommendations

page

16

Discussion & RecommendationsFollowing the final presentation of the workshop, participants spent one hour discussing the impact the workshop had on them, challenges they are facing in the area of complex data analysis and visualization, as well as giving recommendations for how what was learned could be incorporate to their respective research programs. they also discussed recommendations for developing effective information design platforms and strategies in science as a whole. the challenges facing the scientific community with respect to bigdAtA analysis and visualization fell into four categories: tools, training, data and funding, and as such, recommendations were intended to address these specific areas. Following is a list of comments,

challenges, and recommendations brought to the table:

Workshop Comments(see bigDATA Workshop Survey Results on p. 60 for additional comments)

· excellent workshop, liked the heavy emphasis on visualization

· particularly useful for getting researchers to think about effective and innovative ways to display data

· data visualization presentations were a highlight. An area not normally systematically addressed in the day-to-day life of a scientist

· very successful workshop – actually addressing the how of communicating our data instead of coming to another meeting and showing bad, unclear figures

ChallengesTools · the reason we don’t use effective tools is often because they aren’t yet available

· in some cases too many tools exist but we are unsure how effective they all are and which would provide the best option for a given problem/dataset – perhaps we need a rating system?

· problems exist with documentation of different software packages- no idea how a user/example arrived at a given output, documentation requires more input/output/formatting options. Also require documentation to make tools portable across different computational systems.

· whatever is the easiest or most familiar tool to use will get used most, even if it’s not effective

Training· there is very little formal training in the life sciences for computer training

· Analysis of data and communication of data are connected but different, require different sorts of tools and skill sets

Data· we are on the edge with the technological capacity to generate more and more and more data and the ability to adequately analyse and present that data

· there are problems with using web portals for analyzing larger and larger datasets – not enough capacity to either analyze oR visualize these datasets

Funding· Funding agencies aren’t always giving money for analytics and visualization tools, we need to influence funders to give money for visualization

Page 19: CIFAR bigDATA Workshop Report

Discussions and Recommendations

page

17

· Funding model for science is different than that for software. in software, you just want to do the least you can to publish a tool, thus, low incentive to make well documented software or to continue updating. Funding is often not available for v2 and beyond for software tools.

Recommendations

Tools · we need to develop visual analytic tools that enable end users to navigate between and among different levels of biological information. these tools should be embedded in a social networking environment to promote collaborative scientific discovery, policy development and public outreach

· we must energize the developer community to create more open access tools focused on the bigdAtA themes with an emphasis on visualization

· we need a clearinghouse where useful tools are collected and made available with necessary documentation

· Big data is new to biology but not new to other fields like information science or computer science, so lots of tools exist for scientists in other fields. the internet can help us find those resources and learn/adapt those existing tools for use in biology

Training· more emphasis is required on training in

communication of results and on development and maintenance of tools both in undergraduate and continuing education

· we need a Gordon conference or some other regular meeting to cultivate the development of an active and engaged developer and user community

· perhaps other ciFAR groups who also deal with big data could meet together with the imB (for example, some physics groups) to bring everyone outside comfort zones and discuss the future of data analysis

· lynda.com is an online resource where you can pay to learn to use various computational tools (Adobe, etc)

· integrate problem oriented experiences that involve design and software analysis at the undergraduate level of education

Data· there is a need to develop common standards of data collection, encoding, and formatting

· need to create or open access to more high performance computing clusters and cloud computing services

· Get each scientific journal to hire a designer to do reviewing for figures while the content is still peer reviewed by scientists

Funding· Philosophies, focus and funding have to switch

over from data generation to analysis and informatics

· perhaps a sort of “x prize” or standing committee award for teams to compete for in the area of data analysis and visualization would be useful to move the field forward

Page 20: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

18

Abstracts & Presenter InformationBrenda Andrews Brenda is director of ciFAR’s Genetic networks program, and professor and chair of the Banting & Best department of medical Research within the Faculty of medicine at the university of toronto. she is also director of the terrence donnelly center for cellular and Biomolecular Research, a new interdisciplinary research institute with the mandate to create a research environment that encourages integration of biology, computer science, engineering and chemistry and that spans leading areas of biomedical research. After receiving her phd in medical Biophysics from the university of toronto, dr. Andrews obtained her early training in genetics with the late dr. ira herskowitz at the university of california san Francisco. in 1991, dr. Andrews was recruited to the department of medical Genetics (now medical Genetics & microbiology) at the university of toronto. she became chair of the department in 1999, a position she held for 5 years before assuming her current positions. dr. Andrews’ current research interests include mechanisms of cell division control and polarity and functional genomics.

Contact Brenda Andrews

Rm 230 the Donnelly Centre

160 College Street

toronto on M5S 3e1

e-mail: [email protected]

URL: http://www.utoronto.ca/andrewslab/

Page 21: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

19

Abstractdetermining how combinations of genetic variants or perturbations manifest themselves, particularly in the context of human disease, is a formidable challenge. to define general principles of genetic networks, our group has focused on the systematic identification of genetic interactions in the budding yeast. synthetic genetic array (sGA) analysis provides a high throughput approach to automate yeast genetics. we have used sGA analysis to construct a genome-scale genetic interaction map by examining 5.4 million gene-gene pairs for synthetic genetic interactions, generating quantitative genetic interaction profiles for about 75% of all genes in Saccharomyces cerevisiae. the global network identifies functional cross-connections between all

bioprocesses, mapping a cellular wiring diagram of pleiotropy. we have also expanded our sGA platform to encompass other types of genetic interactions and to include cell biological phenotypes and quantitative read-outs of the activity of specific biological pathways. in one project, we combined sGA with a high-content screening (hcs) platform, to monitor morphological phenotypes of the growing mitotic spindle in both single gene deletion mutants and in selected double mutant arrays, sensitized for spindle defects. hcs enables virtually any pathway that can be monitored with a fluorescent reporter to be assessed quantitatively within the context of numerous genetic and environmental perturbations.

The Genetic Landscape of a Cell

a correlation-based network connecting genes with similar genetic interaction profiles. Genetic profile similarities were measured for all gene pairs by computing pearson correlation coefficients (pccs) from the complete genetic interaction matrix. Gene pairs whose profile similarity exceeded a pcc > 0.2 threshold were connected in the network and laid out using an edge-weighted, spring-embedded, network layout algorithm (7, 8). Genes sharing similar patterns of genetic interactions are proximal to each other; less-similar genes are positioned farther apart. colored regions indicate sets of genes enriched for Go biological processes summarized by the indicated terms.

Page 22: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

20

Robert Beikothroughout his career, Rob has been passionate about the automated analysis of large datasets. his phd research under Robert charlebois at the university of ottawa was focused on the detection of weak conserved patterns in the first published microbial genomes. during a postdoctoral stint with mark Ragan at the university of Queensland, he shifted his attention to the impact and implications of lateral genetic transfer (lGt), using newly developed bioinformatic techniques to build a complete map of gene sharing among all published microbial genomes. in 2006 he took up a faculty position in computer science at dalhousie university, and has expanded the search for lGt and other important evolutionary processes using simulations, new machine-learning techniques and metagenomic data analysis.

ContactRobert Beiko

Faculty of Computer Science, Dalhousie University

6050 University Avenue

Halifax, nS B3H 1W5 CAnADA

e-mail: [email protected]

URL: http://users.cs.dal.ca/»beiko/

Page 23: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

21

AbstractGenGis (http://kiwi.cs.dal.ca/GenGis) is a new software tool that merges digital map, habitat, and georeferenced genomic data in a 3d Gis environment. the input files required by GenGis are simple (newick trees, comma-separated files) or standard (widely used digital map formats), but greater accessibility to maps and to standard metagenomic data sets can be achieved using relational databases and web services. the moA database and web service (http://ratite.cs.dal.ca/moa/), originally developed to compute and serve comparative genomic data, has now been extended to support georeferenced metagenomic data sets that can be acquired and displayed in GenGis. in this presentation i will give an overview of GenGis and moA, and show how the two have been linked together to support analysis of large metagenomic datasets. i will also briefly cover some our

other forays into analysis of emerging sequence datasets with tools such as seqmonitor (http://ratite.cs.dal.ca/seqmonitor), which adds a temporal aspect to the geographic and genomic analyses described above.

Robert’s Recommended Readings

lozupone, c.A., and knight, R. (2007) Global patterns in bacterial diversity. Proceedings of the National Academy of

Sciences 104: 11436-11440.

walsh, d.A., zaikova, e., howes, c.G., song, Y.c., wright, J.J., tringe, s.G. et al. (2009) metagenome of a versatile chemolithoautotroph from expanding oceanic dead zones. science 326: 578-582.

horner-devine, m.c., lage, m., hughes, J.B., and Bohannan, B.J.m. (2004) A taxa-area relationship for bacteria. Nature 432: 750-753.

Geographic and Temporal Analysis of Genomes and Metagenomes

geographic visualizations of genetic variation using gengis. top panel: phylogenetic tree of hemagglutinin proteins from the 2009 influenza A h1n1 'swine flu' outbreak. isolate locations are coloured along a longitudinal gradient, from west (blue) to east (red). isolates recovered from new York city are highlighted in the tree, illustrating the diversity of sequences sampled from this location. Bottom panel: distribution and phylogenetic tree of oxyrrhis d from lowe et al., (2010) "patterns of genetic diversity in the marine heterotrophic flagellate oxyrrhis marina (Alveolata: dinophyceae)". the four distinct clades identified by the authors based on 5.8s rdnA and cytochrome oxidase analysis are indicated using different colours.

Page 24: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

22

ContactArthur Brady

Bldg. #296, Rm. #3104D

University of Maryland

College park, MD USA

e-mail: [email protected]

URL: http://www.cbcb.umd.edu/»abrady/

Arthur BradyArthur completed his phd in computers science in 2008 at tufts university, with a concentration in bioinformatics. he is currently working with steven salzberg as a postdoctoral researcher at the university of maryland center for Bioinformatics and computational Biology. Recent areas of interest include metagenomics analysis, computational gene prediction and RnA expression analysis.

Page 25: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

23

Abstractmetagenomics projects collect dnA from uncharacterized environments that may contain thousands of species per sample. one main challenge facing metagenomic analysis is phylogenetic classification of raw sequence reads into groups representing the same or similar taxa, a prerequisite for genome assembly and for analyzing the biological diversity of a sample. new sequencing technologies have made metagenomics easier, by making sequencing faster, and more difficult, by producing shorter reads than previous technologies. classifying sequences from reads as short as 100 base pairs has until now been relatively inaccurate, requiring researchers to use older, long-read technologies. i’ll be presenting phymm, a classifier for metagenomic data, that has been trained on 539 complete, curated genomes and can accurately classify reads as short as 100 base pairs, a substantial improvement over previous composition-based classification methods. i’ll also describe how combining phymm with sequence alignment algorithms improves accuracy.

Phymm and PhymmBL: Phylogenetic Identification of Metagenomic Fragments

Page 26: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

24

ContactMiguel A. Fortuna

postdoctoral Research Associate

Department of ecology and evolutionary Biology

princeton University.

princeton, new Jersey 08540 USA

e-mail: [email protected]

URL: http://ieg.ebd.csic.es/fortuna/

Miguel A. Fortunamiguel obtained a phd in Biology at the university of seville, under the supervision of Jordi Bascompte where he applied the theory of complex networks to identify the spatial scale at which ecological and evolutionary processes take place. in 2009 he was awarded a marie curie outgoing international Fellowship by the european union to conduct postdoctoral work with simon levin at princeton university. his current interests focus on developing an ecological theory of coevolution for understanding how natural selection shapes the architecture of ecological networks such as food webs and plant-animal mutualistic networks.

Page 27: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

25

Abstractthe architecture of complex networks of species interactions, such as predation, parasitism and mutualism plays an important role in the persistence and stability of species-rich communities. if we look at the identity of who interacts with whom at a community-wide level, plant-animal mutualistic networks and host-parasite webs tend to show a significantly nested pattern wherein specialists interact with proper subsets of the species interacting with generalists. in food webs studies, a significantly modular pattern characterized by the existence of densely-connected, non-overlapping subsets of species —called modules— has also been identified. in this case, modules are composed of species having many interactions among themselves as well as very few with species in other modules. the dynamical implications of these two community-level patterns have begun to be explored. in this talk i will illustrate how nestedness and

Ecological Networks: Nested and Modular Structures

modularity have also been detected in other ecological contexts such as scavenger communities, animal societies, and gene flow in plant populations. hopefully, the potential of the network approach for exploring the metagenomic data from the next generation of sequencing projects will be shown.

Miguel’s Recommended Readings:

loeuille, n. and loreau, m. (2005). evolutionary emergence of size-structured food webs. proc. natl. Acad. sci. usA., 102: 5761-5766.

Bastolla, u., Fortuna, m. A., pascual-Garcia, A., Ferrera, A., luque, B., and Bascompte. J. (2009). the architecture of mutualistic networks minimizes competition and increasing biodiversity. nature, 458: 1018-1020.

Fortuna, m. A., Albaladejo, R., Fernandez, l., Aparicio, A., and Bascompte, J. (2009). networks of spatial genetic variation across species. proc. natl. Acad. sci. usA., 106: 19044-19049.

modular structure of the bipartite roosting network of bird-predator bats. nodes represent bats (n=25, on the left) and trees (m=73, on the right). the size of nodes is proportional (in logarithmic scale) to the number of trees visited by each bat and to the number of bats visiting each tree, respectively. A link between a bat and a tree indicates that the bat visited the tree. the thickness of a link represents the fraction of days particular tree was visited by a particular bat from the total number of days that bat was recorded using trees. that is, it indicates how important is each tree for each bat. colors represent the three modules detected by the modularity algorithm, that is, three groups of bats sharing the same roosting sites and their associated three groups of trees which are used by the same bats. in blue, n_1=7 and m_1=16; in green, n_2=8 and m_2=27; in red, n_3=10 and m_3=30.

Page 28: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

26

Contact Dr. Jennifer Gardy

Genome Research Laboratory

BC Centre for Disease Control

655 West 12th Ave.

Vancouver, BC

V5Z 4R4

ph.: 604-707-2488

e-mail: [email protected]

URL: www.bccdc.ca & http://www.cmdr.ubc.ca/»jennifer/

Jennifer GardyJennifer is an adjunct professor in uBc’s department of microbiology & immunology, and she runs the Genome Research laboratory at the British columbia centre for disease control, where she uses genomics, bioinformatics, and network analysis to study the origins, spread and control of infectious disease. prior to joining Bccdc, Jennifer was a postdoctoral fellow at uBc working on the innatedBdatabase and analysis environment, where she developed novel cytoscape-based visualization tools for biological interaction and pathway data.

Page 29: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

27

Abstract“in this short session, participants will be introduced to the open-source interaction and network visualization tool cytoscape (www.cytoscape.org). topics to be covered include input formats, visualization and analysis methods, the power of plugins, and creative uses of the cytoscape platform, as well as where to go for further information and training.”

Jennifer’s Recommended Readings:

paul shannon, Andrew markiel, owen ozier, nitin s. Baliga, Jonathan t. wang, daniel Ramage, nada Amin, Benno schwikowski, and trey ideker. cytoscape: A software environment for integrated models of Biomolecular interaction networks. Genome Res. november 2003 13: 2498-2504

Freifeld cc, mandl kd, Reis BY, Brownstein Js. healthmap: global infectious disease monitoring through automated classification and visualization of internet media reports. J Am med inform Assoc. 2008 mar-Apr;15(2):150-7. epub 2007 dec 20.

krzywinski, m., schein, J., Birol, i., connors, J., Gascoyne, R., horsman, d. et al. (2009) circos: an information aesthetic for comparative genomics. Genome Res 19: 1639-1645.

Cytoscape

this is a combined phylogeny/social network of a tuberculosis outbreak. circles are patients, and their colour indicates how infectious they are, based on their clinical presentation and lab test results (black cases are highly infectious and can transmit tB easily, grey cases are moderately infectious, white cases are extremely unlikely to transmit). there are two distinct genetic lineages in the outbreak as represented by the phylogenetic tree (a pink lineage and a blue lineage). light blue lines connecting cases are social interactions that could have resulted in transmission of tB from person to person, while dark blue arrows are social interactions that we are very confident led to transmission.

Page 30: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

28

Contact tara Gianoulis

Wyss Institute for Bio-Inspired engineering at Harvard, Church Lab

Center for Life Science, Rm 528

3 Blackfan Circle

Boston, MA 02115

e-mail: [email protected]

URL: http://arep.med.harvard.edu/»tgianoulis

Tara Gianoulistara completed her phd in computational Biology jointly advised by mark Gerstein and michael snyder at Yale in 2009. she is currently a research fellow in George church’s lab in the wyss institute for Bio-inspired engineering at harvard.

Page 31: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

29

AbstractRecent metagenomics studies have begun to sample the genomic diversity among disparate habitats and relate this variation to features of the environment. membrane proteins are an intuitive, but thus far overlooked, choice in this type of analysis as they directly interact with the environment, receiving signals from the outside and transporting nutrients. using Global ocean sampling data, we found nearly »900k membrane proteins in large scale metagenomic sequencing, approximately a fifth of which are completely novel, suggesting a large space of hitherto unexplored protein diversity. using Gps coordinates for the Gos sites, we extracted additional environmental features. this allowed us to study membrane protein variation in terms of natural features, such as phosphate and nitrate concentrations, and also in terms of human impacts, such as pollution and climate change. we show that there is widespread variation in membrane protein content across marine sites, which is correlated with changes in both oceanographic variables and human factors. Further, using these data, we developed a network approach, protein Families and environment Features network (pen), to quantify and visualize the correlations. pen identifies small groups of co-varying environmental features and membrane protein families, which we call “bimodules”.

Tara’s Recommended Readings:

the sorcerer ii Global ocean sampling expedition: northwest Atlantic through eastern tropical pacific. Rusch dB, halpern Al, et. al, plos Biol. 2007 mar;5(3): 77.

Rawls, J.F., mahowald, m.A., ley, R.e., and Gordon, J.i. Reciprocal transplantation of gut microbial communities from zebrafish and mice into gnotobiotic recipients reveals host habitat selection of a microbiota. cell 127:423-33 (2006).

symbiosis insights through metagenomic analysis of a microbial consortium.

woyke t, teeling h, ivanova nn, huntemann m, Richter m, Gloeckner Fo, Boffelli d, Anderson iJ, Barry kw, shapiro hJ, szeto e, kyrpides nc, mussmann m, Amann R, Bergin c, Ruehland c, Rubin em, dubilier n. nature. 2006 oct 26;443(7114):950-5. epub 2006 sep 17.

Network Dynamics Across Environments: Metabolism and Membrane Proteins

Page 32: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

30

ContactSteven Hallam

University of British Columbia

Department of Microbiology & Immunology

Life Sciences Institute

2552-2350 Health Sciences Mall

Vancouver, BC Canada V6t 1Z3

e-mail: [email protected]

URL: http://www.cmde.science.ubc.ca/hallam/index.php

Steven Hallamsteven is an Assistant professor in the department of microbiology and immunology at uBc and canada Research chair in environmental Genomics. he is also a scholar in ciFAR’s integrated microbial Biodiversity program. he received his phd from the university of california santa cruz where he studied developmental regulation of neuronal asymmetry and synaptic remodeling in the model nematode C. elegans. motivated by this experience in complex networks he became a postdoctoral researcher at the monterey Bay Aquarium Research institute and later massachusetts institute of technology focusing on microbial systems ecology with edward delong. his current research interests include environmental genomics and genetics with specific emphasis on the creation of computational tools and workflows for taxonomic and functional binning, population genome assembly, and comparative community analysis. .

Page 33: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

31

Abstractmetagenomics, the application of high throughput sequencing to environmental samples, has the great advantage over genomics that it bypasses the cultivation bottleneck and provides access to the largely unexplored microbial world. on the downside, metagenomes are usually aggregates of numerous microbial genomes that need to be informatically separated to facilitate analysis. moreover, most members of a microbial community are not sampled sufficiently to produce even low-quality draft genomes. An alternative approach to analyzing communities without the need for cultivation is flow sorting constituent members and obtaining genomic data from sorted cells or populations. Although this is not a new approach, several recent advances have brought us to the point where it may be possible to sort and sequence representatives of most dominant populations (>0.5%) in a given community, essentially turning comparative metagenomics into comparative genomics. An additional benefit of this approach is the potential to obtain a phylogenetically-balanced genomic representation of the microbial tree of life.

Phil Hugenholtzphil received his ph.d. in microbiology from the university of Queensland, Brisbane, Australia, in 1994. he then pursued postdoctoral work with norman pace in the department of Biology at indiana university and later in the department of plant and microbial Biology at the university of california, Berkeley. he joined the doe Joint Genome institute in may 2004 to lead the microbial ecology program. his group is developing methods for analyzing metagenomic datasets and applying them to a number of interesting communities, including termite hindguts, sludges and compost.

Contactphil Hugenholtz

Head, Microbial ecology program

Doe Joint Genome Institute

2800 Mitchell Drive Bldg 400-440

Walnut Creek, CA 94598

phone: 925-296-5725

Fax: 925-296-5720

e-mail: [email protected]

URL: http://www.jgi.doe.gov/research/hugenholtz.html

Divide and Conquer Strategies for Metagenomics

Page 34: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

32

ContactMartin Krzywinski

Canada’s Michael Smith Genome Sciences Centre

100-570 West 7th Avenue

Vancouver BC V5Z 4S6 Canada

office 604. 877. 6000 ext. 673262 cell 604.782.1024 fax 604. 876.3561

e-mail: [email protected]

URL: http://mkweb.bcgsc.ca

Martin Krzywinskimartin started as a system administrator at canada’s michael smith Genome sciences center (www.bcgsc.

ca) in 1999 and built its first computing and network infrastructure (www.linuxjournal.com/article/6977), applying his interests in computing to it security (www.linuxjournal.

com/article/6811) and visualization (mkweb.bcgsc.ca/

schemaball). he later moved to research, using fingerprint mapping to identify rearrangements in cancer genomes. in an attempt to visualize structural variation seen in cancer, he created circos (mkweb.bcgsc.ca/circos), which has become a common paradigm for displaying comparisons of genomes. his information graphics have appeared in the new York times, wired and on the covers of books and scientific journals. martin believes that form and function can (must) be coexist in any visual forum, runs the espresso club at the Gsc and applies his creative style to fashion photography (www.

lumondo.com).

Page 35: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

33

Abstractthe evolutionary process has provided us with extremely sophisticated spatial perception and visual pattern recognition. Although computers easily outpace our ability to perform raw numerical calculations, we are yet to devise a system that competes with our own ability to visually identify complex patterns and structures. we have developed visualization paradigms to harness powerful vision-based cognitive resources. our analytical reasoning can be significantly enhanced by well-crafted visualizations, which can suggest hypotheses by revealing patterns that are unexpected or otherwise difficult to parametrize. we recognize the benefit in effectively displaying and communicating our data and results. we wish to generate figures that have a clear message, emphasize important aspects of the data while preserving its underlying texture, all while distinguishing signal from noise. in practice, achieving this in biological sciences is made difficult by the large number of variables in the data, their inherent variability and measurement error. these challenges are confounded by continually developing technologies, which permit new types of questions, which in turn require new approaches to visualization. no one-stop solution to visualization currently exists – each data set

can benefit from a variety of approaches. the circular layout used by the visualization tool circos is ideal for communicating large amounts of information, at a variety of length scales in the same figure. Features such as dynamic zooming and dynamic data formatting rules have been implemented to address challenges inherent in drawing genomic data. circos can be automated and incorporated into data pipelines to create exploratory figures for screen viewing, as well as high-resolution publication-ready figures. it is an ideal platform for the display and exploration of relationships between entities, and has been used in a wide range of applications, from characterizing the structural landscapes of cancer genomes to characterizing relationships between characters on the tv show lost.

Martin’s Recommended Readings

tukey, J.w. (1977). exploratory data Analysis. Addison-wesley.

tufte, e.R. (1983). the visual display of Quantitative information. cheshire, ct: Graphics press.

Anders, s. (2009). visualization of genomic data with the hilbert curve. Bioinformatics 25: 1231-1235.

bigFIGURES for bigDATA: Creating Information-Rich, Informative and Appealing Genome Data Graphics

the chart shows the genomes of a variety of bacteria and viruses that cause human disease. the x-axis represents the disease burden, the average number of worldwide deaths due to the disesase. the y-axis depicts mortality, the percentage of worldwide cases that result in death. each colored line represents the genome of the bacterium or virus that causes the disease. the size of the genome and the percentage of guanine or cytosine in the genome (Gc content) are also noted. For viruses, the genomes of several strains are shown when possible.

Page 36: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

34

ContactAngela D. norbeck

pacific northwest national Laboratory

902 Battelle Boulevard

p.o. Box 999, MSIn K8-98

Richland, WA 99352 USA

tel: 509-371-6575

e-mail: [email protected]

URL: http://www.pnl.gov/

Angela NorbeckAngela received her Bsc in oceanography and chemistry from the university of washington in 2000 learning the ins and outs of mass spectrometry from kenneth walsh and Richard keil. in 2004 following several years of technical service at uw and the Fred hutchinson cancer Research center she joined the staff at the pacific northwest national laboratory. she is currently a senior Research scientist specializing in mass spectrometry and protein identification and quantification. in this capacity she has studied an array of biological systems including bacterial, plant, animal, and organic matter of unknown origin. her current research interests include the development and application of clustering methods and visual analytic tools for systems level studies of the microcosm.

Page 37: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

35

Abstractthe study of complex biological systems requires the integration of descriptive information and experimental data from multiple experiments in order to reveal a biological story that is accurate and representative of the system as a whole. high-throughput experiments, such as genomic microarrays and proteomics, have generated millions of sequences that if interpretable, hold a wealth of knowledge about the system under study. impRov, or the integrated metapRoteomics viewer, is designed to collate and cluster multiple data types and display protein expression data with interactive views and interpretable visualization. development is driven by ongoing biological investigations, including microbial community

proteomics.

Angela’s Recommended Readings:

Yooseph et al, “the sorcerer ii Global ocean sampling expedition: expanding the universe of protein Families”, plos Biology, 2007. march, vol 5, (3), 432-466.

wilmes et al, “metaproteomics: studying functional gene expression in microbial ecosystems”, tRends in microbiology, 2006. Feb, vol 14 (2), 92-97

Gehlenborg et al, “visualization of omics data for systems biology”, nature methods supplement, 2010. march, vol 7 (3), s56-s68.

IMPROV for Data Integration and Visualization

galaxy view of metaproteomics data. clusters of protein sequences are represented by dark black rectangular boundaries, and the size of the cluster represents the number of proteins within the cluster. identified peptides, aligning to identified proteins, are highlighted with a black (no coverage) to yellow (moderate coverage), to white (high coverage) scale.

Page 38: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

36

Contactprof. Christian von Mering

Bioinformatics Group

Institute of Molecular Life Sciences

University of Zurich

Winterthurerstrasse 190

8057 Zurich, Switzerland

e-mail: [email protected]

URL: http://www.imls.uzh.ch/research/vonmering.html

tel: +41-44-6353147

Christian von Meringchristian studied Biochemistry at the Free university of Berlin. he completed his ph.d. in developmental Biology at the university of zurich, working on early wing development in drosophila melanogaster. he then moved to the european molecular Biology lab (emBl) in heidelberg, working as a postdoc in the group of peer Bork. there, he conducted a number of computational Biology projects in the areas of protein-protein interaction networks and metagenomics. since 2006, dr. von mering is Associate professor at the university of zurich. he is a founding member of the consortium behind the protein network resource stRinG (http://string-db.org/), and his group continues to develop stRinG with long-term support from the swiss Federal government.

Page 39: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

37

Abstractmicrobes are the most abundant and diverse organisms on earth. in contrast to macroscopic organisms, their environmental preferences and ecological inter-dependencies remain difficult to assess, requiring laborious molecular surveys at diverse sampling sites. here we present a global meta-analysis of previously sampled microbial lineages in the environment. we grouped publicly available 16s ribosomal RnA sequences into operational taxonomic units at various levels of resolution, and systematically searched these for co-occurrence across environments. naturally occurring microbes indeed exhibited numerous, significant inter-lineage associations. these ranged from relatively specific groupings encompassing only a few lineages, to larger assemblages of microbes with shared habitat preferences. many of the co-existing lineages were phylogenetically closely related, but a significant number of distant associations were observed as well. the increased availability of completely sequenced genomes allowed us, for the first time, to search for genomic correlates of such ecological associations. Genomes from coexisting microbes tended to be more similar than expected by chance, both with respect to pathway content and genome size. we hypothesize that groupings of lineages are often ancient, and that they may have significantly impacted on genome evolution.

Christian’s Recommended Readings:

wu d, et. al. (2009) “A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea.” nature. 2009 dec 24;462(7276):1056-60.

martiny JB, et. al. (2006): “microbial biogeography: putting microorganisms on the map.” nat Rev microbiol. 2006 Feb;4(2):102-12. Review.

horner-devine mc, lage m, hughes JB, Bohannan BJ (2004) “A taxa-area relationship for bacteria.” nature. 2004 dec 9;432(7018):750-3.

Meta-analysis of Published 16S Sequences Can (Re)Connect Sequenced Genomes to Their Environmental Contexts

Page 40: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

38

microbial Coexistence networks. microbial lineages occurring in the environment (often nameless and uncharacterized) can be detected and classified by molecular surveys. such data has by now accumulated world-wide, allowing statistical inferrences about coexistence among lineages. the figure shows statistically significant coexistence connections among microbial lineages, clustered and visualized against a map of the world (in the background; only for illustration purposes)

Page 41: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

39

anonymous microbes mapped to the tree of life. the availability of completely sequenced genomes provides essential reference information for microbial sequence data from the environment. the schematic shows four distinct environments, from which microbial genome fragments have been sampled, and mapped to the tree of life (using the mltreemap resource; http://mltreemap.org/)

Page 42: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

40

ContactJenn + Ken Visocky o’Grady

c/o enspace

29841 Lake Road

Bay Village, oH 44140, USA

(216) 410-8332 (Jenn’s mobile)

(216) 408-1013 (Ken’s mobile)

e-mail: [email protected]

e-mail: [email protected]

URL: http://www.enspacedesign.com/

Jenn and Ken Visocky O’GradyJenn + ken visocky o’Grady are partners in business and life. the couple cofounded enspace, a creative think tank where collaboration enhances visual communication. the firm’s work has been recognized by numerous organizations and featured internationally in magazines and books. together they have had privilege to travel north America, jurying competitions and presenting workshops and lectures. most recently they have served as consultants for the RGd Accessibility project, defining best practices and supporting the implementation of information and communications standards under the Accessibility for ontarians with disabilities Act (AodA). they also promote the value of design in the classroom—Jenn as an Associate professor at cleveland state university, and ken as an Associate professor at kent state university. their first book, A designer’s Research manual, is suggested preparatory text for a portion of the canadian RGd Qualification examination. their second, the information design handbook, was released in september 2008.

Page 43: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

41

Abstractin a global environment, clear and accessible communication across a range of media has become essential. A broad understanding of design principals, framed by cognition, communication, and aesthetic theories, can help visualize complex ideas. this presentation will explore the connections and convergences between perception, thinking, and learning; how we transmit knowledge, share concepts, and process information through visual language; and how structure and legibility affect the visualization of messaging.

Jenn & Ken’s Recommended Readings

good magazine “transparency” feature

Good magazine has a section of each issue, called “transparency,” devoted to the visualization of complex topics. they invite celebrated information designers to act as guest contributors. You can also catch many of these infographics on their web site at: http://www.good.is/departments/transparency/

envisioning information by edward tufte

An information design classic.

Universal Principles of design by william lidwell, Kritina holder, Jill Butler

the best all-in-one reference we’ve found for design principles, regardless of discipline.

Information Design Dissected

the pioneer plaque, nAsA’s greeting to interstellar life, was mounted on pioneer 10, the first spacecraft to travel outside the boundaries of our solar system (launched on march 2, 1972). pioneer 10 transports an example of information design intended as possible first contact with alien life.

the personal phase of human interaction (according to the uncertainty Reduction theory) occurs when the engaged parties being to feel relaxed and start to share information more freely.

Page 44: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

42

ContactDavid A. Walsh

Department of Biology, Concordia University

7141 Sherbrooke Street West

Montreal, Quebec H4B 1R6

e-mail: [email protected]

URL: http://www.cmde.science.ubc.ca/hallam/peoplewalsh.php

David Walshdavid received his B.sc. from the university of victoria and his ph.d. from dalhousie university where he studied and contributed to the fields of microbial evolution and ecology. in 2007 david received fellowships from the tulA foundation and killam trusts to conduct postdoctoral research with steven hallam at the university of British columbia. in July 2010, david moves to concordia university where he will hold a canada Research chair in microbial systems ecology. he is an Associate of ciFAR’s integrated microbial Biodiversity program.

Page 45: CIFAR bigDATA Workshop Report

Abstracts & presenter Info

page

43

Abstractearth’s ecology is governed by microbial life, as evidenced by the widespread dominance of microbes and the central role their activities and interactions play in the maintenance of ecosystem function. in recent years, application of the systems biology toolbox (e.g. genomics, transcriptomics, proteomics,) to whole microbial communities has lead to a deeper description of the depth, breadth, and ecological importance of microbial diversity. Yet, the grand challenge −to model complex interactions from gene to ecosystem scales using high throughput, large scale descriptive technologies− remains open. microbial systems ecology couples systems biology with ecological theory and practice and physical, chemical, and biological system descriptions in order to meet this challenge. this integration is essential to understanding and predicting ecosystem function and response to global environmental change. the oceans play a central role in climate regulation through storage of heat and anthropogenic carbon dioxide. currently, ocean warming is leading to an expansion of naturally occurring regions of low oxygen (hypoxia) known as oxygen minimum zones. ecosystem models demonstrate that increasing hypoxia leads to a diversion of energy from higher trophic levels into microbial pathways. moreover, the extent of hypoxia regulates the microbial-mediated loss of fixed n from the oceans and controls the oceanic production of

greenhouse gases. Given the ecological impact of marine hypoxia, and the contribution of microbial processes to oxygen-sensitive biogeochemical cycles, it is important to further understand the microbial ecology of hypoxic systems. to this end, we report on a time-series study of a seasonally hypoxic coastal basin (saanich inlet) in British columbia. we highlight the power and limitations of our approach to modeling and predicting microbial community response to marine hypoxia in the world’s oceans. moreover, we discuss the general characteristics desirable of a model system conducive to microbial systems ecology research.

David’s Recommended Readings

proteorhodopsin phototrophy in the ocean. Béjà o, spudich en, spudich Jl, leclerc m, delong eF.nature. 2001 Jun 14;411(6839):786-9

community structure and metabolism through reconstruction of microbial genomes from the environment. tyson Gw, chapman J, hugenholtz p, Allen ee, Ram RJ, Richardson pm, solovyev vv, Rubin em, Rokhsar ds, Banfield JF. nature. 2004 mar 4;428(6978):37-43. epub 2004 Feb 1.

Quantifying environmental adaptation of metabolic pathways in metagenomics. Gianoulis tA, Raes J, patel pv, Bjornson R, korbel Jo, letunic i, Yamada t, paccanaro A, Jensen lJ, snyder m, Bork p, Gerstein mB. proc natl Acad sci u s A. 2009 Feb 3;106(5):1374-9. epub 2009 Jan 22.

Ocean Health: A Case Study for Microbial Systems Ecology

ecological indicator analysis reveals diagnostic genes for omzs and oligotrophic oceanic waters spanning gradients of light, oxygen and nutrients. saanich inlet (si) surface: 375 indicator = operational protein families (opFs), si basin: 1530 indicator opFs, hawaii ocean time-series (hot) station Aloha: 311 indicator opFs.

Page 46: CIFAR bigDATA Workshop Report

Selected publications

page

44

Selected Publications from Workshop PresentersBrenda AndrewsAsenjo, A.J., Ramirez, p., Rapaport, i., Aracena, J., Goles, e., and Andrews, B.A. (2007) A discrete mathematical model applied to genetic regulation and metabolic networks. J microbiol Biotechnol 17: 496-510.

Baetz, k., measday, v., and Andrews, B. (2006) Revealing hidden relationships among yeast genes involved in chromosome segregation using systematic synthetic lethal and synthetic dosage lethal screens. cell cycle 5: 592-595.

Boone, c., Bussey, h., and Andrews, B.J. (2007) exploring genetic interactions and networks with yeast. nat Rev Genet 8: 437-449.

Bussey, h., Andrews, B., and Boone, c. (2006) From worm genetic networks to complex human diseases. nat Genet 38: 862-863.

costanzo, m., Giaever, G., nislow, c., and Andrews, B. (2006) experimental approaches to identify genetic networks. curr opin Biotechnol 17: 472-480.

diaz, h., Andrews, B.A., hayes, A., castrillo, J., oliver, s.G., and Asenjo, J.A. (2009) Global gene expression in recombinant and non-recombinant yeast saccharomyces cerevisiae in three different metabolic states. Biotechnol Adv 27: 1092-1117.

dixon, s.J., Andrews, B.J., and Boone, c. (2009) exploring the conservation of synthetic lethal genetic interaction networks. commun integr Biol 2: 78-81.

dixon, s.J., costanzo, m., Baryshnikova, A., Andrews, B., and Boone, c. (2009) systematic mapping of genetic interaction networks. Annu Rev Genet 43: 601-625.

Friesen, h., humphries, c., ho, Y., schub, o., colwill, k., and Andrews, B. (2006) characterization of the yeast amphiphysins Rvs161p and Rvs167p reveals roles for the Rvs heterodimer in vivo. mol Biol cell 17: 1306-1321.

haynes, J., Garcia, B., stollar, e.J., Rath, A., Andrews, B.J., and davidson, A.R. (2007) the biologically relevant targets and binding affinity requirements for the function of the yeast actin-binding protein 1 src-homology 3 domain vary with genetic context. Genetics 176: 193-208.

hoke, s.m., Guzzo, J., Andrews, B., and Brandl, c.J. (2008) systematic genetic array analysis links the saccharomyces cerevisiae sAGA/slik and nuA4 component tra1 to multiple cellular processes. Bmc Genet 9: 46.

huang, d., Friesen, h., and Andrews, B. (2007) pho85, a multifunctional cyclin-dependent protein kinase in budding yeast. mol microbiol 66: 303-314.

kainth, p., and Andrews, B. Quantitative cell array screening to identify regulators of gene expression. Brief Funct Genomics 9: 13-23.

kainth, p., sassi, h.e., pena-castillo, l., chua, G., hughes, t.R., and Andrews, B. (2009) comprehensive genetic analysis of transcription factor pathways using a dual reporter gene system in budding yeast. methods 48: 258-264.

kurat, c.F., wolinski, h., petschnigg, J., kaluarachchi, s., Andrews, B., natter, k., and kohlwein, s.d. (2009) cdk1/cdc28-dependent activation of the major triacylglycerol lipase tgl4 in yeast links lipolysis to cell-cycle progression. mol cell 33: 53-63.

liu, c., van dyk, d., li, Y., Andrews, B., and Rao, h. (2009) A genome-wide synthetic dosage lethality screen reveals multiple pathways that require the functioning of ubiquitin-binding proteins Rad23 and dsk2. Bmc Biol 7: 75.

sassi, h.e., Bastajian, n., kainth, p., and Andrews, B.J. (2009) Reporter-based synthetic genetic array analysis: a functional genomics approach for investigating the cell cycle in saccharomyces cerevisiae. methods mol Biol 548: 55-73.

Page 47: CIFAR bigDATA Workshop Report

Selected publications

page

45

sopko, R., huang, d., smith, J.c., Figeys, d., and Andrews, B.J. (2007) Activation of the cdc42p Gtpase by cyclin-dependent protein kinases in budding yeast. emBo J 26: 4487-4500.

sopko, R., papp, B., oliver, s.G., and Andrews, B.J. (2006) phenotypic activation to discover biological pathways and kinase substrates. cell cycle 5: 1397-1402.

traven, A., lo, t.l., pike, B.l., Friesen, h., Guzzo, J., Andrews, B., and heierhorst, J. dual functions of mdt1 in genome maintenance and cell integrity pathways in saccharomyces cerevisiae. Yeast 27: 41-52.

vizeacoumar, F.J., chong, Y., Boone, c., and Andrews, B.J. (2009) A picture is worth a thousand words: genomics to phenomics in the yeast saccharomyces cerevisiae. FeBs lett 583: 1656-1661.

Yan, z., costanzo, m., heisler, l.e., paw, J., kaper, F., Andrews, B.J. et al. (2008) Yeast Barcoders: a chemogenomic application of a universal donor-strain collection carrying bar-code identifiers. nat methods 5: 719-725.

zou, J., Friesen, h., larson, J., huang, d., cox, m., tatchell, k., and Andrews, B. (2009) Regulation of cell polarity through phosphorylation of Bni4 by pho85 G1 cyclin-dependent kinases in saccharomyces cerevisiae. mol Biol cell 20: 3239-3250.

Robert Beiko Bapteste, e., o’malley, m.A., Beiko, R.G., ereshefsky, m., Gogarten, J.p., Franklin-hall, l. et al. (2009) prokaryotic evolution and the tree of life are two different things. Biol direct 4: 34.

Beiko, R.G., chan, c.x., and Ragan, m.A. (2005) A word-oriented approach to alignment validation. Bioinformatics 21: 2230-2239.

Beiko, R.G., and charlebois, R.l. (2005) GAnn: genetic algorithm neural networks for the detection of conserved combinations of features in dnA. Bmc Bioinformatics 6: 36.

Beiko, R.G., and charlebois, R.l. (2007) A simulation test bed for hypotheses of genome evolution. Bioinformatics 23: 825-831.

Beiko, R.G., doolittle, w.F., and charlebois, R.l. (2008) the impact of reticulate evolution on genome phylogeny. syst Biol 57: 844-856.

Beiko, R.G., and hamilton, n. (2006) phylogenetic identification of lateral genetic transfer events. Bmc evol Biol 6: 15.

Beiko, R.G., harlow, t.J., and Ragan, m.A. (2005) highways of gene sharing in prokaryotes. proc natl Acad sci u s A 102: 14332-14337.

Beiko, R.G., keith, J.m., harlow, t.J., and Ragan, m.A. (2006) searching for convergence in phylogenetic markov chain monte carlo. syst Biol 55: 553-565.

Beiko, R.G., and Ragan, m.A. (2008) detecting lateral genetic transfer : a phylogenetic approach. methods mol Biol 452: 457-469.

Beiko, R.G., and Ragan, m.A. (2009) untangling hybrid phylogenetic signals: horizontal gene transfer and artifacts of phylogenetic reconstruction. methods mol Biol 532: 241-256.

chan, c.x., Beiko, R.G., and Ragan, m.A. (2006) detecting recombination in evolving nucleotide sequences. Bmc Bioinformatics 7: 412.

chan, c.x., darling, A.e., Beiko, R.G., and Ragan, m.A. (2009) Are protein domains modules of lateral genetic transfer? plos one 4: e4524.

charlebois, R.l., Beiko, R.G., and Ragan, m.A. (2003) microbial phylogenomics: Branching out. nature 421: 217.

Page 48: CIFAR bigDATA Workshop Report

Selected publications

page

46

charlebois, R.l., clarke, G.d., Beiko, R.G., and st Jean, A. (2003) characterization of species-specific genes using a flexible, web-based querying system. Fems microbiol lett 225: 213-220.

clarke, G.d., Beiko, R.G., Ragan, m.A., and charlebois, R.l. (2002) inferring genome trees by using a filter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BlAstp scores. J Bacteriol 184: 2072-2080.

davies, m.R., mcmillan, d.J., Beiko, R.G., Barroso, v., Geffers, R., sriprakash, k.s., and chhatwal, G.s. (2007) virulence profiling of streptococcus dysgalactiae subspecies equisimilis isolated from infected humans reveals 2 distinct genetic lineages that do not segregate with their phenotypes or propensity to cause diseases. clin infect dis 44: 1442-1454.

macdonald, n., parks, d., and Beiko, R. (2009) seqmonitor: influenza analysis pipeline and visualization. plos curr influenza: RRn1040.

martin, c.c., tsang, c.h., Beiko, R.G., and krone, p.h. (2002) expression and genomic organization of the zebrafish chaperonin gene complex. Genome 45: 804-811.

mcmillan, d.J., Beiko, R.G., Geffers, R., Buer, J., schouls, l.m., vlaminckx, B.J. et al. (2006) Genes for the majority of group a streptococcal virulence factors and extracellular surface proteins do not confer an increased propensity to cause invasive disease. clin infect dis 43: 884-891.

parks, d., macdonald, n., and Beiko, R. (2009) tracking the evolution and geographic spread of influenza A. plos curr influenza: RRn1014.

parks, d.h., and Beiko, R.G. (2010) identifying biologically relevant differences between metagenomic communities. Bioinformatics.

parks, d.h., porter, m., churcher, s., wang, s., Blouin, c., whalley, J. et al. (2009) GenGis: A geospatial information system for genomic data. Genome Res 19: 1896-1904.

Ragan, m.A., and Beiko, R.G. (2009) lateral genetic transfer: open issues. philos trans R soc lond B Biol sci 364: 2241-2251.

Ragan, m.A., harlow, t.J., and Beiko, R.G. (2006) do different surrogate methods detect lateral genetic transfer events of different relative ages? trends microbiol 14: 4-8.

whalley, J., Brooks, s., and Beiko, R.G. (2009) Radie: visualizing taxon properties and parsimonious mappings using a radial phylogenetic tree. Bioinformatics 25: 672-673.

Arthur BradyBrady, A., maxwell, k., daniels, n., and cowen, l.J. (2009) Fault tolerance in protein interaction networks: stable bipartite subgraphs and redundant pathways. plos one 4: e5364.

Brady, A., and salzberg, s.l. (2009) phymm and phymmBl: metagenomic phylogenetic classification with interpolated markov models. nat methods 6: 673-676.

Miguel A. FortunaBastolla, u., Fortuna, m.A., pascual-Garcia, A., Ferrera, A., luque, B., and Bascompte, J. (2009) the architecture of mutualistic networks minimizes competition and increases biodiversity. nature 458: 1018-1020.

Fortuna, m.A., Albaladejo, R.G., Fernandez, l., Aparicio, A., and Bascompte, J. (2009) networks of spatial genetic variation across species. proc natl Acad sci u s A 106: 19044-19049.

Fortuna, m.A., and Bascompte, J. (2006) habitat loss and the structure of plant-animal mutualistic networks. ecol lett 9: 281-286.

Page 49: CIFAR bigDATA Workshop Report

Selected publications

page

47

Fortuna, m.A., Garcia, c., Guimaraes, p.R., Jr., and Bascompte, J. (2008) spatial mating networks in insect-pollinated plants. ecol lett 11: 490-498.

Fortuna, m.A., Gomez-Rodriguez, c., and Bascompte, J. (2006) spatial network structure and amphibian persistence in stochastic environments. proc Biol sci 273: 1429-1434.

Fortuna, m.A., and melian, c.J. (2007) do scale-free regulatory networks allow more expression than random ones? J theor Biol 247: 331-336.

Fortuna, m.A., popa-lisseanu, A.G., ibanez, c., and Bascompte, J. (2009) the roosting spatial network of a bird-predator bat. ecology 90: 934-944.

Fortuna, m.A., stouffer, d.B., olesen, J.m., Jordano, p., mouillot, d., krasnov, B.R. et al. nestedness versus modularity in ecological networks: two sides of the same coin? J Anim ecol.

Rezende, e.l., Albert, e.m., Fortuna, m.A., and Bascompte, J. (2009) compartments in a marine food web associated with phylogeny, body mass, and habitat structure. ecol lett 12: 779-788.

selva, n., and Fortuna, m.A. (2007) the nested structure of a scavenger community. proc Biol sci 274: 1101-1108.

Jennifer GardyBarsky, A., Gardy, J.l., hancock, R.e., and munzner, t. (2007) cerebral: a cytoscape plugin for layout of and interaction with biological networks using subcellular localization annotation. Bioinformatics 23: 1040-1042.

Barsky, A., munzner, t., Gardy, J., and kincaid, R. (2008) cerebral: visualizing multiple experimental conditions on a graph with biological context. ieee trans vis comput Graph 14: 1253-1260.

Brown, k.l., cosseau, c., Gardy, J.l., and hancock, R.e. (2007) complexities of targeting innate immunity to treat infection. trends immunol 28: 260-266.

Brown, k.l., Falsafi, R., kum, w., hamill, p., Gardy, J.l., davidson, d.J. et al. Robust tlR4-induced gene expression patterns are not an accurate indicator of human immunity. J transl med 8: 6.

cosseau, c., devine, d.A., dullaghan, e., Gardy, J.l., chikatamarla, A., Gellatly, s. et al. (2008) the commensal streptococcus salivarius k12 downregulates the innate immune responses of human epithelial cells and promotes host-microbe homeostasis. infect immun 76: 4163-4175.

Gardy, J.l., and Brinkman, F.s. (2006) methods for predicting bacterial protein subcellular localization. nat Rev microbiol 4: 741-751.

Gardy, J.l., laird, m.R., chen, F., Rey, s., walsh, c.J., ester, m., and Brinkman, F.s. (2005) psoRtb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21: 617-623.

Gardy, J.l., lynn, d.J., Brinkman, F.s., and hancock, R.e. (2009) enabling a systems biology approach to immunology: focus on innate immunity. trends immunol 30: 249-262.

Gardy, J.l., spencer, c., wang, k., ester, m., tusnady, G.e., simon, i. et al. (2003) psoRt-B: improving protein subcellular localization prediction for Gram-negative bacteria. nucleic Acids Res 31: 3613-3617.

lee, s.m., Gardy, J.l., cheung, c.Y., cheung, t.k., hui, k.p., ip, n.Y. et al. (2009) systems-level comparison of host-responses elicited by avian h5n1 and seasonal h1n1 influenza viruses in primary human macrophages. plos one 4: e8072.

lewenza, s., Gardy, J.l., Brinkman, F.s., and hancock, R.e. (2005) Genome-wide identification of pseudomonas aeruginosa exported proteins using a consensus computational strategy combined with a laboratory-based phoA fusion screen. Genome Res 15: 321-329.

lynn, d.J., winsor, G.l., chan, c., Richard, n., laird, m.R., Barsky, A. et al. (2008) innatedB: facilitating systems-level analyses of the mammalian innate immune response. mol syst Biol 4: 218.

Page 50: CIFAR bigDATA Workshop Report

Selected publications

page

48

mookherjee, n., hamill, p., Gardy, J., Blimkie, d., Falsafi, R., chikatamarla, A. et al. (2009) systems biology evaluation of immune responses induced by human host defence peptide ll-37 in mononuclear cells. mol Biosyst 5: 483-496.

mookherjee, n., lippert, d.n., hamill, p., Falsafi, R., nijnik, A., kindrachuk, J. et al. (2009) intracellular receptor for human host defense peptide ll-37 in monocytes. J immunol 183: 2688-2696.

Rey, s., Acab, m., Gardy, J.l., laird, m.R., deFays, k., lambert, c., and Brinkman, F.s. (2005) psoRtdb: a protein subcellular localization database for bacteria. nucleic Acids Res 33: d164-168.

Rey, s., Gardy, J.l., and Brinkman, F.s. (2005) Assessing the precision of high-throughput computational and laboratory approaches for the genome-wide identification of protein subcellular localization in bacteria. Bmc Genomics 6: 162.

skowronski, d.m., de serres, G., crowcroft, n.s., Janjua, n.z., Boulianne, n., hottes, t.s. et al. Association between the 2008-09 seasonal influenza vaccine and pandemic h1n1 illness during spring-summer 2009: four observational studies from canada. plos med 7: e1000258.

vivona, s., Gardy, J.l., Ramachandran, s., Brinkman, F.s., Raghava, G.p., Flower, d.R., and Filippini, F. (2008) computer-aided biotechnology: from immuno-informatics to reverse vaccinology. trends Biotechnol 26: 190-200.

Tara GianoulisBorneman, A.R., Gianoulis, t.A., zhang, z.d., Yu, h., Rozowsky, J., seringhaus, m.R. et al. (2007) divergence of transcription factor binding sites across related yeast species. science 317: 815-819.

Gianoulis, t.A., Raes, J., patel, p.v., Bjornson, R., korbel, J.o., letunic, i. et al. (2009) Quantifying environmental adaptation of metabolic pathways in metagenomics. proc natl Acad sci u s A 106: 1374-1379.

Goh, c.s., Gianoulis, t.A., liu, Y., li, J., paccanaro, A., lussier, Y.A., and Gerstein, m. (2006) integration of curated databases to identify genotype-phenotype associations. Bmc Genomics 7: 257.

lu, l.J., sboner, A., huang, Y.J., lu, h.x., Gianoulis, t.A., Yip, k.Y. et al. (2007) comparing classical pathways and modern networks: towards the development of an edge ontology. trends Biochem sci 32: 320-331.

smith, m.G., Gianoulis, t.A., pukatzki, s., mekalanos, J.J., ornston, l.n., Gerstein, m., and snyder, m. (2007) new insights into Acinetobacter baumannii pathogenesis revealed by high-density pyrosequencing and transposon mutagenesis. Genes dev 21: 601-614.

Philip HugenholtzAllgaier, m., Reddy, A., park, J.i., ivanova, n., d’haeseleer, p., lowry, s. et al. (2010) targeted discovery of glycoside hydrolases from a switchgrass-adapted compost community. plos one 5: e8812.

Baker, B.J., hugenholtz, p., dawson, s.c., and Banfield, J.F. (2003) extremely acidophilic protists from acid mine drainage host Rickettsiales-lineage endosymbionts that have intervening sequences in their 16s rRnA genes. Appl environ microbiol 69: 5512-5518.

Baker, B.J., tyson, G.w., webb, R.i., Flanagan, J., hugenholtz, p., Allen, e.e., and Banfield, J.F. (2006) lineages of acidophilic archaea revealed by community genomic analysis. science 314: 1933-1935.

Bjornsson, l., hugenholtz, p., tyson, G.w., and Blackall, l.l. (2002) Filamentous chloroflexi (green non-sulfur bacteria) are abundant in wastewater treatment processes with biological nutrient removal. microbiology 148: 2309-2318.

Page 51: CIFAR bigDATA Workshop Report

Selected publications

page

49

Blackall, l.l., Rossetti, s., christensson, c., cunningham, m., hartman, p., hugenholtz, p., and tandoi, v. (1997) the characterization and description of representatives of ‘G’ bacteria from activated sludge plants. lett Appl microbiol 25: 63-69.

Bland, c., Ramsey, t.l., sabree, F., lowe, m., Brown, k., kyrpides, n.c., and hugenholtz, p. (2007) cRispR recognition tool (cRt): a tool for automatic detection of clustered regularly interspaced palindromic repeats. Bmc Bioinformatics 8: 209.

Blank, l.m., hugenholtz, p., and nielsen, l.k. (2008) evolution of the hyaluronic acid synthesis (has) operon in streptococcus zooepidemicus and other pathogenic streptococci. J mol evol 67: 13-22.

chain, p.s., Grafham, d.v., Fulton, R.s., Fitzgerald, m.G., hostetler, J., muzny, d. et al. (2009) Genomics. Genome project standards in a new era of sequencing. science 326: 236-237.

crocetti, G.R., hugenholtz, p., Bond, p.l., schuler, A., keller, J., Jenkins, d., and Blackall, l.l. (2000) identification of polyphosphate-accumulating organisms and design of 16s rRnA-directed probes for their detection and quantitation. Appl environ microbiol 66: 1175-1182.

dalevi, d., desantis, t.z., Fredslund, J., Andersen, G.l., markowitz, v.m., and hugenholtz, p. (2007) Automated group assignment in large phylogenetic trees using GRunt: GRouping, ungrouping, naming tool. Bmc Bioinformatics 8: 402.

dalevi, d., hugenholtz, p., and Blackall, l.l. (2001) A multiple-outgroup approach to resolving division-level phylogenetic relationships using 16s rdnA data. int J syst evol microbiol 51: 385-391.

dalevi, d., ivanova, n.n., mavromatis, k., hooper, s.d., szeto, e., hugenholtz, p. et al. (2008) Annotation of metagenome short reads using proxygenes. Bioinformatics 24: i7-13.

desantis, t.z., Jr., hugenholtz, p., keller, k., Brodie, e.l., larsen, n., piceno, Y.m. et al. (2006) nAst: a multiple sequence alignment server for comparative analysis of 16s rRnA genes. nucleic Acids Res 34: w394-399.

desantis, t.z., hugenholtz, p., larsen, n., Rojas, m., Brodie, e.l., keller, k. et al. (2006) Greengenes, a chimera-checked 16s rRnA gene database and workbench compatible with ARB. Appl environ microbiol 72: 5069-5072.

dojka, m.A., hugenholtz, p., haack, s.k., and pace, n.R. (1998) microbial diversity in a hydrocarbon- and chlorinated-solvent-contaminated aquifer undergoing intrinsic bioremediation. Appl environ microbiol 64: 3869-3877.

elkins, J.G., podar, m., Graham, d.e., makarova, k.s., wolf, Y., Randau, l. et al. (2008) A korarchaeal genome reveals insights into the evolution of the Archaea. proc natl Acad sci u s A 105: 8102-8107.

engelbrektson, A., kunin, v., wrighton, k.c., zvenigorodsky, n., chen, F., ochman, h., and hugenholtz, p. (2010) experimental factors affecting pcR-based estimates of microbial species richness and evenness. isme J.

Field, d., Garrity, G., Gray, t., morrison, n., selengut, J., sterk, p. et al. (2008) the minimum information about a genome sequence (miGs) specification. nat Biotechnol 26: 541-547.

Fuerst, J.A., Gwilliam, h.G., lindsay, m., lichanska, A., Belcher, c., vickers, J.e., and hugenholtz, p. (1997) isolation and molecular identification of planctomycete bacteria from postlarvae of the giant tiger prawn, penaeus monodon. Appl environ microbiol 63: 254-262.

Fuerst, J.A., and hugenholtz, p. (2000) microorganisms should be high on dnA preservation list. science 290: 1503.

Garcia martin, h., ivanova, n., kunin, v., warnecke, F., Barry, k.w., mchardy, A.c. et al. (2006) metagenomic analysis of two enhanced biological phosphorus removal (eBpR) sludge communities. nat Biotechnol 24: 1263-1269.

Page 52: CIFAR bigDATA Workshop Report

Selected publications

page

50

Ginige, m.p., hugenholtz, p., daims, h., wagner, m., keller, J., and Blackall, l.l. (2004) use of stable-isotope probing, full-cycle rRnA analysis, and fluorescence in situ hybridization-microautoradiography to study a methanol-fed denitrifying microbial community. Appl environ microbiol 70: 588-596.

hall, s.J., hugenholtz, p., siyambalapitiya, n., keller, J., and Blackall, l.l. (2002) the development and use of real-time pcR for the quantification of nitrifiers in activated sludge. water sci technol 46: 267-272.

he, s., kunin, v., haynes, m., martin, h.G., ivanova, n., Rohwer, F. et al. (2010) metatranscriptomic array analysis of ‘candidatus Accumulibacter phosphatis’-enriched enhanced biological phosphorus removal sludge. environ microbiol.

herlemann, d.p., Geissinger, o., ikeda-ohtsubo, w., kunin, v., sun, h., lapidus, A. et al. (2009) Genomic analysis of “elusimicrobium minutum,” the first cultivated representative of the phylum “elusimicrobia” (formerly termite group 1). Appl environ microbiol 75: 2841-2849.

huber, t., Faulkner, G., and hugenholtz, p. (2004) Bellerophon: a program to detect chimeric sequences in multiple sequence alignments. Bioinformatics 20: 2317-2319.

hugenholtz, p. (2002) exploring prokaryotic diversity in the genomic era. Genome Biol 3: Reviews0003.

hugenholtz, p. (2007) Riding giants. environ microbiol 9: 5.

hugenholtz, p., Goebel, B.m., and pace, n.R. (1998) impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. J Bacteriol 180: 4765-4774.

hugenholtz, p., hooper, s.d., and kyrpides, n.c. (2009) Focus: synergistetes. environ microbiol 11: 1327-1329.

hugenholtz, p., and kyrpides, n.c. (2009) A changing of the guard. environ microbiol 11: 551-553.

hugenholtz, p., and pace, n.R. (1996) identifying microbial diversity in the natural environment: a molecular phylogenetic approach. trends Biotechnol 14: 190-197.

hugenholtz, p., pitulle, c., hershberger, k.l., and pace, n.R. (1998) novel division level bacterial diversity in a Yellowstone hot spring. J Bacteriol 180: 366-376.

hugenholtz, p., and stackebrandt, e. (2004) Reclassification of sphaerobacter thermophilus from the subclass sphaerobacteridae in the phylum Actinobacteria to the class thermomicrobia (emended description) in the phylum chloroflexi (emended description). int J syst evol microbiol 54: 2049-2051.

hugenholtz, p., and tyson, G.w. (2008) microbiology: metagenomics. nature 455: 481-483.

hugenholtz, p., tyson, G.w., and Blackall, l.l. (2002) design and evaluation of 16s rRnA-targeted oligonucleotide probes for fluorescence in situ hybridization. methods mol Biol 179: 29-42.

hugenholtz, p., tyson, G.w., webb, R.i., wagner, A.m., and Blackall, l.l. (2001) investigation of candidate division tm7, a recently recognized major lineage of the domain Bacteria with no known pure-culture representatives. Appl environ microbiol 67: 411-419.

hugenholtz, p.G. (1996) individual decision making in cardiology: it could remain our privilege. Rev port cardiol 15: 277-279.

imachi, h., sekiguchi, Y., kamagata, Y., loy, A., Qiu, Y.l., hugenholtz, p. et al. (2006) non-sulfate-reducing, syntrophic bacteria affiliated with desulfotomaculum cluster i are widely distributed in methanogenic environments. Appl environ microbiol 72: 2080-2091.

Janssen, p.h., and hugenholtz, p. (2003) Fermentation of glycolate by a pure culture of a strictly anaerobic gram-positive bacterium belonging to the family lachnospiraceae. Arch microbiol 179: 321-328.

Page 53: CIFAR bigDATA Workshop Report

Selected publications

page

51

Joseph, s.J., hugenholtz, p., sangwan, p., osborne, c.A., and Janssen, p.h. (2003) laboratory cultivation of widespread and previously uncultured soil bacteria. Appl environ microbiol 69: 7210-7215.

klein, m., Friedrich, m., Roger, A.J., hugenholtz, p., Fishbain, s., Abicht, h. et al. (2001) multiple lateral transfers of dissimilatory sulfite reductase genes between major lineages of sulfate-reducing prokaryotes. J Bacteriol 183: 6028-6035.

kristiansson, e., hugenholtz, p., and dalevi, d. (2009) shotgunFunctionalizeR: an R-package for functional comparison of metagenomes. Bioinformatics 25: 2737-2738.

kunin, v., copeland, A., lapidus, A., mavromatis, k., and hugenholtz, p. (2008) A bioinformatician’s guide to metagenomics. microbiol mol Biol Rev 72: 557-578, table of contents.

kunin, v., engelbrektson, A., ochman, h., and hugenholtz, p. (2009) wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. environ microbiol 12: 118-123.

kunin, v., he, s., warnecke, F., peterson, s.B., Garcia martin, h., haynes, m. et al. (2008) A bacterial metapopulation adapts locally to phage predation despite global dispersal. Genome Res 18: 293-297.

kunin, v., Raes, J., harris, J.k., spear, J.R., walker, J.J., ivanova, n. et al. (2008) millimeter-scale genetic gradients and community-level molecular convergence in a hypersaline microbial mat. mol syst Biol 4: 198.

kunin, v., sorek, R., and hugenholtz, p. (2007) evolutionary conservation of sequence and secondary structures in cRispR repeats. Genome Biol 8: R61.

liolios, k., chen, i.m., mavromatis, k., tavernarakis, n., hugenholtz, p., markowitz, v.m., and kyrpides, n.c. (2009) the Genomes on line database (Gold) in 2009: status of genomic and metagenomic projects and their associated metadata. nucleic Acids Res 38: d346-354.

liolios, k., tavernarakis, n., hugenholtz, p., and kyrpides, n.c. (2006) the Genomes on line database (Gold) v.2: a monitor of genome projects worldwide. nucleic Acids Res 34: d332-334.

marcy, Y., ouverney, c., Bik, e.m., losekann, t., ivanova, n., martin, h.G. et al. (2007) dissecting biological “dark matter” with single-cell genetic analysis of rare and uncultivated tm7 microbes from the human mouth. proc natl Acad sci u s A 104: 11889-11894.

markowitz, v.m., ivanova, n., palaniappan, k., szeto, e., korzeniewski, F., lykidis, A. et al. (2006) An experimental metagenome data management and analysis system. Bioinformatics 22: e359-367.

markowitz, v.m., ivanova, n.n., szeto, e., palaniappan, k., chu, k., dalevi, d. et al. (2008) imG/m: a data management and analysis system for metagenomes. nucleic Acids Res 36: d534-538.

markowitz, v.m., korzeniewski, F., palaniappan, k., szeto, e., werner, G., padki, A. et al. (2006) the integrated microbial genomes (imG) system. nucleic Acids Res 34: d344-348.

mavromatis, k., ivanova, n., Anderson, i., lykidis, A., hooper, s.d., sun, h. et al. (2009) Genome analysis of the anaerobic thermohalophilic bacterium halothermothrix orenii. plos one 4: e4192.

mavromatis, k., ivanova, n., Barry, k., shapiro, h., Goltsman, e., mchardy, A.c. et al. (2007) use of simulated data sets to evaluate the fidelity of metagenomic processing methods. nat methods 4: 495-500.

mcdevitt, c.A., hugenholtz, p., hanson, G.R., and mcewan, A.G. (2002) molecular analysis of dimethyl sulphide dehydrogenase from Rhodovulum sulfidophilum: its place in the dimethyl sulphoxide reductase family of microbial molybdopterin-containing enzymes. mol microbiol 44: 1575-1587.

mchardy, A.c., martin, h.G., tsirigos, A., hugenholtz, p., and Rigoutsos, i. (2007) Accurate phylogenetic classification of variable-length dnA fragments. nat methods 4: 63-72.

Page 54: CIFAR bigDATA Workshop Report

Selected publications

page

52

mcmahon, k.d., martin, h.G., and hugenholtz, p. (2007) integrating ecology into biotechnology. curr opin Biotechnol 18: 287-292.

peterson, s.B., warnecke, F., madejska, J., mcmahon, k.d., and hugenholtz, p. (2008) environmental distribution and population biology of candidatus Accumulibacter, a primary agent of biological phosphorus removal. environ microbiol 10: 2692-2703.

Rossetti, s., Blackall, l.l., majone, m., hugenholtz, p., plumb, J.J., and tandoi, v. (2003) kinetic and phylogenetic characterization of an anaerobic dechlorinating microbial community. microbiology 149: 459-469.

sait, m., hugenholtz, p., and Janssen, p.h. (2002) cultivation of globally distributed soil bacteria from phylogenetic lineages previously only detected in cultivation-independent surveys. environ microbiol 4: 654-666.

sandler, s.J., hugenholtz, p., schleper, c., delong, e.F., pace, n.R., and clark, A.J. (1999) diversity of radA genes from cultured and uncultured archaea: comparative analysis of putative RadA proteins and their use as a phylogenetic marker. J Bacteriol 181: 907-915.

sangwan, p., chen, x., hugenholtz, p., and Janssen, p.h. (2004) chthoniobacter flavus gen. nov., sp. nov., the first pure-culture representative of subdivision two, spartobacteria classis nov., of the phylum verrucomicrobia. Appl environ microbiol 70: 5875-5881.

schoenborn, l., Yates, p.s., Grinton, B.e., hugenholtz, p., and Janssen, p.h. (2004) liquid serial dilution is inferior to solid media for isolation of cultures representative of the phylum-level diversity of soil bacteria. Appl environ microbiol 70: 4363-4366.

seviour, e.m., Blackall, l.l., christensson, c., hugenholtz, p., cunningham, m.A., Bradford, d. et al. (1997) the filamentous morphotype eikelboom type 1863 is not a single genetic entity. J Appl microbiol 82: 411-421.

shah, n., teplitsky, m.v., minovitsky, s., pennacchio, l.A., hugenholtz, p., hamann, B., and dubchak, i.l. (2005) snp-vistA: an interactive snp visualization tool. Bmc Bioinformatics 6: 292.

sorek, R., kunin, v., and hugenholtz, p. (2008) cRispR--a widespread system that provides acquired resistance against phages in bacteria and archaea. nat Rev microbiol 6: 181-186.

thomsen, t.R., kjellerup, B.v., nielsen, J.l., hugenholtz, p., and nielsen, p.h. (2002) in situ studies of the phylogeny and physiology of filamentous bacteria with attached growth. environ microbiol 4: 383-391.

tringe, s.G., and hugenholtz, p. (2008) A renaissance for the pioneering 16s rRnA gene. curr opin microbiol 11: 442-446.

tringe, s.G., von mering, c., kobayashi, A., salamov, A.A., chen, k., chang, h.w. et al. (2005) comparative metagenomics of microbial communities. science 308: 554-557.

tschop, m.h., hugenholtz, p., and karp, c.l. (2009) Getting to the core of the gut microbiome. nat Biotechnol 27: 344-346.

tyson, G.w., chapman, J., hugenholtz, p., Allen, e.e., Ram, R.J., Richardson, p.m. et al. (2004) community structure and metabolism through reconstruction of microbial genomes from the environment. nature 428: 37-43.

tyson, G.w., lo, i., Baker, B.J., Allen, e.e., hugenholtz, p., and Banfield, J.F. (2005) Genome-directed isolation of the key nitrogen fixer leptospirillum ferrodiazotrophum sp. nov. from an acidophilic microbial community. Appl environ microbiol 71: 6319-6324.

uzal, F.A., hugenholtz, p., Blackall, l.l., petray, s., moss, s., Assis, R.A. et al. (2003) pcR detection of clostridium chauvoei in pure cultures and in formalin-fixed, paraffin-embedded tissues. vet microbiol 91: 239-248.

von mering, c., hugenholtz, p., Raes, J., tringe, s.G., doerks, t., Jensen, l.J. et al. (2007) Quantitative phylogenetic assessment of microbial communities in diverse environments. science 315: 1126-1130.

Page 55: CIFAR bigDATA Workshop Report

Selected publications

page

53

warnecke, F., and hugenholtz, p. (2007) Building on basic metagenomics with complementary technologies. Genome Biol 8: 231.

warnecke, F., luginbuhl, p., ivanova, n., Ghassemian, m., Richardson, t.h., stege, J.t. et al. (2007) metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite. nature 450: 560-565.

watt, m., hugenholtz, p., white, R., and vinall, k. (2006) numbers and locations of native bacteria on field-grown wheat roots quantified by fluorescence in situ hybridization (Fish). environ microbiol 8: 871-884.

wrighton, k.c., Agbo, p., warnecke, F., weber, k.A., Brodie, e.l., desantis, t.z. et al. (2008) A novel ecological role of the Firmicutes identified in thermophilic microbial fuel cells. isme J 2: 1146-1156.

wu, d., hugenholtz, p., mavromatis, k., pukall, R., dalin, e., ivanova, n.n. et al. (2009) A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. nature 462: 1056-1060.

Yamada, t., Yamauchi, t., shiraishi, k., hugenholtz, p., ohashi, A., harada, h. et al. (2007) characterization of filamentous bacteria, belonging to candidate phylum ksB3, that are associated with bulking in methanogenic granular sludges. isme J 1: 246-255.

zhang, h., sekiguchi, Y., hanada, s., hugenholtz, p., kim, h., kamagata, Y., and nakamura, k. (2003) Gemmatimonas aurantiaca gen. nov., sp. nov., a gram-negative, aerobic, polyphosphate-accumulating micro-organism, the first cultured representative of the new bacterial phylum Gemmatimonadetes phyl. nov. int J syst evol microbiol 53: 1155-1163.

Martin KrzywinskiButterfield, Y.s., marra, m.A., Asano, J.k., chan, s.Y., Guin, R., krzywinski, m.i. et al. (2002) An efficient strategy for large-scale high-throughput transposon-mediated sequencing of cdnA clones. nucleic Acids Res 30: 2460-2468.

costello, J.F., krzywinski, m., and marra, m.A. (2009) A first look at entire human methylomes. nat Biotechnol 27: 1130-1132.

Flibotte, s., chiu, R., Fjell, c., krzywinski, m., schein, J.e., shin, h., and marra, m.A. (2004) Automated ordering of fingerprinted clones. Bioinformatics 20: 1264-1271.

Fuhrmann, d.R., krzywinski, m.i., chiu, R., saeedi, p., schein, J.e., Bosdet, i.e. et al. (2003) software for automated analysis of dnA fingerprinting gels. Genome Res 13: 940-953.

kelleher, c.t., chiu, R., shin, h., Bosdet, i.e., krzywinski, m.i., Fjell, c.d. et al. (2007) A physical map of the highly heterozygous populus genome: integration with the genome sequence and genetic map and analysis of haplotype variation. plant J 50: 1063-1078.

krzywinski, m., Bosdet, i., mathewson, c., wye, n., Brebner, J., chiu, R. et al. (2007) A BAc clone fingerprinting approach to the detection of human genome rearrangements. Genome Biol 8: R224.

krzywinski, m., Bosdet, i., smailus, d., chiu, R., mathewson, c., wye, n. et al. (2004) A set of BAc clones spanning the human genome. nucleic Acids Res 32: 3651-3660.

krzywinski, m., schein, J., Birol, i., connors, J., Gascoyne, R., horsman, d. et al. (2009) circos: an information aesthetic for comparative genomics. Genome Res 19: 1639-1645.

krzywinski, m., wallis, J., Gosele, c., Bosdet, i., chiu, R., Graves, t. et al. (2004) integrated and sequence-ordered BAc- and YAc-based physical maps for the rat genome. Genome Res 14: 766-779.

morin, R., Bainbridge, m., Fejes, A., hirst, m., krzywinski, m., pugh, t. et al. (2008) profiling the hela s3 transcriptome using randomly primed cdnA and massively parallel short-read sequencing. Biotechniques 45: 81-94.

ness, s.R., terpstra, w., krzywinski, m., marra, m.A., and Jones, s.J. (2002) Assembly of fingerprint contigs: parallelized Fpc. Bioinformatics 18: 484-485.

Page 56: CIFAR bigDATA Workshop Report

Selected publications

page

54

pugh, t.J., keyes, m., Barclay, l., delaney, A., krzywinski, m., thomas, d. et al. (2009) sequence variant discovery in dnA repair genes from radiosensitive and radiotolerant prostate brachytherapy patients. clin cancer Res 15: 5008-5016.

sossi, v., holden, J.e., chan, G., krzywinski, m., stoessl, A.J., and Ruth, t.J. (2000) Analysis of four dopaminergic tracers kinetics using two different tissue input function methods. J cereb Blood Flow metab 20: 653-660.

vatcher, G., smailus, d., krzywinski, m., Guin, R., stott, J., tsai, m. et al. (2002) Resuspension of dnA sequencing reaction products in agarose increases sequence quality on an automated sequencer. Biotechniques 33: 532-534, 536, 538-539.

Angela NorbeckAdkins, J.n., mottaz, h.m., norbeck, A.d., Gustin, J.k., Rue, J., clauss, t.R. et al. (2006) Analysis of the salmonella typhimurium proteome through environmental response toward infectious conditions. mol cell proteomics 5: 1450-1461.

Ansong, c., Yoon, h., norbeck, A.d., Gustin, J.k., mcdermott, J.e., mottaz, h.m. et al. (2008) proteomics analysis of the causative agent of typhoid fever. J proteome Res 7: 546-557.

chowdhury, s.m., shi, l., Yoon, h., Ansong, c., Rommereim, l.m., norbeck, A.d. et al. (2009) A method for investigating protein-protein interactions related to salmonella typhimurium pathogenesis. J proteome Res 8: 1504-1514.

Farr, c.d., Gafken, p.R., norbeck, A.d., doneanu, c.e., stapels, m.d., Barofsky, d.F. et al. (2004) proteomic analysis of native metabotropic glutamate receptor 5 protein complexes reveals novel molecular constituents. J neurochem 91: 438-450.

huang, h., lin, m., wang, x., kikuchi, t., mottaz, h., norbeck, A., and Rikihisa, Y. (2008) proteomic analysis of and immune responses to ehrlichia chaffeensis lipoproteins. infect immun 76: 3405-3414.

liu, t., Qian, w.J., mottaz, h.m., Gritsenko, m.A., norbeck, A.d., moore, R.J. et al. (2006) evaluation of multiprotein immunoaffinity subtraction for plasma proteomics and candidate biomarker discovery using mass spectrometry. mol cell proteomics 5: 2167-2174.

manes, n.p., Gustin, J.k., Rue, J., mottaz, h.m., purvine, s.o., norbeck, A.d. et al. (2007) targeted protein degradation by salmonella under phagosome-mimicking culture conditions investigated using comparative peptidomics. mol cell proteomics 6: 717-727.

mottaz-Brewer, h.m., norbeck, A.d., Adkins, J.n., manes, n.p., Ansong, c., shi, l. et al. (2008) optimization of proteomic sample preparation procedures for comprehensive protein characterization of pathogenic systems. J Biomol tech 19: 285-295.

norbeck, A.d., callister, s.J., monroe, m.e., Jaitly, n., elias, d.A., lipton, m.s., and smith, R.d. (2006) proteomic approaches to bacterial differentiation. J microbiol methods 67: 473-486.

norbeck, A.d., monroe, m.e., Adkins, J.n., Anderson, k.k., daly, d.s., and smith, R.d. (2005) the utility of accurate mass and lc elution time information in the analysis of complex proteomes. J Am soc mass spectrom 16: 1239-1249.

Ream, t.s., haag, J.R., wierzbicki, A.t., nicora, c.d., norbeck, A.d., zhu, J.k. et al. (2009) subunit compositions of the RnA-silencing enzymes pol iv and pol v reveal their origins as specialized forms of RnA polymerase ii. mol cell 33: 192-203.

Romine, m.F., carlson, t.s., norbeck, A.d., mccue, l.A., and lipton, m.s. (2008) identification of mobile elements and pseudogenes in the shewanella oneidensis mR-1 genome. Appl environ microbiol 74: 3257-3265.

Page 57: CIFAR bigDATA Workshop Report

Selected publications

page

55

shi, l., Adkins, J.n., coleman, J.R., schepmoes, A.A., dohnkova, A., mottaz, h.m. et al. (2006) proteomic analysis of salmonella enterica serovar typhimurium isolated from RAw 264.7 macrophages: identification of a novel protein that contributes to the replication of serovar typhimurium inside macrophages. J Biol chem 281: 29131-29140.

shi, l., chowdhury, s.m., smallwood, h.s., Yoon, h., mottaz-Brewer, h.m., norbeck, A.d. et al. (2009) proteomic investigation of the time course responses of RAw 264.7 macrophages to infection with salmonella enterica. infect immun 77: 3227-3233.

smith, d.p., kitner, J.B., norbeck, A.d., clauss, t.R., lipton, m.s., schwalbach, m.s. et al. transcriptional and translational regulatory responses to iron limitation in the globally distributed marine bacterium candidatus pelagibacter ubique. plos one 5: e10487.

sowell, s.m., norbeck, A.d., lipton, m.s., nicora, c.d., callister, s.J., smith, R.d. et al. (2008) proteomic analysis of stationary phase in the marine bacterium “candidatus pelagibacter ubique”. Appl environ microbiol 74: 4091-4100.

sowell, s.m., wilhelm, l.J., norbeck, A.d., lipton, m.s., nicora, c.d., Barofsky, d.F. et al. (2009) transport functions dominate the sAR11 metaproteome at low-nutrient extremes in the sargasso sea. isme J 3: 93-105.

zhou, J.Y., petritis, B.o., petritis, k., norbeck, A.d., weitz, k.k., moore, R.J. et al. (2009) mouse-specific tandem igY7-supermix immunoaffinity separations for improved lc-ms/ms coverage of the plasma proteome. J proteome Res 8: 5387-5395.

Christian von MeringBoettner, m., steffens, c., von mering, c., Bork, p., stahl, u., and lang, c. (2007) sequence-based factors influencing the expression of heterologous genes in the yeast pichia pastoris--A comparative view on 79 human genes. J Biotechnol 130: 1-10.

Bork, p., Jensen, l.J., von mering, c., Ramani, A.k., lee, i., and marcotte, e.m. (2004) protein interaction networks from yeast to human. curr opin struct Biol 14: 292-299.

campillos, m., von mering, c., Jensen, l.J., and Bork, p. (2006) identification and analysis of evolutionarily cohesive functional modules in protein networks. Genome Res 16: 374-382.

chaffron, s., and von mering, c. (2007) termites in the woodwork. Genome Biol 8: 229.

ciccarelli, F.d., doerks, t., von mering, c., creevey, c.J., snel, B., and Bork, p. (2006) toward automatic reconstruction of a highly resolved tree of life. science 311: 1283-1287.

ciccarelli, F.d., von mering, c., suyama, m., harrington, e.d., izaurralde, e., and Bork, p. (2005) complex genomic rearrangements lead to novel primate gene function. Genome Res 15: 343-351.

doerks, t., Andrade, m.A., lathe, w., 3rd, von mering, c., and Bork, p. (2004) Global analysis of bacterial transcription factors to predict cellular target processes. trends Genet 20: 126-131.

doerks, t., von mering, c., and Bork, p. (2004) Functional clues for hypothetical proteins based on genomic context analysis in prokaryotes. nucleic Acids Res 32: 6321-6326.

Foerstner, k.u., von mering, c., and Bork, p. (2006) comparative analysis of environmental sequences: potential and challenges. philos trans R soc lond B Biol sci 361: 519-523.

Foerstner, k.u., von mering, c., hooper, s.d., and Bork, p. (2005) environments shape the nucleotide composition of genomes. emBo Rep 6: 1208-1213.

harrington, e.d., singh, A.h., doerks, t., letunic, i., von mering, c., Jensen, l.J. et al. (2007) Quantitative assessment of protein function prediction from metagenomics shotgun sequences. proc natl Acad sci u s A 104: 13913-13918.

Page 58: CIFAR bigDATA Workshop Report

Selected publications

page

56

hausmann, G., von mering, c., and Basler, k. (2009) the hedgehog signaling pathway: where did it come from? plos Biol 7: e1000146.

huynen, m.A., snel, B., von mering, c., and Bork, p. (2003) Function prediction and protein networks. curr opin cell Biol 15: 191-198.

Jensen, l.J., Julien, p., kuhn, m., von mering, c., muller, J., doerks, t., and Bork, p. (2008) eggnoG: automated construction and annotation of orthologous groups of genes. nucleic Acids Res 36: d250-254.

Jensen, l.J., lagarde, J., von mering, c., and Bork, p. (2004) Arrayprospector: a web resource of functional associations inferred from microarray expression data. nucleic Acids Res 32: w445-448.

korbel, J.o., Jensen, l.J., von mering, c., and Bork, p. (2004) Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. nat Biotechnol 22: 911-917.

krause, R., von mering, c., and Bork, p. (2003) A comprehensive set of protein complexes in yeast: mining large scale protein-protein interaction screens. Bioinformatics 19: 1901-1908.

krause, R., von mering, c., Bork, p., and dandekar, t. (2004) shared components of protein complexes--versatile building blocks or biochemical artefacts? Bioessays 26: 1333-1343.

kuhn, m., szklarczyk, d., Franceschini, A., campillos, m., von mering, c., Jensen, l.J. et al. (2010) stitch 2: an interaction network database for small molecules and proteins. nucleic Acids Res 38: d552-556.

kuhn, m., von mering, c., campillos, m., Jensen, l.J., and Bork, p. (2008) stitch: interaction networks of chemicals and proteins. nucleic Acids Res 36: d684-688.

perocchi, F., Jensen, l.J., Gagneur, J., Ahting, u., von mering, c., Bork, p. et al. (2006) Assessing systems properties of yeast mitochondria through an interaction map of the organelle. plos Genet 2: e170.

Raes, J., korbel, J.o., lercher, m.J., von mering, c., and Bork, p. (2007) prediction of effective genome size in metagenomic samples. Genome Biol 8: R10.

Romanos, m., Renner, t.J., schecklmann, m., hummel, B., Roos, m., von mering, c. et al. (2008) improved odor sensitivity in attention-deficit/hyperactivity disorder. Biol psychiatry 64: 938-940.

tringe, s.G., von mering, c., kobayashi, A., salamov, A.A., chen, k., chang, h.w. et al. (2005) comparative metagenomics of microbial communities. science 308: 554-557.

von mering, c., and Bork, p. (2002) teamed up for transcription. nature 417: 797-798.

von mering, c., hugenholtz, p., Raes, J., tringe, s.G., doerks, t., Jensen, l.J. et al. (2007) Quantitative phylogenetic assessment of microbial communities in diverse environments. science 315: 1126-1130.

von mering, c., huynen, m., Jaeggi, d., schmidt, s., Bork, p., and snel, B. (2003) stRinG: a database of predicted functional associations between proteins. nucleic Acids Res 31: 258-261.

von mering, c., Jensen, l.J., kuhn, m., chaffron, s., doerks, t., kruger, B. et al. (2007) stRinG 7--recent developments in the integration and prediction of protein interactions. nucleic Acids Res 35: d358-362.

von mering, c., Jensen, l.J., snel, B., hooper, s.d., krupp, m., Foglierini, m. et al. (2005) stRinG: known and predicted protein-protein associations, integrated and transferred across organisms. nucleic Acids Res 33: d433-437.

von mering, c., krause, R., snel, B., cornell, m., oliver, s.G., Fields, s., and Bork, p. (2002) comparative assessment of large-scale data sets of protein-protein interactions. nature 417: 399-403.

von mering, c., zdobnov, e.m., tsoka, s., ciccarelli, F.d., pereira-leal, J.B., ouzounis, c.A., and Bork, p. (2003) Genome evolution reveals biochemical networks and functional modules. proc natl Acad sci u s A 100: 15428-15433.

Page 59: CIFAR bigDATA Workshop Report

Selected publications

page

57

weiss, m., schrimpf, s., hengartner, m.o., lercher, m.J., and von mering, c. (2010) shotgun proteomics data from multiple organisms reveals remarkable quantitative conservation of the eukaryotic core proteome. proteomics.

zdobnov, e.m., von mering, c., letunic, i., and Bork, p. (2005) consistency of genome-based methods in measuring metazoan evolution. FeBs lett 579: 3355-3361.

zdobnov, e.m., von mering, c., letunic, i., torrents, d., suyama, m., copley, R.R. et al. (2002) comparative genome and proteome analysis of Anopheles gambiae and drosophila mlanogaster. science 298: 149-159.

Jenn and Ken Visocky O’Gradyviscoky o'Grady, J., & o'Grady, k. (2009). A designer's Research manual. Beverley, mA: Rockport publishers.

visocky o'Grady, J., & visocky o'Grady, k. (2008) cincinnati, oh: how Books.

David WalshBapteste, e., and walsh, d.A. (2005) does the ‘Ring of life’ ring true? trends microbiol 13: 256-261.

Boucher, Y., douady, c.J., papke, R.t., walsh, d.A., Boudreau, m.e., nesbo, c.l. et al. (2003) lateral gene transfer and the origins of prokaryotic groups. Annu Rev Genet 37: 283-328.

Gardy, J.l., laird, m.R., chen, F., Rey, s., walsh, c.J., ester, m., and Brinkman, F.s. (2005) psoRtb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21: 617-623.

sharma, A.k., walsh, d.A., Bapteste, e., Rodriguez-valera, F., Ford doolittle, w., and papke, R.t. (2007) evolution of rhodopsin ion pumps in haloarchaea. Bmc evol Biol 7: 79.

walsh, d.A., Bapteste, e., kamekura, m., and doolittle, w.F. (2004) evolution of the RnA polymerase B’ subunit gene (rpoB’) in halobacteriales: a complementary molecular marker to the ssu rRnA gene. mol Biol evol 21: 2340-2351.

walsh, d.A., papke, R.t., and doolittle, w.F. (2005) Archaeal diversity along a soil salinity gradient prone to disturbance. environ microbiol 7: 1655-1666.

walsh, d.A., and sharma, A.k. (2009) molecular phylogenetics: testing evolutionary hypotheses. methods mol Biol 502: 131-168.

walsh, d.A., zaikova, e., and hallam, s.J. (2009) small volume (1-3l) filtration of coastal seawater samples. J vis exp.

walsh, d.A., zaikova, e., and hallam, s.J. (2009) large volume (20l+) filtration of coastal seawater samples. J vis exp.

walsh, d.A., zaikova, e., howes, c.G., song, Y.c., wright, J.J., tringe, s.G. et al. (2009) metagenome of a versatile chemolithoautotroph from expanding oceanic dead zones. science 326: 578-582.

wright, J.J., lee, s., zaikova, e., walsh, d.A., and hallam, s.J. (2009) dnA extraction from 0.22 microm sterivex filters and cesium chloride density gradient centrifugation. J vis exp.

zaikova, e., hawley, A., walsh, d.A., and hallam, s.J. (2009) seawater sampling and collection. J vis exp.

zaikova, e., walsh, d.A., stilwell, c.p., mohn, w.w., tortell, p.d., and hallam, s.J. microbial community dynamics in a seasonally anoxic fjord: saanich inlet, British columbia. environ microbiol 12: 172-191.

Page 60: CIFAR bigDATA Workshop Report

Survey

page

58

1. Overall the following workshop elements were effective to my learning.

1 it may not have taught me anything to do with big data gathering, analysis and presentation, but it taught me what other people out there know.

2 more demonstrations of new software with directions on how to use each software provided.

3 data visualization presentations were a highlight. An area not normally systematically addressed in the day-to-day life of a scientist.

4 the break out tutorials were good but there really isn’t a lot covered in the sessions so maybe fewer tutorials but more in depth with one or two software programs.

5 i’m glad i can look at the videos of the talks.

6 i would appreciate: shorter talks = 20-25 mins; longer discussions after each talk; more working groups; more practical info

bigDATA Survey Results

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

Demonstration Sessions

Presentations

Page 61: CIFAR bigDATA Workshop Report

Survey

page

59

2. Demonstration sessions on the following software tools were effective to my learning:

1 they all taught me what the programs could do, but i also learnt that they wouldn’t be applicable to my work.

2 i thought most tools were geared towards very specific ends. i’m sure those tools are useful but i do not foresee using them with my data.

3 metacyc and GenGis are two very complicated programs so maybe a written outline of them prior to the meeting would have been useful so that we had an understanding of the basics before we got there

4 i didn’t see the second demonstration.

5 i was already familiar with some tools.

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

Cytoscape Pyrotagger

Metacyc STRING

GenGIS

Page 62: CIFAR bigDATA Workshop Report

Survey

page

60

3. I would have liked to have learned more (greater depth of coverage) about the following subjects / topics / software (please also comment on subjects/topics/software not covered but could be included):

1 limitations due to data quality/inherent bias.

2 sorting and annotation software for big data sets. it would have been nice to have a walkthrough from sequence acquisition to sorting to annotation to analysis (this bit was covered more) to presentation (also covered to a larger degree).

3 sequence aligners taxonomic identification software metagenomic comparison algorithms

4 imG / m

5 cytoscape and associated plugins

6 About issues with the various sequencing technolog that result in false data, artificially inflating the big data problem.

7 visualization technology development

8 software circos

9 considering the strong emphasis on data visualization, illustrator and photoshop (or Gimp) are great tools to generate beautiful figures and it would have been a plus to learn a few tricks on how to use them efficiently. Also, perl, python and sQl (or other database interfaces) are good topics on how to handle large datasets that were not discussed in the workshop.

10 how to manage/store the actual data (‘databasing’)

11 i thought there would be a bit more on the community ecology side, i.e. more emphasis on the problems associated with determining microbial community composition in different environments. it seems like these problems were only discussed as a context for genomic studies, but it does not seem a trivial matter. then again, i’ve been out of academics and research for the past couple years...the pyrotagger tutorial was great. i would have loved similar tutorials focused on fungal community sequence data.

12 imG / Qiime

13 i thought that a basic discussion on conventions for dealing with metagenomic data would have been a very useful session.

14 comparisons between the new methods presented and how they function compared to the older methods they are replacing.

15 newbler, consed, perl, python, bioperl

16 next time, it would be nice to have a section on phylogeny (new softwares, methods, etc.). Also, an additional day with “bring-your-laptop” exercies would be good to transfer the skills.

17 sequence clustering algorithms - only covered a little. Functional annotation programs.

18 the Role oF viRuses in micRoBiAl communities

19 the best software for every step of the analysis pipeline.

20 more tools for data analysis.

21 more tools for work with raw genomic data

22 * standardization of bioinformatics protocol/formats

* cd-hit or uclust: A presentation on clustering and the need of reducing sequence space when working with huge amount fo data

* large scale phylogenies. c. von mering presentation was great about mltree and markers but was just a part of his talk. i would have liked another presentation on phylogenomics e.g Amphora, pplacer.

23 hands-on tutorials would have been nice, but would have to involve a different format, and perhaps computing infrastructure. i would imagine having several sessions in parallel that people would need to commit to in advance.

Page 63: CIFAR bigDATA Workshop Report

Survey

page

61

4. The bigDATA Workshop has enhanced my ability to:

1 particularly useful for getting me to think about effective and innovative ways to display data.

2 bigdAtA could be an opportunity to list open source software for visualization/drawing (e.g. inkscape, svg file format, treedyn). Also, a circos tutorial would be great (quite difficult to masterize)

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

Effective use of visualization software tools Apply skills to current projects

Act as a resource to my home lab Communicate more effectively about my design needs

Page 64: CIFAR bigDATA Workshop Report

Survey

page

62

5. Elements of workshop organization that are conducive to learning are:

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

Welcome Dinner Interaction with presenters

Brining my own Laptop Schedule of presentations / demonstrations

Interaction with other participants Coffee breaks

Page 65: CIFAR bigDATA Workshop Report

Survey

page

63

1 laptops can detract from audiences attention to the speaker. i noticed in this workshop that most of the audience was paying attention and not checking their emails.

2 i thought that the overall structure of the workshop was excellent

3 perhaps one more day and a social activity to break the ice? the bigBBQ was announced later and lots of people had already booked their return flight. perhaps move one of those social events the day before departure.

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

Saturday dinner

LSI / UBC as a venue

Page 66: CIFAR bigDATA Workshop Report

Survey

page

64

6. Outcomes:

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

5 10 15 20 250

Number of people

Don’t Know

Strongly Disagree

Disagree

Niether Agree Nor Disagree

Agree

Strongly Agree

I am satisgied with the workshop 3 day workshop format works well

I would recommend this workshop Saturday / Sunday workshops work well

I received value from the worksho fee Prefer if workshop had been during the week

Page 67: CIFAR bigDATA Workshop Report

Survey

page

65

1 weekdays are precious, but so are weekends. i would have preferred a sun-tues format rather than Fri-sun.

2 A lot of people have children and/or relatives to attend to during the w-e, or are simply tired from working all week. As such, interactions were diminished as people often had to leave early.

3 i tend to like more shorter days and would have preferred one or two partial weekdays in addition to the weekend; i get information overload after too many presentations in a row.

4 if they were during the week i would be less likely to attend.

5 it would have been nice to have the full Friday (or at least afternoon) instead of just one talk and a dinner.

6 some talks were too much about speakers and their projects (conference style) rather than audience-centered (introducing and explaining tools, practical guiding, aplicability=workshop style). the usefulness of a tool across biological disciplines should have been clearly stated in every talk.

7 having it over the us long weekend may have reduced some of the participation

Page 68: CIFAR bigDATA Workshop Report

Survey

page

66

7. What are the problems/challenges you are facing with respect to working with complex data?

1 computational

2 having to have programming skills in order to deal with the data when your background is purely biological.

3 we are at the beginning stages, so simple things like appropriate file formats and storage are an issue as well as overall visualization and research tools.

4 meta-Annotation of the data is often incomplete / not standardized / not made available

5 hard disk access times, incomplete annotations.

6 meaningful multivariate statistical analysis.

7 lack of efficient data mining tools

8 the computing requirements are now much higher. Analyses require some sort of automation to filter through all the data and the hardware has trouble keeping up the pace.

9 aligning several thousands of sequences navigating through them.

10 i study soil fungal communities, and there is a lack of software (for us folks without computer programming skills) to allow simultaneous processing of thousands of sequences to determine fungal taxonomic composition.

11. Analysis and visualization, particularly innovative ways to display data.

12 i currently am working on RnA-seq data and using it to effectively identify alternatively spliced rna’s

13 finding the patterns within and between communities.

14 lack of good foundation for programming ignorance about the tools available for bioinformatics.

15 computer memory (ram). problems with presenting the massive amount of data in small figures. softwares not always made for large amount of data.

16 data management, integration, extrapolation, validation.

17 AnAlYsis oF dAtA / methodoloGY

18 it’s hard to know about what software is available for analysis, and what’s the best tool for a particular job. worse yet is the fact that programs die of neglect; microsoft or Apple moves on, and programs stop working on newer versions of the operating system.

19 Finding the best tools out there.

20 genomic information is now growing most rapidly of all so it is reasonable to expect many people need help with a huge amount of raw genomic data rather than metabolic networks, more functional data ... would be nice to take that into account.

21 Formats and harmonization with other studies. loss of lot of valuable data because focusing on small vignette stories. difficulty to link all the data together. to much dimensions (pcA sucks)

22 - no “off-the-shelf” applications that do what we want - very large phylogenies - alignment free methods for generating phylogenies - poor documentation and support for many applications - adequate computational power - access to bioinformatic personnel

Page 69: CIFAR bigDATA Workshop Report

Survey

page

67

8. I heard about the bigDATA workshop through:

Graduate student 28.0% post-doctoral fellow 36.0% Faculty 16.0% technician 0.0% Bioinformatician 8.0% Research scientist 16.0% other 4.0%

9. My research role is:

Graduate student 28.0% post-doctoral fellow 36.0% Faculty 16.0% technician 0.0% Bioinformatician 8.0% Research scientist 16.0% other 4.0%

10. My main research institution would be classified as:

Government 8.0% industry 4.0% Academia 92.0% hospital/medial 0.0% other 0.0%

5 10 15 20 250

Number of people

Graduate Student

Post-Doctoral Fellow

Faculty

Techncian

Bioinformatician

Research Scientist

Other

5 10 15 20 250

Graduate Student

Post-Doctoral Fellow

Faculty

Techncian

Bioinformatician

Research Scientist

Other

Number of people

5 10 15 20 250

Number of people

Government

Industry

Acadamia

Hospital /Medical

Other

Page 70: CIFAR bigDATA Workshop Report
Page 71: CIFAR bigDATA Workshop Report
Page 72: CIFAR bigDATA Workshop Report

1

0 20

40

60

80

100

120

140

160

180

200

220

240

2

020406080100120140160180200220240

3

020

4060

80100120140160180

4

0

20406080100120140160180

5

020406080100

120140160180

6

020406080100

120

140

1607

020406080100120140

8 02040608010

0120140

9

02040

6080

100120

140

100

2040

6080

100120

11

020406080

100120

12

020406080100120

13

020406080

100

14

020406080

100

15

020406080

10016

020

4060

8017

020

4060

180

204060

190

204060

200

20 40 60

http://www.cmde.science.ubc.ca/hallam/bigdata.php