Post on 18-Jan-2015
description
Yannick Pouliot, PhD 10/14/2011
There’s No Avoiding It: Programming Skills You’ll Need
Three Things I want To Impress
•Why software programming is essential for bioresearch▫… as essential as knowing how to use a
pipette•Why you should partially dump Excel and
use a relational database•Why the Cloud is your friend
•Free software!•Free algorithms!•Pre-coded algorithms (i.e., packages)!•Very cheap computing power!
The Good News
The Bad News
•Dunno how to use•“Not talented”•“Not enough time”•(can’t be bothered)
▫e.g., reading the paper describing the software tool one is relying on
More Good News
•Not that hard•Lots and lots of good resources•Read a book, dammit•Find a buddy•Use Cloud instances (preconfigured
machines)▫Can even be free!
The Quest For Situation-Appropriate Storage & Computation
Or, when Excel fails you
Some Questions…
1. Do you use MS Excel?2. How much time do you spend using it?3. Are you good at it? Be honest…4. Have you ever read a book or tutorial on
Excel?5. So how are you going to improve your
ability?
Are You an Excelaholic?
•Do you have an unhealthy dependence on Excel?▫Do you use Excel to store data?▫Do you feel like you’re making Excel jump
through hoops to perform your calculations? Do you have a vague feeling of shame as a
result?
The Worst Case (More Frequent Than You’d Wish)
•Postdoc uses Excel to keep track of complex experiment involving two external groups
•Eventually realizes that data stored in Excel were corrupted (“paste failure”)▫Result: it took her six months to recover
•She now uses FileMaker (relational database)
The Next Level Up: Relational Databases Take Your Pick
A Real Example From Yours Truly
But You Also Need Programming…
Why Programming?
•Address small problems that can nail you•Address bigger problems by standing on
the shoulders of giants•Flexibility: If you’re doing “real” science,
off-the-shelf software will fail you every time▫80% rule…
Don’t Try This With Excel
•Millions of reads compared against mouse transcriptome• Determining number of distinct species and frequency of members in each• Summarize using plots for each codon
Remember SQL?
The Quest For Power
Heard at lab meeting:
“I would have shown you this graph
but Excel crashed while computing a big file”
→You can’t do this (censored) on your laptop anymore
Welcome To The Cloud
Why Own When You Can Rent?
An Example: PathSeq•Compare millions of short-read sequences
against all genomic + transcriptomic sequences for all microbes (!)
Amazon Cloud “Management Console”
Why The Cloud Matters For Biologists
• You can purchase as much computing power as you need▫You don’t have to run/manage what you don’t use
• Your purchasing computing power, not machines▫ never outdated
• Can easily migrate from one machine type to another (minutes)
• Can add storage in seconds• Accessible from anywhere• Easy to share e.g., (large) datasets with others
04/10/2023
23
WEKA: the software
•Machine learning/data mining software written in Java (distributed under the GNU Public License)
•Used for research, education, and applications•Complements “Data Mining” by Witten & Frank•Main features:
▫Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods
▫Graphical user interfaces (incl. data visualization)▫Environment for comparing learning algorithms
04/10/2023University of Waikato
24
04/10/2023
25
University of Waikato
Explorer: building “classifiers”
•Classifiers in WEKA are models for predicting nominal or numeric quantities
•Implemented learning schemes include:▫Decision trees and lists, instance-based
classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, …
•“Meta”-classifiers include:▫Bagging, boosting, stacking, error-
correcting output codes, locally weighted learning, …
04/10/2023University of Waikato
26
04/10/2023University of Waikato
27
04/10/2023University of Waikato
28