LIFE SCIENCES AND ALGORITHMIC DESIGN: THE NEED FOR SPEED, THE JOY OF SPEED Raffaele Giancarlo...

53

Transcript of LIFE SCIENCES AND ALGORITHMIC DESIGN: THE NEED FOR SPEED, THE JOY OF SPEED Raffaele Giancarlo...

LIFE SCIENCES AND ALGORITHMIC DESIGN:THE NEED FOR SPEED, THE JOY OF SPEED

Raffaele Giancarlo

Dipartimento di Matematica ed InformaticaUniversità di Palermo

CINI InfoLife Lab

Summary- Short Version

• Impact of Algorithmic Theory and Practice on Modern Biology:

•Gene Myers:• ACM Paris Kannelakis Theory and Practice

Award 2001 • ISMB Career Award

2014

Summary- Long Version

• Some History

• The data deluge: is algorithmic theory and practice catering to biology ???

…roughly 1200 years ago

Abu Abdullah Muhammad bin Musa al-Khwarizmi

Algorithm-Intuition

Example:Algorithm prepare coffee

The Thirties

Algorithms-formalization…Turing, Church, Kleene, Godel, Post…. a non-ambiguos ordered finite sequence of steps, each effectively excetuable in finite time, producing a result in finite time

The Forties

Kissing The War GoodBye

The forties again

Claude Shannon-The Birth of InformationTheory …and data compression

The guy is quite a character, please visit: https://www.youtube.com/watch?v=G5rJJgt_5mg

the fab sixties

Not only algorithms butFast algorithms and formal

methodologies for their design and analysis: Knuth, Tarjan, Hopcroft

The seventies

NP-Completeness:Not all problems seem to admit anefficient algorithmic solution …and Computational Biology has plenty of examples

Edmonds, Cook, Karp

Discrete Algorithms

• Discrete mathematical objects: good models to represent computational problems

Example: Graphs

Discrete Algorithms

•Discrete mathematical objects: Efficient organization of information

• Example : Trees

Discrete Algorithms

•How to establish the performance of an algorithm: Models of computers, Hardware, etc.

Discrete Algorithms

•How to establish the performance of an algorithm: Models of computers, Hardware, etc.

• Here: The “real thing”: The Turing Machine or equivalent

Discrete Algorithms and Bioinformatics

-Do we need more algorithms?• Pubmed search: 21471 papers [1991,

2014]• Scopus: 101178 papers (Biochemistry,

Genetics, Molecular Biology)

We need GOOD ALGORITHMS

Discrete Algorithms and Bioinformatics

•Good Algorithms Fast and memory efficient, i.e., process

growing amounts of data in “reasonable” time and little space

Accurate, i.e., able to identify useful biological information in terms of function and/or structure

Descrete Algorithms and Bioinformatics

Good Algorithms Accurate (evaluation): THE BIOLOGIST

A physical person A Statistician, i.e., statistical analysis

Surprising or unexpected “events” are related to “biologically useful” information

Example: BLAST, Transcription factors binding sites Benchmark data sets

They offer solutions, validated by experts, one can compare against

Examples: CASP, DREAM, MSA NOT AVAILABLE IN MANY CRUCIAL DOMAINS

Discrete Algorithms and Bioinformatics

•Good Algorithms The statistician: care must be exercised…(ahi

ahi ahi, no Alpitour)

Towards epistemological foundations of statistical methods for high-dimensional

biologyMehta et al., Nature Genetics 2004

Exponential growth of statistical methods for microarrays analysis

For many of them, it is unclear what they do and why they are needed: they are defined as Questionable

Discrete Algorithms and Bioinformatics

-Good Algorithms• Time• Space

Let’s Take a Global Look:•Processors Power (MIPS)•External Disk Capacity (MB)• Sequencing Capacity (kb per day)•Transmission costs are not counted

Discrete Algorithms and Bioinformatics

-On the future of genomic data [Kahn11] and good algorithms (time, space)

Discrete Algorithms and Bioinformatics

-On the future of genomic data [Kahn11] and good algorithms (time, space)- A “Meteorological map on the Data Flood”

[96-02]

[02,06]

[06,08]

[08, -]

Discrete Algorithms and Bioinformatics

•Questions:1. How long does it take for a “foundational

advance” in algorithmic theory to be perceived as such in bioinformatics and be applied,

as proof of principle, or as the base for a tool

1. Is such a delay related to the “meteorological map” outlined earlier?

Algorithmic Theory and Bio Impact

-Four small case studies:

- Suffix trees in Computational Biology- Data Compression of biological sequences- Genome scale sequence alignment- Compressive Genomics

Suffix Trees and Comp. Bio.

Suffix Tree for the sequence banana$

Suffix Trees and Comp. Bio.

• Why Useful• Searching• Word Statistics• Data Compression• Etc, etc

Suffix Trees and Comp. Bio.

•A brief history: • Weiner 75• Mc Creight 76• Manber and Myers 93-Suffix arrays• Ukkonen 95• Gusfield 97: Algorithms on strings, trees

and sequences: Computer Science and Computational Biology, Cambridge Univesity Press

• Gusfield and Stoye 98• Ect., etc.

Suffix Trees and Comp. Bio.

•Compressed suffix arrays and Self-Indexes

• Ferragina and Manzini 2000• Grossi and Vitter 2000

Proof of Principle in Comp. Biology: index with a 2G footprint for the Human Genome

Sadakane and Shibuya 2001 Lippert 2002

Suffix Trees in Comp. Bio.

•Compressed suffix arrays and Self-Indexes

• Ferragina and Navarro 2005 • The pizza and chili corpus: highly tuned

collections of implementations ready for download and use

• Velimaki et al. 2007• Experimental study for CSA as a genome scale

sequence analysis tool

Suffix Trees and Comp. Bio.

•Compressed arrays and Self-Indexes• Vyvemar et al. 2012: prospects and

limitations of full text indexes in genome analysis

• Essential for: • Read Mapping, e.g. Bowtie • Short read error correction, • genome assembly

Genome scale alignments

- MUMer1 and 2- Delcher et al. 1999, 2002

-LAGAN and MultiLagan- Brudno et al,2003

- Suffix trees: Weiner 75, Mc Creight76, Miller and Myers 93, Ukkonen 95

- Sparse Dynamic Programming: H77, HS77, AG87,

EGGI92

Data Compression

A

•lossless•lossy

hopefully |Y||X|

X

Y

Data Compression

• Data Compression in Computational Biology, Giancarlo, Scaturro, Utro, 2009•Compressive Sequence Analysis, Giancarlo, Rombo, Utro 2014

•General compression-Rich history, 1948…

Data Compression

• Compression of biological sequences, Grumbach and Tahi 1993

• Period 1993-2007: “only” 17 new methods specialized to biological sequences

• Period 2008-2013: 36…and counting new methods specialized to NGS data and large genomic sequence collections- a couple of fundamentally new ideas are present: problem to be studied

Compressive genomics

•In a nutshell:

• Algorithm A solves problem P on input x=AAAAAAAAAAACCCCCCCCGGGGGG Algorithm A’ solves problem P on input x’= (A,11); (C,8); (G,6)

OUTPUT IS THE SAME

Compressive Genomics

• Protein DataBase Blast Searches on a compressed DataBase, Berger et al. 2012, 2013

• Compressed Indexing and DNA Local Alignment, Lam et al., 2008

• String Matching over compressed text, Amir et al. 1994• A sub-quadratic sequence alignment algorithm over compressed text, Crochemore et al. 2003,

Discrete Algorithms and Bioinformatics

-A data deluge…ehm, universal

-Remedies Part 1: Historia Magistra Vitae

Discrete Algorithms and Bioinformatics

-A data deluge…ehm, universal-Remedies Part 1:Historia Magistra Vitae

Discrete Algorithms and Bioinformatics

-A data deluge…ehm, universal-Remedies Part 1:Historia Magistra Vitae

A. Apostolico and M. Crochemore, String pattern matching for a deluge survival kit, 2002

Discrete Algorithms and Bioinformatics

-A data deluge…ehm, universal-Remedies Part 1:Historia Magistra Vitae

B. Berger, J. Peng, M. Singh, Computational Solutions for omic data, 2013

Discrete Algorithms and Bioinformatics

-A data deluge…ehm, universal

Discrete Algorithms and Bioinformatics

-A data deluge…ehm, universal

-Remedies Part 2:- Algorithmic foundational work to

Discrete Algorithms and Bioinformatics

-A data deluge…ehm, universal

-Remedies Part 2:- Algorithmic foundational work to:

Break the Big Data Wall!!!

Discrete Algorithms

-New algorithmic design paradigms • External Memory algorithms: Input data

reside on disk and are too big to fit in memory• Aggarwal and Vitter 1988

• An area that has reached full maturity, Comp. Bio. may be reasonably happy with it.• Recoil, Yanovsky 2011: Compression of

embarassingly large DNA sequence collections• Bauer et al., 2012, Lightweight LCP

construction for Next Generation Sequencing Datasets

Discrete Algorithms and Bioinformatics

-New algorithmic design paradigms

Algorithms on Data Streams: the volume of data is so large that one cannot

even store it Data is produced “in a stream” and cannot be

stored on memory

M. Henzinger, P. Raghavan, S. Rajacopalan 1999

Probably not very good for Comp. Bio.

Discrete Algorithms

-New algorithmic design paradigms- Succinct data structures: storing data in

small space- G.J. Jacobson, 1988- Promising for Comp. Bio.

Full Text Self-Indexes Bloom Filters:Pell et al., 2012: 40-fold reduction

in memory requirement for metagenomes assembly

Bloom Filters have been invented in 1970

Discrete Algorithms

-New algorithmic design paradigms

- Synopsis Data Structures: Only a “relevant summary” of the data is kept-

Gibbons and Matias, 1998

No Use yet in Comp. Bio., but very promising because of its success in DataBase System design: Iceberg Queries

Discrete Algorithms

-New algorithmic design paradigms

- Approximation algorithms: well known for hard problems, e.g. TSP,

genome assembly

New: use it for “resource bounded” problems in order to obtain performance guaranteed approximations

Already in use in Comp. Bio. WITHOUT the performance guarantee part…

Conclusions

-Since the late 80’s, a solid bridge has been builtbetween Algorithmic Research and Bioinformatics and Comp. Bio. • Algorithmic Research seems to be asking the right questions in foundational terms for “BIG DATA”- Biology is a privileged testbed, with a turning point in attention around 1997

•The fact that algorithimc research “does not listen to Comp. Bio. needs” is a false metropolitan legend: •Having fun learning about algorithmic theory ? We do learning about biology!!!

Open Problems

- Who won the race?

Open Problems

- Who won the race? Hint: