Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator,...

61
Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India Chandigarh, India & & Visiting Professor, Pohang Univ. of Science & Visiting Professor, Pohang Univ. of Science & Technology, Republic of Korea Technology, Republic of Korea Email: [email protected] Email: [email protected] Web: http://www.imtech.res.in/raghava Web: http://www.imtech.res.in/raghava / /

Transcript of Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator,...

Page 1: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Role of Computer and Information Science in

Biology

Presented By

Dr G. P. S. RaghavaDr G. P. S. Raghava

Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, IndiaCo-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India

&&Visiting Professor, Pohang Univ. of Science & Technology, Republic of KoreaVisiting Professor, Pohang Univ. of Science & Technology, Republic of Korea

Email: [email protected]: [email protected]

Web: http://www.imtech.res.in/raghavaWeb: http://www.imtech.res.in/raghava//

Page 2: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Major Applications & Major Applications & ChallengesChallenges

Introduction to BiologyIntroduction to Biology Genome Annotation: Gene PredictionGenome Annotation: Gene Prediction Analysis and Comparison of SequencesAnalysis and Comparison of Sequences Protein Structure PredictionProtein Structure Prediction DNA Chip (Microarray) technologyDNA Chip (Microarray) technology Proteomics: Analysis of 2D gelProteomics: Analysis of 2D gel Fingerprinting TechniqueFingerprinting Technique Drug developmentDrug development Computer-Aided Vaccine Design Computer-Aided Vaccine Design

Page 3: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Hierarchy in BiologyAtoms

Molecules

Macromolecules

Organelles

Cells

Tissues

Organs

Organ Systems

Individual Organisms

Populations

Communities

Ecosystems

Biosphere

Page 4: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Animal cell

Page 5: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Human ChromosomesHuman Chromosomes

Page 6: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Genes are linearly arranged along Genes are linearly arranged along chromosomeschromosomes

Page 7: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Chromosomes and DNAChromosomes and DNA

Page 8: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

DNA can be simplified DNA can be simplified to a string of four to a string of four

lettersletters

GATTACA

Page 9: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

(RT)

Page 10: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Sequence to Structure:Sequence to Structure:It’s a matter of dimensions!It’s a matter of dimensions!

1D Nucleic acid sequence1D Nucleic acid sequence

AGT-TTC-CCA-GGG…AGT-TTC-CCA-GGG…

1D Protein sequence1D Protein sequence

Met-Ala-Gly-Lys-His…Met-Ala-Gly-Lys-His…M – A – G – K – H…M – A – G – K – H…

3D Spatial arrangement of atoms3D Spatial arrangement of atoms

Page 11: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Genome AnnotationGenome Annotation

The Process of Adding Biology Information andThe Process of Adding Biology Information and

Predictions to a Sequenced Genome FrameworkPredictions to a Sequenced Genome Framework

Page 12: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Importance of Sequence Importance of Sequence ComparisonComparison

Protein Structure PredictionProtein Structure Prediction– Similar sequence have similar structure & Similar sequence have similar structure &

functionfunction– Phylogenetic TreePhylogenetic Tree– Homology based protein structure predictionHomology based protein structure prediction

Genome AnnotationGenome Annotation– Homology based gene predictionHomology based gene prediction– Function assignment & evolutionary studiesFunction assignment & evolutionary studies

Searching drug targetsSearching drug targets– Searching sequence present or absent across Searching sequence present or absent across

genomesgenomes

Page 13: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Protein Sequence Alignment and Database Protein Sequence Alignment and Database SearchingSearching

Alignment of Two Sequences (Pair-wise Alignment)Alignment of Two Sequences (Pair-wise Alignment)– The Scoring Schemes or Weight MatricesThe Scoring Schemes or Weight Matrices– Techniques of AlignmentsTechniques of Alignments– DOTPLOTDOTPLOT

Multiple Sequence Alignment (Alignment of > 2 Multiple Sequence Alignment (Alignment of > 2 Sequences)Sequences)

–Extending Dynamic Programming to more sequencesExtending Dynamic Programming to more sequences–Progressive Alignment (Tree or Hierarchical Methods)Progressive Alignment (Tree or Hierarchical Methods)–Iterative TechniquesIterative Techniques

Stochastic Algorithms (SA, GA, HMM)Stochastic Algorithms (SA, GA, HMM) Non Stochastic AlgorithmsNon Stochastic Algorithms

Database ScanningDatabase Scanning– FASTA, BLAST, PSIBLAST, ISSFASTA, BLAST, PSIBLAST, ISS

Alignment of Whole GenomesAlignment of Whole Genomes– MUMmer (Maximal Unique Match)MUMmer (Maximal Unique Match)

Page 14: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Alignment of Two SequencesAlignment of Two SequencesDealing Gaps in Pair-wise AlignmentDealing Gaps in Pair-wise Alignment

Sequence Comparison without GapsSequence Comparison without GapsSlide Windos method to got maximum scoreSlide Windos method to got maximum score

ALGAWDEALGAWDE

ALATWDEALATWDE

Total score= 1+1+0+0+1+1+1=5 ; (PID) = (5*100)/7Total score= 1+1+0+0+1+1+1=5 ; (PID) = (5*100)/7

Sequence with variable length should use dynamic programming Sequence with variable length should use dynamic programming

Sequence Comparison with GapsSequence Comparison with Gaps•Insertion and deletion is commonInsertion and deletion is common•Slide Window method failsSlide Window method fails•Generate all possible alignmentGenerate all possible alignment•100 residue alignment require > 10100 residue alignment require > 107575

Page 15: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Alternate Dot Matrix PlotAlternate Dot Matrix PlotDiagnoal * shows align/identical regionsDiagnoal * shows align/identical regions

Page 16: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Dynamic ProgrammingDynamic Programming

Dynamic Programming allow Optimal Alignment Dynamic Programming allow Optimal Alignment between two sequencesbetween two sequences

Allow Insertion and Deletion or Alignment with gapsAllow Insertion and Deletion or Alignment with gaps Needlman and Wunsh Algorithm (1970) for global Needlman and Wunsh Algorithm (1970) for global

alignmentalignment Smith & Waterman Algorithm (1981) for local Smith & Waterman Algorithm (1981) for local

alignmentalignment Important StepsImportant Steps

– Create DOTPLOT between two sequencesCreate DOTPLOT between two sequences– Compute SUM matrixCompute SUM matrix– Trace Optimal PathTrace Optimal Path

Page 17: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Alignment of Multiple SequencesAlignment of Multiple SequencesExtending Dynamic Programming to more sequencesExtending Dynamic Programming to more sequences

–Dynamic programming can be extended for more than twoDynamic programming can be extended for more than two–In practice it requires CPU and Memory (Murata et al 1985)In practice it requires CPU and Memory (Murata et al 1985)– MSA, Limited only up to 8-10 sequences (1989)MSA, Limited only up to 8-10 sequences (1989)–DCA (Divide and Conquer; Stoye et al., 1997), 20-25 sequencesDCA (Divide and Conquer; Stoye et al., 1997), 20-25 sequences–OMA (Optimal Multiple Alignment; Reinert et al., 2000)OMA (Optimal Multiple Alignment; Reinert et al., 2000)–COSA (Althaus et al., 2002)COSA (Althaus et al., 2002)

Progressive or Tree or Hierarchical Methods (CLUSTAL-Progressive or Tree or Hierarchical Methods (CLUSTAL-W)W)

–Practical approach for multiple alignmentPractical approach for multiple alignment–Compare all sequences pair wiseCompare all sequences pair wise–Perform cluster analysisPerform cluster analysis–GGenerate a hierarchy for alignmentenerate a hierarchy for alignment–first aligning the most similar pair of sequencesfirst aligning the most similar pair of sequences–Align alignment with next similar alignment or sequenceAlign alignment with next similar alignment or sequence

Page 18: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.
Page 19: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Database scanningDatabase scanning

Basic principles of Database searchingBasic principles of Database searching– Search query sequence against all sequence in databaseSearch query sequence against all sequence in database– Calculate score and select top sequencesCalculate score and select top sequences– Dynamic programming is best Dynamic programming is best

Approximation AlgorithmsApproximation Algorithms

FASTAFASTAFast sequence searchFast sequence searchBased on dotplotBased on dotplotIdentify identical words (k-tuples)Identify identical words (k-tuples)Search significant diagonalsSearch significant diagonalsUse PAM 250 for further refinementUse PAM 250 for further refinementDynamic programming for narrow regionDynamic programming for narrow region

Page 20: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Principles of FASTA AlgorithmsPrinciples of FASTA Algorithms

Page 21: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.
Page 22: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Database Scanning or Fold Database Scanning or Fold RecognitionRecognition

Concept of PSIBLASTConcept of PSIBLAST– Perform the BLAST search (gap handling)Perform the BLAST search (gap handling)– GeneImprove the sensivity of BLAST GeneImprove the sensivity of BLAST – rate the position-specific score matrixrate the position-specific score matrix– Use PSSM for next round of searchUse PSSM for next round of search

Intermediate Sequence SearchIntermediate Sequence Search– Search query against protein databaseSearch query against protein database– Generate multiple alignment or profileGenerate multiple alignment or profile– Use profile to search against PDBUse profile to search against PDB

Page 23: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Comparison of Whole Genomes Comparison of Whole Genomes MUMmer (Salzberg group, MUMmer (Salzberg group,

1999, 2002)1999, 2002)– Pair-wise sequence alignment of Pair-wise sequence alignment of

genomesgenomes– Assume that sequences are closely Assume that sequences are closely

relatedrelated– Allow to detect repeats, inverse repeats, Allow to detect repeats, inverse repeats,

SNPSNP– Domain inserted/deletedDomain inserted/deleted– Identify the exact matchesIdentify the exact matches

How it worksHow it works– Identify the maximal unique match Identify the maximal unique match

(MUM) in two genomes(MUM) in two genomes– As two genome are similar so larger MUM As two genome are similar so larger MUM

will be therewill be there– Sort the matches found in MUM and Sort the matches found in MUM and

extract longest set of possible matches extract longest set of possible matches that occurs in same order (Ordered that occurs in same order (Ordered MUM)MUM)

– Suffix tree was used to identify MUMSuffix tree was used to identify MUM– Close the gaps by SNPs, large insertsClose the gaps by SNPs, large inserts– Align region between MUMs by Smith-Align region between MUMs by Smith-

WatermanWaterman

Page 24: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Protein Structure Protein Structure PredictionPrediction

Experimental TechniquesExperimental Techniques– X-ray Crystallography X-ray Crystallography – NMRNMR

Limitations of Current Experimental Limitations of Current Experimental TechniquesTechniques– Protein DataBank (PDB) -> 24000 protein structuresProtein DataBank (PDB) -> 24000 protein structures– SwissProt -> 100,000 proteinsSwissProt -> 100,000 proteins– Non-Redudant (NR) -> 1,000,000 proteinsNon-Redudant (NR) -> 1,000,000 proteins

Importance of Structure PredictionImportance of Structure Prediction– Fill gap between known sequence and structures Fill gap between known sequence and structures – Protein Engg. To alter function of a proteinProtein Engg. To alter function of a protein– Rational Drug DesignRational Drug Design

Page 25: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Protein StructuresProtein Structures

Page 26: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Techniques of Structure Techniques of Structure PredictionPrediction

Computer simulation based on energy Computer simulation based on energy calculationcalculation– Based on physio-chemical principlesBased on physio-chemical principles– Thermodynamic equilibrium with a minimum free Thermodynamic equilibrium with a minimum free

energyenergy– Global minimum free energy of protein surfaceGlobal minimum free energy of protein surface

Knowledge Based approachesKnowledge Based approaches– Homology Based ApproachHomology Based Approach– Threading Protein SequenceThreading Protein Sequence– Hierarchical MethodsHierarchical Methods

Page 27: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Energy Minimization TechniquesEnergy Minimization TechniquesEnergy Minimization based methods in their pure form, make Energy Minimization based methods in their pure form, make

no priori assumptions and attempt to locate global minma.no priori assumptions and attempt to locate global minma. Static Minimization MethodsStatic Minimization Methods

– Classical many potential-potential can be construtedClassical many potential-potential can be construted– Assume that atoms in protein is in static formAssume that atoms in protein is in static form– Problems(large number of variables & minima and Problems(large number of variables & minima and

validity of potentials)validity of potentials) Dynamical Minimization MethodsDynamical Minimization Methods

– Motions of atoms also consideredMotions of atoms also considered– Monte Carlo simulation (stochastics in nature, time is Monte Carlo simulation (stochastics in nature, time is

not cosider)not cosider)– Molecular Dynamics (time, quantum mechanical, Molecular Dynamics (time, quantum mechanical,

classical equ.)classical equ.) LimitationsLimitations

– large number of degree of freedom,CPU power not large number of degree of freedom,CPU power not adequate adequate

– Interaction potential is not good enough to modelInteraction potential is not good enough to model

Page 28: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Knowledge Based ApproachesKnowledge Based Approaches Homology ModellingHomology Modelling

– Need homologues of known protein Need homologues of known protein structurestructure

– Backbone modellingBackbone modelling– Side chain modelling Side chain modelling – Fail in absence of homologyFail in absence of homology

Threading Based MethodsThreading Based Methods– New way of fold recognitionNew way of fold recognition– Sequence is tried to fit in known structuresSequence is tried to fit in known structures– Motif recognitionMotif recognition– Loop & Side chain modellingLoop & Side chain modelling– Fail in absence of known exampleFail in absence of known example

Page 29: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Hierarcial MethodsHierarcial Methods

Intermidiate structures are predicted, instead of Intermidiate structures are predicted, instead of predicting tertiary structure of protein from amino predicting tertiary structure of protein from amino acids sequenceacids sequence

Prediction of backbone structurePrediction of backbone structure– Secondary structure (helix, sheet,coil)Secondary structure (helix, sheet,coil)– Beta Turn PredictionBeta Turn Prediction– Super-secondary structureSuper-secondary structure

Tertiary structure predictionTertiary structure prediction Limitation Limitation

Accuracy is only 75-80 %Accuracy is only 75-80 %

Only three state predictionOnly three state prediction

Page 30: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

cDNA clones(probes)

PCR product amplificationpurification

printing

microarray Hybridise target to microarray

mRNA target)

excitation

laser 1laser 2

emission

scanning

analysis

overlay images and normalise

0.1nl/spot

Page 31: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Major ApplicationsMajor Applications Identification of differentially Identification of differentially

expressed genes in diseased expressed genes in diseased tissues (in presence of drug)tissues (in presence of drug)

Classification of differentially Classification of differentially expressed (genes) or clustering/ expressed (genes) or clustering/ grouping of genes having similar grouping of genes having similar behaviour in different conditionsbehaviour in different conditions

Use expression profile of known Use expression profile of known disease to diagnosis and classify disease to diagnosis and classify of unknown genes of unknown genes

Page 32: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Terms/JargonsTerms/Jargons

Stanford/cDNA chipStanford/cDNA chip one slide/experimentone slide/experiment one spotone spot 1 gene => one spot 1 gene => one spot

or few spots(replica)or few spots(replica) control: control spotscontrol: control spots control: two control: two

fluorescent dyes fluorescent dyes (Cy3/Cy5)(Cy3/Cy5)

Affymetrix/oligo Affymetrix/oligo chipchip

one one chip/experimentchip/experiment

one one probe/feature/cellprobe/feature/cell

1 gene => many 1 gene => many probes (20~25 probes (20~25 mers)mers)

control: match and control: match and mismatch cells.mismatch cells.

Page 33: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Images : examplesImages : examples

Cy3

Cy5 Spot colourSpot colour Signal strengthSignal strength Gene Gene expressionexpression

yellowyellow Control = perturbedControl = perturbed unchangedunchanged

redred Control < perturbedControl < perturbed inducedinduced

greengreen Control > perturbedControl > perturbed repressedrepressed

Pseudo-colour overlay

Page 34: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Processing of imagesProcessing of images Addressing or griddingAddressing or gridding

– Assigning coordinates to each of the spotsAssigning coordinates to each of the spots

SegmentationSegmentation– Classification of pixels either as foreground Classification of pixels either as foreground

or as backgroundor as background

Intensity determination for each spotIntensity determination for each spot– Foreground fluorescence intensity pairs (R, Foreground fluorescence intensity pairs (R,

G)G)– Background intensitiesBackground intensities– Quality measuresQuality measures

Page 35: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Management of Microarray DataManagement of Microarray Data

Magnitude of DataMagnitude of Data– ExperimentsExperiments

50 000 genes in human50 000 genes in human 320 cell types320 cell types 2000 compunds2000 compunds 3 times points3 times points 2 concentrations2 concentrations 2 replicates2 replicates

– Data VolumeData Volume 4*104*1011 11 data-pointsdata-points 10101515 = 1 petaB of Data = 1 petaB of Data

Page 36: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Management of Microarray Management of Microarray DataData

Major IssuesMajor Issues Large volume of microarray data in last few Large volume of microarray data in last few

yearsyears– Storage and efficient accessStorage and efficient access– Comparison and integration of dataComparison and integration of data

Problem of data access and exchangeProblem of data access and exchange– Data scattered around InternetData scattered around Internet– Supplementary material of publicationsSupplementary material of publications– Difficult for user to access relivent dataDifficult for user to access relivent data

Problems with existing databasesProblems with existing databases– Diverse purposeDiverse purpose– Developed for specific purposeDeveloped for specific purpose

Page 37: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Management of Microarray Management of Microarray DataData

Specific DatabaseSpecific Database– Platform (eg.Stanford MA Database; SMD)Platform (eg.Stanford MA Database; SMD)– Organism (Yeast MA global viewer)Organism (Yeast MA global viewer)– Project (Life cycle database of Project (Life cycle database of DrosophilaDrosophila))

Problem with Supplement and MA Problem with Supplement and MA databasesdatabases– Lack of direct accessLack of direct access– Quality not checkedQuality not checked– No standard formatNo standard format– Incomplete data Incomplete data

Page 38: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Pre-processed cDNA Gene Pre-processed cDNA Gene Expression DataExpression Data

On p genes for n slides: p is O(10,000), n is O(10-100), but growing,On p genes for n slides: p is O(10,000), n is O(10-100), but growing,

Genes

Slides

Gene expression level of gene 5 in slide 4

= Log2( Red intensity / Green intensity)

slide 1 slide 2 slide 3 slide 4 slide 5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.

Page 39: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Analysis of Microarray DataAnalysis of Microarray Data Analysis of imagesAnalysis of images Preprocessing of gene expression data Preprocessing of gene expression data Normalization of dataNormalization of data

– Subtraction of Background NoiseSubtraction of Background Noise– Global/local Normalization Global/local Normalization – House keeping genes (or same gene) House keeping genes (or same gene) – Expression in ratio (test/references) in logExpression in ratio (test/references) in log

Differential Gene expressionDifferential Gene expression– Repeats and calculate significance (t-test)Repeats and calculate significance (t-test)– Significance of fold used statistical methodSignificance of fold used statistical method

ClusteringClustering– Supervised/Unsupervised (Hierarchical, K-Supervised/Unsupervised (Hierarchical, K-

means, SOM)means, SOM) Prediction or Supervised Machine Learnning Prediction or Supervised Machine Learnning

(SVM)(SVM)

Page 40: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Normalization TechniquesNormalization Techniques Global normalizationGlobal normalization

– Divide channel value by meansDivide channel value by means Control spotsControl spots

– Common spots in both channelsCommon spots in both channels– House keeping genesHouse keeping genes– Ratio of intensity of same gene in two channel is used for Ratio of intensity of same gene in two channel is used for

correctioncorrection Iterative linear regressionIterative linear regression Parametric nonlinear nomalization Parametric nonlinear nomalization

– log(CY3/CY5) vs log(CY5))log(CY3/CY5) vs log(CY5))– Fitted log ratio – observed log ratioFitted log ratio – observed log ratio

General Non Linear NormalizationGeneral Non Linear Normalization– LOESSLOESS– curve between log(R/G) vs log(sqrt(R.G))curve between log(R/G) vs log(sqrt(R.G))

Page 41: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

ClassificationClassification

Task: Task: assign objects to classes assign objects to classes (groups) on the basis of (groups) on the basis of measurements made on the objectsmeasurements made on the objects

Unsupervised: Unsupervised: classes unknown, want classes unknown, want to discover them from the data to discover them from the data (cluster analysis)(cluster analysis)

Supervised: Supervised: classes are predefined, classes are predefined, want to use a (training or learning) set want to use a (training or learning) set of labeled objects to form a classifier of labeled objects to form a classifier for classification of future observationsfor classification of future observations

Page 42: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Issues in ClusteringIssues in Clustering

Pre-processing (Image analysis and Pre-processing (Image analysis and Normalization)Normalization)

Which genes (variables) are usedWhich genes (variables) are used Which samples are usedWhich samples are used Which distance measure is usedWhich distance measure is used Which algorithm is appliedWhich algorithm is applied How to decide the number of clusters How to decide the number of clusters KK

Page 43: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Unsupervised LearnningUnsupervised Learnning

Hierarchical clustering: Hierarchical clustering: merging two merging two branches at the time until all vari-ablesbranches at the time until all vari-ables

(genes) are in one tree. [it does not answer the (genes) are in one tree. [it does not answer the question of “howquestion of “how

many gene clusters there are”?]many gene clusters there are”?] K-mean clustering: K-mean clustering: assuming there are K assuming there are K

clusters. [what if this assump-tionclusters. [what if this assump-tion is incorrect?]is incorrect?] Model-based clustering: Model-based clustering: the number of the number of

clusters is determined dynami-callyclusters is determined dynami-cally [could be one of the most promising methods][could be one of the most promising methods]

Page 44: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Supervised AnalysisSupervised Analysis

Fisher’s linear discriminant Fisher’s linear discriminant analysisanalysis

Quadratic discriminant analysisQuadratic discriminant analysis Logistic regression Logistic regression (a linear (a linear

discriminant analysis)discriminant analysis) Neural networksNeural networks Support vector machineSupport vector machine

Page 45: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Traditional ProteomicsTraditional Proteomics

1D gel electrophoresis (SDS-PAGE)1D gel electrophoresis (SDS-PAGE) 2D gel electrophoresis2D gel electrophoresis Protein ChipsProtein Chips

– Chips coated with proteins/AntibodiesChips coated with proteins/Antibodies– large scale version of ELISAlarge scale version of ELISA

Mass SpectrometryMass Spectrometry– MALDI: Mass fingerprintingMALDI: Mass fingerprinting– Electrospray and tandem mass Electrospray and tandem mass

spectrometryspectrometry Sequencing of Peptides (N->C)Sequencing of Peptides (N->C) Matching in Genome/Proteome DatabasesMatching in Genome/Proteome Databases

Page 46: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Overview of 2D Gel Overview of 2D Gel SDS-PAGE + Isoelectric focusing (IEF)SDS-PAGE + Isoelectric focusing (IEF)

– Gene Expression StudiesGene Expression Studies– Medical Applications Medical Applications – Sample ExperimentsSample Experiments

Capturing and Analyzing DataCapturing and Analyzing Data– Image AcquistionImage Acquistion– Image Sizing & OrientationImage Sizing & Orientation– Spot IdentificationSpot Identification– Matching and AnalysisMatching and Analysis

Page 47: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Comparision/Matcing of Gel Comparision/Matcing of Gel ImagesImages

Compare 2 gel imagesCompare 2 gel images– Set X and y axisSet X and y axis– Overlap matching spotsOverlap matching spots– Compare intensity of spotsCompare intensity of spots

Scan against databaseScan against database– Compare query gel with all gelsCompare query gel with all gels– Calculate similarity scoreCalculate similarity score– Sort based on score Sort based on score

Page 48: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Differential Differential Proteomics:Proteomics:

Fingerprints of Fingerprints of DiseaseDisease

PhenotypicPhenotypicChangesChanges

PhenotypicPhenotypicChangesChanges

Normal CellsNormal Cells

Disease CellsDisease Cells

•Differential protein expression• Protein nitration patterns•Altered phosporylation•Altered glycosylation profiles

•Differential protein expression• Protein nitration patterns•Altered phosporylation•Altered glycosylation profiles Utility

•Target discovery•Disease pathways•Disease biomarkers

Utility•Target discovery•Disease pathways•Disease biomarkers

Page 49: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Fingerprinting TechniqueFingerprinting Technique

What is fingerprintingWhat is fingerprinting– It is technique to create specific pattern for a given It is technique to create specific pattern for a given

organism/personorganism/person– To compare pattern of query and target objectTo compare pattern of query and target object– To create Phylogenetic tree/classification based on patternTo create Phylogenetic tree/classification based on pattern

Type of FingerprintingType of Fingerprinting– DNA FingerprintingDNA Fingerprinting– Mass/peptide fingerprintingMass/peptide fingerprinting– Properties based (Toxicity, classification)Properties based (Toxicity, classification)– Domain/conserved pattern fingerprinting Domain/conserved pattern fingerprinting

Common ApplicationsCommon Applications– Paternity and Maternity Paternity and Maternity – Criminal Identification and ForensicsCriminal Identification and Forensics– Personal Identification Personal Identification – Classification/Identification of organismsClassification/Identification of organisms– Classification of cellsClassification of cells

Page 50: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Fingerprinting Techniques: Principles & Fingerprinting Techniques: Principles & ApplicationsApplications

What is fingerprintingWhat is fingerprinting Type of FingerprintingType of Fingerprinting Common ApplicationsCommon Applications

Role of Computer in DNA FingerprintingRole of Computer in DNA Fingerprinting– Searching Restriction EnzymesSearching Restriction Enzymes– Searching VNTRsSearching VNTRs– Computation of size of DNA fragmentsComputation of size of DNA fragments– Optimization of gelsOptimization of gels– Comparison of patternsComparison of patterns– Creation of Phylogenetic treeCreation of Phylogenetic tree

Page 51: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Drug Design Drug Design

History of Drug/Vaccine developmentHistory of Drug/Vaccine development– Plants or Natural ProductPlants or Natural Product

Plant and Natural products were source for medical Plant and Natural products were source for medical substancesubstance

Example: foxglove used to treat congestive heart failureExample: foxglove used to treat congestive heart failure Foxglove contain digitalis and cardiotonic glycosideFoxglove contain digitalis and cardiotonic glycoside Identification of active componentIdentification of active component

– Accidental ObservationsAccidental Observations Penicillin is one good examplePenicillin is one good example Alexander Fleming observed the effect of moldAlexander Fleming observed the effect of mold Mold(Penicillium) produce substance penicillinMold(Penicillium) produce substance penicillin Discovery of penicillin lead to large scale screeningDiscovery of penicillin lead to large scale screening Soil micoorganism were grown and testedSoil micoorganism were grown and tested Streptomycin, neomycin, gentamicin, tetracyclines etc.Streptomycin, neomycin, gentamicin, tetracyclines etc.

Page 52: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Drug DesignDrug Design Chemical Modification of Known DrugsChemical Modification of Known Drugs

– Drug improvement by chemical modificationDrug improvement by chemical modification– Pencillin G -> Methicillin; morphine->nalorphinePencillin G -> Methicillin; morphine->nalorphine

Receptor Based drug designReceptor Based drug design– Receptor is the target (usually a protein)Receptor is the target (usually a protein)– Drug molecule binds to cause biological effectsDrug molecule binds to cause biological effects– It is also called lock and key systemIt is also called lock and key system– Structure determination of receptor is importantStructure determination of receptor is important

Ligand-based drug designLigand-based drug design– Search a lead ocompound or active ligandSearch a lead ocompound or active ligand– Structure of ligand guide the drug design processStructure of ligand guide the drug design process

Page 53: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Drug Design based on Bioinformatics ToolsDrug Design based on Bioinformatics Tools Detect the Molecular Bases for DiseaseDetect the Molecular Bases for Disease

– Detection of drug binding siteDetection of drug binding site

– Tailor drug to bind at that siteTailor drug to bind at that site

– Protein modeling techniquesProtein modeling techniques

– Traditional Method (brute force testing)Traditional Method (brute force testing)

Rational drug design techniquesRational drug design techniques

– Screen likely compounds built Screen likely compounds built

– Modeling large number of compounds (automated)Modeling large number of compounds (automated)

– Application of Artificial intelligenceApplication of Artificial intelligence

– Limitation of known structuresLimitation of known structures

Page 54: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Important Points in Drug Design based on Important Points in Drug Design based on Bioinformatics ToolsBioinformatics Tools

Application of GenomeApplication of Genome– 3 billion bases pair3 billion bases pair– 30,000 unique genes30,000 unique genes– Any gene may be a potential drug targetAny gene may be a potential drug target– ~500 unique target~500 unique target– Their may be 10 to 100 variants at each Their may be 10 to 100 variants at each

target genetarget gene– 1.4 million SNP1.4 million SNP– 1010200200 potential small molecules potential small molecules

Page 55: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Concept of Drug and VaccineConcept of Drug and Vaccine

Concept of DrugConcept of Drug– Kill invaders of foreign pathogensKill invaders of foreign pathogens– Inhibit the growth of pathogensInhibit the growth of pathogens

Concept of VaccineConcept of Vaccine– Generate memory cellsGenerate memory cells– Trained immune system to face Trained immune system to face

various existing disease agentsvarious existing disease agents

Page 56: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

VACCINESVACCINES

AA. SUCCESS STORY. SUCCESS STORY::• COMPLETE ERADICATION OF SMALLPOXCOMPLETE ERADICATION OF SMALLPOX• WHO PREDICTION : ERADICATION OF PARALYTICWHO PREDICTION : ERADICATION OF PARALYTIC

POLIO THROUGHOUT THE WORLD BY YEAR 2003POLIO THROUGHOUT THE WORLD BY YEAR 2003• SIGNIFICANT REDUCTION OF INCIDENCE OF DISEASES:SIGNIFICANT REDUCTION OF INCIDENCE OF DISEASES:

DIPTHERIA, MEASLES, MUMPS, PERTUSSIS, RUBELLA,DIPTHERIA, MEASLES, MUMPS, PERTUSSIS, RUBELLA,POLIOMYELITIS, TETANUSPOLIOMYELITIS, TETANUS

B.NEED OF AN HOURB.NEED OF AN HOUR1) SEARCH FOR NONAVAILABILE EFFECTIVE VACCINES FOR 1) SEARCH FOR NONAVAILABILE EFFECTIVE VACCINES FOR

DISEASES LIKE: DISEASES LIKE: MALARIA, TUBERCULOSIS AND AIDSMALARIA, TUBERCULOSIS AND AIDS

2) IMPROVEMENT IN SAFETY AND EFFICACY OF PRESENT2) IMPROVEMENT IN SAFETY AND EFFICACY OF PRESENTVACCINESVACCINES3) LOW COST3) LOW COST4) EFFICIENT DELIVERY TO NEEDY4) EFFICIENT DELIVERY TO NEEDY5) REDUCTION OF ADVERSE SIDE EFFECTS5) REDUCTION OF ADVERSE SIDE EFFECTS

Page 57: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Computer Aided Vaccine Computer Aided Vaccine DesignDesign

Whole Organism of PathogenWhole Organism of Pathogen– Consists more than 4000 genes and Consists more than 4000 genes and

proteinsproteins– Genomes have millions base pairGenomes have millions base pair

Target antigen to recognise pathogenTarget antigen to recognise pathogen– Search vaccine target (essential and non-Search vaccine target (essential and non-

self)self)– Consists of amino acid sequence (e.g. A-V-L-Consists of amino acid sequence (e.g. A-V-L-

G-Y-R-G-C-T ……)G-Y-R-G-C-T ……) Search antigenic region (peptide of Search antigenic region (peptide of

length 9 amino acids)length 9 amino acids)

Page 58: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Major steps of endogenous antigen processing

Page 59: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Computer Aided Vaccine Computer Aided Vaccine DesignDesign

Problem of Pattern RecognitionProblem of Pattern Recognition– ATGGTRDAR ATGGTRDAR EpitopeEpitope– LMRGTCAAYLMRGTCAAY Non-epitopeNon-epitope– RTTGTRAWR RTTGTRAWR EpitopeEpitope– EMGGTCAAYEMGGTCAAY Non-epitopeNon-epitope– ATGGTRKAR ATGGTRKAR EpitopeEpitope– GTCVGYATTGTCVGYATT EpitopeEpitope

Commonly used techniquesCommonly used techniques– Statistical (Motif and Matrix)Statistical (Motif and Matrix)– AI TechniquesAI Techniques

Page 60: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

Why computational tools are required for prediction.

200 aa proteins

Chopped to overlapping peptides of 9 amino acids

192 peptides

invitro or invivo experiments for detecting which snippets of protein will spark an immune response.

10-20 predicted peptides

Bioinformatics Tools

Page 61: Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting.

ThanksThanks