Scoring Matrices - Weizmann Institute of...

26
Scoring Matrices Shifra Ben-Dor Irit Orr Bioinformatics Lecture 4 2019

Transcript of Scoring Matrices - Weizmann Institute of...

Page 1: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

ScoringMatrices

ShifraBen-DorIritOrr

Bioinformatics Lecture 4 2019

Page 2: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

Scoringmatrices

❆ Sequencealignmentanddatabasesearchingprogramscomparesequencestoeachotherasaseriesofcharacters.

❆ Allalgorithms(programs)forcomparisonrelyonsomescoringschemeforthat.

❆ Scoringmatricesareusedtoassignascoretoeachcomparisonofapairofcharacters.

Page 3: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

Scoringmatrices

❆ Thescoresinthematrixareintegervalues.❆  InmostcasesaposiDvescoreisgiventoidenDcalorsimilarcharacterpairs,andanegaDveorzeroscoretodissimilarcharacterpairs.

Page 4: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

Differenttypesofmatrices

❆  IdenDtyscoring-thesimplestscoringscheme,wherecharactersareclassifiedas:idenDcal(scores1),ornon-idenDcal(scores0).Thisscoringschemeisnotmuchused.

❆ DNAscoring-considerchangesastransiDonsandtransversions.ThismatrixscoresidenDcalbp3,transiDons2,andtransversions0.

Page 5: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

Differenttypesofmatrices

❆ Chemicalsimilarityscoring(forproteins)-thismatrixgivesgreaterweighttoaminoacidswithsimilarchemicalproperDes(e.gsize,shapeorchargeoftheaa).

❆ Observedmatricesforproteins-mostcommonlyusedbyallprograms.ThesematricesareconstructedbyanalyzingthesubsDtuDonfrequenciesseeninthealignmentsofknownfamiliesofproteins.

Page 6: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

ObservedScoringMatrices

•  EverypossibleidenDtyandsubsDtuDonisassignedascore.

•  ThisscoreisbasedontheobservedfrequenciesofsuchoccurrencesinalignmentsofevoluDonaryrelatedproteins.

•  ThisscorewillalsoreflectthefrequencythataparDcularaminoacidoccursinnature,assomeaminoacidsaremoreabundantthanothers.

Page 7: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

ObservedScoringMatrices

•  IdenDDesareassignedthemostposiDvescores

•  FrequentlyobservedsubsDtuDonsalsoreceiveposiDvescores

•  Mismatches,ormatchesthatareunlikelytohavebeenaresultofevoluDon,aregivennegaDvescores.

Page 8: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

•  EachmatrixentrygivestheraDooftheobservedfrequencyofsubsDtuDonbetweeneachpossiblepairofaminoacidsinrelatedproteinstothatexpectedbychance,giventhefrequenciesofaminoacidsinproteins.

•  TheseraDosarecalledoddsscores.•  TheseraDosaretransformedtologarithmsofoddsscorescalledlogoddsscores.

•  Oddsscoresandlogoddsscoresareusedtoscoreproteinalignments

Page 9: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

Differenttypesofmatrices

•  ObservedScoringMatricesaresuperiortosimpleidenDtyscores,orscoresbasedsolelyonchemicalproperDesofaminoacids

•  ThemostfrequentlyusedobservedlogoddsmatricesusedarethePAMandBLOSUMmatrices.

Page 10: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

PAMMatrices

❆ DevelopedbyMargaretDayhoffandco-workers.

❆ Derivedfromglobalalignmentsofverysimilarsequences(atleast85%idenDty),sothattherewouldbeli[lelikelihoodofanobservedchangebeingtheresultofseveralsuccessivemutaDons,butitshouldreflectonemutaDononly.

❆ PAM-PointAcceptedMutaDons.

Page 11: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

❆  AnacceptedpointmutaDoninaproteinisareplacementofoneaminoacidbyanother,acceptedbynaturalselecDon.ItistheresultoftwodisDnctprocesses:

❆  thefirstistheoccurrenceofamutaDonintheporDonofthegenetemplateproducingoneaminoacidofaprotein

❆  thesecondistheacceptanceofthemutaDonbythespeciesasthenewpredominantform.Tobeaccepted,thenewaminoacidusuallymustfuncDoninawaysimilartotheoldone:chemicalandphysicalsimilariDesarefoundbetweentheaminoacidsthatareobservedtointerchangefrequently.

Page 12: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

❆ DayhoffesDmatedmutaDonratesfromsubsDtuDonsobservedincloselyrelatedproteinsandextrapolatedthoseratestomodeldistantrelaDonships.

❆ PAMgivestheprobabilitythatagivenaminoacidwillbereplacedbyanyotherparDcularaminoacida\eragivenevoluDonaryinterval,inthiscase1acceptedpointmutaDonper100aminoacids.

Page 13: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

❆ Whenusedforproteincomparison,themutaDonprobability(odds)matrixisnormalizedandthelogarithmistaken.(thisletsusaddthescoresalongaproteininsteadofmulDplyingtheprobabiliDes)

❆ TheresulDngmatrixisthe“log-odds”matrix,knownasthePAMmatrix.

Page 14: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

PAM#=PointAcceptedMutaDons/100bases

❆ Thenumberwiththematrix(PAM120,PAM90),referstotheevoluDonarydistance.Greaternumbersaregreaterdistances.

❆ ToderivePAM250youmulDplyPAM1250Dmesitself

❆ PAM250isthematrixderivedofsequenceswith250PAMs.

Page 15: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

PAM250❆  AtthisevoluDonarydistance,onlyoneaminoacidinfiveremainsunchanged.

❆  However,theaminoacidsvarygreatlyintheirmutability;55%ofthetryptophans,52%ofthecysteinesand27%oftheglycineswouldsDllbeunchanged,butonly6%ofthehighlymutableasparagineswouldremain.Severalotheraminoacids,parDcularlyalanine,asparDcacid,glutamicacid,glycine,lysine,andserinearemorelikelytooccurinplaceofanoriginalasparaginethanasparagineitselfatthisevoluDonarydistance!

Page 16: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

# This matrix was produced by "pam" Version 1.0.6 [28-Jul-93] # # PAM 250 substitution matrix, scale = ln(2)/3 = 0.231049 # # Expected score = -0.844, Entropy = 0.354 bits # # Lowest score = -8, Highest score = 17 # A R N D C Q E G H I L K M F P S T W Y V B Z X *A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0 0 -8R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 -1 -8N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1 0 -8D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 3 3 -1 -8C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5 -3 -8Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 -1 -8E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3 -1 -8G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 0 0 -1 -8H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2 -1 -8I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 -1 -8L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 -1 -8K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 1 0 -1 -8M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2 -1 -8F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -4 -5 -2 -8P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0 -1 -8S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 0 0 0 -8T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1 0 -8W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -5 -3 -6 -1Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -1 -4 -1 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 -2 2 -2 -1

PAM250 MATRIX

Page 17: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

Pet91-anupdatedDayhoffmatrix

❆ SincethefamilyofPAMmatriceswerederivedfromacomparaDvelysmallnumberoffamilies,manyofthepossiblemutaDonswerenotobserved.

❆  Jonesetal.havederivedanupdatedmatrixbyexaminingaverylargenumberoffamilies,andcreatedthePET91scoringmatrix.

Page 18: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

GonnetMatrices

❆ AnotherimprovementonthePAMmatrices

❆ Donewhenthedatabasewaslarger❆ All-against-allpairwisecomparisontoseeallpossiblesubsDtuDons

❆ OpDmizedtofindsequencesthatarefurtherapart(asopposedtoPAM,thatstartedwithverysimilarsequences)

Page 19: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

BLOSUMMatrices

❆ CreatedbyHenikoff&Henikoff,basedonlocalmulDplealignmentsofmoredistantlyrelatedsequences.

❆ First,mulDplealignmentsofshortregions(withoutgaps)ofrelatedsequencesweregathered.

❆  IneachalignmentthesequencessimilaratsomethresholdvalueofpercentidenDtywereclusteredintogroupsandaveraged.

Page 20: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

BLOSUMMatrices

❆ SubsDtuDonfrequenciesforallpairsofaminoacidswerecalculatedbetweenthegroups,thiswasusedtocreatethelog-oddsBLOSUM(BlockSubsDtuDonMatrix).

Page 21: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

BLOSUM#-where#isthethresholdidenDtypercentageofthesequencesclusteredinthoseblocks.

❆ Thus,BLOSUM62meansthatthesequencesclusteredinthisblockareatleast62%idenDcal.

❆ ThisallowsdetecDonofmoredistantlyrelatedsequences,asitdownplaystheroleofthemorerelatedsequencesintheblockwhenbuildingthematrix.

Page 22: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

# Matrix made by matblas from blosum62.iij # * column uses minimum score # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62 # Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X *A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4

BLOSUM62 MATRIX

Page 23: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

Sowhichobservedmatrixtouse???

❆ FormostprogramsthedefaultisBLOSUM62,andformostsearchesitworksverywell.

❆  Ifyoudon’tgetresults,thentrythefollowingrulesofthumb:

Page 24: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

Sowhichobservedmatrixtouse???

❆ ForglobalalignmentsusePAMmatrices.❆ LowerPAMmatricestendtofindshortalignmentsofhighlysimilarregions.

❆ HigherPAMmatriceswillfindweaker,longeralignments.

❆ ForlocalalignmentsuseBLOSUMmatrices.❆ BLOSUMmatriceswithHIGHnumber,arebe[erforsimilarsequences.

❆ BLOSUMmatriceswithLOWnumber,arebe[erfordistantsequences.

Page 25: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

Tips...

❆ Whendoingglobalalignment(anddatabasescanning)ofrelated(similar)sequencesusePAM200orPAM250.

❆  Ifyoudon’tknowwhattoexpect(e.g.fordatabasescanning)usePAM120.

❆ Forlocaldatabasescanning(e.gblast),orforungapped,localalignments,useBLOSUM62(recommendedforproteins).

Page 26: Scoring Matrices - Weizmann Institute of Sciencedors.weizmann.ac.il/.../Lect4_scoringmatrices.pdf · 2019. 4. 15. · Scoring matrices Sequence alignment and database searching programs

Tips….

❆  Inallcasesitisrecommendedtousemorethanonematrixforanydatabasescanning,whenthedefaultsdon’twork...