BCHM 6280 Tutorial: Gene specific information using...
Transcript of BCHM 6280 Tutorial: Gene specific information using...
BCHM 6280 2017 NCBI & Ensembl Tutorial Page 1 of 5
BCHM6280Tutorial:GenespecificinformationusingNCBI,EnsemblandgenomeviewersWebresources:NCBIdatabase:http://www.ncbi.nlm.nih.gov/Ensembldatabase:http://useast.ensembl.org/index.htmlUCSCGenomebrowser:http://genome.ucsc.edu/Exercise1homepage:http://biochem.slu.edu/bchm628/exercise1.htmlGoals:LearnhowtoefficientlynavigatetheNCBI,EBI-Ensembl,andUCSCGenomebrowserstofindinformationonspecificgenes.NOTE:RefseqreferstorecordsthathavebeenreviewedbytheNCBIcurationstaff.TheRefseqdatabaseisaprecursortotheGenedatabaseandisavailableasaLimitsoptionintheproteinandnucleotidedatabases.CuratedRefseqrecordshavethenomenclature:NM_####formRNAandNP_####forproteinrecords.OtherdesignationsaredescribedinthePDFfileRefseqNomenclature.pdfavailablefromtheExercise1homepage.ConducttextbasedsearchesofNCBIandEnsembla)SearchtheNCBIGenedatabaseusingthequeryterm:“p53ANDhuman”.
TheANDtellsittosearchforbothp53andhumanineveryfield.b)Changethesearchqueryto:“p53ANDhuman[Organism]”orusetheAdvanceoptiontocreatethesamequery.
ThistellsthesearchalgorithmthatyouaresearchingspecificallyforspecieshumanintheOrganismfieldofthedatabase.
c)SearchtheEnsembldatabaseforthehumangeneencodingp53.Changethedropdownmenutohuman,type“p53”inthesearchboxandclickGO.Thefirstthingyoushouldnoteisthattherearemanymatchestothequery“p53.”Thereareseveralreasonsforthis:1.Youaresearchingeveryfieldandnotjustthegenename2.YouarenotusingtheofficialHGNC(HumanGenomeNomenclatureCommittee)genenameandthereareseveraldifferentaliasesforthisgene.
3.Thep53proteininteractswith>100otherproteinssothereisalotofliteraturethatmentionthisproteinandthusthenamewillappearintherecordsofmanyothergenes.
Sohowdoyougetaroundthis?Youcantrysearchingfordifferentaliases.Youcanlookthroughthefirstfewrecordsandseeifyoucandeterminewhattheofficialgenesymbolis.Youcansearchtheliteratureforotheraliases.Inthiscase,fromyoursearchofNCBI/Genedatabaseineithera)orb),thetophitisthegenewiththesymbolTP53,whichisthecorrectsymbol.Readthroughthesummaryandyou’llnotethattheofficialgenenameisTumorProteinp53andthatitisinvolvedinnumerouscellularprocessesinvolvedingeneregulation.Youshouldalsonotethatp53isoneofthelistedaliases.
BCHM 6280 2017 NCBI & Ensembl Tutorial Page 2 of 5
SearchtheEnsemblhumangenomewiththequery“p53”.Howmanyresults?Now,restricttheresultstoGenesandthisshouldreducethelistto~443records.However,Ididnotfinditwithinthefirstfewpages.Changethesearchto“TP53”restrictedtohumanandGenesanditshouldcomeupasthetoprecord.Centraltothiscourseisdealingwithlistsofgenes.Forthisreason,wewillusetheofficialgenesymbolsandspecificdatabaseIDs.Ifyouhadtofindtheofficialgenesymbolformorethanabout10genesyouwillquicklyseethevalueofusinggeneidentifiersthatareuniversallyrecognized.Youwillalsolearntovalueliteraturethatreferencesgenesbytheirofficialsymbols.Unfortunately,thisisnotauniversalpractice.FindingtranscriptinformationaboutaspecificgeneusingNCBI&EnsemblHumangenesarecomplexandoftenhaveseveraltranscriptisoforms.Thecurationofgenemodelstoidentifyallpossibleandexpressedtranscriptsusesseveralexperimentaltechniques,includingtissue-specificRNAseq,whichprovidesdirectsupportforexpressionofexons.
ThecurationofgenesatNCBIusesasinglepipelineandcollectsthecuratedgenomic,transcriptandproteinsequencesintotheRefSeqdatabase.TheynomenclatureidentifiesthosesequencesthatareconsideredReference(NG_(genomic)NM_(mRNA)andNP_(protein).ThereisaPDFontheexercise1homepagethatdescribesalloftheRefseqnomenclature.NotethatsomeoflistedasXMorXP,whichindicatespredictedtranscriptsorproteinswithlessornoexperimentalevidenceforthem.
Ensemblhastwogenecurationpipelines(VEGA&HAVANNA),andwhenthetwopipelinesarecombined,theannotationisknownasGENCODE.OntheGenespecificpages,thetranscriptsareidentifiedbywhethertheyareproteincodingornot.Thereisalsoavisualforsplicevariantsthatmatchestheknowndomainsinthegenewiththedifferenttranscripts.EnsemblalsomakesiteasytoexportanExcel-compatibletranscripttableandusuallyidentifieswhichofitstranscriptshaveacorrespondingRefseqtranscriptmatch.
a)WithintheNCBIgenerecordfortheTP53genethereare2sectionsthatprovidetranscript/proteininformation:Genomicregions,transcriptsandproductsandNCBIReferenceSet.
ExportaPDFfromtheGenomicregionssection.Here,genesarecolorcoded(greenforproteincoding,bluefornon-coding).Italsolistsgenemodels(XRorXM).Refseqtranscripts/proteinsstartingwithXrepresentcomputationalmodelswithoutexperimentalverification.AnexampleisprovidedontheExercise1homepage.
b)WithintheEnsemblgenerecordforTP53,findthetranscripttable.HereyoucanexporttheentiretableinCSVformatandthenimportintoExcel.AnexampleisprovidedontheExercise1homepage.
NOTE:TheEnsemblsitegenerallymakesiteasiertodealwithlistsofgenes(bothimportingandexporting).TheNCBIsitehasbettercross-databasefunctionalityandisbetterintegratedwiththeliterature.
Youshouldnoteseveralthingsaboutthesetranscriptsearches:
BCHM 6280 2017 NCBI & Ensembl Tutorial Page 3 of 5
1.TP53hasalargenumberoftranscriptisoforms.Notallhumangeneshavethismany,butifyouwanttoconductawholegenomeexpressionexperiment,oneconsiderationisconsiderwhethertoanalyzethedataonagene(~25,000)ortranscript(~160,000)level.
2.ThetranscriptvariantsdifferbetweenEnsemblandNCBI.ThoughEnsemblkindlyliststhosethatareincommonbetweenthetwosites.
3.Ensemblmakesiteasytodistinguishbetweentranscriptsthatareproteincodingornotandalsobetweentranscriptswithgoodexperimentalevidenceversuscomputationallypredictedtranscripts.
ExploringthegenomiccontextofgenesusingEnsemblandUCSCGenomebrowser.Thegenomiccontextmeanswhereonthegenomethegeneislocated.Thatis:
• Whichchromosome• Whereonthatchromosome• Whatstrand• Whatgenesareupstream/downstream
Genomebrowsersofferawaytovisualizedatathatcanbeplacedonachromosome.Thesedataareincludedasadditionaltracksofinformation(fromafewtohundredsdependingonthegenome)andincludesuchdataas:
• Locationofrepetitivesequences• Levelofhomologytoothergenomes• SNPorvariantswithinthegenomeofinterest• TFbindingsites
Thedatabehindagenomebrowserisenormousandcanbequitecomplextosortthrough.Thisamountofdatacanalsobeslowtoload.Spendsometimeturningtracksonandoffandfollowinglinksorpop-upsthatexplainthedifferentdatasources.WewilluseboththeUCSCandEnsemblgenomebrowsersforthisexercise.Bothallowyoutoexportimagesofthebrowserwindowandofferlinkstodownloadsequencedata.EnsemblgenomebrowserToaccesstheEnsemblgenomebrowser,clickontheLocationtab(whichshouldhaveatitle:Location:17:7,661,779-7,687,550.ThisindicatesthatthisgeneislocatedonChromosome17betweenthecoordinates7,661,779-7,687,550.Thefirstsectionshowsaschematicofthechromosomewitharedboxaroundthecoordinatesofthegene(Fig.1).IfyouclickontheAssemblyExceptionslink,youcanturnoffthattrackandareleftwithjusttheboxhighlighting
thegene.
Figure1:Chromosomeideogramofchr17withtheregionforTP53shownasaredbox
BCHM 6280 2017 NCBI & Ensembl Tutorial Page 4 of 5
Scrolldowntothenextsectionandyou’llseethechromosomeregioninmoredetail,withtheTP53geneinthemiddle.Thisgivesyouanideaofthegenomiccontextofthegeneofinterest.Scrolldowntothenextsectionandthiswilldisplaythe25Kbregionthatencompassesthelargesttranscriptisoformofthegene.Youcanseeallthedifferentsplicevariants.Theyarecolorcodedbyexperimentalsupportandwhethertheyareproteincodingornot.Clickononeofthetranscriptsanditwillopenapop-upwindowwithadditionaldetailsaboutthattranscript.Youcanright-clickonthelinkswithinthepop-upwindowtoopenupthelinkinanewtaborwindow.ClickontheXtoclosethewindow.Scrolldownfurtherandyouwillseeadditionaltracksofinformation,suchasSNPlocations,associatedphenotypesand%GC.Thesetrackscanbeexpandedandturnedonandoff.Itcantakeawhileforthechangestobeimplementeddependingonhowlongofachromosomalregionyouareworkingwithandhowmuchdataisinthetrack.Ifyouscrollbacktothetopofthissection,youcanzoominorout.Sometimestrackswon’texpandbecauseyouareviewingalargeenoughsectionthattherewillbetoomuchinformationtodisplay.Ifyoutriedexpandingatrackandnothinghappened,tryzoominginsuchthatyouaredisplaying<10Kbofsequence.Thatwillusuallyallowanytracktobeexpanded.Figure2showsaportionoftheTDP53transcriptwithexpandedtrackofSNPs.
Figure2:PartoftheTP53transcriptvariantswithexpandedSNPsbelow.
BCHM 6280 2017 NCBI & Ensembl Tutorial Page 5 of 5
UsingtheUCSCGenomebrowserBelowtheheadersisadarkbluebarwiththelinkGenomes.MouseoveritandselecthumangenomeGRCh38/hg38.OrclickthelinkanditwillopenasearchwindowforthelatestHumanassemblyasadefaultoption.TypeinTP53intothesearchtextboxanditwilllistmanypossiblematches.Selectthesecondonewhichcorrespondstotumorproteinp53(fromHGNCTP53).ThisshouldopenawindowthatlookssomethinglikeFig.3.
ThegenesizeandcoordinatesofwherethisgenefallsonChr17shouldbeverysimilarifnotidenticaltothecoordinateslistedfortheEnsemblbrowser.Scrolldownthroughthegraphics.Clickonthegraphicorclickingonthenameofthetrackwillpopopenawindowwithinformationaboutthetrack.Clickonanysingletranscripttoseedetailsaboutthetranscript.AFEWofthequestionsyoucanaskwithagenomebrowserinclude(dependingonthegenomeandavailabletrackinformation):
1) Whatgenesarelocatednearitormaysharepromoters?2) WhatSNPsarefoundinmygeneandaretheylocatedinintrons,promotersorexons?3) Whatstrandismygeneencodedon?4) Whatregulatorelementsarelocatedwithinornearmygene?5) Whatclinicalvariantsareassociatedwithmygene?
Spendsometimeexploringthetracksandlookingupwhattheyrepresentandhowthedataispresented.Youmayfindsomeoftheinformationpertinenttoyourresearchproject.
Figure3:UCSCviewofTp53