faculty.ithaca.edu · Web viewYou can look at each of tese files via a text editor, or from your...

27
Woods Hole – Zebrafish Genetics and Development Bioinformatics/Genomics Lab Ian Woods Note: This document “wh_informatics_practical.doc” and supporting materials can be downloaded from my website: http://faculty.ithaca.edu/iwoods/docs/wh/ Setting the stage: These tasks each pertain to the mutation that we (virtually) mapped in lab. The curved body axis and U-shaped somites observed in these mutants are hallmarks of disrupted slow muscle development, and similar phenotypes are observed in mutants with defects in Hedgehog signaling. General descriptions of the tasks are provided below. Specific protocols can be found following this introductory section. Each of you should choose (at least) one task to accomplish, and collaboration is highly encouraged. Task 0: High resolution mapping, sequencing, and expression Overview: From a rough map position, refine the critical interval via (virtual) high resolution mapping with additional markers. Query the critical interval in the zebrafish genome for potential candidate genes. Locate expression patterns online for these candidates. Design primers to sequence candidate genes for the mutagenic lesion or for additional SNPs to use in mapping. Task 1: Create a transgenic reporter line by cloning candidate enhancer/promoter sequences Overview: Identify the translational start site of a gene of interest. Obtain ~6 kb of sequence upstream of this site. Design PCR primers that will amplify this region, and clone it in-frame with GFP in a tol2 expression vector. Identify BACs for use in creating reporter constructs via homologous recombination. Identify evolutionarily conserved sequences

Transcript of faculty.ithaca.edu · Web viewYou can look at each of tese files via a text editor, or from your...

Woods Hole Zebrafish Genetics and Development

Bioinformatics/Genomics Lab

Ian Woods

Note: This document wh_informatics_practical.doc and supporting materials can be downloaded from my website:

http://faculty.ithaca.edu/iwoods/docs/wh/

Setting the stage: These tasks each pertain to the mutation that we (virtually) mapped in lab. The curved body axis and U-shaped somites observed in these mutants are hallmarks of disrupted slow muscle development, and similar phenotypes are observed in mutants with defects in Hedgehog signaling.

General descriptions of the tasks are provided below. Specific protocols can be found following this introductory section. Each of you should choose (at least) one task to accomplish, and collaboration is highly encouraged.

Task 0: High resolution mapping, sequencing, and expression

Overview: From a rough map position, refine the critical interval via (virtual) high resolution mapping with additional markers. Query the critical interval in the zebrafish genome for potential candidate genes. Locate expression patterns online for these candidates. Design primers to sequence candidate genes for the mutagenic lesion or for additional SNPs to use in mapping.

Task 1: Create a transgenic reporter line by cloning candidate enhancer/promoter sequences

Overview: Identify the translational start site of a gene of interest. Obtain ~6 kb of sequence upstream of this site. Design PCR primers that will amplify this region, and clone it in-frame with GFP in a tol2 expression vector. Identify BACs for use in creating reporter constructs via homologous recombination. Identify evolutionarily conserved sequences from other organisms to uncover potential regulatory regions around your gene of interest.

Task 2: Expression, Rescue, and Phenocopying

Overview: Identify the zebrafish ortholog of your favorite gene. Find its location in the genome, locate the translational start site (ATG), and identify the exon-intron boundaries. Design two 25-mer morpholino sequences that target (1) the ATG and (2) an exon-intron boundary. Identify an orthologous gene in another fish species for use in rescue experiments to control for morpholino specificity. Align this sequence with your morpholinos to determine degree of potential activity. Obtain a full-length clone of the zebrafish gene (via RTPCR or clone collections) for use in overexpression experiments or expression analyses via in situ hybridization. Identify potential CRISPR targets within your gene.

Task 3: Batch BLAST and parsing with Python to identifying zebrafish transcripts related to a specific signaling pathway

Overview: Mine OMIM (Online Mendelian inheritance in Man) for genes related to Hedgehog signaling. Obtain amino acid sequences for these genes, and identify putative zebrafish orthologs for these proteins via BLAST. Use a simple script to parse the blast results to see where the genes are located in the zebrafish genome. Finally, find out where a few of these genes are expressed (via zfin).

Requirements: Terminal, python (both native on MacOSX)

Task 4: Visualization of enriched motifs in putative promoter / enhancer regions.

Overview: From a file of unidentified sequences derived from a transcriptome profiling experiment, identify the best matching Ensembl transcript via local BLAST, batch download potential promoter sequences for each of these transcripts, search through these promoter sequences for enriched motifs, and visualize the location of the motifs on the promoters (just a bit advanced)

Requirements: Terminal, Python, Matplotlib (all native on MacOSX)

Protocols:

Task 0: High resolution mapping, sequencing, and expression

1.The mutation we mapped in lab is flanked by SSLP/Zmarkers Z11119 and Z15270. Your first job will be to view the region of the genome that is flanked by these two markers. Within this region, you can identify candidate genes and find additional markers can be used to refine your map position, and thereby narrow the critical interval in the genome to look for the gene that is disrupted in the mutation.

Start at zfin.

http://www.zfin.org

Enter Z15270 in the box at the top. On the page that follows, hit the link, and then hit the link to GenBank. On the GenBank page, click FASTA and copy the sequence onto the clipboard of the computer. To find the location of this marker in the genome, well go to the zebrafish genome browser hosted by EMBL:

http://ensembl.org

Follow the link for BLAT, paste the sequence of this gene into the window, select Danio_rerio from the species menu, and click RUN. On the following page, click the link for the best matching chromosome region. This takes you to a view of the genome, centered on this map marker. Note the physical location of this marker (the numbers in the genome window). Zoom out a bit to get a sense of the genomic region.

Repeat the above steps for Z11119. How many hits in the genome do you obtain for Z11119? What does this mean? Choose the best alignment. Where in the genome if your mutation likely located (answer in terms of numbers).

2.Now lets look at a candidate gene near one of your map markers. Find primer sequences for one of these genes (calca). From your browser window for Z15270 in #1, locate calca and click on it. This takes you to the Ensembl page for this marker. Click on the ZFIN link, which takes you to the ZFIN page for this zebrafish sequence. Scroll down to the RefSeq link under Sequence Information and follow it. Locate the FASTA link and click it, which takes you to a page where the sequence is located. Copy this sequence to the clipboard on your computer.

Now to go one of many websites for Primer design:

http://bioinfo.ut.ee/primer3-0.4.0/

Paste in your sequence and select a length of 500-550 (the comfortable limit for sequencing PCR products). Hit Pick Primers and retrieve your primer sequences. The next step would be to amplify gDNA from wildtype and mutant embryos via PCR, sequence the PCR products, identify sequence differences, and use this information to test for linkage between this gene and your mutation. Well go over how to do that in more detail below.

3.You collect hundreds of mutants for use in a high-resolution mapping panel, and test them for linkage to numerous markers from your region. You find that the SSLP Z15270 is the marker that is most tightly linked to your mutation, but some recombinants remain. Query the zebrafish genome assembly to see a model of your region of interest (the assembly is pretty good on a large scale, but can be misleading in a local region). Go to the Ensembl website.

http://www.ensembl.org

Find the genomic location of Z15270 as above. Click on the Configure this page link on the left hand side of the page. Here youll find all sorts of tracks you can turn on and off to show different kinds of information. Try turning some additional features on. Save and close the configuration window by hitting the checkmark in the upper right, and zoom out in the browser as far as is allowed.

4.Exploring the genomic region what do these genes do? Click on some of the genes found in the region, taking you to the gene record page. Find and click the orthologues link on the left hand side of the page for each gene. What kind of gene is PDE3B?

5.Go back to the genomic view. Can you get a link to ZFIN for any of these genes? Click on rras2, and follow through to ZFIN. Follow the link for Expression Data. Your mutant has defects in muscle specification is the expression pattern of rras2 consistent with a role in muscle?

6.You decide to sequence rras2 in wildtype and mutant embryos to see if (1) you can find a SNP to map to rule this gene out via recombination, and (2) you can find a change in the mutant sequence that might cause a loss-of-function phenotype. Design primers that will amplify a 600 bp PCR product that contains the first exon of rras2.

Find the rras2 entry in ZFIN (you are probably already there in step #5). Go to the ZFIN homepage:

http://zfin.org

Click on Genes/Markers/Clones and enter rras2. On the ZFIN gene page, scroll down and follow the link to the RefSeq RNA record. Scroll down and note the coordinates of the coding sequence (CDS) in the entry. Copy the coding sequence onto the clipboard.

Go to the UCSC genome browser (you can also do this on the Ensembl browser, but the UCSC interface is a bit friendlier for this task):

http://genome.ucsc.edu

Click on the BLAT tab, and paste in your sequence. Select Zebrafish from the Genome pulldown menu, and click Submit. Follow the link for details on the first BLAT hit. Scroll up and down to check your results what to the different color-codings mean in your sequence?

Select about 600b of genomic sequence from which to design primers, then head to the primer3 website:

http://bioinfo.ut.ee/primer3-0.4.0/

Paste in your sequence, choose a size range of 500-600b (about the limit of a sequence trace from a PCR template), and click Pick Primers.

7.You PCR from genomic DNA of wildtype and mutant embryos, and sequence the PCR products. The sequencing results are as follows:

>wildtype_rras2_exon1

AGGCGGGAGTGTGAGCGCGCGCCCCCTCGCGCCCGCCGCGCGCACTGCCAGCACTGATTAGCCGTATCTTCCCCTCATCTTGCAGCACAGGCAGTCAGTCAGTGCCTGGTAGCGATTTGGACGAGGGCGTATGGACTTGAAGCAGCAGTGTATGCATTTCCCACAGACTGTGGTCGTACTTTTCTCCTGTCGGACGGATTACCACTGAGTTGACACATAGCCCAAAAGCCGCTTCGCATTTTTTCCGCTGCATTTCTCTAACTGAAGGCCTGTCACAGAGTAAAGTGGCTCGGTGTGCGTGTGTTTAGACAGCGGAGCGAGAGCAGCAGTGTGTCCCCGATGGCTGGCTGGAAGGACGGCTCAGTGCAGGAGAAATATCGCCTGGTGGTCGTCGGAGGTGGTGGCGTCGGAAAATCAGCGTTAACCATCCAGTTTATCCAGGTAAGCGGATACATGGCGGAATGTTATGTGGTTTTCGGCCCTTTAAAAAGATGTGAGGGTGTTGAGGAGAAATGCGTGGATCTTGCTCACAGAAATGGGGACCCCATGAGCGGAAAAGGGGGTTCAGGAATCCAAGCTAGGCCTGCGACACTTTAAACC

>mutant_rras2_exon1

AGGCGGGAGTGTGAGCGCGCGCCCCCTCGCGCCCGCCGCGCGCACTGCCAGCACTGATTAGCCGTATCTTCCCCTCATCTTGCAGCACAGGCAGTCAGTCAGTGCCTGGTAGCGATTTGGACGAGGGCGTATGGACTTGAAGCAGCAGTGTATGCATTTCCCACAGACTGTGGTCGTACTTTTCTCCTGTCGGACGGATTACCACTGAGTTGACACATAGCCCAAAAGCCGCTTCGCATTTTTTCCGCTGCATTTCTCTAACTGAAGGCCTGTCACAGAGTAAAGTGGCTCGGTGTGCGTGTGTTTAGACAGCGGAGCGAGAGCAGCAGTGTGTCCCCGATGGCTGGCTGGAAGGACGGCTCAGTGCAGGAGAAATATCGCCTGGTGGTCGTCGGAGGTGGTGGCGTCGGAAAATCAGCGTTAACCATCCAGTTTATCCAGGTAAGCGGATACATGGCGGAATGTTATGTGGTTTTCGGCCCTTTAAAAAGATGTGAGGGTGTTGAGGAGAAATGAGTGGATCTTGCTCACAGAAATGGGGACCCCATGAGCGGAAAAGGGGGTTCAGGAATCCAAGCTAGGCATGCGACACTTTAAACC

You wish to know if these sequences harbor any polymorphisms, and whether you can use these polymorphisms to facilitate your high resolution mapping. Align the two sequences via BLAST2:

http://blast.ncbi.nlm.nih.gov/Blast.cgi

Follow the link for nucleotide blast, and check the box for Align Two or More Sequences. Note the points at which the two sequences differ.

Next, youd like to see if the polymorphisms can be distinguished via restriction digest. Paste about 40b of wildtype and mutant sequence flanking the SNP into the dCAPS website, leaving the mismatches field blank.

http://helix.wustl.edu/dcaps/dcaps.html

Are there enzymes available that will cut wildtype but not mutant sequence (or vice versa)? If a SNP does not have a polymorphism, try entering 1 in the mismatch field what does this accomplish?

8.Finally, do the SNPs result in changes in the coding sequence for rras2? Try BLASTing the mutant sequence (from #7 above) vs. the amino acid sequence (from the GenBank/NCBI page from #6), using Align Two or More Sequences and BLASTX.

Task 1: Create a transgenic reporter line by cloning candidate enhancer/promoter sequences

1.Eventually you identify the mutation as a lesion in the gene scube2. You wish to analyze the morphogenetic movements of cells expressing this gene during development in live embryos. To accomplish this, you decide to make a GFP reporter line that reflects the endogenous expression of this gene. First you decide to try a quick-and-dirty approach: you plan to clone genomic sequences upstream of the translational start site (ATG) of this gene and put them into a tol2 GFP expression vector.

Locate this gene in the genome and retrieve the coding sequence: go to the ZFIN homepage, and enter scube2 in the search box at the top of the page.

http://zfin.org

Follow the gene link to the ZFIN record for this gene, and scroll down the page. Where (which chromosome) does ZFIN say this gene is located?

2.Next, you want to retrieve the nucleotide sequence of this gene to (1) compare it with the genomic sequence, and (2) identify the translational start site. Scroll down the ZFIN page until you find the link for RNA. Follow this to the RefSeq record for this gene. Scroll down to the sequence information at the bottom of the page. Where does the coding sequence (cds) begin and end within the complete mRNA transcript? Find the ATG in the nucleotide sequence. Beginning at the ATG, copy about 100b of nucleotide sequence to the clipboard and head to the Ensembl Genome Browser for Zebrafish.

http://www.ensembl.org/Danio_rerio/

Enter scube2 into the search box. On the resulting page, click on Location. Which direction is the gene transcribed (ie. which strand is the coding strand)?

By high-resolution genetic mapping, you localized the SSLP Z15270 to be 0.1 cM from the mutation in scube2. Z15270 is on chromosome 7 at about 27,488,000. The genetic map length of the zebrafish genome is 3000 cM total, and the total physical length of the genome is 1.7 x 109 bp. Is the actual physical (basepair) distance between Z15270 and scube2 surprising? What factors might account for any differences in expected distance?

Zoom in and move the window so that the first exon encompasses the entire view (you can do so by drawing a rectangle around the first exon or by pressing the < and > buttons). Resize the window to include about 5 kb of upstream sequence (just add 5000 to the righthand number in the location box). Would grabbing 5 kb of upstream sequence be a good idea to make a reporter construct for scube2? Why or why not?

You decide to retrieve all intergenic sequence and test various parts of it for enhancer activity. First, resize the browser window to just include this intergenic sequence. Click the link for export data on the left hand side of the page. Pull down soft repeat masking in the genomic FASTA options, and hit next. Then click the text link to get the sequence.

Copy the DNA on to the clipboard, then go to the Primer3 website to design primers, trying to get as much of the input sequence as possible into the PCR product.

http://bioinfo.ut.ee/primer3-0.4.0/

To clone this bit of DNA, you would add appropriate restriction enzyme (or Gateway, or SLIC, or PIPE, or Gibson) sequences to the primers, PCR amplify, and hop into your favorite GFP expression vector.

3.You successfully make this vector and inject it into 1-cell stage embryos. The GFP expression in injected fish (aka. transient-transgenics) is promising the pattern of GFP expression in a few fish roughly matches what is observed via in situ hybridization. In addition, many other tissues express GFP. Encouraged by this result, you raise the embryos to adulthood and cross them to identify founders. You identify ten founders, but none of your lines express GFP in a pattern consistent with the in situ data: expression in some tissues is absent, and many tissues express GFP where the gene is not normally expressed. How might you explain these results?

You decide to make a new reporter line by BAC recombination: you will obtain a large (~200kb) chunk of genomic DNA that contains this gene, and replace the first exon of your target gene with GFP. Why might this strategy result in GFP expression that more accurately recapitulates the endogenous expression pattern?

You can use at least two approaches to identify a BAC that contains your favorite gene: (1) directly from the Ensembl genome browser, (2) via a BLAST search at NCBI.

3a.Go to the Ensembl home page for zebrafish:

http://www.ensembl.org/Danio_rerio/

Enter scube2 in the search box and click Go. Follow the link for Location. Look at the Location pane in the browser page what is written in the blue bar in the center of the page? If a region of the assembly is represented by a sequenced BAC, there will be a GenBank accession number (eg. AL845363) in this blue bar. By contrast, if the region is represented by whole-genome shotgun traces, you will see something like Zv9_scaffold12345 in the middle bar.

Turn on the BAC ends track (if not already on) by clicking Configure this page (Simple Features) on the left hand side. Check the boxes for CHXXX and DKEYXXX (where X = a series of numbers), and hit the check mark on the upper right corner. Zoom out until you can see connected BAC ends (represented by horizontal blue bars). Are there any good options for BACS that contain the scube2 coding sequence and putative regulatory regions?

3b.Another way to search for a BAC is via a BLAST query at NCBI/Genbank. Retrieve the GenBank accession number for scube2 again from ZFIN, then go to the NCBI BLAST homepage:

http://blast.ncbi.nlm.nih.gov/Blast.cgi

Click nucleotide blast, enter the accession number in the search box, select nr button from the pulldown menu, and type in Danio rerio in the organism box. Hit BLAST. On the results page, genome sequence will be annotated as Zebrafish DNA sequence from clone. Are there any BAC clones that cover the entirety of the scube2 sequence? You next decide to align the coding sequence with one BAC sequence to check for overlap. Note the accession number of the BAC, and go to the BLAST2 page:

http://blast.ncbi.nlm.nih.gov/

=> select nucleotide blast and click the Align two sequences box

Enter the accession number for the coding sequence in the top box, and for the BAC in the bottom box, and hit Align. Where does the coding sequence (ie. query) begin and end in the BAC sequence? Hit the Dot Matrix view for a graphical look.

The next steps would involve creating a targeting vector for homologous recombination. In this case, you could use ET recombination (or another method) to replace the first exon with GFP (or whatever youd like), and also modify the BAC with tol2 transposon LTR sequences. BACs can be ordered from two sources, depending on the library:

DKEY and DKEYp from ImaGenes (http://www.imagenes-bio.de/); CH211, CH1073 and CH73 from BacPAC Resources (http://bacpac.chori.org/)

4.As a final step, you wish to identify candidate regulatory sequences by comparing genomic sequences from multiple teleost species. This can be accomplished via the VISTA webserver:

http://genome.lbl.gov/vista/index.shtml

First we will need to collect genomic sequences from other fish. In this example we will use three fish in which both genomic sequences and chromosome assignments are available: Tetraodon nigroviridis (Green-spotted pufferfish), Gasterosteus aculeatus (3-spined stickleback), and Oryzias latipes (medaka).

The whole-genome duplication event in the teleost lineage can make definitive orthology assigments a bit tricky. Clues to the correct ortholog can be gleaned from analyzing conserved syntenies, in which gene content on particular chromosomes has been retained after species divergence. A useful viewer of conserved syntenies in multiple organisms can be found at the Oxgrid website:

http://oxgrid.angis.org.au/oxg_table.html

By selecting the appropriate species comparisons, you can view chromosomes and chromosome segments in which gene content has been conserved. Which regions of the stickleback, medaka, and pufferfish genomes most closely match zebrafish chromosome 7?

Find the orthologs of Scube2 in these species by performing BLAT searches at the UCSC genome website, using the peptide sequence of Scube2 as the query. In the resulting browser page for each BLAT search, expand the window size to include ~ 10kb of upstream and downstream flanking sequences. Note the orientation of the gene (+ or strand), and export the genomic DNA via the DNA tab. Save these sequences on to your desktop. They may need to be edited to retain FASTA format you can do this in Notepad, TextEdit, or via a command-line editor such as emacs.

Note Ive collected these sequences for you here, if you dont want to do all of the searching:

http://faculty.ithaca.edu/iwoods/docs/wh/vista_scube2/

Next return to the VISTA homepage, choose mVISTA, select 4 sequences to align, and upload your sequences to the VISTA server. View both the visual alignments as well as the textual alignments. Since we collected about 10kb of upstream sequence for each fish, the exons should begin to align at 10k. Can you see where the exon sequences are? Are there conserved noncoding sequences present as well? You may want to relax (via shorter alignments, lower %match) the conservation parameters a bit (Adjust Conservation Parameters), to see if you can get more sequences to show up as conserved.

Task 2: Expression, Rescue, and Phenocopying

In midline patterning, Hedgehog signals emanate from the notochord and ventral neural tube. Although scube2 mutants display many defects observed in other Hedgehog pathway mutants, scube2 expression in the neural tube is confined to dorsal regions. This expression pattern is reminiscent of Boc, a gene involved in Hedgehog signaling in mouse. You wish to analyze the zebrafish ortholog of Boc at the level of expression and function.

1. As a first step, you search for the zebrafish ortholog of Boc. Start at the NCBI home page:

http://www.ncbi.nlm.nih.gov/

Select Genes from the Search menu, and type in Boc. Scroll down until you see the first mouse record, and follow its link. On the resulting page, scroll down to the bottom to find the link for the amino acid sequence. Follow this link to the GenPept record for the protein. Scroll down and copy the amino acid sequence into your clipboard.

Now go to the BLAST home page:

http://blast.ncbi.nlm.nih.gov

Follow the link for tblastn, paste in the sequence, select nr and type in Danio rerio. While this search is running, hit the back button and select Expressed Sequence Tags from the database menu. These two simultaneous BLASTs will ensure that all available coding sequences from Zebrafish will be searched. You can access all of your ongoing BLAST searches via the Recent Results tab. Check the BLAST result pages; if you have hits on the nr search with accession numbers that start with NM_, youre all set; these are typically full-length sequences that have been subjected to some degree of annotation and verification.

2.Next youd like to obtain a clone of zebrafish boc to use in expression analysis via in situ hybridization. You can follow at least two avenues: (1) obtain a clone from a commercial source or another laboratory, or (2) make a clone via RTPCR. Clones that are commercially available are labeled with an IMAGE ID. Ideally, you would like a full-length sequence that you could use for rescue or overexpression experiments, but partial sequences are fine for generating in situ probes. Search the UniGene database using the accession number of your top NM hit.

http://www.ncbi.nlm.nih.gov/unigene/

Scroll down to locate sequences that are annotated with IMAGE IDs. Do any of the IMAGE clones represent full-length cDNAs (compare with the NMsequences)? You can order IMAGE-IDd clones from Open Biosystems:

http://www.openbiosystems.com

3.Next, design primers that will allow you to amplify a full-length clone via RT-PCR for mRNA overexpression/rescue experiments. Follow the NM_XXX link from the UniGene page for boc. Scroll down, highlight and copy the nucleotide sequence, then paste it into primer3.

http://bioinfo.ut.ee/primer3-0.4.0/

Choose a size range that is sufficient to include the entire cds. Do your primer sequences flank the translational start and stop codons?

4.Youd also like to design morpholino oligonucleotides (MOs) that target the translational start site and a splice junction. First, compare the coding sequence with the genomic sequence to find the ATG and the exon-intron boundaries. There are several ways to do this, including (a) the GenBank record, (b) exporting sequence from a genome browser (Ensembl or UCSC), and (c) BLAST searches on genomic traces or sequenced BACs.

4a. The GenBank browser will often have 5 UTR sequences that can be used to design an ATG-binding MO. Where does coding sequence of boc begin? Check the GenBank record for NM_001005393 and look for cds.

4b. Go to the Ensembl blast page and paste in the sequence for boc.

http://www.ensembl.org/Multi/Tools/Blast

In the Search Against box, select Danio rerio. Examine the alignment overview on the results page. Is the whole gene aligned to the genome? You can now zoom in on the first exon of boc, extract the sequence, and design your morpholinos. To get to the first exon, look for the lowest numbers in the Query column on the blast result page. Or, zoom out to the whole gene and draw a rectangle around the first exon and click Jump to Region. Similarly, you can design splice-blocking morpholinos by finding the exon-intron boundaries in the browser.

4c. You can repeat the procedure above to design a Morpholino that targets an exon/intron junction. This time, well use the UCSC browser so that you can get a feel for its features:

http://genome.ucsc.edu

5. Next, you would like do control for Morpholino specificity via rescue by injection of an mRNA to which your morpholinos will not bind. One approach to do this is to inject an mRNA from a different species. You head to the pet store and acquire a Green-spotted pufferfish (Tetraodon nigroviridis), grind it up in liquid nitrogen, and extract total RNA. Your plan is to identify the ortholog of boc in Tetraodon, clone this sequence via high-fidelity PCR, generate mRNA via in vitro transcription, and inject this mRNA into morpholino-treated zebrafish for rescue.

First, return to the mouse GenPept page for Boc (from Step #1). Copy the sequence into the clipboard, and return to the UCSC Blat page.

http://genome.ucsc.edu/cgi-bin/hgBlat?command=start

Paste in the sequence, and select Tetraodon from the species menu. Hit Submit, and then follow the link for browser view. Zoom out until the full Tetraodon sequence is shown. How does your BLAT query compare with the Tetraodon genome?

Click on the Tetraodon Gene within the browser window this will take you to a page in which it will be possible to export the predicted gene and peptide sequences. Copy the amino acid sequence, and paste it into blast2, along with the mouse Boc peptide sequence.

http://blast.ncbi.nlm.nih.gov/ (select blastp, and check box for compare two sequences)

Is the full mouse sequence matched by the Tetraodon sequence? How does the Tetraodon sequence compare with zebrafish boc? Which chromosome contains boc in Tetraodon? Does this make sense via patterns of conserved synteny between Zebrafish and Tetraodon? Check the OxGrid website and find the picture that depicts the orthology relationship between Zebrafish (drer) and Tetraodon (tnig):

http://oxgrid.angis.org.au/oxg_table.html

The next step is to design primers to amplify Tetraodon boc via RTPCR. The predicted gene sequence does not include 5 and 3 untranslated regions, which does not leave much wiggle room for designing effective primers. You can collect putative UTR sequences from the genome. Go back to the UCSC BLAT page for Tetraodon, and paste in the predicted cDNA sequence. On the results page, follow the link for details and scroll down until you see the alignment with genomic cDNA. Collect about 80b of genomic sequence up and downstream of the Tetraodon boc gene, and make a new sequence that includes these putative UTR sequences. Enter this sequence into Primer3, and pick primers that will flank your coding sequence. Next, youll amplify by high-fidelity PCR, clone into an expression vector, and verify the clone by sequencing.

How do the morpholino sequences you designed match up with the Tetraodon boc sequence. Compare with blast two sequences:

Go to ncbi BLAST, click on blast nucleotide, and click on align two sequences. Paste your morpholino in one box, and the Tetraodon boc sequence in the other. Is there a significant match? Will Tetraodon boc mRNA likely escape morpholino-induced knockdown?

6. Finally, you wish to explore the possibility of targeting boc for knockout via CRISPRs. One way to look for CRISPR targets is to take advantage of publicly-available tracks that can be added to genome browsers. To see one of these, head to

http://genome.ucsc.edu/cgi-bin/hgHubConnect?hubSearchTerms=&hgHub_do_search=on

Scroll down to ZebrafishGenomics and click Connect. Enter boc in the search box on the next page, and select the RefSeq Sequence. On the next page, scroll down, turn on the CRISPRs track under ZebrafishGenomics, and hit Refresh. Zoom in to the first exon to visualize potential targets. Lower scores indicate targets that may potentially have higher specificity.

Task 3: Batch BLAST and parsing with Python to identifying zebrafish transcripts relating to a specific signaling pathway (just a bit advanced)

Note: the Python software to perform this step is included as part of Mac OSX. If you run a PC, youll need to figure out how to install and run Python.

The mutation we have been studying has a phenotype similar to mutations in genes encoding members of the Hedgehog pathway. We are going to find every human protein that has been associated with Hedgehog signaling, identify their putative zebrafish orthologs via a batch BLAST search, and determine whether any of these putative zebrafish Hedgehog-associated genes is located near our mapped mutation (on chromosome 7 at about 28.8 Mb).

A.Collect sequences for human proteins with hints of hedgehog interactions

1.Go to the NCBI website: http://www.ncbi.nlm.nih.gov/

2.Pull down the OMIM database.

3.Enter SHH in the search box finds any record that mentions SHH.

4.Select Protein from the Find Related Data menu on the right, and click Find Items. Limit the list to Human proteins by clicking Homo sapiens in the Top Organisms menu.

5.Select FASTA from the Display Settings pull-down menu this retrieves the amino acid sequences.

6.From the Send to menu, select file, and save by hitting the Create File button.

7.Open a Terminal window and move the sequence file to a new folder called WH_bioInformatics on your Desktop.

B. Performing a local BLAST search on a batch of sequences

1.Go to the BLAST homepage at http://blast.ncbi.nlm.nih.gov/Blast.cgi

2.Click on the help tab

3. Follow the link for Download BLAST Software and Databases

4.Follow the link for the ftp site and connect as Guest if necessary, then click the "blast" link appropriate to your platform (eg. macosx-universal). The file will be compressed usually you can uncompress it by just double clicking the .zip file. Or you can do it via the terminal:

5.The download will result in a folder saved somewhere on your computer (depending on your browser preferences). Move the contents of this folder to the WH_bioInformatics folder.

6.Now well download all current zebrafish transcripts (known and predicted) from Ensembl). While in the same folder as your protein sequences, connect via the terminal or your web browser to the ensembl ftp site.

ftp ftp.ensembl.org, or type ftp.ensembl.org into a browser window

(login as anonymous with your email address as password)

>: cd pub/release-81/fasta/danio_rerio/cdna

7.Fetch all the sequences and disconnect:

>: mget Dan* (answer y at the prompts)

>: bye

8.UnZip the sequences and concatenate them into one file, and move this file into your WH_bioInformatics folder:

9.Make a BLASTable database from these zebrafish sequences (type ./bin/makeblastdb help for options of makeblastdb):

./ncbi-blast-2.2.31+/bin/makeblastdb -in zebrafishcDNAs.fa -dbtype 'nucl'

10.Now youre ready to do a blast search. You can always type [command] help (eg ./bin/tblastn help) for blast options:

./ncbi-blast-2.2.31+/bin/tblastn -query shh_human_proteins.fa -db zebrafishcDNAs.fa -num_alignments 2 -num_descriptions 2 -evalue 1e-5 -out shhVzfCDNA.tblastn &

[It will take awhile (~1.5 h) to compare our ~2600 human proteins with the entire database of known and predicted zebrafish transcripts. To save time, get the blast output from my website as part of task3_materials:

http://faculty.ithaca.edu/iwoods/docs/wh/informatics_problem/

you can check on the progress of the blast search:

>: less shhVzfCDNA.tblastn (type q to quit out of less)

11.When the BLAST search is finished, parse it with one of my hacktastic Python scripts (wh_blast.py) downloaded from my website as part of task3_materials. To run the script, type the following:

>: python wh_blast.pl shhVzfCDNA.tblastn > blast_results.csv

[sample results]

Import the results into excel (comma-delimited) and sort by chromosome and map position does anything map to Chr 7 near Z15270 (27,488,000 bp)?

12.Choose ENSDART00000113313. Look at the Ensembl record:

http://www.ensembl.org/Danio_rerio/Transcript/Transcript?t=ENSDART00000113313

13. What is the cDNA sequence of this gene? What is the function of this gene? Click on the Gene record, and then follow the Orthologs link (left of page) for some clues.

14.Find expression data (if it exists): if youre lucky, there will be a link to ZFIN straight from the Ensembl Gene Record page.

Task 4: Visualization of enriched motifs in putative promoter / enhancer regions.

Overview: From a file of unidentified sequences derived from a transcriptome profiling experiment, identify the best matching Ensembl transcript via local BLAST, batch download potential promoter sequences for each of these transcripts, search through these promoter sequences for enriched motifs, and visualize the location of the motifs on the promoters (just a bit advanced)

Note: the Python and Matplotlib software to perform this step is included as part of Mac OSX. If you run a PC, youll need to figure out how to install and run Python.

1. Open the Terminal application. Navigate to the Desktop of your computer via the Terminal.

$ cd ~/Desktop

2. Download a file that contains task 4 materials from

http://faculty.ithaca.edu/iwoods/docs/wh/

Via the Finder/Windows GUI (graphical user interface), move this zip file to your WH_task4 folder. You can also do this on the Terminal:

$ mv ~/Downloads/task4_materials.zip .

Uncompress this archive by double clicking on it in your GUI, or via the terminal:

$ gunzip task4Materials.zip

This will create a folder of files on your Desktop. Move to this folder via the Terminal and list its contents.

$ cd task4Materials

$ ls -1

task4sequences.fa = unidentified sequences obtained from a transcription profiling experiment

task4BlastAndParse.py = a Python script that will BLAST the sequences vs. a database of transcripts downloaded from Ensembl

kmerCounter.py = a Python script that will identify enriched motifs in a group sequences

visualizeMotifs.py = a Python script that will show a picture representing the location of motifs within a group of sequences

You can look at each of tese files via a text editor, or from your Terminal using the less command. For example, try:

$ less task4sequences.fa

Our next task is to determine the Ensembl Gene IDs that match these sequences. To do so, we will BLAST these sequences against a database of Ensembl Transcripts. First we have to download BLAST from NCBI, and then we will download the Transcripts from Ensembl.

3. Downloading BLAST:

Go to the BLAST homepage at http://blast.ncbi.nlm.nih.gov/Blast.cgi

Click on the help tab

Follow the link for Download BLAST Software and Databases

Follow the link for the ftp site and connect as Guest if necessary, then click the "blast" link appropriate to your platform (eg. macosx-universal). The file will be compressed usually you can uncompress it by just double clicking the .zip file. Or you can do it via the terminal:

$ mv ~/Downloads/ncbi-blast-2.2.31+-universal-macosx.tar.gz .

$ gunzip ncbi-blast-2.2.31+-universal-macosx.tar.gz

Via your GUI (Windows), rename the ncbi-blast-2.2.31+-universal . BLAST folder to something more manageable (call it BLAST).

4. Downloading Ensembl Transcripts. Connect via the terminal or your web browser to the ensembl ftp site.

ftp ftp.ensembl.org, or type ftp.ensembl.org into a browser window

(login as anonymous with your email address as password). Once youre logged in, type the following (after the > prompt)

ftp> cd pub/release-81/fasta/danio_rerio/cdna

ftp> get Danio_rerio.GRCz10.cdna.all.fa.gz

ftp> bye

Uncompress the sequences by double clicking on them within your GUI, or via the terminal:

$ gunzip Danio_rerio.GRCz10.cdna.all.fa.gz

Rename the sequences to something more manageable. Again, you can do this within your GUI or via the terminal:

$ mv Danio_rerio.GRCz10.cdna.all.fa zfEnsemblCdna.fa

Make BLASTable databases for your Ensembl sequences with the following command:

$ ./BLAST/bin/makeblastdb -in zfEnsemblCdna.fa -dbtype 'nucl'

5. BLAST the transcriptome sequences vs. the ensembl sequences and get the top result. Both the BLAST and the parsing can be performed via one of my Python scripts as follows. Note: this script requires that the files be named exactly as specified here (i.e. the BLAST commands must be in a BLAST folder, and the Ensembl Transcripts file must be named zfEnsemblCdna.fa. So when you list (ls 1) the contents of your folder, you should see the following:

$ ls -1

BLAST

kmerCounter.py

task4blastAndParse.py

task4sequences.fa

task4sequences.txt

visualizeMotifs.py

zfEnsemblCdna.fa

zfEnsemblCdna.fa.nhr

zfEnsemblCdna.fa.nin

zfEnsemblCdna.fa.nsq

If it looks good, run the blast and parsing as follows:

$ python task4BlastAndParse.py

This script creates several new files in your folder. You can visualize them with a list command:

$ ls 1

You should see this (new files are BOLD):

BLAST

kmerCounter.py

task4SequencesTranscriptIDs.txt = a list of transcript IDs that match the sequences

task4SequencesgeneINFO.txt = Ensembl GeneIDs and Transcript IDs for all sequences

task4SequencesvCDNA.blastn = the BLAST results

task4blastAndParse.py

task4sequences.fa

task4sequences.txt

visualizeMotifs.py

zfEnsemblCdna.fa

zfEnsemblCdna.fa.nhr

zfEnsemblCdna.fa.nin

zfEnsemblCdna.fa.nsq

6. Now we will obtain potential promoter sequences for these transcripts by downloading the 5UTR and 2000b upstream (arbitrary) for these sequences from BioMart at Ensembl.

First, copy the transcript IDs (from task4SequencesTranscriptIDs.txt) to your computers clipboard.

Next, navigate to Ensembl

http://www.ensembl.org

and follow the link for BioMart. On the Following page, select Ensembl Genes from the pull-down menu, then select Danio rerio Genes.

To select for a set of Transcripts: click on Filters, check the plus sign for Genes, select Ensembl Transcript IDs, and paste in your list of transcripts.

To select information for download: click on Attributes, select Sequences, click the plus sign for SEQUENCES, select 5-UTR, check the box for Upstream flank and input 2000 into the box.

To download results: click on the Results button. Check box for Unique results only. Hit the button for GO next to Export all results to (leave file selected in pull-down).

A file called mart_export.txt should appear in your Downloads folder. Move this to your working folder via your GUI or Terminal, and rename it as upstreamSequences.fa.

$ mv ~/Downloads/mart_export.txt upstreamSequences.fa

7. Identify enriched motifs in these sequences with a Python script. You can change the parameters within the script if youd like, but if you search for anything more than 7 bases long, it will take quite some time.

$ python kmerCounter.py

8. Choose a motif from the list, and modify the visualizeMotifs.py script to look for this motif. Run this script to visualize motifs

$ python visualizeMotifs.py