1 Ensembl Modules and MySQL. SQL and Database Tables Quick Examples 2.

35
1 Ensembl Modules and MySQL

Transcript of 1 Ensembl Modules and MySQL. SQL and Database Tables Quick Examples 2.

1

Ensembl Modules and MySQL

SQL and Database Tables

• Quick Examples

2

3

exontranscript_idexon_num = 3sequence_startsequence_stop

introntranscript_idintron_num = 3sequence_startsequence_stop

primer_pairidtranscript_idleft_primer_idright_primer_id

transcriptidsequence_idsource = Ensemblsource_id

sequenceidtarget_idtype = nucleotidesequence = ATG…chr_name = 15strand = 1genomic_start = 15,123,120genomic_stop = 16,378,131sourcesource_idrefresh

targetiddategene_name = BBS4descriptionaccessionstatus

projectidname = pro1descriptiondate

set_tableidproject_idname =testsetdatedescription

target_set_infoset_idtarget_idrank = 5cas_rankcas_options

select id from target where gene_name = "BBS4";

4

MySQL Demo with Ensemblmysql -u anonymous -h ensembldb.ensembl.org

show databases;show databases like "%omo%core%";

use homo_sapiens_core_47_36i;

show tables;

select count(*) from exon;

show columns from gene;

select * from xref limit 10

select * from xref where dbprimary_acc = "NM_000777";

select stable_id, gene.gene_id from gene_stable_id, gene, transcript, object_xref, xrefwhere gene_stable_id.gene_id = gene.gene_id and gene.gene_id = transcript.gene_id and transcript.transcript_id = object_xref.ensembl_id and xref.xref_id = object_xref.xref_id

and xref.dbprimary_acc = 'NM_000777';

select * from transcript where gene_id = 17393;

select * from exon_transcript where transcript_id = 33341;

select * from exon where exon_id = 193252;

5

Ensembl Schema

Core Schema

http://www.ensembl.org/info/docs/api/core/schema/index.html#exon_stable_id

API Tutorial:

http://www.ensembl.org/info/docs/api/core/core_tutorial.html

6

Code Development

1) Generate random sequenceATGCCCGCTGAGT

2) Generate formatted random sequence1 ATGCCCGCTT TGACCCTTTA 20

3) Generate random sequence, translated into protein, and formatted…

• Code revision– adding functionality and features– may introduce bugs that are not discovered until much later– useful to examine the changes to code, that may have caused

bugs

7

Code Development Solutions

• May retain a copy of every version of every file– have complete record– redundant and waste of space– responsibility on developer to maintain

revision history– Example (V1, V2, V3 experiment, V4

unfinished feature, return in 6 months?)

8

Multi-coder Environment

• Developers D1, D2, and D3• Source code S1, S2, S3, S4, S5.• D1 copies S1 and makes changes• D2 copies S1 and makes changes• D2 returns S1 • D1 returns S1• Clearly, this is ineffective for managing

and integrating changes

9

Brief Overview of CVS

CVS – Concurrent Versions SystemCVS

– only stores differences between files/versions– uses repository structure

• check out• check in• lock• branching, merging• etc

• Reference– http://www.gnu.org/software/cvs/– https://www.cvshome.org/

10

Installing Ensembl ModulesSample program – ens4.pl (simple demo program that obtains exons for a

particular gene from Ensembl database, for given accession number, and Ensembl Gene ID)

When connected to Ensembl's MySQL database % mysql -u anonymous -h ensembldb.ensembl.org

To get a list of their current databases. Find the most recent (highest numbers) version of the homo_sapiens core database.

type % show databases;

Example: homo_sapiens_core_47_36i

Example: homo_sapiens_core_25_36

The final two numbers represent the Ensembl code version and the NCBI human build,

respectively (i.e. Ensembl modules 25 and NCBI Human Build 36).

In this case, you should be using Ensembl code 47 to do the following:

11

NO LONGER VALID

(for CSS)%touch ~/.cvspass %chmod 755 ~/.cvspass

create the directory %mkdir Ensembl_modules-41

enter the directory %cd Ensembl_modules-41

type the following:

%cvs -d :pserver:[email protected]:/cvsroot/CVSmaster login (when prompted, the password is CVSUSER) -- yes, in all CAPS

%cvs -d :pserver:[email protected]:/cvsroot/CVSmaster checkout -r branch-ensembl-41 ensembl %cvs -d :pserver:[email protected]:/cvsroot/CVSmaster checkout -r branch-ensembl-41 ensembl-external

%cvs -d :pserver:[email protected]:/cvsroot/CVSmaster checkout -r branch-ensembl-41 ensembl-lite

Note this is all about 9 Meg

Make symbolic link called "Ensembl_modules-current" to point to your newly created directory of modules:%cd ..

%ln -s Ensembl_modules-41 Ensembl_modules-current

12

http://www.ensembl.org/info/software/api_installation.html

# -- Clearly this assumes a Unix flavor -- Create an installation directory$ cd$ mkdir src$ cd src

$ cvs -d :pserver:[email protected]:/home/repository/bioperl loginLogging in to :pserver:[email protected]:2401/home/repository/bioperlCVS password: cvs

Install BioPerl (version 1.2.3)$ cvs -d :pserver:[email protected]:/home/repository/bioperl checkout -r bioperl-release-1-2-3 bioperl-live

Log into the Ensembl CVS server at Sanger (using a password of CVSUSER):$ cvs -d :pserver:[email protected]:/cvsroot/ensembl loginCVS password: CVSUSER

Install the Ensembl Core Perl API for version 47$ cvs -d :pserver:[email protected]:/cvsroot/ensembl checkout -r branch-ensembl-47 ensembl

If required, install the Ensembl Variation Perl API for version 47$ cvs -d :pserver:[email protected]:/cvsroot/ensembl checkout -r branch-ensembl-47 ensembl-variation

If required, install the Ensembl Compara Perl API for verion 47$ cvs -d :pserver:[email protected]:/cvsroot/ensembl checkout -r branch-ensembl-47 ensembl-compara

NB: You can install as many Ensembl APIs as you need in this way.

13

To install Ensembl modules -- assumes you do need to have

BioPerl modules installed

(used to be separate step)

Run Program

• Now put the following program into your "src" directory – and it should run.

14

15

#!/usr/local/bin/perl

use lib "bioperl-live"; # you MAY have to use: use lib "bioperl-live/bioperl-live";use lib "ensembl/modules"; # use lib "ensembl/ensembl/modules";use Bio::EnsEMBL::DBSQL::DBAdaptor;

#my $host = "kaka.sanger.ac.uk";my $host = "ensembldb.ensembl.org";my $user = "anonymous";#my $dbname = "homo_sapiens_core_41_36c";my $dbname = "homo_sapiens_core_47_36i";

my $accession_num = "NM_000777";my $Ensembl_gene_id = "ENSG00000106258";my $flank_length = 5000;

my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor( -host => $host, -user => $user, -dbname => $dbname);

my $gene_adaptor = $db->get_GeneAdaptor();my @genes = @{$gene_adaptor->fetch_all_by_external_name('NM_000777')};

foreach my $gene (@genes) { my $string = feature2string($gene); print "$string\n"; }

sub feature2string { my $f = shift; my $stable_id = $f->stable_id(); my $name = $f->external_name(); my $seq_region = $f->slice->seq_region_name(); my $start = $f->start(); my $end = $f->end(); my $strand = $f->strand(); return "$stable_id: $name $seq_region:$start-$end ($strand)"; }

16

Output

ENSG00000106258: CYP3A5 7:99083759-99115557 (-1)

Doesn't seem like much, but remember:

1) Using the language "perl"

2) Using other peoples software (modules)

3) Accessing genomic data in a database in England

4) Accessing data programatically

17

Look at API

Ensembl API (full):http://www.ensembl.org/info/docs/api/Pdoc/index.html

Ensembl->gene_adaptor->fetch_all_by_external_name

@genes = @{$gene_adaptor->fetch_all_by_external_name('BRCA2')};

18

From the API…

# Fetch all clones from a slice adaptor (returns a list reference)

my $clones_ref = $slice_adaptor->fetch_all('clone');

# If you want a copy of the contents of the list referenced by

# the $clones_ref reference...

my @clones = @{$clones_ref};

# Get the first clone from the list via the reference:

my $first_clone = $clones_ref->[0];

19

Object adaptors have internal knowledge of the underlying database schema and use this knowledge to fetch, store and remove objects (and data) from the database. This way you can write code and use the Ensembl Core API without having to know anything about the underlying databases you are

using.

Object adaptors are obtained from the Registry via a method named get_adaptor(). To obtain a Slice adaptor or a Gene adaptor (which retrieve Slice and Gene objects respectively) for Human, do the following after having loaded the Registry, here called $registry, as above:

my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' );my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );

Don't worry if you don't immediately see how useful this could be. Just remember that you don't need to know anything about how the database is structured, but you can retrieve the necessary data (neatly packaged in objects) by asking for it from the correct adaptor. Throughout the rest of this document we are going to work through the ways the Ensembl objects can be used to derive the information you want.

UCSC

• http://genome.ucsc.edu/FAQ/FAQdownloads#download29

20

21

#### genome-mysql.cse.ucsc.edu

use DBI;

my ($dsn) = "DBI:mysql:hg18:genome-mysql.cse.ucsc.edu";my ($username) = "genome";my ($passwd) = "";my ($query);

my $dbh = DBI->connect ($dsn, $username, $passwd,{RaiseError=>1});

if (! defined $dbh) { print "\nConnect to database(Human_annot_mar06): FAILED\n"; } else { print "\nConnect to database (gh18): SUCCESS\n"; }

$gene = "BBS4";my $string = "SELECT geneName, name, exonStarts, exonEnds, chrom, strand FROM refFlat WHERE geneName = '$gene'";my $sth = $dbh->prepare($string);$sth->execute();while(my @row = $sth->fetchrow_array) { $GENENAME = $row[0]; $NAME = $row[1]; $EXONSTARTS = $row[2]; $STRAND = $row[5]; }

$sth->finish();

print "geneName = $GENENAME\n";print "Name = $NAME\n";print "strand = $STRAND\n";

22

End

23

OutputIntron: 11 -9247 -6928Exon: 12 -6927 -6768sequence_start = -6927sequence_stop = -6768exon length= 160exon start, exon_stop 6768 6927exon sequence:

CTGTGTTTCTTTACAAGGTTTGAAGGAGAAGTTCTGAAGGACTCTGATTAGAGCAAGTTTCATGTTCATGAGAGCAAACCTCATGCCAATGCAGTTTCTGGGTCCAGTTCCAAAGGGTGTGTATATGTAAGGATCTATGCTGTCCTTCTTCTTACTGAAC

Intron: 12 -6767 -5096Exon: 13 -5095 -5000sequence_start = -5095sequence_stop = -5000exon length= 96exon start, exon_stop 5000 5095exon sequence:

TCATTCTCCACTTAGGGTTCCATCTCTTGAATCCACCTTTAGAACAATGGGTTTTTCTGGTTGAAGAAGTCCTTGCGTGTCTAATTTCAAGGGGAT

chr = chr7seq length= 41692

24

Installing bioperl (Linux)

3.5) mkdir ~/perl3.6) mkdir ~/perl/bioperl

3.8) cd bioperl-1.2.3

4) perl Makefile.PL LIB=~/perl/bioperl

(Do it this way -- with "LIB" -- recently changed slide)

make test make install (see installing in private space on next slides)

To uninstall, just delete ~/perl/bioperl and ~/perl/bioperl-1.2.3

Note: version -1.2.3 was the current version when I made this slide -- it may have updated since.

25

5) To use:

#!/usr/local/bin/perl

use lib "~/local/bioperl/"; # this is supposed to work ,but did NOT on CSS use Bio::Tools::BPlite; # Need -- LIB prefix for this to work.

csh 5.1) setenv PERL5LIB ~/perl/bioperlbash 5.1) PERL5LIB=~/perl/bioperl; export PERL5LIB

mac (bash) 5.1) PERL5LIB=~/perl/bioperl; export PERL5LIB

6) To make docs work (I would just put this in your .cshrc file:set path = ($path ~/perl/bioperl/lib/site_perl/5.8.1)PATH=$PATH:~/perl/bioperl/lib/site_perl/5.8.1; export PATH

Test with:cd perldoc Bio::SearchIO

FINALLY, please note that the version numbers change over time, and the actual paths may very a little between CPAN and/or bioperl.org

It make take some trial and error (it usually does for me).

NOTE TO SELF -- check out the CPAN installer (its much easier)

26

Using ModulesFinally, need DBI.pm% mkdir modules% cd modules% ftp ftp.cpan.org (login: ftp passwd: [email protected])% bin% cd /pub/CPAN/modules/by-module/DBI% get DBI-1.53.tar.gz% cd ../DBD% get DBD-mysql-3.0008.tar.gz% gunzip DBI-1.53.tar.gz% tar –xvf DBI-1.53.tar% cd DBI-1.53

% perl Makefile.PL LIB=~/modules (**** changed this slide)% make% make install(set up Environment for DBI -- next slide), then install DBD

27

Connecting /w Perl% mkdir modules(put modules in this dir)Need DBI, DBD-mysqlgunzip, and tar

(do this for both modules)perl Makefile.PL LIB=~/modulesmakemake install

csh5.0) setenv PERL5LIB "$HOME/modules:$HOME/perl/bioperl"bash 5.1) PERL5LIB=$HOME/modules:$HOME/perl/bioperl; export PERL5LIB

(note CSS has upgraded perl from 5.6.0 – used the last time)

28

Using ModulesCSH setenv PERL5LIB "$HOME/local/bioperl/lib/site_perl/5.8.1:$HOME/modules/lib/site_perl/5.8.1:$HOME/

Ensembl_modules-41/ensembl/modules:$HOME/Ensembl_modules-41/ensembl-external/modules:$HOME/Ensembl_modules-41/ensembl-lite/modules :$HOME/modules:$HOME/perl/bioperl"

BASHPERL5LIB="$HOME/local/bioperl/lib/site_perl/5.8.1:$HOME/modules/lib/site_perl/5.8.1:$HOME/

Ensembl_modules-41/ensembl/modules:$HOME/Ensembl_modules-41/ensembl-external/modules:$HOME/Ensembl_modules-41/ensembl-lite/modules:$HOME/modules:$HOME/perl/bioperl"

export PERL5LIB

This (below) would work if we used the LIB prefix -- but that makes it a pain to install DBD. So just rely on environment settings. NOTE -- if you log out -- and don’t save the environment setting somewhere (such as .chsrc, or .bashrc, you will have to re-type the command).

DBI used with:use lib "~/modules/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi/";

Ensembl modules can then be used in a perl program with:

use lib "~/Ensembl_modules-current/ensembl/modules";use lib "~/Ensembl_modules-current/ensembl-external/modules";use lib "~/Ensembl_modules-current/ensembl-lite/modules";

29

Another ModuleFinally, need DBD.pm% cd modules% ftp ftp.cpan.org (login: ftp passwd: [email protected])% bin% cd /pub/CPAN/modules/by-module/DBD% get DBD-mysql-2.9003.tar.gz% quit% gunzip DBD-mysql-2.9003.tar.gz% tar –xvf DBD-mysql-2.9003.tar.gz% cd DBD-mysql-2.9003.tar.gz

% perl Makefile.PL LIB=~/modules% make% make install

30

Does not work on CSS• Concluded either

– version of Perl incompatible– port blocking

./ens3.plcurrent core DB: homo_sapiens_core_18_34-------------------- EXCEPTION --------------------MSG: Could not connect to database homo_sapiens_core_18_34 user anonymous using

[DBI:mysql:database=homo_sapiens_core_18_34;host=ensembldb.ensembl.org;port=3306] as a locator

STACK Bio::EnsEMBL::DBSQL::DBConnection::new /user/eng/tbraun/Ensembl_modules-18/ensembl/modules/Bio/EnsEMBL/DBSQL/DBConnection.pm:125

STACK Bio::EnsEMBL::DBSQL::DBAdaptor::new /user/eng/tbraun/Ensembl_modules-18/ensembl/modules/Bio/EnsEMBL/DBSQL/DBAdaptor.pm:79

STACK main::dbconnect_Ensembl ./ens3.pl:150STACK toplevel ./ens3.pl:26-------------------------------------------

31

However…

• Installed local version of MySQL

• Needed modules– DBI (perl database interface)– DBD (database specific interface – mysql)

• Realized that I had failed to install DBD with Ensembl modules

End

32

• No longer need to install BioPerl separately – Ensembl install instructions installs BioPerl now.

33

34

Install BioPerl• I'll assume Windows XP, Eclipse (if you are using Linux/Unix, then the

default documentation with Bioperl is better than these slides www.bioperl.org).

• Dowload Bioperl:• http://pdb.eng.uiowa.edu/~tabraun/biotech/2007/modules/bioperl-1.4.zip• The "official version can be found from here:

(http://code.open-bio.org/cgi/viewcvs.cgi/bioperl-live/bioperl-live.tar.gz?tarball=1)

• Move this zip file into your Eclipse "workspace" directory and unzip it (mine is H:\windowsdata\workspace)

• You will need an "unzip" program. Most default versions of XP comes with one. If you don't have one, you can download a free one:– http://www.download.com/jZip/3000-2250_4-10761563.html?tag=lst-6

• Now in your perl program -- you will need to add line:• use lib "H:\windowsdata\workspace\bioperl-live";

35

BioPerl continuedDepending on if your "zip" program creates a directory for you, you may have to

put in:use lib "H:\windowsdata\workspace\bioperl-live\bioperl-live";

You will also need 2 other modules (DBD and DBI). These are used by the Ensembl modules to allow a perl program to connect to a mySql database.

• DBI - Database independent interface for Perl• DBD::mysql - MySQL driver for the Perl5

Database Interface (DBI)

I tried to compile a library for Windows to make availabe -- but was unable to get it to work. Therefore I asked CSS to install these two modules for me -- since I do not have administrative permission on CSS nodes.