Protein database

80
Protein Database Unit-2nd Dr. Khalid Rehman Hakeem Department of Bioresources University of Kashmir

Transcript of Protein database

Protein DatabaseUnit-2nd

Dr. Khalid Rehman HakeemDepartment of Bioresources

University of Kashmir

Biores-111: Bio-informatics Unit I1. What is Bioinformatics? Overview of Bioinformatics.2. Basic concepts and applications of bioinformatics.3. Bio-informatics related to proteins, amino acids, DNA and RNA5. Sequence, structure and function. Unit II •Bioinformatics databases, introduction, type of databases: Nucleotide sequence databases; primary nucleotide sequence databases viz., EMBL; GeneBank ; DDBJ •Protein sequence databases viz., SwissProt/TrEMBL; PIR •Sequence motif databases: Pfam and. PROSITE •Protein structure databases: Protein Data Bank; SCOP; CATH •Other relevant databases e.g., KEGG.Unit III •Single sequence and pair-wise alignments •Scoring matrix: PAM and BLOSUM and Dot Plots•Heuristic methods: FASTA and BLAST •Statistics of sequence alignment score: E-Value; P-Value.•Multiple sequence alignments: ClustalW, PSI-Blast.Unit IV • Human Genome Project• Genbank SNPs, GOG, STSs, and ESTs data bases• Genomic Data Mining• Understanding of the principles of the microarray technique, limitations of microarrays and problems associated with the technique•Brief introduction to PERL language.

Proteome Bioinformatics/Databases

1. Protein sequence databases - SWISS-PROT; TrEMBL

2. Nucleotide sequence databases - Hidden treasures: EST

3. Pattern and profile databases - PROSITE; BLOCKS; PRINTS; Pfam

4. 2D-PAGE databases - SWISS-2DPAGE;WORLD-2DPAGE

5. 3D structural databases - PDB; DSSP; HSSP

Proteome Bioinformatics/Databases

6. Post-translational modification databases - O-GLYCBASE

7. Genomic databases - OMIM; GDB; MGD; FlyBase

8. Metabolic databases - ENZYME; KEGG; EMP/WT

9. Interfacing and Integrating Databases - EXExPASy; SWISS-PROT; Cyber–Encyclopaedia of the Proteome

Protein Database

UniPro - protein knowledge database Swiss 2DPAGE - 2D PAGE Pfam - protein family and domain Prosite - protein family and domain SMART - protein module BLOCK - protein conserved regions

The Pfam database is one the most important collections of information in the world for classifying proteins. The database categorises 75 per cent of known proteins to form a library of protein families - a 'periodic table' of biology.

The open access resource was established at the Wellcome Trust Sanger Institute in 1998. Its vision is to provide a tool which allows experimental, computational and evolutionary biologists to classify protein sequences and answer questions about what they do and how they have evolved. The Pfam project is led by Dr Alex Bateman at the Sanger Institute.

Each entry in the Pfam database includes a protein sequence alignment as well as an accompanying statistical model, called a hidden Markov model

Pfam :: Home

Pfam :: Home

Pfam is a large collection of multiple sequence alignments and hidden Markov models (HMMs)covering many common protein families.

Pfam version 26.0 (November, 2011) contains alignments and models for 13672 protein families, based on the Swissprot and SP-TrEMBL protein sequence databases.

HMM: A Hidden Markov Model

HMM: A Hidden Markov Model, or HMM, is a statistical model for any system that can be represented as a succession of transitions between discrete states.

•Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. The identification of domains that occur within proteins can therefore provide insights into their function.

•There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality, manually curated families. Although these Pfam-A entries cover a large proportion of the sequences in the underlying sequence database, in order to give a more comprehensive coverage of known proteins we also generate a supplement using the ADDA (Automatic Domain Decomposition Algorithm ) database. These automatically generated entries are called Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found.

•Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM.

Biotin Synthase

Biotin synthase (BioB) converts dethiobiotin into biotin byinserting a sulfur atom between C6 and C9 of dethiobiotin inan S-adenosylmethionine (SAM)-dependent reaction.

Reaction steps common in radical SAM enzymes

PROSITE is a protein database. It consists of entries describing the protein families, domains and functional sites as well as amino acid patterns, signatures, and profiles in them, which are manually curated by a team of the Swiss Institute of Bioinformatics and tightly integrated into Swiss-Prot protein annotation.

PROSITE was created in 1988 by Amos Bairoch, who directed the group for more than 20 years. Since July 2009 the director of the PROSITE, Swiss-Prot

and Vital-IT groups is Ioannis Xenarios.PROSITE's uses include identifying possible functions of newly discovered proteins and analysis of known proteins for previously undetermined activity. Properties from well-studied genes can be propagated to biologically related organisms, and for different or poorly known genes biochemical functions can be predicted from similarities.

PROSITE offers tools for protein sequence analysis and motif detection. It is part of the ExPASy proteomics analysis servers.

PROSITE

Simple Modular Architecture Research Tool

KEGG:Kyoto Encyclopedia of Genes

and Genomes

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The PATHWAY database records networks of molecular interactions in the cells, and variants of them specific to particular organisms. As of July 2011, KEGG has switched to a subscription model and access via FTP is no longer free.

The KEGG, the Kyoto Encyclopedia of Genes and Genomes, was initiated by the Japanese human genome programme in 1995. According to the developers they consider KEGG to be a "computer representation" of the biological system. The KEGG database can be utilized for modeling and simulation, browsing and retrieval of data. It is a part of the systems biology approach.KEGG maintains five main databases:KEGG AtlasKEGG PathwayKEGG GenesKEGG LigandKEGG BRITE

Purpose Developed at the Kanehisa Laboratory Integrates:

current knowledge of molecular interaction networks

information about genes and proteins information about chemical compounds

and reactions

THANK YOU