Conceptual basis for critical thinking, data analysis and problem solving

59
EMBL-EBI Conceptual basis for critical thinking, data analysis and problem solving (and I don’t know what this is either !) STRATEGY

description

STRATEGY. Conceptual basis for critical thinking, data analysis and problem solving (and I don’t know what this is either !). Challenges for bioinformatics. With the sequence/structure deficit, the challenges are to rationalise the mass of sequence data - PowerPoint PPT Presentation

Transcript of Conceptual basis for critical thinking, data analysis and problem solving

Page 1: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Conceptual basis for critical thinking, data analysis and

problem solving

(and I don’t know what this is either !)

STRATEGY

Page 2: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Challenges for bioinformatics

With the sequence/structure deficit, the challenges are to rationalise the mass of sequence data derive more efficient means of data storage design more reliable analysis tools

Imperative - to convert sequence information into biochemical & biophysical knowledge

Page 3: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

What we cannot do well

“Give us sequence, we do rest”

Page 4: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Page 5: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Page 6: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

What is the function of this structure?

What is the function of this sequence?

What is the function of this motif? the fold provides a scaffold, which

can be decorated in different ways by different sequences to confer different functions - knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level

Page 7: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Complication – Multiprotein Complexes

Page 8: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

1H8E (ADP.ALF4)2(ADP.SO4) BOVINE F1-ATPASE (ALL THREE CATALYTIC SITES OCCUPIED)MENZ, R.I., WALKER, J.E., LESLIE, A.G.W.

ATPase

Page 9: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

1NT9 COMPLETE 12-SUBUNIT RNA POLYMERASE IIARMACHE, K.-J., KETTENBERGER, H., CRAMER, P

Multiprotein transcription complexes- RNA Polymerase

Science 288, 640 (2000) P. Cramer et.al.

Page 10: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

STRING: a database of predicted functional associations between proteins. von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B

http://string.embl.de/

Prolinks: a database of protein functional linkages derived from coevolution P.M. Bowers, M.Pellegrini, M.J. Thompson, J.Fierro, T.O. Yeates, D.Eisenberghttp://dip.doe-mbi.ucla.edu/pronav (? )

Page 11: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Ground rules for bioinformatics

Don't always believe what programs tell youthey're often misleading & sometimes wrong!

Don't always believe what databases tell youthey're often misleading & sometimes wrong!

Don't always believe what lecturers tell youthey're often misleading & sometimes wrong!

In short, don't be a naive user when computers are applied to biology, it is vital

to understand the difference between mathematical & biological significance

computers don’t do biology - they do sums quickly!

Page 12: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

General Evaluation Criteria Be sceptical and cynical!

When you are searching for information you need to judge its quality and suitability.

Think critically about each piece of information you find and how you found it.

Relevance: Does the information you have found adequately support your research? Does it answer the question, or support one of your arguments? How general or specific is the information about the topic?

Page 13: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Building a search protocol

The usual starting point searching the primary data sources

NRDB, SPTR, etc.Pattern recognition methods

searching the secondary sourcespatterns, profiles, blocks, fingerprints

& HMMsEstimating significance

when do we believe a result?

Page 14: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

A central goal is to predict protein function from sequence

Given a sequence, we want to know what is my protein? to what family does it belong? what is its function? how can we explain its function in structural terms?

By searching pattern dbs & fold libraries, we may recognise patterns that allow us to infer relationships with previously-characterised families & folds

Given the variety of dbs to search, how do we use them to build a sensible search protocol?

Page 15: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Planning a database Search

To find various aspects of your query sequence, you may have to search a number of databases

1. Identify the sequenceSearch for a matching or similar sequence using a 'BLAST' program.

2. Find related sequences(a) For a protein sequence, find the mRNA sequence that produces the protein, and the DNA sequence that codes for the mRNA.(b) For mRNA sequence, find the protein it produces, and the DNA sequence that codes for the mRNA.(c) For DNA sequence, find the mRNA it translates to, and the protein that the mRNA produces.

Page 16: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

3. If a the sequence is from a protein, find a structural image.

4. Research the functionality of the sequence: (a) What is its function in different tissues (homology)?(b) What is its function in different organisms (phylogeny).(c) Are there any mutations, and what are their consequences?(d) What is the role of the protein in cell function?

Page 17: Conceptual basis for critical thinking, data analysis and problem solving

Protein sequence database identity searche.g., for short fragments, pinpoints identical matches to probe - may

identify correct reading frame

Protein sequence database similarity searche.g., nrdb, OWL, SP+SPTrEMBL - identifies homologues to

probe

Protein pattern database search e.g., PROSITE, profiles, PRINTS, BLOCKS, Pfam - identifies

family relationships or pinpoints key structural or functional sites

Known structure Structure classification database query library search e.g., scop, CATH, FSSPprovides details ligand-binding, etc.

Unknown StructureProtein fold patterne.g. threading identifies compatible of structural class

Page 18: Conceptual basis for critical thinking, data analysis and problem solving
Page 19: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

iGAP

http://eol.sdsc.edu

Page 20: Conceptual basis for critical thinking, data analysis and problem solving

Protein sequences

Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)

Structural assignment of domains by PSI-BLAST profiles on FOLDLIB

Structural assignment of domains by 123D on FOLDLIB

Structural assignment of domains by WU-BLAST

Data Warehouse

Functional assignment by PFAM, NR assignments

FOLDLIB

Building FOLDLIB:

PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP

90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)

Domain location prediction by sequence

structure info sequence info

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

NR, PFAMSCOP, PDB

Page 21: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

http://harvester.embl.de/

“Harvester” collects information from selected public databases

Page 22: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Similarity searching

Whether or not an identity search finds a match, the next step is to look for similar sequencese.g., you may wish to know if a wider family exists

The most rapid option is to use BLAST & variants and look for high scores with low P-values (unlikely to be

random) clusters of high scores at the top of the hitlist (a

family?) trends in the type of sequences matched

Use a composite databasese.g., UNIPROT

Page 23: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Structural & functional interpretation

db searches often does little more than identify a protein familythis only scratches the surface - we still want

to know what our protein does & what it might look like

The first step is to examine the detailed family in InterPromay help to elucidate function

The next step is to examine the fold classification & structure summary resourcese.g., SCOP, CATH

Page 24: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Gene prediction, structure & function prediction are non-trivialstructure & function prediction tools are, at best,

70% accurate What are the lessons for sequence analysis?

when searching for distant homologues, several dbs should be searched

different methods provide different perspectives dbs aren’t complete & their contents don’t fully

overlap

The more dbs searched, the more difficult it can be to interpret results

Page 25: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Thinking about your Topic

Can you identify what you already know about the topic, and identify what you do not know.

Can you create questions based on these knowledge 'gaps', that is, can you identify your information needs.

What do you require about your protein sequence.

Develop a concept map to organise your ideas and structure your approach to the topic.

Discuss your topic with others.

Page 26: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Identifying the Type of Information you need

As well as thinking about your topic, you need to consider the type of information you will need.

Which information tools are best suited to your inquiry?How much information do you need - to what degree of detail?

Page 27: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Appreciate how difficult it is to draw a complex 3-D object and appreciate the complexity of the requirements for storing sequence and structural information of molecules in a database.

There are a lot of interrelated pieces of information about a biomolecule, such as

sequence similaritiesgenome locationprotein structureExpressionchemistry

Page 28: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

All information on a molecule or sequence will not be found in one record, nor even in the one database.

Be prepared to search in several databases for information on your query sequence

As different organisations create databases to suit their own purposes, there will not be a great deal of similarity between these databases.

Page 29: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Data formats are not standard. The nomenclature is not standard. There is more than one database offering the same information (data redundancy). Links between databases may not be easy to follow. The number of databases available makes it confusing to choose from

Some of the obstacles of searching databases are:

Page 30: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Once you have found some information on your query sequence, you will find a new focus for your research from this information.

Through exploring any linked text in the databases:-

Page 31: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

What function does the protein/mRNA/DNA have?

Do mutations occur and what are their effects?

Does it play a role in disease?

Homologies: Does it have the same function in different tissues?

Phylogenies: Does it have the same function in different organisms?

What role does structure play in the protein's function?

Does it have a similar function to other molecules with similar structure?

Page 32: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Pitfalls of searching databases

Remember that you are looking for information about a molecule, not database records.

Duplication of information (even within the same database) Links that are not always intuitive (or self-explanatory) Nomenclature that is not always standard

Page 33: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

You need to determine whether the information is reliable or not

Accuracy or Validity

Page 34: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Quality Control Issues

The quality of archived data is no better than the data determined in the contributing laboratories.

Curation of the data can help to identify errors. Disagreement between duplicate determinations is

a clear warning of an error in one or the other. Similarly, results that disagree with established

principles may contain errors. It is useful, for instance, to flag deviations from

expected stereochemistry in protein structures, but such ``outliers'' are not necessarily wrong.

Page 35: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Data quality

Data Consistency Data Models Reliability

Evidences ? Level of confidence ? Assignation of function by similarity

recursive process propagation of errors

Page 36: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Data quality

It’s hard to judge whether something “makes sense”.

The lack of labeling on many web pages makes it hard to know the source.

Calculations based on databases are even harder to deal with

Logical deductions may be worse.

“tacR gene regulates the human nervous system”

“tacQ gene is similar to tacR but is found in E. coli”

“so tacQ gene regulates the E. coli nervous system”

Page 37: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

E. coli nervous system

Who spotted ?

Page 38: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Evaluating database records

In order for your research to reliable you must use reliable sources of information

It is important to evaluate the information you find in databases as you would any other type of information

In the case of sequencing research however, peer review does not necessarily happen prior to publication.

Page 39: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Significance

Appreciating that mathematical & biological significance are different is crucial

Important in understanding the limitations of database search algorithms multiple sequence alignment algorithms pattern recognition techniques functional site & structure prediction tools

Contrary to popular opinion, there is currently still no biologically-reliable automatic multiple alignment

algorithm no infallible pattern-recognition technique no reliable gene, function or structure prediction

algorithm

Page 40: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Summary

Difficult questions on big data Data and Information Database and Databanks Organise the data to provide a service Visualization and Rendering Keep it up-to-date Provide a means to ask questions Provide a useful service to a large and

diverse scientific field

Page 41: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Data & Information

Data : a collection of factsi.e. X-ordinate, B-value, sequence

Information : acquired knowledge Data within a scientific “context” Meaning of the data

Sequence/structure alignment

Page 42: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Databases & Databanks

Databank A (usually large) collection of data

DatabaseA (usually large) set of data organized to allow

rapid retrieval of information. Organized for a reason Rapid retrieval : human short term memory is ~5

seconds information

Page 43: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

WHAT IS THE PDB?

Page 44: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Databanks and Databases

The PDB Archive is a “databank” A series of flat files that have a format originally

designed for Fortran card readers

The MSD, RCSB, and PDBj provide “databases”

Collections of data (1000’s attributes) organized into relational tables and held with a RDMS.

Page 45: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Page 46: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Data & information

ATOM 2567 N PHE B 175 7.821 -25.530 -22.848 1.00 8.71 ATOM 2568 CA PHE B 175 8.845 -25.172 -21.877 1.00 9.41ATOM 2569 C PHE B 175 9.449 -23.798 -22.169 1.00 10.02 ATOM 2570 O PHE B 175 10.664 -23.613 -22.103 1.00 10.37 ATOM 2571 CB PHE B 175 9.928 -26.251 -21.848 1.00 9.53 ATOM 2572 CG PHE B 175 10.969 -26.137 -22.982 1.00 10.03 ATOM 2573 CD1 PHE B 175 12.356 -25.819 -22.988 1.00 10.51 ATOM 2574 CD2 PHE B 175 11.725 -27.211 -23.402 1.00 10.25 ATOM 2575 CE1 PHE B 175 11.821 -27.095 -22.869 1.00 11.17 ATOM 2576 CE2 PHE B 175 12.282 -26.086 -24.008 1.00 10.95 ATOM 2577 CZ PHE B 175 10.953 -26.335 -23.622 1.00 11.38

Page 47: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

http://oca.ebi.ac.uk/oca-docs/oca-home.htmlhttp://srs.ebi.ac.uk/

http://www.rcsb.org/pdb/http://www.ebi.ac.uk/msd/http://www.pdbj.org/

Page 48: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

wwPDB are service providers

We provide a service to the scientific community

24/7 (almost) : parallel DB with fail-over, etc. Service “ping” baseline check several times/day Data is incremented with new data weekly Systems are extensible

Page 49: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Query capabilities

Browsing (click and read) Simple search

select records with some constraints More elaborate search

select specific fields of some records with constraints on some fields

Complex queryingability to return an answer that results from a

"live" computation, and was not part of any record of the database

Page 50: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Interfaces

User interfaces user-friendly convenient browsing intuitive query forms visualization (graphical output)

Programmatic interfaces - communication with external programs: other databases (concept of distributed database) analysis tools

Page 51: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Annotation Issues

Page 52: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Annotation

Problem The flow of available data is increasing

exponentiallyStrategies

internal curators selected external experts public submission computer-based extraction of information

from biological texts

Page 53: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

Annotation is a weak component of the enterprise.

Automation of annotation is possible only to a limited extent and getting annotation right remains labor-intensive.

But the importance of proper annotation, however, cannot be underestimated.

P. Bork has commented that for people interested in analysing the protein sequences implicit in genome sequence information, errors in gene assignment corrupt the high quality of the sequence data.

Annotation of the data

Page 54: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

A possible solution is a distributed and dynamic error-correction and annotation process.

The workload must be distributed because databank staff have neither the time nor the expertise for the job; specialists will have to act as curators.

Progress in automation of annotation and error identification /correction will permit re-annotation of databanks.

Distributed Annotation

Page 55: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

As a result, we will have to give up the ``safe'' idea of a stable databank composed of entries that are correct when they are first distributed in mature form and stay fixed thereafter.

Databanks are dynamic in information content and growing in size, and maturing in quality.

Maintaining local copies – largely “top up” this is not sufficient.

Proliferation of various copies in various states with out-of-date linkages

New Problems

Page 56: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

The more computers are involved in automating genome annotation, the greater the need for collaboration with biologists

The more data we have to handle, the more rigorous we must be in our thinking (& writing) if we are to make sense of the complexities

We are still a long way from having reliable tools for deducing protein function from sequence

but with the right approach, there is hope

Page 57: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

not much without intervention

What can you do with bioinformatics?

Conclusion

however, a lot if you know how to apply it right!

Page 58: Conceptual basis for critical thinking, data analysis and problem solving

EMBL-EBI

http://www.library.cqu.edu.au/chemcompass/index.htm

Terri AttwoodSchool of Biological SciencesUniversity of Manchester, Oxford RoadManchester M13 9PT, UKhttp://www.bioinf.man.ac.uk/dbbrowser/

Referencing - and Plagiarism

Page 59: Conceptual basis for critical thinking, data analysis and problem solving

http://www.vts.rdn.ac.uk/tutorial/biores