CADD meeting 08-30-2016
-
Upload
yana-valasatava -
Category
Science
-
view
123 -
download
1
Transcript of CADD meeting 08-30-2016
![Page 1: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/1.jpg)
Compact representation of 3D macromolecular
structures from the PDB
![Page 2: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/2.jpg)
Presented by Yana ValasatavaPostdoctoral Researcher
Structural Bioinformatics GroupSan Diego Supercomputer Center
![Page 3: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/3.jpg)
The PDB evolving complexity
PDB archive
> 30 GB
~250 MB in mmCIF format
Structural biology efforts meet a big-data era:
● Growing size: ~ 120K structures with an
annual growth by ~10K structures
● Evolving complexity: growing
compositional heterogeneity and size
● Increasing usage: > 300,000 users per
month from over 160 countries
3J3Q
3J3Q has more than 1 million atoms
The PDB has more than 1 billion atoms
![Page 4: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/4.jpg)
★ Interactive visualization○ slow network transfer○ slow parsing○ slow rendering
★ Mobile visualization○ limited bandwidth○ limited memory
★ Large-scale structural analysis○ slow repeated I/O○ slow repeated parsing
Scalability issues
![Page 5: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/5.jpg)
PDBx/mmCIF
Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes.
repetitive information
redundant annotations
inefficient representation
![Page 6: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/6.jpg)
PDB/MMTF
The MacroMolecular Transmission Format
MMTF has the following advantages:
❏ it occupies less space (less disk I/O) ❏ it is faster to read (no time-consuming string parsing)❏ it contains precalculated information useful for structural analysis
and visualisation (covalent bonds and bond orders)
Fields:
○ Format data (e.g. the version number of the specification)○ Metadata (e.g. rFree and resolution)○ Structure data (e.g. number of models, chains, groups, atoms)○ Chain data (e.g. list of chain IDs, chain names)○ Group data (e.g. list of group names, formal charges, bonds)○ Atom data (e.g. B-factors, coordinates, occupancies)
https://github.com/rcsb/mmtf/blob/master/spec.md
![Page 7: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/7.jpg)
MMTF compression pipeline
integer encodingdictionary encodingrun-length encoding
delta encoding
GZIPrecursive indexing
extract structural datacalculate bonds, SSE
The binary container format of MMTF
![Page 8: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/8.jpg)
Compression pipeline: dictionary encoding
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
{ "groupName": "ARG",
"singleLetterCode": "R",
"chemCompType": "L-PEPTIDE LINKING",
"atomNameList": [ "N", "CA", "C" ],
"elementList": [ "N", "C", "C"] }
index: 1SER-GLY-ARG-SER-SER
groupTypeList: [ 2, 0, 1, 2, 2 ]
![Page 9: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/9.jpg)
Compression pipeline: encodings
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
14.699 -> 1469914.500 -> 14500 169
1,2,3->1,1,1->1,3(delta + run-length) -> (integer + delta)
integer encoding: map floating point numbers to integer
run-length encoding: stretches of equal values are represented by the value itself and the occurrence count
delta encoding: differences (deltas) between the numbers are stored
![Page 10: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/10.jpg)
Compression pipeline: Recursive Indexing
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
Recursive Indexing: [-50, -128, 7, 127, 268] -> [-50, -128, 0, 7, 127, 0, 127, 127, 14]
Array of 8-bit integer values, so the open interval is (127, -128):
![Page 11: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/11.jpg)
Overview of data
Full format• all atoms (useful for structural bioinformatics analysis)• coordinates with 3 decimal place precision (no loss after decoding)
Reduced format• C-alpha/phosphate backbone atoms and ligands (useful for
visualisation and some structural bioinformatics)• coordinates with 1 decimal place precision (almost further 40 %
reduction in size)• exactly same data structure as full (parsers work for both)
![Page 12: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/12.jpg)
MMTF size and parsing speed
* Parsing using Java libraries
![Page 13: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/13.jpg)
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological macromolecules
To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
![Page 14: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/14.jpg)
Presented by Anthony BradleyPostdoctoral Researcher
Structural Bioinformatics GroupSan Diego Supercomputer Center
![Page 15: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/15.jpg)
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological macromolecules
To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
![Page 16: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/16.jpg)
Goals
• Analysis should be easy and simple
• Whole archive analysis of the PDB should be trivial AND fast
• Big Data tools (e.g. Spark and Hadoop) are available
![Page 17: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/17.jpg)
mmtf-python
mmtf-java
Nobody should (have to) write their own parser. Ever.
![Page 18: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/18.jpg)
MMTF-Spark - Simple API
![Page 19: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/19.jpg)
Continued…..
![Page 20: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/20.jpg)
Data mining - speed advantage
![Page 21: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/21.jpg)
Contact finding
![Page 22: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/22.jpg)
Contact finding
![Page 23: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/23.jpg)
Pros and consPros:
● Looping through the whole library performing simple analyses
● Simple to parallelize code● Much more complete data
Cons:
● Tied to Java ● Not a magic unicorn
![Page 24: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/24.jpg)
Pros and consPros:
● Looping through the whole library performing simple analyses
● Simple to parallelize code● Much more complete data
Cons:
● Tied to Java ● Not a magic unicorn
![Page 25: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/25.jpg)
Thanks!• http://mmtf.rcsb.org/
• https://github.com/rcsb/mmtf-javascript
• https://github.com/rcsb/mmtf-java
• https://github.com/rcsb/mmtf-python
• http://spark.apache.org/
![Page 26: CADD meeting 08-30-2016](https://reader031.fdocuments.in/reader031/viewer/2022030301/587ff95d1a28ab3a1e8b5b27/html5/thumbnails/26.jpg)
Acknowledgements
NCI/NIH (U01 CA198942)