Abstracts of main servers in CASP11 presented by Chao Wang.

Abstracts of main servers in CASP11

presented by Chao Wang

Offical Ranking by SUM Z-scoer (>-2.0)

• We focus on– Zhang-Server– QUARK– BAKER-ROSETTASERVER– RaptorX– MULTICOM-CLUSTER– Pcons

• We don’t focus on– HHPred– ZHOU-SPARKS-X– MUFOLD-Server

• No HHpred and ZHOU-SPARKS-X abstracts in proceedings of CASP11

• MUFOLD: • formulates the structure prediction problem as a graph realization pr

oblem• employs the multi-dimensional scaling (MDS) technique

– Cut the sequence into different segments and generate distance matrices of the blocks.

– Cluster distance matrices on each block.– Recombine these cluster centers of each block to generate new

distance matrices and filter out some poor distance matrices by a set of criteria such as triangle law.

– Generate new structures according to the sampled distance matrices.

– Use a consensus method to select best candidates from the new structures.

Zhang-Server

• based on the I-TASSER pipeline

• In addition to the classic I-TASSER pipeline, several approaches were recently developed and integrated into I-TASSER to enhance its ability of structure modeling for distant-homology targets.

• First, the top models generated by the QUARK ab initio folding were merged into the threading template pool, which were used as the starting conformations of I-TASSER simulations.

• Second, since the hard targets generally lack global templates, the sequences were broken into segments of 2-4 consecutive secondary structure elements which were then threaded through the PDB by the segmental threading tool SEGMER9 to identify super-secondary structure motifs.

• Third, SVM-SEQ and SPcon (Shen et al, in preparation) are used to generate residue contact maps.

• For multiple-domain proteins, ThreaDom was used to predict the domain boundary and linker regions.

QUARK• QUARK has been developed for ab initio protein structure prediction.

• It starts with the collection of continuously distributed structural fragments with 1-20 residues from unrelated proteins in the PDB. Full-length structure models are then assembled from the fragments by replica-exchanged Monte Carlo (REMC) simulations, which are guided by a composite physics- and knowledge-based force field that contains a variety of local structure features derived from sequence.

• For the proteins that are deemed by LOMETS as the Easy or Medium targets, i.e. there are at least one structure template with Z-score above the confidence cutoff, a new template-based QUARK pipeline is exploited to generate the structure prediction. In this pipeline, each replica in the REMC simulation starts from different top LOMETS templates.

• The weights of the QUARK force field have been reparameterized in this pipeline to enhance the knowledge-based components derived from threading alignments.

• multiple-domain proteins: ThreaDom

BAKER-ROSETTASERVER• Robetta is a fully automated structure prediction server that consists

of three main steps: domain boundary identification, structure modeling, and domain assembly.

• Domain boundary identification: Domain boundaries are predicted by identifying PDB templates with optimal sequence similarity and structural coverage to the target through an iterative process. For each iteration, we use locally installed programs, HHSearch, Sparks, and Raptor, to identify templates and generate alignments. The target sequence is threaded onto the template structures to generate partial-threaded models, which are then clustered to identify distinct topologies that are ranked based on the likelihood of the alignments. Regions of the target sequence that are not covered by the partial-threads or are not similar in structure within the top ranked cluster are passed on to the next search iteration.

• Structure modeling: For each predicted domain, models are generated using our comparative modeling protocol, RosettaCM, which recombines structural elements from the clustered partial-threads and models missing segments using a combination of fragment insertion and mixed torsion-Cartesian space minimization.

• For difficult domains, models are also generated using the Rosetta fragment assembly methodology (Rosetta Abinitio), and if GREMLIN contacts are predicted, they are used as restraints for sampling and refinement.

• All models are refined using a relax protocol that minimizes the Rosetta full-atom energy in torsion and Cartesian space to allow bond angle flexibility. Final models are selected by clustering the best scoring 100 models from each topologically distinct alignment cluster, and then averaging the models within each cluster and refining the final averaged models.

RaptorX

• RaptorX is a template-based protein modeling server.

• Not finished.

• To significantly advance homology detection and fold recognition, we have developed a Markov Random Fields (MRFs) modeling of an MSA (multiple sequence alignment). MRFs can model long-range residue interactions and thus, encodes information for the global 3D structure of a protein family.

• Each node is associated with a function describing position-specific amino acid mutation pattern. Similarly, each edge is associated with a function describing correlated mutation statistics between two columns.

• To score the similarity of two MRFs, we use both node and edge alignment potentials, which measure the node (i.e., residue) similarity and edge (i.e., interaction pattern) similarity, respectively. To derive the node alignment potential, we use a set of 1400 protein pairs as the training data, which covers 458 SCOP folds. The reference alignment for a protein pair is generated by a structure alignment tool DeepAlign2. The edge alignment potential is derived from a software package EPAD3, which takes as input PSSM and residue interaction strength and outputs the inter-residue distance probability distribution. The interaction strength of two residues can be calculated by different ways. In current implementation we calculate the mutual information matrix (MI).

• It is computationally challenging to optimize the MRFalign scoring function due to the edge alignment potential. We formulate this problem as an integer programming problem and then develop an ADMM (Alternative Direction Method of Multipliers) algorithm to solve it efficiently to a suboptimal solution.

MULTICOM-CLUSTER

• The method was based on a conformation ensemble approach to protein tertiary structure prediction.

• The basic conformation ensemble protocol in MULTICOM-CLUSTER generated an ensemble of protein models for each target using multiple templates identified by more than a dozen of sequence/profile comparison tools (e.g., BLAST, PSI-BLAST, HHSearch, SAM, HMMer, MUSTER, RaptorX), combination of alternative target-template alignments, and complementary model generation tools.

• An ensemble of hundreds (e.g., 150-250) of models generally approximated the near native conformations of a relatively easy target well if one or more homologous templates were identified for the target. For relatively hard targets for which no good template was found, additional tens of models selected from hundreds of template-free models generated by a fragment assembly based tool (i.e. Rosetta) were added into the ensemble in order to increase the diversity of the model pool.

• The conformations of all the chunks will be combined into a full-length model using Modeller.

• The ensemble of models of a target were evaluated by several different methods, including the single-model absolute model quality assessment tool – ModelEvaluator, the fully pairwise model comparison tool-APOLLO, a protein energy calculation tool-SELECTpro, and the frequency of the templates (i.e., number of times that a template was chosen by different sequence/profile comparison tools) used to generate models if any. From the ensemble, MULTICOM-CLUSTER selected top five models ranked mostly by the APOLLO scores supplemented by other information.

• Trick (Chao comments) Furthermore, several exception handling strategies were applied to remove seemly bad models ranked in the top five models, including replacing template-based models with very low template coverage, filling in terminal regions of models not covered by any template by model combination, replacing the same models within top five models, removing models based on false positives of blast-based search.

Pcons

• PconsFold is a fully automated pipeline for ab-initio protein structure prediction based on evolutionary information.

• PconsFold is based on PconsC contact prediction and uses the Rosetta folding protocol.

• PconsC2, is a novel method that uses a deep learning approach to identify protein-like contact patterns to improve contact predictions.

Advantages

• RaptorX: alignment accuracy, statistical model and learning methods

• MULTICOM: model selection, template selection

• Rosetta: assembly

• I-TASSER & QUARK: You see

We need to improve …

• Clean vs. Dirty– Discovery vs. Performance– For understanding vs. For CASP

• Clean– Single secondary structure element– Interaction of pair SSEs– Topology

• Domain parsing: to avoid directly from threading alignments

• Template selection: to avoid from only p-value or threading raw scores

• Model generation: MODELLER is really NOT reliable when the gap length is over 15(?)

• Model selection: to avoid selection based on only dDFIRE score

• to develop:– loop modeling tools– template selection tools (SVM?)– consensus based selection strategy

Abstracts of main servers in CASP11 presented by Chao Wang.

Documents

Transcript of Abstracts of main servers in CASP11 presented by Chao Wang.