# Publication

Please find more information on eQuant's theoretical backgrounds and implementation details in [Bittrich, 2016].

## Underlying Concepts

eQuant aims at providing a fast and easy-to-use service for assessing protein model quality. In addition, lightweight visualizations help to quickly catch the information that is of most importance to the user. If users are interested in inspecting and further processing the made assessments, raw output data is also available for download. A major advantage of eQuant over most available quality assessment programs (MQAPs) is the ability to process structures within only a few seconds. Also no processing parameters have to be provided by the user (force fields, distance cut-offs etc.). In the assessment process, all chains in a structure are considered, including given inter-molecular interactions between residues. Essentially, eQuant employs a set of features and procedures that have proven themselves to be successful in other, existing approaches, but also considers information derived from coarse-grained knowledge-based potentials [Heinke, 2012, Heinke, 2013] as well as residue-residue interaction preferences and packing statistics. These data are obtained for each residue and evaluated by means of the random subspace method [Ho, 1998], which gives a prediction of local per-residue error. For each residue, the corresponding predicted error is an estimate of its Cα deviation between its location in the submitted model and in the unknown native structure. With respect to processing time, the assessment of a normal-sized query structure (a single chain with about 120 residues) requires less than one second of computation. However, like most MQAPs, eQuant is currently limited to processing soluble proteins.

## Assessing Global Structure Quality from Local Error Predictions

Given all predicted local errors (Cα deviations), an estimate on the global structural match to the unknown native structure can be made; or to put it differently: the deviation between structure model and native structure after superimposition is estimated. Here, the global distance test total score (GDT_TS) score [Kryshtafovych, 2014] is a widely-used measure. Ranging from 0% to 100%, it quantifies the quality of the structural match. Values close to 100% indicate almost perfect structural alignments. Based on predicted per-residue deviations, the GDT can be computed and an overall quantification of the deviation to the unknown native structure estimated. Furthermore, the computed GDT score is re-scaled in order to provide a Z-score. The transformation is calculated by analyzing the population of pre-processed structures with a relative length of ± 10% to the query model structure. If a set of structures is submitted, all models are sorted according to the Z-score and reported in descending order. Thus, the model predicted as most reliable will appear first.

To derive the GDT scores two structures of the same sequence, but with different tertiary structure were superimposed. The GDT score was computed by $$GDT\_TS = \frac{1}{4} \sum_{t=1,2,4,8} \frac{f(t) \cdot 100}{N}$$ with $$f(t)$$ referring to the absolute count of residues pairs less than $$t$$ Å apart and $$N$$ being the number of aligned atom pairs.

As alternative global quality measure the TM-score is provided. It ranges from 0 to 1, whereby values above 0.5 generally indicate the same fold for the submitted and the native structure. Scores close to 1 occur for almost identical structures [Xu, 2010].

## Predicted Local Error

The backbone of the utilized assessment routine are the per-residue predictions of Cα deviations of residue locations in the given model and the native structures. eQuant was trained on the CASP9 dataset and evaluated on CASP10 [Kryshtafovych, 2014].

As the first step in the process, the following descriptive features are determined for each residue:

• Evaluation: the final, composite quality score
• Solvation Energy: the energy profile solvation energy term [Heinke, 2012, Heinke, 2013]
• Contact Energy: the amino acid-amino acid contact energy
• Smoothed Solvation/Contact Energy: both values averaged by a bell-shaped weighting rule of size 9 - the central residue has the biggest impact, while the influence of other residues decreases the farer away they are sequentially
• Interactions: number of long-range interactions defined as being less than 10 Å apart and exhibiting a sequence-separation of more than 8 Å - amino acid-specific preferences concerning the number of these interactions were assessed by Z-Scores
• Good Interactions Z-Scores: number of surrounding residues within 10 Å with interaction favorable Z-scores above +0.5 - this implies that the interaction preferences for this residue and its neighbors are met
• Bad Interactions Z-Scores: vice versa for unfavorable Z-scores below -0.5 - thus, this indicates erroneous interaction patterns
• Relative Accessible Surface Area: the ASA computed by DSSP [Kabsch, 1983], normalized by the observed maximum ASA of this amino acid in a dataset of high-quality PDB structures
• Loop Fraction: the fraction of unordered secondary structure (non-strand, non-helix) within a window size of 9 residues
• Egor Agreement: motivated by the GOR algorithm, it is possible to predict the solvation energy values of the protein based solely on its sequence - discrepancies between the computed and the predicted energy profile indicate structural flaws for certain residues

These data provide a numerical description of a residue's environment with respect to observed and expected energies, and thus local stability. The random subspace method [Ho, 1998] is finally employed to predict and report Cα deviations as a measure of 'unnaturalness'. Thus, all predictions are made only by analyzing the submitted model; thus no knowledge and information on the native structure is required. All underlying statistics are gathered from 63 soluble, non-redundant, high-resolution protein structures obtained via the PDB-REPRDB service [Noguchi, 2001]. During training the CASP X data set [Moult, 2014] was used to compute Cα-Cα distances - the target function. QMEAN [Benkert, 2009; Benkert, 2011; Benkert, 2008] followed a comparable approach during design and strongly influenced the development of this method.

eQuant accepts files containing structure data in RCSB Protein Data Bank format as well as archive files. Supported archive file formats are .tar, .gz, .rar, .zip and .7z. This enables you to submit multiple structures (and structure complexes) at the same time. After processing, all structure assessments are reported on one single page in descending order with respect to global quality expressed by the GDT Z-score.

Simple text files are accessible to export the evaluation results. The SMALL file contains the results in condensed format. Only basic residue information, the actual per-residue error score and a rudimentary interpretation are provided. Thereby, scores exceeding 3.8 Å are considered unreliable. Should you be interested in more detailed data, choose the FULL report file, as it summaries not only the final evaluation scores, but also all information which was gathered during the quality assessment routine. Even though PV [Biasini, 2014] is used for visualization of the structure, you can furthermore download a modified PDB file of the originally submitted structure with evaluation scores written to the B-factor column. Using e.g. PyMOL [DeLano, 2002] you can conveniently create appealing images. Additionally an ZIP-archive is provided, which contains the result files. Last but not least, each figure on the result page can be locally stored by utilizing the browser's capabilities or HighChart's [Highsoft, 2012] context menu in the upper right corner.

## References

• S. Bittrich, F. Heinke and D. Labudde. eQuant - A Server for Fast Protein Model Quality Assessment by Integrating High-Dimensional Data and Machine Learning. Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery, 2016:419-433, 2016. ISSN 1865-0929.
doi: 10.1007/978-3-319-34099-9_32

• F. Heinke, S. Schildbach, D. Stockmann and D. Labudde.epros - a database and toolbox for investigating protein sequence-structure-function relationships through energy profiles. Nucleic Acids Res., 41(D1):D320-D326, Jan 2013. ISSN 1362-4962.
doi: 10.1093/nar/gks1079
• F. Heinke and D. Labudde. Membrane protein stability analyses by means of protein energy profiles in case of nephrogenic diabetes insipidus. Computational and Mathematical Methods in Medicine, 2012:1-11, 2012. ISSN 1748-6718.
doi: 10.1155/2012/790281

• P. Benkert, M. Kunzli and T. Schwede. Qmean server for protein model quality estimation. Nucleic Acids Res., 37(Web Server):W510-W514, Jul 2009. ISSN 1362-4962.
doi: 10.1093/nar/gkp322
• P. Benkert, M. Biasini and T. Schwede. Toward the estimation of the absolute quality of individual protein structure models. Bioinformatics, 27(3):343-350, Feb 2011. ISSN 1460-2059.
doi: 10.1093/bioinformatics/btq662
• P. Benkert, S. Tosatto and D. Schomburg. Qmean: A comprehensive scoring function for model quality assessment. Proteins: Struct., Funct., Bioinf., 71(1):261-277, Apr 2008. ISSN 1097-0134.
doi: 10.1002/prot.21715

• A. Kryshtafovych, B. Monastyrskyy and K. Fidelis. CASP prediction center infrastructure and evaluation measures in CASP10 and CASP ROLL. Proteins: Struct., Funct., Bioinf., 82:7-13, Feb 2014. ISSN 0887-3585.
doi: 10.1002/prot.24399
• M. Biasini. PV - WebGL-based protein viewer. 2014.
doi: 10.5281/zenodo.12620 - View on GitHub
• W. DeLano. The PyMOL molecular graphics system. 2002.
http://www.pymol.org/
• J. Moult, K. Fidelis, A. Kryshtafovych, T. Schwede and A. Tramontano. Critical assessment of methods of protein structure prediction (CASP) - round x. Proteins: Struct., Funct., Bioinf., 82:1-6, Feb 2014. ISSN 0887-3585.
doi: 10.1002/prot.24452
• Highsoft AS. Highcharts JS. 2012.
http://www.highcharts.com
• T. Noguchi. PDB-REPRDB: a database of representative protein chains from the protein data bank (PDB). Nucleic Acids Res., 29(1):219-220, Jan 2001. ISSN 1362-4962.
doi: 10.1093/nar/29.1.219
• W. Kabsch and C. Sander. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22 (12):2577-2637, Dec 1983. ISSN 1097-0282.
doi: 10.1002/bip.360221211
• T. Ho. The random subspace method for constructing decision forests. 20(8): 832-844, 1998. ISSN 0162-8828.
doi: 10.1109/34.709601
• J. Xu and Y. Zhang How significant is a protein structure similarity with TM-score=0.5?. 26(7): 889-895, 2010. ISSN 1460-2059.
doi: 10.1093/bioinformatics/btq066