Structural impact of mutations

A classic disease that results from protein structural change via amino acid substitution is sickle cell anemia [15]. Replacement of a hydrophilic glutamic acid residue with a strong hydrophobic valine on the sixth amino acid of hemoglobin subunit beta causes the protein to aggregate and form rigid molecules, which in turn reshape the red blood cells as sickle-like [16]. The sickle cells die prematurely and thus result in anemia. Other possible structural abnormalities that nsSNPs can induce include changes of secondary structure, gain or loss of protein stability, and other physicochemical property alterations. In this section, we will illustrate two mutations on a cancer-related gene, BRCA1, and then describe an algorithm for predicting protein stability; finally, we will discuss its application to discriminating neutral and deleterious mutations.

BRCA1 is a well-known suppressor of breast and ovarian cancer tumors. Two C-terminal sequence repeats (BRCT) are essential for BRCA1's function, since mutations of stop codon and missense substitutions on these regions were observed in breast cancer patients [17, 18]. The crystal structure of the BRCT segments [19] shows that these two domains pack to each other in a tandem manner where one helix on the N-terminal domain and two helices on the C-terminal domain form an inner-domain interaction surface (Figure 1).

Two amino acid substitutions occur on this interface at A1708E, located near the end of the a1 helix, and at M1775R, located near the beginning of the a2 helix. At position 1708, the mutant glutamic acid is much larger than the original alanine (having a molecular weight of 147 versus 89) and introduces negative charge. Because M1708 lies near the center of the interaction surface, the compact core cannot accommodate this mutation sterically. Thus,

Figure 1. The crystal structure of human BRCT domains (PDB ID: 1JNX). The N-terminus is shown in blue; the C-terminus, in red. Residues A1708 and M1775 are depicted as ball and stick models. Three helices, a1 from the N-terminus and both a2 and a3 from the C-terminus, pack into a hydrophobic core that is important to the folding of BRCT domains.

Figure 1. The crystal structure of human BRCT domains (PDB ID: 1JNX). The N-terminus is shown in blue; the C-terminus, in red. Residues A1708 and M1775 are depicted as ball and stick models. Three helices, a1 from the N-terminus and both a2 and a3 from the C-terminus, pack into a hydrophobic core that is important to the folding of BRCT domains.

A1708E would destabilize the BRCT interaction. On the other hand, although R1775 could be placed on the edge of the BRCT interface spatially, it positions a positive charge against the nearby R1835. Thus, both mutations would destabilize the BRCT core through either sterical incompatibility or disruption of electrostatic interactions [19]. This explanation found support from a mutation sensitivity assay that measures the stability of the inner domain interaction subject to proteolytic degradation. The wild-type protein resists the digestion by trypsin, elastase, and chymotrypsin, whereas the mutant with M1775R was partially degraded and A1708E was almost completely degraded [19]. The BRCT structure and in vitro experiments suggest that the genetic variants A1708E and M1775R cause the BRCA1 defect by destabilizing its inner-domain interaction.

From this example, we can see that crystal structure can be a powerful tool in interpreting possible consequences of nsSNPs by physicochemical principles. However, we cannot reasonably expect every protein and its mutants to have high-resolution three-dimensional (3D) structures or homology models available, either because of difficulties in structural determination, such as for membrane proteins, or because some proteins are intrinsically disordered [20].

To overcome this severe limitation, many computational tools aiming to predict structural properties use sequence information as input, either by direct use of sequence or through derived features such as amino acid composition and sequence motifs. Here, we describe a stability prediction method proposed by [21], namely MUpro, which was based on a sophisticated machine learning technique-Support Vector Machine (SVM)-and which achieved good performance.

In traditional molecular dynamics simulation, potential functions from a force field were usually calculated to obtain AAG, which was mainly influenced by interactions between nonlocal amino acids [22]. Although it is generally difficult, if not completely impossible, to infer protein structural architecture accurately based solely on amino acid sequence, pioneering work from [23, 24] showed that protein sequence was effective in the prediction of secondary structure and solvent accessibility. MUpro fit a set of features derived from protein sequence to an experimental stability data by nonlinear transformation through SVM. The ProTherm database [25] collects from the literature a range of experimentally measured thermodynamic parameters, such as Gibbs free energy changes for wild-type and mutant proteins, with experimental conditions, including pH and temperature. From ProTherm MUpro used protein sequences and mutations for training and test purposes, along with numeric energy changes.

MUpro adopted a standard binary classification scheme in feature generation by selecting a window centered on a mutant position and then encoding each amino acid in the window as a vector of 20 elements. In this kind of vector, each element corresponds to one of 20 standard amino acids and takes a value of 1 if the corresponding amino acid is identical to the one observed or else 0. MUpro considered a window of seven amino acids for each mutation, thereby representing the feature set by a 140-element vector. The first 20-element vector records information about wild-type and mutant amino acids at the mutant position, and the final six vectors document the six flanking amino acids.

In a two-dimensional space, linear classifiers are designed to separate two classes of data points by a straight line. As illustrated in Figure 2 (left plot), any lines passing through the space between two parallel lines can separate the blue points (one class) from the orange (the other class) perfectly, and thus would be a good choice for linear classification. However, SVM algorithms [26] would select the dashed line, which distances two lines equally, as the class boundary. In other words SVMs optimize a margin separator that maximizes its distance to data points. Figure 2 shows the margin m between two classes, which is the optimization object in SVMs algorithm. Mathematically, larger m is expected to provide the classifier greater generalization, which measures how well the classifier performs on new, unseen data points.


JS* •

A •

» •

MUpro Prediction

Figure 2. The left plot illustrates a linear classification on separable data with two classes (blue and orange). The class boundary (dashed line) is the middle line between two parallel lines. The right plot shows MUpro predictions against experimental values for 1,008 nsSNPs; points on the diagonal represent exact predictions.

When data sets overlap, SVMs still try to optimize a new objective function that considers both m and penalties from misclassification. Regardless of the separability of the data, m depends only on points located on the parallel lines (completely separable) or points located between them (partially separable). These points are called support vectors.

Besides data classification, SVMs can perform regression for data points with continuous response values, where the objective function measures the difference between prediction and actual values. But unlike typical linear regression, SVM regressions do not penalize differences falling within a predefined range.

The abilities of SVMs, however, go beyond linear classification and regression. By projecting the original data points into higher dimensional spaces, SVMs actually create additional, and usually more complex, features from the input points. By using the same linear settings as described above in these newly high-dimensional spaces, SVMs can effectively capture highly nonlinear relationships among data which otherwise would be missed.

MUpro applied a popular SVM implementation, SVMllght [27], to carry out energy change sign classification and regression. In 1,008 training mutations, MUpro performed rather well against true energy changes, with a root-mean-square deviation (RMSD) of 0.39 (Figure 2, right plot). Moreover, it made more accurate predictions with less dramatic actual stability changes between wild-type and mutant amino acids. Generally, MUpro tended to underestimate larger energy changes.

In one early comprehensive examination of the effects of nsSNPs on protein function, [28] catalogued nsSNP effects according to structural and sequence changes caused by the introduction of mutant amino acids. That study extracted 262 disease-causing missense variants from the HGMD and 42 neutral variants from hypertension-associated genes. Proteins harboring these variants either had 3D structures deposited in the Protein Data Bank (PDB) or they could find homologous ones with a sequence similarity of at least 40 percent. They then modeled both wild-type and mutant protein structures based on available 3D structures. By examining a broad range of physicochemical parameters from built models, including loss of hydrogen bonds, loss of a salt bridge, over-packing, and disruption of binding, Wang et al. could compare distributions of effects observed in disease-causing and neutral variants (Table 1). Their results clearly demonstrated that loss of stability accounts for many more disease-causing variants than neutral variants (83 versus 26 percent) and that 70 percent of neutral variants cause no measurable effects on the protein structure.


Disease Neutral




Ligand binding






No effect



Table 1. Percentage of effects from missense variants on protein function (adapted from Figure 2 in [28])

Table 1. Percentage of effects from missense variants on protein function (adapted from Figure 2 in [28])

This survey suggests that nsSNPs giving rise to stability changes will more likely be disease-related than not, and this property might be useful in distinguishing disease-causing from neutral nsSNPs. Moreover, computational tools like MUpro capable of predicting stability greatly facilitate this task by applying to virtually any protein with sequences available.

Was this article helpful?

0 0
10 Ways To Fight Off Cancer

10 Ways To Fight Off Cancer

Learning About 10 Ways Fight Off Cancer Can Have Amazing Benefits For Your Life The Best Tips On How To Keep This Killer At Bay Discovering that you or a loved one has cancer can be utterly terrifying. All the same, once you comprehend the causes of cancer and learn how to reverse those causes, you or your loved one may have more than a fighting chance of beating out cancer.

Get My Free Ebook

Post a comment