Mutation prediction Mut Pred

In light of the above observations on the wide variety of consequences of a single mutation, we developed a large range of features for each variant and employed a popular machine learning technique, random forest, to distinguish disease-associated mutations from neutral ones. We called the model MutPred [42].

In a supervised learning scenario, we collected two sets of disease-associated mutations. One set came from the HGMD [3], in which 95 percent of mutations were annotated to monogenic diseases. We extracted the other set from a cancer-sequencing project [41]. Also, we created two corresponding control data sets (Table 5). For the HGMD data, we took a set of variants from UniProt that were annotated as polymorphisms to serve as controls (SPP). We identified all neutral mutations that occurred on the same proteins observed in the cancer data set and used them as the cancer controls. On average, HGMD proteins harbored 7.3 times as many variants as SPP proteins, while we observed a much less dramatic difference between cancer data set and its controls.

Data set

Mutations Proteins














Cancer control




Table 5. Summary of disease and neutral data sets.

We generated a total of 130 numeric attributes based on protein sequences for each mutation and utilized them as the input into a random forest classifier. These attributes can be divided into three major types (Table 6). Other evolutionary attributes include position-specific scoring matrix (PSSM) generated by PSI-BLAST, Pfam domain profile, and transition frequency from SNAP [43].

As the PTPS example shows, the influence of nsSNPs could spread to neighboring PTM sites. Accordingly, we expanded the definitions for gain/loss of structural and functional properties to pick up the largest gain/loss changes within an 11-residue window centered on the mutant position.

Random forest is an ensemble learning technique based on a population of binary decision trees, each of which is grown on a proportion of randomly chosen features and bootstrapped samples [54]. For classification, the outcome is the majority voting of individual trees.




Functional properties DNA-binding residues Catalytic residues MoRFs

Phosphorylation sites Methylation sites Glycosylation sites Ubiquitination sites Structure and dynamics Secondary structure Solvent accessibility Stability

Intrinsic disorder B-factor

Transmembrane helix Coiled-coil structure Evolutionary information Sequence Conservation

DisPhos [37]

PHD/Prof [48] PHD/Prof [48] MUpro [21] DISPROT [49] [50]

Conservation index^[53]

Table 6. Major attributes used in MutPred. + unpublished in-house program. £ used in latest version of MutPred.

Compared to a normal single decision tree, each subtree within a random forest uses only partial features and samples, which results in small correlations among subtrees and effectively reduces the overall variance of the model. Moreover, random forests inherit some attractive properties from decision trees, such as robustness to outliers and ease of interpretation.

In our model, we specified 1,000 trees to build the classifier between disease and neutral mutations. The HGMD achieved better accuracy than the somatic cancer data, suggesting that monogenic disease-related mutations are more suited to MutPred than somatic cancer mutations (Table 7). This is likely due to the large number of passenger variants (not causative) in tissue cancer sequencing data sets. Also, in terms of area under the curve (AUC) MutPred observed 0.86 in HGMD and 0.69 in cancer data sets (Figure 5, left).

Data set Sensitivity Specificity Accuracy HGMD 76.8 790 777

Cancer 60.9 68.4 65.5

Table 7. Percentage of classification performance measurement for HGMD and cancer data sets.

MutPred can provide not only comparable predictions for a mutation's predisposition to cause diseases [55], but it also allows the estimation of the significance level for individual gain/loss of properties (Figure 5, right). It is reasonable to assume that the distribution of property p in the neutral data set provides an unbiased approximation of the true null distribution, given the fact that UniProt provided the largest available set of curated neutral variants. Therefore, we could generate hypotheses about the molecular mechanism underlying variants at three different confidence levels: (1) actionable hypotheses: 0.78 > MutPred score > 0.5 AND property score < 0.05; (2) confident hypotheses: MutPred score > 0.78 AND 0.01 < property

1 - Specificity

Figure 5. The Receiver Operating Characteristic (ROC) curves for HGMD and cancer data sets (left), and example distributions of gain/loss property p in neutral and disease sets (green and red, respectively; right). An empirical distribution of the putatively neutral substitutions can be used to define a threshold r on the false positive rate that, in turn, can be used to accept/reject the null hypothesis on new substitutions. The area shaded in green represents the P-value threshold (corresponding to the score r) that is used by MutPred to hypothesize molecular cause of disease. A particular area under the right tail of the neutral distribution is referred to as the property score.

score < 0.05; (3) very confident hypotheses: MutPred score > 0.78 AND property score < 0.01, where 0.78 corresponds to specificity 0.95 in HGMD data set.

We applied MutPred to 203,899 nsSNPs deposited in the dbSNP (build 135) and examined the score distribution and frequent hypotheses behind predicted deleterious mutations. In general, 35 percent of mutations were predicted with scores higher than 0.5; thus, we classified them as disease-associated (Figure 6). Of these deleterious mutations, 19.6 percent got at least one functional or structural hypothesis of possible molecular mechanism. The top three hypotheses all pointed to structural changes: gain of disorder (9.7 percent), loss of stability (8.5 percent), and loss of disorder (6.2 percent). This result agrees with [28]-at least in the sense that these changes are the most frequently seen. On the other hand, common functional alterations involved in disease included loss of MoRF binding (6.0 percent), gain of methylation (5.9 percent), and gain of catalytic residue (5.6 percent).

Was this article helpful?

0 0
10 Ways To Fight Off Cancer

10 Ways To Fight Off Cancer

Learning About 10 Ways Fight Off Cancer Can Have Amazing Benefits For Your Life The Best Tips On How To Keep This Killer At Bay Discovering that you or a loved one has cancer can be utterly terrifying. All the same, once you comprehend the causes of cancer and learn how to reverse those causes, you or your loved one may have more than a fighting chance of beating out cancer.

Get My Free Ebook

Post a comment