Novel methods to protein mutation prediction based on Naïve Bayes classifier – An application in Influenza A virus  

Anh Tuan Tran1, Ly Le2, and Bao The Pham1

1VNUHCM – University of Science, 227 Nguyen Van Cu street, District 5, Ho Chi Minh City

2VNUHCM – International University, Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City


The high mutation rate of several viruses is one of the main reasons leading to dangerous epidemics or pandemics. Therefore, there is a demand for accurate prediction of dangerous mutations which lead to new phathogenic strains that resist to current drugs and vaccines. In this research, we proposed novel methods for mutation prediction based on Naïve Bayes Classifier and conducted experiments for Neuraminidase and Hemagglutinin on influenza A virus. Our method gives considerable accuracy, also matching score. These results show that our predicted sequences have high likelihood of being the same with protein families in structures and provide useful information for antiviral drug design.

Citation: Tran AT, Le L, & Pham BT (2015) Novel methods to protein mutation prediction based on Naïve Bayes classifier – An application in Influenza A virusGenomic Medicine 2015, eds Le L & Pham S (Ho Chi Minh City, Viet Nam). 
Full-text Download: PDF
VJS Editor: Phuc Le, Center for Value - based Care Research, Cleveland Clinic, USA


The rapid evolution rates of several viruses, such as HIV and influenza, cause threats to the health care system in the world (1-4). These species have high error (mutation) rates during reproduction process (transcription or DNA duplication), that make the variety of different genotypes. The advantage of high error rates is to give the viral abilities to adapt to new environments by escaping our immume system and resisting to current drugs. The mutations in protein sequences lead to the changes of respective polypeptide chains and protein structures. Consequently, designing drugs or vaccines counteracting these viruses becomes a challenge for biologists. Moreover, excluding natural hosts, these viruses possibly infect and cause severe diseases in other hosts. Since 1977, for example, there have been numerous report avian or swine influenza viruses infecting humans (5-8).

A mutated genotype can be a synonymous version of the original genome that seems not to influence the respective proteins, but in case of non-synonym, the mutation significantly contributes to the evolution of protein sequences and structures. Only mutated proteins surviving under selection pressure can make species evolve and adapt to new environments.

Mutations occur randomly and unpredictably, but the mutations surviving under selection possibly have predictable pattern. As far as we know, protein sequences evolve more rapidly than structures (9). This may be the result of that protein structures assign protein functions. A mutated sequence which causes a change into structure may influences protein’s function, so an individual having a mutated protein sequence cannot live as normal. Thus, we can expect that surviving mutated proteins must reserve the protein structure and function.

There are some tools  relating to this subject has been developed, especially in correlated mutation prediction and protein residues-residues contact map prediction (10-16). Scientists also interest in analysis of mutation impact on protein (17, 18). However, we probably cannot find any related researches directly predicting protein mutation. These researches do not indicate what mutants occur in future. In this paper, we investigate how to predict specific protein mutations which survive under natural selection, in future, from protein family data of sequences.

Protein sequences set which is mainly-used data for this study are non-orderable discrete data type. Therefore, a statistical learning method, like Naïve Bayes Classifier, is appropriate with this problem (19). These data sets are usually big with thousands of instances and hundreds of attributes.  While other complex accurate learning methods take long execution time, Naïve Bayes Classifier can learn data during an acceptable period, with reasonable accuracy. Hence, we proposed a method for predicting protein mutation which surviving under natural selection pressure by using Naïve Bayes Classifier and also conduct experiment for Membrane Glycoproteins of Influenza A Virus in Viet Nam.


Our method included four main steps which were preprocessing, multiple sequences alignment, target determination, and mutation prediction. The overall procedure was shown in Figure 1.

Figure 1. Overall methodology


The first step we automatically eliminated sequences which contained strange characters, Algorithm 1. There are several sequences that were incomplete and had uncommon amino acids, so it was necessary to preprocess data in order to reduce noise.

Algorithm 1. Preprocessing

Multiple sequences align  ment

Protein sequences in data set were not equal in length, so we had to normalize the data to the same length. Multiple sequences alignment  using dynamics programming algorithm is time consuming (20). Therefore, we used progressive method with join-neighbor algorithm and Jukes-Cantor evolution distance for solving this task (21-24).

Target determination

According to Mitchel, the ith protein sequence in data set P was called an instance (sample) and denoted by pi; then, the jth position  of pi is an attribute (19). In addition, we needed to determine the target ti with respect to each instance pi. We proposed three methods:1)The first method is based on phylogenetic tree, 2)The second method is based on the order of protein sequence.3) The third method is a combination of both two above methods.

Phylogenetic tree

We assumed that phylogenetic trees can accurately reflect evolutionary processes of  protein families. Then, we could learn rules of protein families and predict surviving sequences occurring in future, through these processes and phylogenetic trees. Target for each sequence is the sequence’s child, equation (1)



where  are children of  pi.

We constructed binary guide trees by using join-neighbor algorithm, Jukes-Cantor evolution distance and dynamics programming for pair wise alignment (20, 24, 25). From guide trees, we constructed phylogenetic trees by determining father for each pair of sequences which have the same higher level node, Algorithm 2.

Algorithm 2. Phylogenetic tree construction

Protein sequences’ order and component

As structural evolution rate of a surviving sequence is lower than its evolution rate of sequence to maintain its function. Thus, we assumed that the order and components of sequence contain information about structure. We could predict future sequence having the same structure with protein family’s structure. Target for each sequence were determined by equation (2).


si = pi


Mutation prediction

For each position in protein sequences, we generated a learner based on Naïve Bayes Classifier to predict mutation at this residue. After that, we combined all predicted mutation position to get the final mutated sequence. Given a protein sequence pi, the probability that the jthresidue  mutates into a specific amino acid  was expressed by (3), based on Bayesian formula, where  was the kth element in the possible amino acid set .



We assumed that attributes of each instance are independent. Then, the probability could be rewritten as (4). and were given by (5) and (6), respectively



where |.| is number of member in a set, L(pi) = |pi|is length of string, t is a target (t= f in case of the first method and t=s in the second method).

The mutated jthresidue was predicted by equation (7).



Finally, the predicted mutant of pi was .

Denote that  and  are the probability with respect to the first and the second method, according to the third method, the mutated jth residue was predicted by equation (8)






Experiments and results

Data collection and preprocessing

We collected sequences of Neuraminidase (NA) and Hemagglutinin (HA) of influenza imported from Vietnam from National Center of Biotechnology Information (NCBI,, in FASTA format. We also collected sequences from Southeast Asia for testing purpose. After preprocessing, we obtained Table 1.

Table 1. Data selection and preprocessing

Type of protein
Viet Nam
Southeast Asia

1.1. Results

Firstly, we defined two evaluation criteria which are accuracy and matching score. The accuracy of our method was expressed by mean of accuracy rate of each predicted position  as equation (10)






We also computed matching score between predicted sequences and reference sequences. Gaps in predicted sequences were eliminated; then, each sequences was aligned with all reference sequences and computed Jukes-Cantor distance to each sequence in reference set. Denoted that the nearest aligned reference sequence to aligned ai was ri, the matching score was defined by equation (12)






In this research, we used NA and HA data set from Vienamfor training and testing process due to the lack of data. Table 2 describes the accuracy of our methodology. As regards NA, the accuracies of the first and the second method are almost the same, 67.87% and 65.42% respectively. Which respect to HA, the difference between two methods is just 4.58%. We also investigated the correlation between the conservation of protein family and the accuracy. As Figure 1 illustrated, almost high conservation positions (80%) gave high accuracy (80%).

Table 2. Accuracy

Type of protein
First method
Second method

Figure 2. Correlation between accuracy and conserved positions. (A) The first method for NA. (B) The second method for NA. (C) The first method for HA. (D) The second method for HA.

Countries in Southeast Asia are similar to Viet Nam in terms of environment, biological resources, etc., so we considered NA and HA data set in Southeast Asia as a reference set, in order to evaluate matching score. As Table 3 described, regarding NA, both three methods give moderate matching scores, 69.98%, 66.35%, and 69.17% respectively. In HA set, the accuracies of three methods are considerable, 77.92%, 78.17%, and 78.49% respectively. In addition, we randomly pick up a predicted sequence with respect to two methods and protein type to predict 3d structure (26-29). According to

Table 4, modeled positions rates are higher than 80% in both six cases.

Table 3. Matching score

Type of protein
First method
Second method
Third method


Table 4. Percentage of residues were modeled

Type of protein
First method
Second method
Third method
83% (4b7qA)
82% (3tiaA)
83% (4gzoA)
91% (2wr0A)
89% (2yp7A)
89% (2yp7A)


We proposed a novel method for predicting mutated protein surviving under selection pressure, with reasonable accuracies and matching scores, but we almost could not find out any related researches in order to compare. We did not expect a perfect accuracy because it will not suggest any novel mutated sequences. Therefore, the results are reasonable, although both two methods give under 80% of accuracies. Moreover, our method gives high accuracies with respect to conserved position which play important role in structure and function of protein. The matching score is more important than the accuracy since it indicated how our predicted sequences similar to the sequences in reference set. The more predicted sequences similar to reference set, the more their structures have high probability of resembling structure of protein family. In our research, matching scores just reach moderate levels, but the accuracy is expected to increase when using bigger reference sets. Additionally, the proportions of modeled positions are higher than 80%. These results show that our predicted sequences have high likelihood of being the same with protein families in structures.Othereffective machine learning methods are suggested to apply for the same data sets to improve accuracy and matching score. After that protein structure of predicted sequences can be constructed using homology modelling for further study.


1. Shankarappa R, et al (1999) Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol 73(12): 10489-10502.
2. Buonagurio DA, et al (1986) Evolution of human influenza A viruses over 50 years: Rapid, uniform rate of change in NS gene. Science 232(4753): 980-982.
3. Parvin JD, Moscona A, Pan WT, Leider JM & Palese P (1986) Measurement of the mutation rates of animal viruses: Influenza A virus and poliovirus type 1. J Virol 59(2): 377-383.
4. Rambaut A, Posada D, Crandall KA & Holmes EC (2004) The causes and consequences of HIV evolution. Nat Rev Genet 5(1): 52-61 (View Article).
5. Kimura K, Adlakha A & Simon PM (1998) Fatal case of swine influenza virus in an immunocompetent host. Mayo Clin Proc 73(3): 243-245.
6. Rota PA, et al (1989) Laboratory characterization of a swine influenza virus isolated from a fatal case of human influenza. J Clin Microbiol 27(6): 1413-1416.
7. Dowdle WR & Hattwick MA (1977) Swine influenza virus infections in humans. J Infect Dis 136 Suppl: S386-9.
8. Wells DL, et al (1991) Swine influenza virus infections. transmission from ill pigs to humans at a wisconsin agricultural fair and subsequent probable person-to-person transmission. JAMA 265(4): 478-481.
9. Chothia C & Lesk AM (1986) The relation between the divergence of sequence and structure in proteins. EMBO J 5(4): 823-826.
10. Fariselli P, Olmea O, Valencia A & Casadio R (2001) Prediction of contact maps with neural networks and correlated mutations. Protein Eng 14(11): 835-843 (View Article).
11. Di Lena P, Nagata K & Baldi P (2012) Deep architectures for protein contact map prediction. Bioinformatics 28(19): 2449-2457 (View Article).
12. Neher E (1994) How frequent are correlated changes in families of protein sequences?. Proc Natl Acad Sci U S A 91(1): 98-102.
13. Gobel U, Sander C, Schneider R & Valencia A (1994) Correlated mutations and residue contacts in proteins. Proteins 18(4): 309-317.
14. Olmea O & Valencia A (1997) Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Fold Des 2(3): S25-32.
15. Cheng J & Baldi P (2007) Improved residue contact prediction using support vector machines and a large feature set. BMC Bioinformatics 8: 113 (View Article).
16. Ashkenazy H & Kliger Y (2010) Reducing phylogenetic bias in correlated mutation analysis. Protein Eng Des Sel 23(5): 321-326 (View Article).
17. Bromberg Y, Yachdav G & Rost B (2008) SNAP predicts effect of mutations on protein function. Bioinformatics 24(20): 2397-2398 (View Article).
18. Worth CL, Preissner R & Blundell TL (2011) SDM-a server for predicting effects of mutations on protein stability and malfunction. Nucleic Acids Res 39: W215-22 (View Article).
19. Mitchell TM (1997) Machine Learning, (McGraw-Hill, New York),
20. Lipman DJ, Altschul SF & Kececioglu JD (1989) A tool for multiple sequence alignment. Proc Natl Acad Sci U S A 86(12): 4412-4415.
21. Chenna R, et al (2003) Multiple sequence alignment with the clustal series of programs. Nucleic Acids Res 31(13): 3497-3500 (View Article).
22. Higgins DG & Sharp PM (1988) CLUSTAL: A package for performing multiple sequence alignment on a microcomputer. Gene 73(1): 237-244.
23. Thompson JD, Higgins DG & Gibson TJ (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22): 4673-4680.
24. Jukes TH & Cantor CR (1969) in Mammalian Protein Metabolism, ed Munro HN (Academic Press, New York), pp 21-132.
25. Saitou N & Nei M (1987) The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol 4(4): 406-425.
26. Kallberg M, et al (2012) Template-based protein structure modeling using the RaptorX web server. Nat Protoc 7(8): 1511-1522 (View Article).
27. Ma J, Wang S, Zhao F & Xu J (2013) Protein threading using context-specific alignment potential. Bioinformatics 29(13): i257-65 (View Article).
28. Peng J & Xu J (2011) A multiple-template approach to protein threading. Proteins 79(6): 1930-1939 (View Article).
29. Peng J & Xu J (2011) RaptorX: Exploiting structure information for protein alignment by statistical inference. Proteins 79 Suppl 10: 161-171 (View Article).

Add new comment

Filtered HTML

  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Enter the characters shown in the image.