Pharmaceutical and biological researchers consistently explore questions related to protein structures and mutations to better understand virus evolution. In thermodynamics, protein structures are predicted through computational simulations, such as molecular dynamic simulation, which calculates free energy, intermediate states, mutation effects, and protein-protein interactions. However, this novel method has limitations in deciphering complex protein structures. To bridge this gap in protein understanding, machine learning and deep learning are applied to study virus evolution. Notably, escape mutations of SARS-CoV-2 have been predicted using natural language processing techniques, which interpret amino acid sequences in terms of semantic change (antigenic variant) and grammatical quality (viability/fitness). Surprisingly, training models using only amino acid sequences was sufficient to predict escape mutations without additional information on protein structure and function.
Despite the potential of natural language models to suggest possible escape mutations, there is a need to enhance the accuracy of these predictions to minimize the selection of unnecessary candidates. In this study, we evaluated and refined a novel language model by incorporating nucleotide substitutions to improve prediction accuracy. Biologically, amino acid sequences are determined by nucleotide compositions, and most mutations occur at the DNA or RNA level. Although deep learning models might indirectly learn this information from amino acid sequences, integrating direct nucleotide data into the model has resulted in more precise estimations with higher accuracy. This approach has enabled us not only to reduce unnecessary candidates for escape mutations and but also to enhance prediction of characteristic and dominant mutations.