How do you know if a compound will be a mutagen before you test it. Before you even make it?
Drug companies make lots of new molecules that they hope will be useful as drugs. But there are a lot of other things that can happen when a biologically active molecule gets inside you. A lot of potential drugs just don't work, or work poorly. Many work well enough for the task at hand, but have side effects that are unpleasant. Some side effects are inconvenient but tolerable, others are deal breakers.
Making a new molecule and testing it is a time consuming process. Just making the molecule will involve several reactions run sequentially, and each step can require careful purification before you can go on to the next step in the process. Once the molecule has been made, there is a battery of tests to run both to see how well it works on the drug target, and to find out if it is likely to make the patient sicker through side effects.
It would be helpful if you could predict the toxicity of a compound before you go to the trouble of making it in the lab, or at least ruling out compounds that are likely to be highly toxic. In Accurate and Interpretable Computational Modeling of Chemical Mutagenicity, Langham and Jain describe their work on predicting whether a compound is a mutagen just based on the types of atoms in the molecule. And they get pretty good results.
To do this properly you will have to look at lots of molecules, so you need a simple way to describe your molecules quickly without running lots of complex calculations. Langham and Jain had a computer program list all possible pairs of atoms in each molecule, and made a
list of these atom pairs. The example they give in their paper is an atom pair found in aspirin described as O3_1_D5_C2_Ar2. This describes an sp3 hybridized oxygen attached to one heavy atom (not hydrogen) that is 5 bonds away from an aromatic carbon with two heavy neighbors.
Next you have to look for a pattern of atom pairs in a molecule that seems to be related to whether or not it is a mutagen. Just looking at the data would probably not be very effective, so the authors used three different Machine Learning techniques to look for a pattern: support vector machines (svm), RuleFit, and K-nearest neighbors (KNN). They analyzed a training set of 4337 diverse compounds, 2401 of which were mutagens and 1936 were not and found that the SVM method gave an accuracy of 0.77, and RuleFit was a little better with an accuracy of 0.79.
The real test is how well the model works in predicting the activity of completely new molecules. So next they used their SVM and RuleFit results to try to predict the mutagenicity of a completely different set of compounds taken from the Carcinogenic Potency Database (CPDB). With this new set of compounds, SVM (accuracy 0.770) worked a little better than RuleFit (accuracy 0.718). This is far from ideal, but it's a pretty good start. And it is interesting to see that such a simple criterion as pairs of atoms can be predictive of a complex behavior like causing mutations.