Computerized Detection of JWH Synthetic Cannabinoids Class Membership Based on Machine Learning Algorithms and Molecular Descriptors

An Artificial Neural Networks (ANN) model identifying JWH Synthetic Cannabinoids, that we have developed based on a combination of topological, 3D-MoRSE (Molecule Representation of Structure based on Electron diffraction) and ADMET (Absorption, Distribution, Metabolism, Excretion and Toxicity) molecular descriptors, is described and analyzed. The validation results indicate that this computerized system has a very high potential for efficiently predicting the class membership of JWH and discriminating them from a large variety of (non-JWH) substances of forensic interest.


I. INTRODUCTION
The emergence of synthetic cannabinoids such as the JWH series is an ongoing challenge for forensic, clinical analytical chemists and toxicologists. Because the JWH series of compounds produces euphoric and hallucinogenic effects, they cause serious social problems concerning their abuse liability. New analogs continuously emerge on the black market, with the aim to circumvent the legal consequences. JWH synthetic cannabinoids are known to be lipid soluble and nonpolar. They contain 22 to 26 carbon atoms, all sharing a common structural feature, i.e. a side chain composed of 4 to 9 saturated carbon atoms, which ensures an optimal activity in the cannabinoid receptor [1].
Detection plays a critical role in preventing substance abuse. In this paper, we are assessing the efficiency of a machine learning algorithm, i.e. an Artificial Neural Networks (ANN) model, designed to recognize JWH synthetic cannabinoids based on a combination of topological and 3D-MoRSE (Molecule Representation of Structure based on Electron diffraction) and ADMET (Absorption, Distribution, Metabolism, Excretion and Toxicity) molecular descriptors. These descriptors allow the representation of the chemical characteristics of the targeted molecules in a numerical vector that is adequate for computer processing.
The proposed method consists of translating the model in a data-oriented manner on a large set of molecular structures. The obtained model can be provide meaningful molecular descriptors for performing structure searching without needing any retraining. In order to find which molecular descriptors are the most relevant for our purpose, we performed a variety of tests related to predictive QSAR (Quantitative Structure -Activity Relationships), toxicity and virtual screening tasks.

A.
Database The input database consists of 150 designer drugs including JWH synthetic cannabinoids, as well as non-JWH substances. The later class includes synthetic cannabinoids other than JWH and a variety of other illicit substances (e.g. doping drugs). The database was built and updated by performing automatic data extraction. For this process, we have used the KnowItAll Informatics System 2021 software package and the chemical abstracts service (CAS) number identifier from the Wiley Database Collection. The substances included in the database were divided into three classes, referred to as JWH, nonJWH Cannabinoids and Others. The group of positives contains 50 JWH synthetic cannabinoids, while the group of negatives includes 100 compounds (50 non-JWH cannabinoids and 50 others drugs). The list of these compounds was presented in a previous paper [2].

B.
Methods The 3D representation of the molecular structures was obtained, for all the 150 compounds forming the input database, with the HyperChem release 8.0.10 and Gaussian 16 software packages. The geometries of the molecular structures included in the database were fully optimized by using algorithms specific to the AM1 semi-empirical quantum method. The improved molecular geometry and associated parameters corresponding to the minimum energy of the molecular system were computed based on the Polak-Ribiere theory.
A number of 300 QSAR and 50 ADMET molecular descriptors have been selected from three blocks, i.e. topological, 3D-MoRSE, and toxicity. The molecular descriptors were computed for each molecular structure in the database by using the online modules and web applications of Swiss Institute of Bioinformatics scientific platforms. The initial 350 input molecular descriptors were evaluated from the point of view of their relative importance to the classification system. Taking into account its variation, we have selected as final input the first 150 most relevant descriptors. Their list was reported in a previous paper [3]. The detailed definitions, mathematical formulas and chemical significance of these topological and 3D-MoRSE descriptors are presented in detail in Todeschini et al [4].

Results
The ANN model designed to recognize JWH synthetic cannabinoids based on the 150 most relevant molecular descriptors was named nd150_DETECT_DNN and was built by using the Neural Designer 5.9.2 software. Fig.1 shows the architecture of this system. ANN represents a predictive artificial intelligence tool. The Neural Designer software may build networks with deep architectures, which represent an efficient type of universal approximators [5]. In our case, the system contains three input nodes, corresponding to Topological, 3DMORSE and ADMET descriptors respectively. The size of the scaling layer is equal to the number of input nodes, i.e. 3.
The nd150_DETECT_DNN system contains the following layers: a scaling layer containing three neurons (represented in Fig. 1 in yellow); two perceptron layers, each formed by three neurons (blue); unscaling layer consisting of three neurons (red); a bounding layer consisting of three neurons (purple).
The optimal number of neurons for the nd150_DETECT_DNN system was determined with the growing neurons algorithm. The following chart (Fig.2) presents the dynamics of the training error and the selection error, as determined for the different subsets while performing the above-mentioned algorithm.  Table 1 shows the results obtained for the selection of the neurons, as performed by the growing neurons algorithm. The model selection algorithms aim to determine which ANN has the topology that corresponds to the optimized error on new data. The most adequate model selection methods are the order selection algorithms and the input selection algorithms. The former determine the optimal number of hidden neurons in the architecture of the ANN. The later ones are designed to identify the optimal subset of input variables. We chose the growing neurons algorithm for selecting the optimal number of neurons in this application (see Table  2). This procedure starts with a minimum number of neurons and then incrementally increases their number in an iterative way. The growing inputs algorithm was used for selecting the optimal set of inputs (see Table 3). Hence, the inputs were added progressively according to their correlations with the targets. The following chart (Fig.3) presents the nature of the samples in the input data subsets. The 150 samples were divided into 90 training samples (60%), 30 selection samples (20%), and 30 testing samples (20%).   The nd150_DETECT_DNN system contains three perceptron layers. The parameters used for scaling the inputs are displayed in Table V. They include the minimum, maximum, mean, and standard deviation (the scalers are the Minimum and the Maximum).  Table VI indicates the size of each layer and the corresponding activation function. The size of the unscaling layer is three (equal to the number of outputs). The nd150_DETECT_DNN model contains three output classes. The Neural Designer software initializes the neural network parameters at random with a uniform distribution. In our case, the number of randomized parameters was 31.
The training (or learning) strategy is applied to the ANN in order to reach the best possible loss [8]. The training strategy adjusts the ANN parameters to obtain the lowest loss possible. The loss index defines the task the ANN must perform and represents a measure of the quality of the representation needed to learn [6]. When setting a loss index, we must consider two different concepts, i.e. the error and the regularization. The error term evaluates quantitatively the extent to which the ANN fits the data set.
Choosing the appropriate error method depends on the particular application. In our case, we selected the Normalized Squared Error (MSE). The normalized squared error is equal to 1 when the outputs from the ANN are equal to the mean values of the target variables. A normalized squared error equal to 0 indicates a perfect prediction of the data [7].
The regularization term measures the parametric values in the ANN system. If it is added to the error, then the ANN will be characterized by smaller weights and biases. In this case, its response will be smoother. In addition, overfitting may be easier avoided. In our case, we applied the L2 regularization method, which is based on the squared sum of all the ANN parameters.
The quasi-Newton method, presented in Table VII, was used as optimization algorithm. Although it is based on Newton's method, this approach has the advantage of not requiring the calculation of second derivatives. Instead, it determines an approximation of the inverse Hessian at each iteration of the algorithm, based only on gradient information. The loss of a model is often tested by performing a linear regression analysis regarding the scaled ANN outputs and the associated targets, as computed for an independent testing subset [9]. This method yields a number of three parameters for each output variable. The first two parameters, a and b, are the y-intercept and the slope of the best linear regression correlating the outputs to the targets. The third parameter, R2, is the correlation coefficient between the scaled outputs and the targets. A perfect fit is obtained when the outputs become equal to the targets. In this case, the slope is equal to 1 and the y-intercept is equal to 0. When the correlation coefficient is equal to 1, there is a perfect correlation between the ANN outputs and the targets in the testing subset. Table VIII lists the linear regression parameters obtained for the output associated with JWH cannabinoids. The values of the intercept is very small (nearly equal to 0), while the slope and correlation are very close to 1, so we may conclude that the ANN is assigning the JWH class membership remarkably well.  Fig. 4 illustrates the linear regression for the output corresponding to the class of JWH cannabinoids. Each circle represents a predicted value versus the actual one. The line indicates the best linear fit.     The confusion matrix computed for this system, presented in Table XII, shows that it is remarkably sensitive, as it detects all the analyzed JWH synthetic cannabinoids without exception. The ANN model is characterized by a rate of true positives (JWH and non-JWH cannabinoids) of 100% and a rate of false negatives (cannabinoids misclassified as Others) of 0%. As many of 148 samples were correctly classified and only two samples were incorrectly assigned the class identity.

III. CONCLUSIONS
The detection of the JWH synthetic cannabinoids, as well as of non-JWH cannabinoids and discriminating them from other compounds of forensic interest is extremely important. These compounds have very limited legal uses. Hence, there are very few studies about their pharmacological activity and toxicity.
The most important characteristic of any system screening for these drugs of abuse is its capacity to recognize correctly the class identity of the positive samples, which, according to the requirements for a forensic tool, should not be misclassified under any circumstances.
The results obtained for the nd150_DETECT_DNN model, which was designed to recognize the class identity of JWH synthetic cannabinoids and discriminate them from non-JWH cannabinoids and other substances of forensic interest, indicates that the model is exceptionally efficient. All the JWH or non-JWH cannabinoids have been recognized as such, and no substance belonging to the class of non-JWH cannabinoids or of the class of Others was misclassified as JWH cannabinoid. The two drugs that were misclassified belong to the class of Others and were incorrectly assigned the class identity of non-JWH cannabinoids.