Pre-selection and assessment of green organic solvents by clustering chemometric tools

The study presents the result of the application of simple multivariate statistical tools for selection of physicochemical parameters of solvents for modelling of missing variables – bioconcentration factors, water-octanol and octanol-air partitioning constants. EPI Suite software was successfully applied to predict missing values for solvents commonly considered as “green”. Values for logBCF, logK OW and logK OA were modelled for 43 rather nonpolar solvents and 69 polar ones. Application of multivariate statistics was also proved to be useful in the assessment of the obtained modelling results. The presented approach can be one of the first steps and support tools in the assessment of chemicals in terms of their greenness.


Introduction
Green chemistry is the concept introduce by Paul Anastas in 1998 (Anastas, P. T. and Warner, 1998) with the publication of the twelve principles that are specific guidance on the introduction of sustainability to chemical science. Since then this concept has developed much and efforts were made to develop zero-waste technologies, design benign products that maintain their properties, find renewable and bio-based feedstock for chemicals or apply energy-efficient technologies. Also solvents gained a lot of attention, the fifth principle of green chemistry states that solvents should not be applied if possible (Jessop et al., 2016) otherwise they should be as inert to the environment as possible. As the application of solvents cannot be avoided in many technological processes, it is highly desired to use green solvents. The green solvent is characterised by preferential environmental, health and safety (EHS) parameters (Capello, Fischer, and Hungerbühler 2007). The first, very basic information about solvent greenness can be obtained from its physicochemical parameters and phase distribution constants. For example, solvents with low boiling points are very volatile; therefore the exposure by inhalation is very likely. High values of octanol -water partitioning coefficients give initial information indicating that the compound can be accumulated in the animal tissues. More specific information on solvents greenness can be obtained from toxicological and ecotoxicological data, such as oral (Sathish et al. 2016) or inhalation toxicities, toxicity towards aquatic organisms or carcinogenicity . Similarly, environmental persistence data such as biodegradability or hydrolysis potential give information about environmental related hazards. Remarkably, the European REACH Regulation (EC No 1907 ("Regulation (EC) No 1907/2006 -REACH -Safety and Health at Work -EU-OSHA" 2017) has set as a priority the assessment of chemicals' bioaccumulative potential, which is the potential of a substance to accumulate in biota and, eventually, to pass through the food chain. A parameter that is widely used to measure a chemical substance's bioaccumulative potential is the bioconcentration factor (BCF). BCF is commonly defined as the ratio between the concentration of a chemical substance present in an aquatic organism and in the surrounding environment at thermodynamic equilibrium under controlled laboratory conditions (Arnot and Gobas 2006).
One of the problems related to the assessment of solvents in terms of their greenness is the non-availability of data that are required (Alder et al. 2016). The missing data can be approximated with the average value for a given class of chemicals. However, usually the obtained assessment estimation is characterised by high uncertainty. Therefore, it is desired to find reliable methods for the predictions of missing data values. It is especially important in the case of solvents that are relatively novel and still poorly characterised, like esters or ethers derived from renewable feedstocks (Pena-Pereira, Kloskowski, and Namieśnik 2015).
There are several methodologies for modelling the bioconcentration values of various chemicals. Quantitative structure-activity relationship (QSAR) applied to a dataset of chemicals allowed to predict logBCF with the model with an r 2 = 0.491 (Petoumenou et al. 2015). QSAR supported by partial least square modelling allowed to obtain fit the model with an r 2 = 0.868 (Qin et al. 2009). The artificial neural network followed by relatively simple generic model also allowed to establish the values of logBCF with acceptable accuracy (Fatemi, Jalali-Heravi, and Konuze 2003). Application of linear and nonlinear models allowed to model logBCFs of 107 pesticides successfully (Yuan et al. 2016). Although different logBCF prediction methods have been mainly applied to organic pollutants, efforts to create models for predictions of these values for solvents, according to authors' knowledge were not made.
The aim of this study is to interpret two sets of solvents described by several physicochemical and biological features by multivariate statistics. Also we present the assessments of physicochemical properties such as logarithms of partitioning constants between octanol and air (logKOA), octanol and water (logKOW) and logBCF by using the Estimation Programs Interface (EPI) Suite. Results of the EPI Suite predictions of the physical properties of the solvents sets and comparisons with the available literature values are validated. The above-mentioned software is a valuable tool to be applied to experimental data for such kind of solvents properties when there is a lack of sufficient data. In this way it is possible to find the relationships that are the basis for modelling of solvents parameters that define their greenness. Easy but reliable prediction of hazards related to greener solvents introduction is highly desired. The paper presents the application of multivariate statistics tools in the prediction of unknown properties and assessment of their modelling results.

Dataset
The dataset consists of 43 solvents conditionally characterised as non-polar and sparingly volatile solvents and 69 solvents a priory categorised as polar solvents. In this study the datasets are treated separately. This a priori classification is based on previous research , where large dataset of 151 solvents was analyzed with cluster analysis using Melting and boiling point, density, water solubility, vapour pressure, Henry Chemicals (Mackay and Mackay 2006).Three additional parameters were included for the chemometric analysis, namely logKOAcalc., logKOWcalc., and logBCFcalc.. These parameters were calculated with models described in section 2.3.

Multivariate statistics
Cluster analysis (CA) is a well-documented approach to the unsupervised pattern recognition (Massart and Kaufman 1983) and (Massart, D.L., Vandeginste, B.G.M., & Buydens 1998). It aims to select groups of similarity (clusters) within different data sets and to interpret the meaning of the clustering either between the objects of interest or between the parameters used for the description of the objects. Usually, the hierarchical cluster analysis requires several steps in performing the algorithm of clustering: standardization of the raw data (in order to avoid the effect of the different dimensionality of parameters); determination of the distance between the objects for clustering (in order to introduce a similarity measure); procedure for linkage. The results are normally presented on a tree-like plot called dendrogram and in the final stage a criterion for determination of the cluster significance is needed in order to improve the interpretation. The use of chemometrics for the treatment of different data sets provides a valuable tool for objective decision-making (Hristov et al. 2016) (Nedyalkova, Donkova, and Simeonov 2017).
Principal component analysis (PCA) is one of the several multivariate methods that allows us to explore patterns in complex data sets allowing to classify the information and detect structure in a diffuse data set. In general, PCA is a mathematical treatment of the input data matrix (objects described by many features or variables) where the goal is to represent the variation present in many variables by a small number of factors or latent variables. A new space of the features is formed which it makes possible to visualise and project the real multivariate nature of the data set. The central task in PCA is to reduce the original dimension of the input matrix X to two parts -factor loadings (part A matrix) and factor scores (part F matrix). The first one includes the weights of each feature (variable) in each identified factor (new latent variable). The higher the weights the higher is the impact of the original variable. Thus, this procedure allows us to identify which variables influence the objects. If the objects have to be presented in the space of the new latent variables, then the factor scores matrix must be used. The specific rules for performing and interpreting PCA are presented, for instance, in (Einax, Zwanziger, and Geiss 1997).

Modelling -EPI Suite™
In the current work the following subprograms of EPI Suite ™ version 4.10 were used: This KOAWIN TM program estimates the logarithm of the octanol-air partition coefficient (KOA) of an organic compound using the compound's octanol-water partition coefficient (Kow) and Henry's Law constant (HLC). KOAWIN requires only a chemical structure to estimate KOA. Structures are entered into KOAWIN through SMILES (Simplified Molecular Input Line Entry System) notations, which are also used by other estimation programs in EPA's EPI Suite. It is possible to estimate KOA from the octanol-water partition coefficient (KOW) and Henry's law constant (H) by the following equation: KOA = KOW (RT)/H, where R is the ideal gas constant and T is the absolute temperature. KOA and KOW are unitless values. H/RT is the unitless Henry's law constant, also known as the air-water partition coefficient (KAW) (W M Meylan and Howard 1995). Therefore, the equation to estimate KOA is:

KOA=KOW / KAW
The KOWWIN™ program predicts the logarithm of the octanol-water partition coefficient. KOWWIN uses a "fragment constant" methodology to predict log P. In a "fragment constant" method, a structure is divided into fragments (atom or larger functional groups) and coefficient values of each fragment or group are summed together to yield the log P estimate. KOWWIN's methodology is known as an Atom/Fragment Contribution (AFC) method. Coefficients for individual fragments and groups were derived by multiple regression of 2447 reliably measured log P values. KOWWIN's "reductionist" fragment constant methodology (i.e. derivation via multiple regressions) differs from the "constructionist" fragment constant methodology of (Hansch and Leo 1979).
The original estimation methodology used by the original BCFWIN program is described in (W M Meylan and Howard 1995). The logBCF was regressed against log (Kow), and chemicals with significant deviations from the line of best fit were analysed according to chemical structure. The BCFBAF method classifies a compound as either ionic or non-ionic. The ionic substances were further divided into carboxylic acids, sulfonic acids and their salts, and quaternary N compounds. LogBCF for nonionic is estimated from log (Kow) and a series of correction factors specific to each chemical (William M. Meylan, Howard, and Boethling 1996)[ 1 ].

Polar solvents
The first step was the calculation of the values for logBCF, logKOW and log KOA, as described in the previous section. To reveal the internal patterns existing in the group of polar solvents CA and PCA were applied. These techniques were used for classification of the chemical variables and of the solvents themselves. From Fig. 1  In the PCA interpretation one could find that the variable logBCF (both experimentally found and calculated) is forming a separate latent factor not directly correlated with other variables.
It indicates the specific importance of logBCF as a discriminant for the dataset. In fact, it is also indicated by CA where K1 could be conditionally subdivided into "logBCF" subcluster and logKow" subcluster.  2 shows the grouping of polar solvents with CA. The objects (polar solvents) are clustered into three major groups. The mean values of physicochemical parameters for each group are presented in Table 2. The first group consists of alcohols with ether functional groups, aromatic alcohols and short-chain organic acids (apart from formic and acetic). Solvents in this group are less volatile, are characterised by slight water solubility and the highest values (but still low) of logKOW and logBCF. Solvents present in this group are mainly novel, bio-based solvents. In the second group lactate esters, formic and acetic acids, glycerol and some alcohols with other functional groups are contained. These solvents are characterised by low volatility and very high water solubility. The third group consists mainly of "traditional" polar solvents, like short chain alcohols, ketones, aldehydes and esters. Its main discriminator is high volatility of solvents, reflected by low boiling points, high vapour pressures and Henry's law constants. These solvents are rather not bioaccumulative because of the low values of LogBCF. The differences between clusters in terms of logKOW, logKAO and logBCF are not significant and are all low. This is an indication that solvents defined as polar ones do not undergo bioaccumulation, what is one of the parameters that define their greenness.  glycerol triacetate 3-butoxypropane-1,2-diol methyl levulinate ethyl levulinate isobutyric acid 1,2,3-trimethoxypropane isopropylidene glycerol 1-methoxy-3-(propan-2-yloxy)propan-2-ol p-cresol o-cresol phenol 3-n-butoxy-1-tert-butoxy-2-propanol 1,3-di-n-butoxy-2-propanol 1-n-butoxy-3-iso-propoxy-2-propanol 1-n-butoxy-3-ethoxy-2-propanol 1-ethoxy-3-iso-propoxy-2-propanol 1-tert-butoxy-3-methoxy-2-propanol 1-n-butoxy-3-methoxy-2-propanol 1-tert-butoxy-3-ethoxy-2-propanol 1,3-di-iso-propoxy-2-propanol

Non-polar solvents
Similarly as in the case of polar solvents, the calculation of the values for logBCF, logKOW and logKOW was performed with Estimation Programs Interface (EPI). Then the clustering of variables and objects were performed. Here, we present the estimations of physicochemical properties such as octanol-air partition coefficients (logKOA), octanol-water partition coefficient (KOW), bioconcentration factor (BCF), using the Estimation Programs Interface (EPI) Suite. Predictions at room temperature were carried out for the all listed nonpolar solvents. The EPI Suite requires only the chemical structure or the Chemical Abstracts Service (CAS) number to estimate the inquire properties. The BCF is estimated by the program by retrieving the BCF data in a file that contains information on measured BCF and other key experimental details. The log (BCF) was regressed against log (Kow) and chemicals with significant deviations from the line of best fit were analysed according to chemical structure. Results of the EPI Suite predictions of the physical properties of the above nonpolar solvents and comparisons with the available literature values are presented. It was interesting to compare the correlation between experimentally obtained and theoretically calculated indicators.
The clustering of the variables for non-polar solvents (Fig. 3) shows a similar pattern as in the case of polar solvents (Fig. 1). In Fig. 3 the hierarchical dendrogram of clustering of variables is shown (z-transformed input data, squared Euclidean distances as a similarity measure, Ward's method of linkage and Sneath's criterion for cluster significance). The clustering of the theoretically calculated and experimentally existing values for log BCF, log KOW and logKOA match very well (they are joint together in the clusters) and it leads to the practically important conclusion that the calculating approach used can be used when there are missing data in the data set for the indicators in consideration. The obtained clusters confirm the relationship between parameters like logBCF and logKOW with the Henry law constant. The variable logKOA is correlated with a whole group of physicochemical parameters like surface tension, density, boiling and melting point.
In Table 3 the factor loadings values (Varimax rotation mode of PCA) are presented. The clustering of variables results is generally confirmed by PCA results.  with the experimental and calculated values of logKOA. Thus, it coincides entirely with cluster K3 and could be conditionally named "physicochemical factor". PC2 also explains a significant part of the total variance (30.2 %) and resembles cluster K1 showing a strong relationship between the theoretical and experimental values of logKOW and logBCF. It is readily seen that the water solubility is negatively correlated to the above-mentioned parameters and this is a difference to the clustering in K1. But this relationship does not seem unusual and this latent factor could be conditionally named "solubility or polarity factor".
In PC3 (contribution of 14.8 % of the total variance) one finds a negative correlation between vapour pressure indicator and Henry law constant. Having in mind the big differences in the values of Henry law constant and vapour pressure for the non-polar solvents found in the literature there is no surprise for such a connection. In the hierarchical dendrogram (Fig. 1) the vapour pressure and water solubility are linked together for the level of significance 66.67% of Dmax but at level 33.33 % of Dmax such a linkage does not exist. The Henry law constant appears to be linked to log BCF and log KOW but at quite a high level of linkage.
The more significant aspect of the cluster analysis was to reveal relationships between the different non-polar solvents and possible markers making the difference within the seemingly homogeneous factor of non polarity.
In Fig. 4 the hierarchical dendrogram of clustering of 43 nonpolar solvents is shown. Three major clusters are very clearly indicated with the level of significance 66.67% of Dmax. The first one contains 19 out of 43 solvents, the second one -10 out of 43 and the third onethe rest of 14 solvents. It is obvious that the formation of three different patterns of non-polar solvents requires identification of specific markers for each one of the groups of similarity. In Table 4 the average values for each one of the 13 variables used for solvents clustering and for each one of the clusters found are presented.

Fig. 4 Hierarchical dendrogram presenting clusters of 43 non-polar solvents
The clustering is based on the specific discriminators being present in the initial dataset. In K1 are included nonpolar solvents (like pentane, cyclohexane, heptane, decane, etc.) with the lowest melting point, lowest boiling point, lowest density, highest vapour pressure, lowest surface tension, and lowest logKOA (both experimental and theoretically calculated). This group is formed by volatile and rather nonpolar solvents.
The second cluster K2 consists of solvents having on average lowest water solubility and vapour pressure, the highest Henry law constant and logKOW, logBCF and log KOA. Cluster K2 consists of a group of non-polar solvents, which are not water soluble. The third cluster K3 is characterised by highest density, water solubility solvents. All 43 solvents defined as non-polar ones are characterised by the much higher potential for bioaccumulation than solvents defined as polar ones. However, from a practical point of view it is important to develop and assess green less polar solvents, as many processes require solvent that is not miscible with water.
The defined as a non-polar solvent group could be divided into three subcategories like volatile, water nonsoluble and slightly water soluble solvents. Grouping can be helpful in the studies searching theoretical relationship between solvent chemical structure and bioconcentration. The close resemblance between experimentally found and theoretically calculated parameters makes it possible to use the approach of filling missing data in data set comprising physicochemical and bioconcentration variables. EPI Suite predicts physicochemical properties and is a relatively convenient means of studying organic materials. When experimental data are not available to assess environmental risk, a possible way to estimate the necessary values is the use of estimation models. The EPI Suite was developed to help environmental scientists to prepare profiles for a wide array of chemical profiles. The fact that the program simply requires the chemical structure or Chemical In Fig. 5 an effort is made to indicate the internal relationship between clusters of non-polar solvents 1 to 3, based on the correlation coefficients found ( predicted vs. experimental values for log BCF).
In Table xx the gradually increase of the correlation coefficient is observed with highest value for cluster 3. This cluster includes the solvents whose log BCF values were predicted also by EPI suit. For the second group of solvents (polar solvents), only cluster 1 shows a reasonable correlation coefficient (predicted vs. experimental log BCF values).

LOG BCF
Probably, the correlation coefficient could be used as another discriminant factor for the solvents studied: when the variability between experimental and predicted values of log BCF is significant the correlation coefficients are with higher values (e.g. non-polar solvents). The lesser variability in log BCF leads to lower correlation which is the case with polar solvents.  In the Figure 5 presents the relation of predicted to experimental logBCF values. For the most of the solvents, the predictions were very good. The model did not perform well for the solvents that are very polar, do not tend to be concentrated and in result have negative experimental logBCF values. The predictions were not accurate also for pentadecane and hexadecane as well as high boiling methyl esters. Determination coefficient calculated after removing these compounds and compounds of low logBCF (totally 13 solvents removed) equals to r 2 = 0,925 for 99 solvents that were left.

Conclusions
The simple classification with well-known chemometric tools is an important preliminary step in the selection of an optimal set of parameters for proper theoretical predictions. Here, cluster analysis and PCA were used to group solvents according to their similarity. Variables were grouped with principal component analysis and cluster analysis to assess and identify from which properties missing values can be predicted. The results show that values of logBCF for organic solvents can be modelled with EPI Suite software. Thus, these estimations will allow identifying novel green solvents for which experimental logBCF values are not yet available. .