Improvement of multi-task learning by data enrichment: application for drug discovery

Multi-task learning in deep neural networks has become a topic of growing importance in many research fields, including drug discovery. However, applying multi-task learning poses new challenges in improving prediction performance. This study investigated the potential of training data enrichment to enhance multi-task model prediction quality in drug discovery. The study evaluated four scenarios with varying degrees of information capacity of the training data and applied two types of test data to evaluate prediction performance. We used three datasets: ViralChEMBL, which consisted of binary activities of compounds against viral species, was applied for the classification task; pQSAR(159) and pQSAR(4267), which consisted of bio-activities of compounds and assays from the research of the profile-QSAR method, were applied for regression tasks. We built multi-task models based on the feed-forward DNNs using the PyTorch framework. Our findings showed that training data enrichment could be an effective means of enhancing prediction performance in multi-task learning, but the degree of improvement depends on the quality of the training data. The more unique compounds and targets the training data included, the more new compound-target interactions are required for prediction improvement. Also, we found out that even using multi-task learning, one could not predict the interactions of compounds that are highly dissimilar from those used for model training. The study provides some recommendations for effectively employing multi-task learning in drug discovery to improve prediction accuracy and facilitate the discovery of novel drug candidates.


Introduction
The amount of chemical information is under exponential growth [1], placing us in the "Big Data" era of chemistry [2]. Chemists are increasingly working with multi-output datasets, combining data from a variety of sources. One example is the ViralChEMBL [3] database, which includes antiviral activity data of various molecules against multiple viruses. Toxicity data can also be collected based on different toxicity endpoints such as investigated organisms or administration methods [4,5]. The bioactivity data can be grouped for the precise investigation of different proteins or assays [6,7]. This new data representation requires new processing methods and raises questions about the applicability of models. To tackle these challenges, the use of multi-task learning (MTL) has become widespread, while the immense potential in deep neural network (DNN) provided the solid ground for its application. MTL allows for the simultaneous use of multiple endpoints in training one model, a distinct advantage over traditional single-task learning (STL) [8][9][10]. Such a model will perform a prediction for all utilized endpoints 1 3 benefiting from the data of each of them. MTL demonstrates the boost of performance across numerous fields from toxicological modeling [5,11] and prediction of physicochemical [12] and biological properties [13,14] to drug recommendation [15].
However, even the most advanced approach is not without limitations. The success of a prediction strongly relies on the applicability of a model to compounds under investigation. To quantify this possibility, the concept of applicability domain (AD) was introduced [16]. However, the definition of AD is quite vague, leading to a lack of consensus view on its assessment [17][18][19][20]. Nevertheless, the long history of AD investigation led to some preferences in its application in STL. The most widely used AD concept is based on structural similarity [21]. According to this view, a compound is considered to be in the AD if it is structurally similar to compounds in the training set. Since there are various ways to express similarity, there are numerous methods for numerical estimation of AD, for example, based on ensemble methods [22] or leverage-based methods [23,24].
The estimation of AD in MTL is more challenging than in STL. In STL, each compound is a unique instance presented exclusively in the test or training sets. MTL implies a new level of abstraction, where the same compound or target may be presented in both training and test sets. In this case, prediction is performed not for separate compounds but for interaction values (or interactions) between compounds and targets. In this regard, new challenges in AD estimation are raised due to forming additional relations between targets and compounds. One of the most critical questions is how to expand the AD of a multi-task model and improve its prediction performance. In STL, the more diverse the training data are, the broader the model's AD. It achieves this in only way by adding interactions of new compounds with the same target. As several targets are used in MTL, one could perform two ways to diversify the data: add interactions with new compounds that did not relate with any targets from the training set or add interactions with compounds connected with some targets from the training set. It is unclear which strategy would be preferable. It is also unclear if MTL can overcome the limitations of AD and provide accurate prediction for interactions with compounds not used for model building. To our knowledge, these issues have not been sufficiently addressed in the literature.
In this research, we utilized DNNs to carry out multitask regression and classification modeling. Our focus was on two key issues. Firstly, we investigated the training data enrichment as a means for expanding the model's applicability domain and so for improving the prediction performance. Secondly, we compared the prediction performance for interactions related to two groups of compounds: those included in model training (known to the model) and those not used in training (novel to the model). Our results demonstrate the potential of training data enrichment as an effective tool for model improvement. We found that the degree of improvement strongly depends on the data used. To this end, we proposed two different test sets to evaluate the models, one containing interactions with known compounds and another with novel ones. Through our research, we have formulated recommendations that can assist researchers in the effective application of Multi-Task Learning (MTL) in the drug discovery domain.

Materials and methods
In this section, we describe the datasets used in the research, the procedure for data splitting into three different subsets, and four possible scenarios of these subsets combination used to assess the influence of training data enrichment on prediction performance. Also, we describe methods for the prediction results evaluation.
We used pQSAR(159) and pQSAR(4276) datasets for regression modeling and ViralChEMBL dataset for classification modeling. The original papers, results, and datasets will be referred to as "reference" ones further throughout the text.

Data characterization
It is feasible to use both compounds' and targets' descriptors in multi-task learning. To define descriptors of compounds, we calculated Morgan fingerprints using the RDKit v.2021.03.1 [28] with the following settings: 2048 bit length vectors, radius 2 Morgan fingerprints. We used compounds' structures from the reference articles without additional standardization. We did not use descriptors of targets as they were either not defined (pQSAR(159), pQSAR(4276)) or not reasonably informative (ViralChEMBL) as it was noticed in the reference articles.

Prediction algorithm
Deep Learning is considered one of the most potent and efficient approaches to dealing with massive amounts of data in drug discovery [29][30][31]. Due to the architectural flexibility, DNNs can be easily applied to MTL. For this research, we constructed a feed-forward DNN using PyTorch Lightning [32] framework. To get the best prediction performance, we conducted a random hyperparameters search [33] by varying: (i) the number of layers and the number of neurons in the layers, (ii) activation functions (ELU, PReLU, and LeakyReLU), (iii) optimizers (Adam, RAdam, and Yogi). We utilized dropout regularization on all hidden layers (dropout values were set from 0.1 to 0.5). Python code implementing the prediction algorithm is available in a public repository on GitHub [34]. The hyperparameters of the best models are presented in Table S2 of the SI1 file (supplementary materials).

Data splits
We utilized two ways of data splitting to generate sets for model training and testing. The first one aimed to validate the predictive ability of our novel algorithm. We kept the original training/test splits from the reference articles and compared the performance of reference models with the performance of our models.
In the second way of splitting, we focused on creating a test set that would fairly evaluate a model. There are several approaches to data splitting. The most commonly used method for data splitting is the random splitting, which randomly selects compounds to the test set ( Fig. 1, red dots). However, this method can lead to over-optimistic model evaluation. Another approach, time splitting [35], is increasingly used to avoid this issue and is considered more appropriate for real-usage in drug discovery. However, it requires time-stamped data, which is not always available for readymade public datasets. As an alternative, scaffold-based splitting [36] is sometimes used, resulting in training and test sets comprising compounds with different scaffolds. It separates compounds based on the idea of a hierarchical classification tree, where the more generic tree levels correspond to scaffolds with fewer rings. We applied a different approach called realistic splitting [6], which considers the similarity between compounds. This method separates the most structurally unique compounds into the test set ( Fig. 1, blue dots), simulating the real virtual screening projects where predictions are made for compounds close to the boundary of a model's applicability domain or beyond it. This approach provides a more robust evaluation of a model and prevents overestimation of its performance.
In the case of a multi-output compound-target dataset, the realistic split is conducted for each target separately (Fig. 2, left). As a result, each target possesses its own individual Fig. 1 The visualization of the random and realistic splits for a part of the ViralChEMBL dataset. In the 2D space, red ( ) and blue ( ) dots represent 20% of compounds selected by the random and realistic splitting from a part of the ViralChEMBL dataset ( ). The visual representation was generated using a parametric t-SNE model [37] 1 3 sets of compounds for training and evaluation. Combining these sets across all targets creates the final training and test sets for the initial data. Thus, the splitting in the multioutput dataset is performed not by compounds (as in the single-output dataset) but by interaction values between compounds and targets. As a result, the same compound could be included in both training and test sets depending on the targets it interacts with. For model building, the splitting results for each target could be represented as a vector consisting of interaction values between these targets and all compounds from the dataset (Fig. 2, right). The lengths of the vectors for all targets is the same and equal to the number of compounds in the dataset. Merging all vectors creates an interaction matrix, reflecting interaction information for all compound-target pairs in the dataset. The matrix columns reflect the interaction information for individual targets, and the matrix rows reflect the interaction information for individual compounds. The value of each matrix cell is equal to the interaction value for the corresponding compound-target pair or is unknown if the interaction information is absent.
For the pQSAR(159) and the pQSAR(4276), we used realistic splits from the reference articles. For the ViralChEMBL dataset, we performed the realistic splitting.
To make the procedure more straightforward and clear, we employed the Euclidean distances on a 2D plane. To translate molecular structures into this plane, we utilized the parametric t-SNE model [37]. This model, a pre-trained artificial neural network, transforms compounds from a high-dimensional space (the space of ECFP fingerprints) into a 2D plane defined by X and Y coordinates. The network was trained to reflect the structural similarity between compounds so that similar compounds would be projected close to one another. The parametric t-SNE model used the Tanimoto distance between ECFP fingerprints as a structural similarity metric in high-dimensional space, so one can regard this projection as an approximation of Tanimoto similarity projected into the Euclidean domain. It allowed us to visualize the outcomes of this procedure on the 2D plot in Fig. 1.
Applying the realistic splitting, we divided data from each dataset into a "trn" subset, consisting of 75% of interactions with the most similar compounds for each target, and a "realistic" subset, consisting of the remaining 25% of interactions with the most structurally novel compounds. We used the "trn" data for model training and the "realistic" data for model testing. As previously stated, each target Fig. 2 The scheme of realistic split for a multi-output dataset in a multi-output database has its own sets of compounds for training and evaluation, so the same compound could appear in both training and test sets depending on the targets it interacts with. Based on this characteristic, we further divided the "realistic" subset into two subsets. The first one consisted of interactions that are related to unique compounds, presented in the "realistic" subset only ("compound-based" or "c" subset). The second one consisted of interactions related to compounds that are also presented in the "trn" subset ("interaction-based" or "i" subset). In total, we created three subsets for each investigated dataset, as shown in Fig. 3: -Subset "trn" contains interactions of the most structurally similar compounds defined for each target separately; -Subset "i" contains interactions of the most structurally novel compounds defined for each target separately but also presented in the subset "trn"; -Subset "c" contains interactions of the most structurally novel compounds for each target that are included in this subset only.

Prediction scenarios
We verified our algorithm prediction performance by comparing the obtained results with those from the reference models. Our models were built on the reference training/test splits, adhering to the reference training procedure. We used the pQSAR(159) dataset to evaluate the algorithm's performance in the regression task. We built separate models for each target in the dataset, utilizing all available interaction values except those from the "realistic" subset for the investigated target. These excluded values were applied for the evaluation of the models. We used the ViralChEMBL dataset to evaluate the algorithm's performance in the classification task, utilizing the reference training and test sets for model training and evaluation.
For the main study, we designed four scenarios referred to as Sc1, Sc2, Sc3, and Sc4 (Fig. 4). The scenarios differed in the training sets (see supplementary materials SI1, Table S1) but possessed the same test sets (see section "Test sets"), allowing us to investigate the influence of the training data on model quality.
-We considered Sc1 as a baseline scenario. The Sc1 training set consisted of interactions involving the most similar compounds for each target in a dataset, i.e., the interaction values from the subset "trn". -In the Sc2 scenario, we utilized the Sc1 training set expanded with the interaction values from the subset "i", i.e., by new interactions related to compounds already included in the Sc1 training set. We assumed that it would slightly increase the information capacity of the training data without changing the compounds' diversity. -In the Sc3 scenario, we utilized the Sc1 training set enriched with the interactions from the subset "c", i.e., by interactions related to compounds that are novel to the Sc1 training set. We assumed it would significantly  (4276)). For each target (or a bunch of targets) in the outer loop, we separated interactions from the "realistic" subset from all other interactions, thus creating sets for testing and training, respectively. In addition, if a compound was not included in the "trn" subset, we did not use it for the training at all (see the scheme of Sc4 in Fig. 4d). Consequently, we used extended training sets to create models with advanced ADs.
In each scenario, the training data differed in the number of included interactions and compounds. Table 1 presents the absolute and relative difference between the Sc2, Sc3, Sc4 scenarios and the Sc1 scenario.

Test sets
In the field of multi-task prediction, data splitting is characterized by a unique feature. Unlike traditional singletask prediction, where data is split based on the interactions between a specific target and multiple unique compounds, in multi-task prediction, data is split according to interactions between multiple targets and compounds. This can result in training and test sets possessing the same compounds, significantly affecting the prediction results. To address this issue, we propose to evaluate multi-task models in two ways by: -Interactions connected with compounds that are presented in training data and used for model training; -Interactions connected with compounds that are not presented in training data and so not familiar to model (also known as "cold-start" prediction in recommender systems [39,40]); We assume that using both these types of test data could significantly improve the evaluation of a multi-task model. Thus, in the research, we applied the subsets "i" and "c", to predict interactions with compounds that are familiar and new to the model, respectively. The exception was made for Sc2 and Sc3, where the interactions from the training sets intersected with those from the "i" and "c" subsets. For these scenarios, we applied only one test set. The application of the "i" and "c" subsets in the scenarios is shown in Table 2 and Fig. 4.

Evaluation and metrics
We found the best model in each scenario by randomly searching the model's hyperparameters [33]. With each set of hyperparameters, we performed fivefolds cross-validaton by sklearn. model_selection package (sklearn v.0.24.2) [41] using three different seeds. For classification modeling, we performed fivefold stratified cross-validation. Stratified cross-validation ensures that training and test sets have the same proportion of classes as the entire database. The prediction performance of the best models was assessed based on two test sets ("i" and "c"). We assessed the prediction quality for the subsets entirely and for each of their targets individually with the further computation of mean and median (Eq. 1).
where M is one of the evaluation metrics for a target t from T targets in a dataset. Mean, median, and standard deviation were calculated by numpy.mean, numpy.median, and numpy. std functions, respectively (numpy v.1.20.3) [42]. When evaluating the results of multi-task modeling, it is essential to consider the performance for each task separately rather than evaluating the model's overall performance on the entire dataset. This is because different tasks may have different prediction difficulty levels and qualities. Moreover, evaluating the model on each task individually makes it possible to identify which tasks the model performs well on and in which tasks the predictions fail. It can inform further model development and improvement.
Moreover, in the case of MTL, the evaluation of results based on an entire test set is not informative. Each target in a test set has a different number of interactions and thus contributes differently to the overall prediction result. As a result, the model evaluation heavily depends on the most representative target. For example, the model could be considered accurate for a test set including dozens of targets with a small number of poorly predicted interactions if there are at least several targets with a large number of well-predicted interactions in this set. Thus, in this research, we calculated the evaluation metrics based on the entire dataset for an acquaintance only and included these results in Tables S3, S5, and S7 of the SI1 file (supplementary materials). We did not use this information in the study.
To facilitate analyses of the results of training data enrichment, we computed the absolute and relative performance change for each metric applied. The performance improvement was measured by comparing evaluation metric values in scenarios Sc2, Sc3, and Sc4 to those in scenario Sc1. The results are presented in Tables S4, S6, and S8 of the SI1 file (supplementary materials).

Metrics for classification
To assess the classification modeling, we used metrics that evaluate the prediction of classes (balanced accuracy, BA) and class probabilities (area under the receiver operating characteristic, ROC AUC, and area under the precisionrecall curve, PR AUC). For their computation, we applied sklearn.metrics module of sklearn v.0.24.2 [41].
Based on the ROC AUC and PR AUC calculations for individual targets, we calculated their mean and median, and the number of targets for which the metrics' values were greater than 0.8, regarding 0.8 as a threshold for efficient  Table 2 Application of the "i" and "c" subsets for model testing in the different scenarios: ✓-the subset was applied, ×-the subset was not applied prediction [43]. We did not perform the calculations for the individual targets if they possessed only one class; the prediction results for these targets were neglected (see details on the supplementary materials).
To calculate BA, we transformed the obtained predicted probabilities into two classes: active and inactive. We used the transformation threshold equal to the ratio of classes in a training set for each target individually. For example, all predictions for a target which are more than 0.6 were considered active if the ratio of active to inactive classes for this target in the training set was 6:4.

Metrics for regression
We used the root-mean-square error (RMSE) and coefficient of determination ( R 2 ) to evaluate the performance of our regression models.
RMSE reflects how much the predicted results deviate from the actual values. It was calculated by the equation: , where x i and x * i are the i th real and predicted interaction values from a dataset of N values. Besides the mean and median of the RMSE values for the targets, we computed RMSE for the 10% most active data (RMSE 10% ). RMSE 10% assesses the model's ability to detect the most promising interactions, which is highly important in computational screening.
The coefficient of determination, commonly referred to as R-squared, quantifies the proportion of predicted values that the model explains. We employed the metrics.r2_score function from sklearn version 0.24.2 [41] to calculate R-squared. We only used the median of the R-squared values for individual targets because the mean value may be heavily biased as the R-squared ranges from − infinity to 1. Additionally, we identified the number of targets for which the R-squared value was greater than specific thresholds. There is no universally accepted threshold for the R-squared value, as it depends on the research goals and application. However, we used a threshold of 0.0 which can be applied as the minimum criteria for the model to be considered as providing at least some prediction of the outcome. Predictions with an R-squared value less than 0 suggest that the model performs worse than simply assigning the mean value to all predicted points [44,45]. Also, we calculated the metric for the thresholds of 0.7 and 0.9, which are commonly used to indicate good and very good results, respectively.

Y-scrambling
Y-scrambling is a technique for additional validation of a model's performance. Under this procedure, one shuffles the training data to estimate the performance drop. Suppose the performance of a "shuffled" model is on the same level as the model of interest. In that case, one can conclude that the statistical advantage of the model of interest is negligible. We used y-scrambling to estimate the robustness of our models and the number of targets with successful predictions that are actually due to chance correlations. Y-scrambling was performed by randomly reassigning the interaction values among the compounds for each target separately. We assessed the scrambled results using the same metrics used for the unscrambled ones.

Verification of our algorithm performance
We evaluated the performance of our algorithm for the regression task by comparing its prediction results to reference ones using the pQSAR(159) dataset (supplementary materials SI1, Table S9). The reference results were obtained by the pQSAR algorithm. Our results were comparable to the reference ones, with a slightly higher mean and median of prediction errors for the targets (Fig. 5a): RMSE mean∕median was 0.61/0.61 for pQSAR and 0.63/0.64 for our DNN-based algorithm. Also, our algorithm outperformed the reference one in terms of the median of the determination coefficient (Fig. 5b): R 2 median was 0.46 for pQSAR and 0.53 for DNN. Additionally, we assessed the ability of the algorithms for computational screening by calculating the prediction error for the "top" 10% of interactions. Chemists are typically interested in the most promising compound-target pairs, so the precise prediction for the "top" interactions is preferable to the accurate prediction for the whole dataset. Our DNN-based algorithm slightly outperformed the baseline one. RMSE 10% was 0.54 for our algorithm compared to 0.56 for the pQSAR one. Based on these results, we consider our DNN-based algorithm to be a reliable and competitive option for regression modeling.
We used the ViralChEMBL dataset for the classification modeling and compared our prediction results with the reference ones obtained by the SGIMC algorithm (supplementary materials SI1, Table S10). Our DNN-based algorithm's prediction performance was found to be comparable to that of the reference algorithm, as shown by similar mean and median of ROC AUC and BA scores for the targets (Fig. 6). Specifically, ROC AUC mean∕median and BA mean∕median were 0.64/0.64 and 0.65/0.65 for SGIMC and 0.61/0.64 and 0.69/0.62 for DNN. This confirms that our DNN-based algorithm performs similarly to the reference algorithm in the classification modeling.
PR AUC is commonly used to evaluate the performance of classification models, but it has a limitation. Since the PR curve only considers the precision and recall measures, it ignores the accuracy of "inactive" examples prediction. This can be a problem when working with imbalanced datasets, such as the highly imbalanced ViralChEMBL dataset used in this study, where the "active" class is more prevalent than the "inactive" class (with a 9:1 ratio). In this case, using PR AUC can lead to overestimating the model's performance due to overfitting the prevalent "active" class and underestimating the under-represented "inactive" class [46]. Despite obtaining high PR AUC scores (PR AUC > 0.8), the prediction accuracy for the "inactive" class was found to be poor, as illustrated in Fig. 7. Therefore, PR AUC was not used for further analysis, and the ROC AUC metric, which takes both classes equally into account, was preferred for evaluating the classification models.
We proved the non-accidental nature of our successful predictions by the y-scrambling test. This test was performed using the same training/test splits and model hyperparameters as the unscrambled tests (supplementary materials SI1, Table S2). The results of the y-scrambled predictions were found to be inaccurate for both regression and classification tasks (Figs. 5, 6 and supplementary materials SI1, Tables S9 and S10), thus confirming that our DNN-based results were not accidental.

Effect of training data on prediction performance
To evaluate the impact of training data on prediction performance, we compared the prediction results in four scenarios (Sc1-Sc4), differing by the information value of their training sets. The Sc1 training set consisted of 75% of all interaction values with the most similar compounds in a dataset (subset "trn"). In Sc2, the training set consisted of the Sc1 training data expanded by new interactions (subset "trn" + subset "i"). In Sc3, the training set consisted of the Sc1 training data enriched by interactions with new compounds (subset "trn" + subset "c"). In the Sc4 scenario, we created separate models for each target (set of targets). The training sets consisted of all interaction values from the datasets except i) interactions from the subset "i" for the selected target (set of targets), and ii) interactions from the subset "c" which are connected with compounds having known interactions with the selected target (set of targets). We evaluated the prediction performance in each scenario using two test sets ("c" and "i") modeling the prediction for novel compounds and ones already used for the model training.

Prediction for compounds used for model training
A comprehensive analysis of compounds, which activity has been partially investigated, is the most common practice in drug discovery. Having prior knowledge of a compound can significantly relieve its further investigation. In the context of multi-task learning, a compound's interactions with different targets are interrelated. Therefore, known interactions of a compound with certain targets can aid in predicting its unknown interactions with other targets. To test this hypothesis, we predicted the interaction values from the subset "i", which contains compounds previously used for model training and belonged to models' AD, according to the traditional view. We focused on the Sc1, Sc3, and Sc4 scenarios, as the interactions from the Sc2 training set overlap with those from the subset "i".
We found out that increasing the training data for the pQSAR(159) dataset led to improving the prediction results, as shown in Table 3 and Fig. 8. We got a performance growth in Sc3 (blue line) in comparison with Sc1 (red line) in terms of the RMSE mean∕median score from 0.71/0.71 to 0.68/0.67 with the expansion of the training set by 15.3% of interaction values (30.2% of new compounds). In terms of R 2 median , the prediction quality was improved from 0.30 to 0.37, which led to an increase in the number of targets with R 2 > 0 from 139 to 146 out of a possible 157. Further enrichment of the training data (Sc4, green line) improved the prediction performance in terms of the RMSE mean∕median score to 0.6/0.6. R 2 median was enhanced to 0.53, resulting in an increase in the number of targets for which the model at least partially predicts the outcome to 149. It was achieved by expanding each Sc4 training set (in comparison with the Sc1 training set) by 25.0 ± 3.5% of interaction values (30.0 ± 1.5% of new compounds).  Targets with R 2 >0.9 0* 0* 0* 0** 0** 0** Also, we used the pQSAR(4276) dataset for the regression modeling. Among the datasets, this one is notable for its high information capacity due to the increased number of interactions, compounds, and targets. At the same time, its data density (the percentage of known interactions out of all possible interactions) is extremely low and close to 0.06%, which should negatively impact prediction accuracy. Despite the prediction performance was changed across scenarios, its variation was not significant enough to claim an improvement of the models (Table 4, Fig. 9). The performance change for the pQSAR(4276) was considerably lower than for the pQSAR (159)  and pQSAR(4276) datasets was almost the same, their percentage of total possible compound-target pairs was drastically different. For example, when we increased the relative number of interactions in Sc2 to 15.3% and 14.3% in the pQSAR(159) and pQSAR(4276) datasets, respectively, we observed a gain of interactions for all possible data of 0.62% and 0.01%, respectively. Thus, the gain of 0.62% led to a significant change in the data, resulting in further  We evaluated the performance of classification modeling using the ViralChEMBL dataset. Our results showed that the prediction quality improved with the increase of the information value of the training set, as it was in the regression modeling for pQSAR(159) dataset (Table 5, ROC AUC mean∕median became 0.79/0.89 and BA mean∕median became 0.72/0.80. There was also not a notable increase in the share of targets for which these metrics' values were greater than 0.8. The enhancement of prediction performance due to data enrichment was not as inspiring as in the case with pQSAR(159) dataset. This could be attributed to the relatively small number of added interactions when considering all possible compound-target pairs, as it was observed for the pQSAR(4276) dataset. The increase in data was 0.06%, which is not as small as that for the pQSAR(4276) dataset and may be sufficient for improving the prediction. The predicted values distribution shown in Fig. 11a, also illustrates the consistent improvement in classification for both classes with the enrichment of the training data.

Prediction for compounds that were not used for model training
In multi-task modeling, predicting interactions for new compounds is referred to as "cold-start" prediction and is a common practice in recommender systems [47]. The term "cold-start" means making a prediction in the absence of data required for training a reliable model. In drug discovery, there are two ways of cold-start predictions: compoundbased and target-based. The former involves predicting interactions for new compounds that were not used in model training, while the latter involves predicting interactions for new targets that were not considered during model creation.
Both of these face the challenge of limited information about investigated compounds or targets. In other words, cold-start prediction lies outside the model AD or close to its border. To perform cold-start prediction, we used the subset "c", containing compounds that were not used for model training and were not similar to compounds from the training set. We only considered the Sc1, Sc2, and Sc4 scenarios because the interactions from the Sc3 training set overlap with those from the subset "c". The cold-start prediction based on the pQSAR(159) dataset was unsatisfactory across all scenarios. Despite the considerable changes in training data, the RMSE score showed no improvement. The RMSE mean∕median score was 1.00/1.04,   (Table 3, Fig. 12). The R 2 median score showed slight variation across the scenarios. However, the absolute value of the R 2 median scores remained too low to consider the prediction a success. R 2 median score was − 0.11, − 0.04, and − 0.08, and the number of targets with R 2 > 0 was 32, 62, and 46 out of a possible 159 (20, 39, and 29%) in Sc1, Sc2, and Sc4, respectively. Thus, the prediction failed for most targets as the negative value of R 2 median indicates that the prediction is worse than a constant function that always predicts the mean of the data. The prediction was unreliable even for that targets with a positive R 2 score, as it did not reach the threshold for good prediction in 0.7.
Further investigation of regression modeling with the pQSAR(4276) dataset also showed no improvement in prediction performance despite data enrichment (Table 4, Fig. 13). With increasing the Sc1 training set by 19.3% (the number of compounds remained constant) in Sc2 and by 7.0 ± 4.5% (8.0 ± 4.6% new compounds) in Sc4, the prediction quality varied only slightly, as evidenced by the RMSE mean∕median score going from 0.97/0.90 to 0.95/0.89 and 0.96/0.89, respectively. The R 2 median score also showed a minor change from − 0.04 to − 0.02 and − 0.05, resulting in an increase in the number of targets with R 2 > 0 from 1879 to 1971 and 1880 out of a total of 4204. Based on these findings, we conclude that the models failed to perform a cold-start prediction of interactions with compounds that are highly dissimilar to training ones.
The results of our classification modeling using the ViralChEMBL dataset also revealed poor prediction performance that failed to improve even with increased training data ( Table 5, Fig. 14). The ROC AUC mean∕median score was 0.65/0.69, 0.67/0.70, and 0.66/0.71, and the BA mean∕median score was 0.63/0.62, 0.66/0.67, and 0.65/0.66 in the series of the Sc1, Sc2, and Sc4 scenarios. Despite a 5.8% increase in interaction values (the number of compounds remained constant) in Sc2 and a 33.0 ±1.2% increase (29.0 ± 0.7% new compounds) in Sc4, the prediction quality remained unaltered. As with regression modeling, we conclude that in classification multi-task modeling, cold-start prediction for interaction values of compounds that are highly dissimilar to the ones from training data could not be performed.

Comparison of prediction performance for known and novel compounds
The unique aspect of multi-task learning is its ability to predict interactions between compounds and multiple targets, resulting in the opportunity of including the compound from the test set in the training one. It is reasonable to assume that the accuracy of the prediction for interactions would vary based on whether its compound was included in the training set or not. To verify this assumption, we evaluated the prediction performance on two subsets: i" and c". The first subset consisted of interactions with compounds that were used in the model training, while the second subset comprised interactions with novel compounds. Our results revealed a significant difference in prediction performance between the two test subsets. The prediction of the subset "i" was accurate for all datasets regardless of the scenario, as evidenced by the results shown in Tables 3, 4, and 5. It can be attributed to the presence of compounds from the test sets in the training data and so their belonging to the model's AD.
In contrast, the prediction for the subset "c" was inaccurate. For the regression modeling, it was illustrated by the low RMSE mean∕median and R 2 median scores and a low percentage of targets with R 2 > 0 (Tables 3, 4).
Moreover, even the positive R 2 score was not sufficiently high and did not reach an acceptable level of prediction accuracy. For example, no more than 1% of targets in the pQSAR(4276) dataset exceeded the threshold at 0.7, while their quantity equals zero in the pQSAR(159) dataset. For the classification modeling, the inaccuracy of prediction was ref lected by the low ROC AUC mean∕median and BA mean∕median scores which did not exceed 0.8 (Table 5). This was also depicted in the prediction density plot in Fig. 11b, illustrating the inaccurate prediction of the  of the y-scrambling test. Integers on the plot-the number of targets for which the metrics' scores are greater than 0.8; fractional numbers-the median value of the metrics' scores "inactive" class. We believe that the primary cause for the poor prediction performance for the subset "c" was the lack of overlap between compounds within the training and test sets and their significant structural differences. Nevertheless, the prediction was not due to chance, which was proved by the results of the y-scrambled test, also presented in the figures.

Proposed recommendations
As a result of the research, we formulated a list of recommendations for applying multi-task learning in drug discovery.
-Utilize two distinct test sets to evaluate model performance. The first test set should assess the ability of the model to predict interactions for novel compounds, while the second test set should evaluate predictions for compounds used in the model's training and already known to the model. This approach will provide a more comprehensive evaluation of the model's ability to generalize and make accurate predictions for both known and novel compounds. -Assess the model's performance for each individual target. This will enable the identification of targets for which the model is more accurate and those for which its prediction performance is poor. This information will be invaluable for further improving and developing the model. -Refrain from using multi-task models to predict interactions for compounds that are substantially different from those used in the model's training. Although multi-task models can make predictions for new compounds, their accuracy decreases as the compounds become more different from the training set. Therefore, it is advisable to limit the use of multitask models to predict interactions only for compounds that are similar to those in the training set. Based on our study, the data enrichment by 0.6% of the possible number of interactions provides a valuable increment in prediction performance, while a data increase of 0.01% has no influence. However, this finding requires further investigation, taking into account other factors such as for example compound/ target diversity and data sparsity.

Conclusions
Multi-task learning has gained widespread adoption in different domains, including drug discovery. While it presents exciting possibilities, it also presents new challenges for researchers. This study aimed to explore the potential of training data enrichment to improve multitask model prediction quality. We evaluated four scenarios (Sc1-Sc4) that varied in the information capacity of the training data. We applied two types of test data for a more precise evaluation of the prediction performance. The first consists of interactions with compounds used in the model training (subset "i"), and the second contains interactions with compounds novel to the model (subset "c"). Our findings reveal that the enrichment of training data can be an effective tool for enhancing prediction performance in multi-task learning. However, the degree of improvement can vary. We assume that the difference in improvement depends on the quality of the training data. The greatest change was observed in the case of the small and dense dataset (pQSAR(159)). In contrast, improvement of the prediction quality for the large and sparse dataset (pQSAR(4267)) was found to be more challenging. Since enriching the dataset possessing a massive number of possible compound-target pairs requires adding numerous new interactions, improving prediction performance for such a dataset can prove to be a daunting task.
Using different test sets provided a more comprehensive models' assessment and evaluation of the influence of the data enrichment. We found that the models were significantly better at predicting interactions with compounds that were used in training compared to those with novel compounds. The advantage in predicting interactions with known compounds was observed in all scenarios using the test set "i". In contrast, predictions for interactions with compounds that were significantly different from those used in training were poor, as demonstrated by the test set "c". This aligns with the assumption that such compounds are outside the applicability domain of the model, and thus their interactions cannot be predicted. As well, the training data enrichment will not improve the prediction for interactions of novel compounds till the compounds from the training and test sets would not be overlapped and would be structurally different. Despite the high expectations for DNN-based multi-task learning, predicting interactions for dissimilar, novel compounds remains challenging.
Based on our research findings, we presented a set of recommendations for effectively employing multi-task learning in drug discovery. By following these guidelines, drug discovery researchers can effectively use multi-task learning to improve prediction accuracy and facilitate the discovery of novel drug candidates.