RankNEAT: Outperforming Stochastic Gradient Search in Preference Learning Tasks

Stochastic gradient descent (SGD) is a premium optimization method for training neural networks, especially for learning objectively defined labels such as image objects and events. When a neural network is instead faced with subjectively defined labels--such as human demonstrations or annotations--SGD may struggle to explore the deceptive and noisy loss landscapes caused by the inherent bias and subjectivity of humans. While neural networks are often trained via preference learning algorithms in an effort to eliminate such data noise, the de facto training methods rely on gradient descent. Motivated by the lack of empirical studies on the impact of evolutionary search to the training of preference learners, we introduce the RankNEAT algorithm which learns to rank through neuroevolution of augmenting topologies. We test the hypothesis that RankNEAT outperforms traditional gradient-based preference learning within the affective computing domain, in particular predicting annotated player arousal from the game footage of three dissimilar games. RankNEAT yields superior performances compared to the gradient-based preference learner (RankNet) in the majority of experiments since its architecture optimization capacity acts as an efficient feature selection mechanism, thereby, eliminating overfitting. Results suggest that RankNEAT is a viable and highly efficient evolutionary alternative to preference learning.


INTRODUCTION
Forms of gradient descent are the natural choice of optimization method for training deep neural networks to predict objectively defined labels in tasks such as image and speech recognition, fraud detection, and event prediction.Over the last few years, we have witnessed a rapidly growing interest in the use of neural networks that are able to classify subjectively defined labels.This family of learning-to-rank or preference learning algorithms [9] that train neural networks-such as RankNet [2], DeepRank [29] and Lamb-daMART [3]-yield good performance by relying primarily on gradient descent methods.Subjectively defined labels, however, including human demonstrations (e.g.creative tasks, navigation traces and paths) or human annotations (e.g. of emotion or aesthetics) yield highly complex, deceptive and noisy loss landscapes for a neural network to learn.Assuming that the plasticity of neuroevolutionary processes would be beneficial for such loss landscapes, in this paper we test the hypothesis that evolutionary search would be a better optimizer for neural network training in preference learning (PL) tasks compared to stochastic gradient descent (SGD).
To test our hypothesis, this paper explores the efficacy of neuroevolutionary search in PL tasks by building on the efficient and popular RankNet [2] architecture and enhancing its search capacity through neuroevolution.In particular, we introduce a novel algorithm named RankNEAT that relies on the Siamese neural network architecture of RankNet and learns to rank via NeuroEvolution of Augmenting Topologies (NEAT) [36].Unlike traditional gradientbased PL methods, RankNEAT resembles the process of plasticity [7], which induces changes in both the coupling strength and the spatial organization of synapses in biological neural networks.RankNEAT learns to rank subjectively defined labels with high degrees of accuracy through its ability to optimize the synaptic parameters such as the network's weights and the edge architecture simultaneously.We test RankNEAT (neuroevolution) and compare it against the vanilla RankNet (stochastic gradient decent) in the task of player affect modeling across three games, using the AGAIN [23] dataset of arousal-annotated gameplay videos.Player modeling [46] is an important subfield in game research since it promotes the development of reliable human computer interaction systems and consequently improves the users' experience.Our current approach feeds images of gameplay to a pretrained vision transformer, while the last fully-connected layer of the network is then trained to predict ordinal values of arousal, using RankNet or RankNEAT.Results indicate that RankNEAT is superior to SGD (RankNet) in training PL models of arousal in the majority of experiments performed.
Our key findings suggest that RankNEAT is a viable PL paradigm which achieves comparable or significantly higher performances to RankNet.In this first experiment, RankNEAT optimizes the edge topology of the networks' last layer, resembling an evolutionary feature selection strategy that eliminates unnecessary features from the observed input space.Additional studies should explore how RankNEAT performs in other subjectively defined tasks and hyperparameter setups, such as increasing the topological complexity.
This paper is novel in many ways.First, to the best of our knowledge, this is the first time a NEAT-based preference learner is introduced, combining a traditional learning-to-rank neural network architecture with neuroevolution.Second, RankNEAT is tested broadly across three dissimilar games from the same genre showcasing the robustness of the method for affect modeling.Third, the proposed approach is compared thoroughly against SGD (RankNet) across different games and hyperparameters.Finally, RankNEAT is combined with vision transformers (pretrained on ImageNet) enabling us to offer general-purpose representations for solving tasks with subjectively defined labels.

RELATED WORK
This section surveys related work on the performance comparison of evolutionary algorithms and gradient descent for training neural networks (see Section 2.1) and on the intersection of evolutionary search and affect modeling (see Section 2.2).

Evolution versus Backpropagation for Neural Network Training
Although SGD is currently the most widely applied training algorithm for neural networks, there has been a rapidly growing interest in employing evolutionary algorithms for optimizing deep learning models over the last years [34,36,47,49].Evolution and gradient descent through backpropagation (BP) are, however, fundamentally different and thus their comparison is a challenging task that numerous studies have tried to tackle.Indicatively, Mandischer [18] pitted evolutionary strategies (ES) against BP for neural network training on several benchmark problems, evaluating them based on the computational effort required to reach a certain error limit and their ability to converge.Results showed that while ES were good for training neural networks with non-differentiable activation functions, they still cannot compete with BP in largescale problems.Siddique et al. [35] proposed a genetic algorithm (GA) capable of outperforming BP in function approximation in terms of convergence.Sexton et al. [33] compared a GA and BP on in-sample, interpolation, and extrapolation data in terms of rootmean-square error, number of epochs, and execution time.Results showed that GAs can be employed to strike a balance between model over-parameterization and model robustness.Gupta et al. [12] compared GAs and BP in terms of effectiveness, ease-of-use, and efficiency for training neural networks, showing that the former can provide better results in a chaotic time series problem.Gudise et al. [11] conducted a comparative study which demonstrated that the weights of a feedforward neural network tend to converge faster with the particle swarm optimization than with BP when it comes to function approximation.Finally, Sexton et al. [32] compared evolution and BP across ten real-world classification problems, showing that BP reached a higher classification error on average.Yannakakis et al. [42] employed supervised and genetic approaches to study the emergence of cooperative behavior among agents in a complex simulated environment and demonstrated that a genetic approach based on rewarding and minimal communication resulted in more efficient computational models of multi-agent spatial organization than supervised learning mechanisms.[25] compared an evolutionary algorithm that evaluates individuals on a small number of training samples per generation and SGD on several benchmarks and showed that the former could optimize large neural networks about as fast and effectively as the latter.
In this work, we extend the literature by comparing neuroevolution and SGD performance in preference learning problems for the first time.In particular, we introduce a NEAT-based preference learner capable of predicting player arousal from gameplay footage and compare its performance with that obtained via SGD.Our results show that combining the topological and global optimization properties of NEAT with the Siamese network architecture of the traditional RankNet can result in robust learning-to-rank models that outperform the BP models trained via SGD.

Modeling Affect via Evolutionary Search
Affective computing is the study of emotions, their manifestations and expressions, and the ways to capture (model) them computationally [30].While research at the intersection of affective computing and evolutionary algorithms has been active over the last decade, studies in the literature are still relatively sparse.For instance, Martinez et al. [21] presented a genetic search-based feature selection method for improving the accuracy of the affective models, comparing it against sequential forward feature selection and random search in a game survey dataset.Tahir et al. [37] introduced a binary chaotic genetic algorithm for feature selection, which achieved scores two times higher than a baseline genetic algorithm in identifying seven emotional states.Finally, Alvarez et al. [1] employed artificial evolution to select speech feature subsets that optimize the success rate of emotion recognition.
When it comes to games, the domain we study in this paper, player modeling [46] refers to the study of models that accurately predict how a player behaves and feels while playing a game.Affect models based on gameplay can provide valuable insights into how players interact with games.As input of such models, most studies apply domain knowledge by manually authoring high-level handcrafted gameplay features.For instance, Frommel et al. [8] employed the input parameters on a graphics tablet and in-game performance to detect the players' current emotional state.Similarly Melhart et al. [24] showed that hand crafting features that describe the player's input, the artificial agents' actions, and the gameplay context on a high level can yield general models of player arousal.
Although domain knowledge may lead to remarkable results, hand-crafted features do not necessarily reduce the data needs of the algorithms but do introduce a critical data preprocessing step.Automated feature extraction, on the other hand, may address such issues.Literature in this vein is fairly sparse.Ng et al. [28] used a deep Convolutional Neural Network (CNN) pretrained on the generic ImageNet dataset to perform emotion recognition on small datasets.Makantasis et al. [16] employed three CNN architectures to predict player arousal from gameplay footage, showcasing that a mapping between gameplay video streams and the player's arousal exists.The same authors also introduced a methodology for predicting arousal from audiovisual features and demonstrated that fusing high-level pixel and audio representations can yield highly accurate models of affect [17].Finally, the study of Martinez et al. [20] seems to be the first to introduce a deep PL methodology for predicting emotional states from physiological signals.In particular, they showed that using auto-encoders and CNNs to find a mapping from raw signals to learnable features can outperform ad-hoc feature extraction and selection.
In contrast to all aforementioned studies, in this work we employ a pretrained Vision Transformer to extract high-level representations from gameplay footage and fine-tune our RankNEAT model to construct player arousal models for three platformer games.We also compare the behavior of evolutionary PL against gradient-based PL.Results verify that RankNEAT outperforms RankNet in most experiments performed due to the global optimization capabilities of the former.At the same time, its architecture search capacity corresponds to an effective mechanism of feature elimination.

CASE STUDY: PREDICTING PLAYER AFFECT
The neuroevolutionary learn-to-rank methodology proposed in this paper is tested on a challenging dataset of three games which includes many players' gameplay and emotion annotations.The games are created for the purposes of general affect modeling and are part of the AGAIN dataset [23].In this paper we focus on the three games of the platformer genre featured in AGAIN as they offer sufficiently diverse gameplay properties without needing excessive computation for experimental validation.The three games are shown in Fig. 1 and include Endless, an infinite runner where players must avoid obstacles while automatically moving ever rightward, Pirates!, a jumping platformer similar to Super Mario Bros (Nintendo, 1985), and Run'N'Gun, a more complex game which requires players to move while aiming and shooting at enemies.All games have arcade-style controls of varying complexity (with Run'N'Gun being the most complex), assign a score to the player for in-game actions, and finish after two minutes for the purposes of data collection.
The AGAIN dataset was collected through Mechanical Turk, with players first playing and then annotating each game.Annotation was done using a stimulated recall protocol, showing the player's own gameplay as a recorded video and requiring them to provide moment-to-moment annotation of arousal using the RankTrace annotation tool [15].Figure 2 shows the arousal annotation trace: the player can keep increasing or decreasing their arousal annotation (unbounded) and can view their entire annotation so far.
The arousal annotations are preprocessed before being used for modelling tasks.First, the trace is normalized to [0, 1] with minmax normalization.Due to the reaction time between a stimulus and a player's emotional response, we processed the data into time windows of 3 seconds.In particular, we calculate the mean arousal value of 3-second time window which we then use as the subjectively defined label for training (see Section 4.2).Gameplay videos are captured at 24Hz, resulting in 72 frames per 3-second window.Each gameplay frame is rescaled to a 224 by 224 pixel RGB image, and the 72 frames of the 3-second window are processed iteratively through a Vision Transformer (see Section 4.1).
When aligning arousal time windows and gameplay frames' time windows, we apply 1 second lag (shifting the annotations 1 second back compared to the video data) to simulate delays in the annotation process [19,24].After processing the data into 3-second windows and data cleanup (e.g.videos with missing frames due to errors during video recording), the dataset across all three games consists of 262 gameplay videos corresponding to 111 different players.In particular, 103 gameplay videos remain for Endless (4, 120 time windows), 92 videos for Pirates! (3, 680 time windows), and 67 videos for Run'N'Gun (2, 680 time windows).Each video within the same game corresponds to a different player, which is important for cross-validation purposes (see Section 5).

METHODOLOGY
This section describes the main components of the algorithms examined in this paper including the Vision Transformers [6] used to extract high-level features from the video data and the two training methods used for our preference learning task: SGD and neuroevolution.An overview of our approach is presented in Fig. 3.

Vision Transformer
A Transformer is an architecture that utilizes an attention mechanism to discover dependencies between input and output.Although Transformers still employ an encoder and decoder, they eschew recurrence and thus require less training time while achieving better results than other sequence transduction models.The Vision Transformer (ViT) is a Transformer-based architecture for image classification tasks, using a single image as input and mapping it to a high-level vector representation, which, in turn, is fed to a multilinear perceptron responsible for conducting the classification task.
In this study, we use a ViT pre-trained on ImageNet 1K [5] as a backbone model to retrieve high-level vector representations of gameplay frame sequences.Since each time window consists of 72 frames, we replicated the weights of the first layer of ViT 72 times to account for input mismatch.Through this process, the 72 × 224 × 224 × 3 tensor of pixel values bound between [0, 1] is transformed into a vector of 768 real values that represent higher representations of the gameplay video segment (see Fig. 3).

Preference Learner
Preference learning involves learning to distinguish data points in an ordinal manner [9], and thus can be applied to any supervised problem as long as the labels represent ordinal relationships.Since emotions are ordinal by nature [40,41], in this study we develop a preference learner based on the RankNet architecture [2] to predict players' arousal using ViT representations of gameplay footage.Our arousal models are trained on pairs of gameplay windows.
Specifically, we formulate the arousal prediction task as a PL problem in the following way.Let us denote as X the space of ViT representations of gameplay footage windows and the data corresponding to the -th gameplay video as D  = {(  ,   )}  =1 , where   ∈ X, and   ∈ [0, 1] stands for the arousal annotation of the gameplay's -th window.To employ a PL model, we transform , where    equals 1 if   −   >   and 0 if   −   >   .It should be noted that when |  −   | <   the pair (  ,   ) is not included in the dataset D .The preference threshold   controls whether or not the difference between the labels qualifies as a preference, while parameter  emphasizes the fact that pairs are produced with datapoints belonging to the same gameplay footage.Finally, the above data transformation procedure results in a perfectly balanced binary classification dataset.
As mentioned above, we adopt the RankNet model [2] for addressing the aforementioned PL problem.RankNet employs a neural network that receives as input pairs (  ,   ) and their respective labels    and outputs    =  (  ) −  (  ), where  is a scalar function computed by the neural network.RankNet training aims to estimate the parameters of the neural network that minimize the binary cross-entropy loss of  (   ) with respect to    , where  (•) is the sigmoid logistic function.In our experiments, we consider linear functions  , that is neural networks with no hidden layers, and we estimate the parameters of  using two fundamentally different optimization methods: SGD with backpropagation-as traditionally employed in RankNet training-and neuroevolution, as described below through RankNEAT.
RankNEAT.NeuroEvolution of Augmenting Topologies is an established algorithm [36] which goes beyond earlier approaches to neuroevolution which represented only the weights of the network as a vector in the genotype.While the typical NEAT algorithm starts from a minimal network (with only input and output nodes) and expands it with new nodes and edges, in this paper we use a simplified version of NEAT which does not add new nodes and thus does not expand the size of the network.It should be noted, however, that other features of NEAT which are crucial to its success, such as speciation and custom operators for adding and removing edges are maintained.In RankNEAT we use NEAT to train our RankNet model by optimizing the parameters of the linear function  (•) via weight mutations, crossover, adding or removing edges.Hence, its behavior resembles a feature elimination mechanism which is essentially the same as setting the weight parameters of the deleted edges to zero.
We use the standard implementation of NEAT-Python [22] for running evolution.The initial population consists of  fully connected RankNet networks with random weights, which are evaluated and then split into species based on their topological similarities.The fitness of each individual is calculated by processing all pairs in the training set through the ViT and RankNet.Each pair consists of two frame sequences (  ,   ) and one ground truth preference (   ); each network is processed through the ViT and RankNet to derive  (  ) and  (  ) and finally to calculate the negative binary cross-entropy of the produced  (   ).The mean of all cross-entropy scores for each pairing forms the fitness of the network and informs the selection of parents to mate and mutate.

RESULTS
This paper aims to leverage neuroevolution for preference learning, assuming that its global optimization strategy may prove beneficial compared to gradient descent.Thus, the performance metric in our experiments is the accuracy in predicting the ranking between unseen pairs of gameplay footage windows.Specifically, we use a ten-fold cross-validation strategy for splitting the data into training and test sets.We ensure that data in the test set belongs to players that are absent from the training set.Therefore, we follow a leave-participants out method for cross-validation, where  is set between 6 and 11 participants depending on the game and fold.To address the randomness of weight initialization, genetic operators, and SGD, results are averaged across 5 independent runs [13] throughout the paper (including the 95% confidence interval between these 5 runs).
Due to the many hyperparameters of RankNet and RankNEAT, we perform a sensitivity analysis in Section 5.1 and report the main findings.Using the best parameters, Section 5.2 compares the performance of RankNet and RankNEAT for the three games, attempting to provide a fair ground of comparison in terms of computational effort.Throughout the experiments, we perform three tests per game by varying the preference threshold (  ) between 0.15, 0.25 and 0.50.Higher threshold values can be more dependable in terms of the accuracy of the ranking but lead to significantly smaller datasets for training and testing.

Parameter Tuning
Several training hyperparameters control the behavior of both NEAT and SGD as optimizers.Parameter setting is often achieved through empirical trial-and-error processes.In terms of RankNet, we tune the batch size since the benefits of the adjustment of this parameter is two-fold.On the one end, the batch size is inversely proportional to the number of updates per epoch, affecting the speed of the training process.On the other end, the ratio of learning rate to batch size is a key element influencing the SGD dynamics [14].When it comes to RankNEAT, there is no single correct choice of parameters for all problems due to interdependencies between hyperparameters such as population size and crossover [4].Although the compatibility threshold (  = 3), elitism per species  (  = 2), and mutation rates (0, 0.5 for nodes and edges, respectively) were tuned according to some preliminary experiments, the population size  was adjusted based on a more systematic approach since it influences both the training time and the robustness of the learner [31].This section details our experiments on the three game test-beds for determining the optimal population size  and batch number   .It should be noted that other hyperparameters such as the learning rate for SGD, the compatibility coefficients and the survival threshold for NEAT were kept at their default values from their respective libraries.For space considerations we only present results with   = 0.25 in this section as experiments with the other two threshold values did not reveal any substantial differences for tuning the selected hyperparameters of RankNet and RankNEAT.Figure 4 shows the progress of RankNet (SGD) and RankNEAT (neuroevolution) over 10 epochs and 10 generations, respectively.It should be noted that generations include more evaluations (depending on the population size ) than SGD epochs and thus the results between RankNet and RankNEAT are not comparable here.Evidence across all three games shows that large   values lead to a quick increase in accuracy for RankNet but subsequent epochs see a drop as the process overfits to the training set.Evidently, with small   values testing accuracy increases more slowly but has the potential to reach higher values.Based on this finding, we will use   = 10 as the best parameter in experiments of Section 5.2.Evolution on the other hand understandably benefits from larger populations: for instance with  = 1000 we see a quick optimization at the first generation but relatively small improvements after that.
Since with  = 100 the test accuracy reaches similar values as with  = 1000 within a few generations, we choose  = 100 in the experiments reported in the remainder of this paper for its significantly lower computational cost.We should note that optimization for  = 10 is slow but does not seem to converge within 10 generations, and it is possible that with more generations it could reach the performance of larger populations; however, we could not test this assumption in this paper.

RankNEAT versus RankNet
This section compares the best RankNEAT and RankNet models according to the hyperparameters investigated in Section 5.1.Following earlier comparative studies [18] we treat each training epoch and each individuals' fitness evaluation as having the same computational overheads and thus report test accuracy over iterations Figure 5 shows the progress over many iterations for the different datasets produced from different games and different preference thresholds (  ).Even though we chose   = 10 because it did not overfit during the short training runs of Section 5.1, it is evident that as training progresses RankNet still is prone to overfitting.In all cases, test accuracy for RankNet drops after the first 100 iterations, often significantly (e.g. in Fig. 5a).On the other hand, evolution starts performing poorly but steadily increases at later generations.While evolution assesses its individuals in terms of accuracy in the training set and consistently improves there, it is evident that the models are also able to perform well (despite some fluctuations between generations) in the test set.At the same computational effort (1, 500 iterations), RankNEAT yields between 1% and 5% higher test accuracies from RankNet, on average, across the 9 experiments performed (with RankNEAT significantly outperforming RankNet in 5 of our 9 tests).Admittedly, in some of the experiments this is due to a noticeable drop in accuracy at later epochs for RankNet; in practice an early stopping criterion for RankNet would likely prevent this.Taking the best models discovered, on average, within these 1, 500 iterations as a whole, we derive the results of Table 1.Here, we see that the results are comparable in several cases, although for the Pirates!game RankNEAT consistently performs better.It is worth noting that all models regardless of method underperform in Pirates!We hypothesize that RankNEAT may be able to perform better in more challenging problems.It is also worth noting that when we compare the best run of each algorithm, RankNEAT yields higher accuracies than RankNet in 7 out of 9 experiments.
Apart from the fact that RankNEAT performs global optimization, we expect that the custom operators that add or delete edges are especially powerful for this problem.As noted in Section 4.2, our version of RankNEAT does not allow for larger topologies to emerge but both speciation and topology changes in the edges are expected to have an impact.We expect that deleting an edge can act as a feature elimination mechanism and remove features that do not play a role in predicting arousal.Indeed, we observe that the best models of Table 1 for RankNEAT have between 5% and 6% fewer edges than the fully connected SGD network with 768 edges).Due to the stochastic nature of the edge removal operator, this "feature selection" requires several generations to be impactful, but may largely be responsible for the good performance of the models.
We observe that models tend to be more accurate at higher preference thresholds.This is not surprising, and matches past findings [16], as ambiguous rankings are more aggressively cleaned.It is worth noting, however, that this comes at the cost of volume and generality of the dataset: indicatively, the datapoints at   = 0.50 are only 28% of the datapoints at   = 0.15 across all games.

Qualitative findings
Results presented in the previous section show that player arousal can be modeled based on general-purpose representations such as video frames and, consequently, pixels.Drawing inspiration from the study of Makantasis et al. [16], we constructed the class activation maps (CAM) in order gain insights on which regions of the frames contributed the most to the final result.Our Eigen-CAM implementation relies heavily on the PyTorch library for

Gameplay Footage
Activation Map

Endless
Pirates! Pirates!Run'N'Gun CAM methods [10].It should be noted that Eigen-CAM visualizes the principal components of the learned features, and thus it does not rely on the backpropagation of gradients or any other class relevance score [26].We use RankNEAT, as it achieves the highest accuracies overall, to construct the visualization of Fig. 6.In these activation maps, warmer colors correspond to higher predictors of arousal value for a specific player in the test set.From the samples of Fig. 6, we observe that important predictors of arousal across games are regions containing information about the player, such as the avatar's position, life, game time, and score.Furthermore, the regions that contain information about the enemies' avatars are also very important for the model.In two out of three games, the model manages to mask out some of the redundant information in the environment, such as empty space in Endless or the sky background in Run'N'Gun.For Pirates!, however, such patterns are less clear, and the model precludes the powerups from high importance regions.This may explain the relatively low accuracy value achieved on this game.

DISCUSSION
This work investigated the potential of neuroevolution for handling PL tasks when labels are defined in a subjective and ill-posed manner.We aimed to assess the power of NEAT as a preference learner by comparing the accuracy of NEAT and backpropagation in arousal prediction from general-purpose representations (gameplay videos) across three platformer games.To the best of our knowledge, this is the first time a NEAT algorithm has been used in a PL task.In particular, we studied the case of player affect modeling due to the fact that capturing the emotional manifestation of players is of great import for the domain of digital games [45][44].The experiments indicate that RankNEAT can outperform RankNet by avoiding overfitting.There is evidence that RankNEAT's operators for deleting or adding edges is beneficial as a form of feature selection.
It should be noted that there is no straightforward way to compare evolution and SGD methods fairly.While past approaches have used CPU time [18,42,43], we instead matched epochs and individual evaluations as approximations of effort.That said, SGD selects a subset of the training data (in experiments in Section 5.2, this was   = 10) to derive a gradient while evolution evaluates cross-entropy in all pairings of the training set.Because we could multi-thread the evaluation of individuals in each generation, RankNEAT was between 57% and 72% faster in CPU times than RankNet per run, for the 1, 500 iterations of Section 5.2 (tested on a CPU-only Intel Xeon, 132GB RAM).We could explore different ways of comparing the two methods in future work, as well as perform a more thorough tuning process for the other hyperparameters.In particular, parameters such as the survival threshold, elitism, and minimum species size can affect the crossover stage, increasing the diversity of the population.Thus, properly tuning these parameters may lead to better exploration of the search space.
Another worthwhile discussion is our choice of applying a more restrained version of NEAT for our experiment.The power of NEAT is arguably the fact that its operators can increase the network size with new nodes and more edges between these new nodes.While our version uses speciation as well as other operators of NEAT, speciation is more meaningful when networks differ in size.The initial population is fully connected while evolved individuals may have fewer edges (i.e.simpler topologies) but never more edges.Preliminary experiments with operators that could add nodes, however, led to an evolutionary process that quickly overfits to the training set while performing poorly on the test set.More experiments are necessary to investigate how this behavior can be countered, e.g. with different fitness evaluation schemes which assess a smaller subset of the training data similar to SGD's batch number.We will also consider and test alternative neuroevolutionary search methods such as covariance matrix adaptation evolution strategy, Differential Evolution, and their respective variants [27,39] against the introduced algorithm in this paper.When it comes to RankNet there are a plethora of hyperparameters that might influence the performance of a model (e.g.network size, regularization) that need to be examined in a follow up study.The initial study presented here, however, contains a fair amount of hyperparameter tuning experiments for both algorithms as described in our results.
In terms of future research, there are several directions that we can follow to extend the goal of this work.An obvious next step on the scalability of this approach is testing the efficiency of RankNEAT to predict affect for the remaining six games of the AGAIN dataset, which includes racing games and shooter games [24].A more important next step is testing whether the mapping between pixels and arousal found via neuroevolution can be generalpurpose, for instance being able to predict arousal rankings in unseen games of the same genre.Earlier work [24] has shown that gameplay metrics (provided they are well-designed) can be robust predictors of arousal even in unseen games of the same genre.Establishing similar predictors through gameplay footage alone is arguably fundamental for general affect modelling [38].Although this initial study used computer games as its test-bed domain, the proposed method is general and thus applicable to any affective computing and preference learning task; it remains to be found to which degree results hold for other tasks, datasets and domains of preference learning.

CONCLUSIONS
This paper introduced RankNEAT, an algorithm that transfers the benefits of the NEAT algorithm to learning-to-rank tasks in challenging domains with subjectively defined and biased labels such as affective computing.By leveraging pretrained computer vision models, we were able to evolve accurate models of arousal (with a test accuracy as high as 77% on average) using only gameplay footage.Comparing the performance of neuroevolution against stochastic gradient descent, which is the standard optimization method for PL, we observe that neuroevolution can overcome issues of overfitting.While SGD sometimes can find robust models early on, overfitting leads to a drop in accuracy that is difficult to control for.In contrast, RankNEAT continues to produce ever-more accurate models, and in some cases results had not converged at our ad-hoc cutoff point.Additional experiments, in more games and with more extensive exploration of hyperparameters (such as evolving larger topologies) are necessary to assess the true potential of this approach for player modeling, affective computing, and any machine learning domain that involves human demonstration and annotation.

Figure 1 :
Figure 1: The three games used in this study.From left to right: Endless, Pirates!, Run'N'Gun.

Figure 2 :
Figure 2: Arousal annotation of a player's gameplay footage using the time-continuous unbounded RankTrace protocol.

Figure 3 :
Figure 3: Illustration of the preference learning architecture (top image) and training methods (bottom image) used.ImageNet ViT (see Section 4.1) is not trainable; only the preference learner is trained.Gameplay footage feeds the Siamese PL architecture and the binary cross-entropy is calculated on the difference between the network's output and respective preference (annotated) label    (see Section 4.2).The training parameters of the preference learner are optimized either: (a) via RankNEAT (Section 4.2) where the binary cross-entropy becomes the fitness, or (b) via RankNet [2] where the binary cross-entropy serves traditionally as the loss function.

Figure 4 :
Figure 4: Impact of the population and batch size to the performance of the two algorithms.

Figure 5 :
Figure 5: Accuracy (and 95% confidence intervals) over evaluations for the RankNEAT and RankNet models.The black dotted line shows the (random) baseline accuracy of 50%.

(
with each generation of RankNEAT having  iterations, and each epoch in RankNet counting as 1 iteration).As in Section 5.1, we measure test accuracy based on a 10-fold leave- -participants-out cross-validation, repeated and averaged from 5 independent runs.Based on Section 5.1, all RankNet experiments are performed with   = 10 (10 random pairs are sampled from the training set per epoch to calculate the gradient) and all RankNEAT experiments are performed with  = 100 (100 individuals in the population).

Figure 6 :
Figure 6: Eigen-CAMs for indicative frames of each game: saturated areas show pixels that are important predictors of arousal.