Leveraging Negative Signals with Self-Attention for Sequential Music Recommendation

Music streaming services heavily rely on their recommendation engines to continuously provide content to their consumers. Sequential recommendation consequently has seen considerable attention in current literature, where state of the art approaches focus on self-attentive models leveraging contextual information such as long and short-term user history and item features; however, most of these studies focus on long-form content domains (retail, movie, etc.) rather than short-form, such as music. Additionally, many do not explore incorporating negative session-level feedback during training. In this study, we investigate the use of transformer-based self-attentive architectures to learn implicit session-level information for sequential music recommendation. We additionally propose a contrastive learning task to incorporate negative feedback (e.g skipped tracks) to promote positive hits and penalize negative hits. This task is formulated as a simple loss term that can be incorporated into a variety of deep learning architectures for sequential recommendation. Our experiments show that this results in consistent performance gains over the baseline architectures ignoring negative user feedback.


INTRODUCTION
Recommendation systems have become integral to streaming services such as Spotify, Apple Music, Deezer, etc., and by proxy, the music industry as a whole.As the music streaming business model relies on continual user engagement and activity, consistent music Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).MuRS '23, September 19, 2023, Singapore © 2023 Copyright held by the owner/author(s).https://doi.org/10.5281/zenodo.8372449discovery is an essential service.Sequential music recommendation is one such task in this domain, where given a current user session (i.e a current sequence of tracks listened to by a user), a system extends the session by recommending the user the next track.Within the music domain, sequential recommendation is generally split into two categories, next song recommendation (NSR), and automatic playlist continuation (APC).These two tasks can be learned in a similar manner from playlist and listening history information, but they differ in output length: APC aims to extend the session or playlist by an arbitrary length, while NSR only aims to provide the next relevant song in sequence [16].For this study we focus specifically on NSR.
Music recommendation differs from other well-studied domains of recommendation (retail, movies, games, etc.) in a number of important ways.Singular music tracks generally are short and easily consumed, necessitating a thorough understanding of a user's preferences in order to provide both breadth and depth over a large quantity of relevant recommendations [17].Robust music recommendation systems often leverage previous consumer history to learn user preferences through methods such as collaborative filtering [11]; however these approaches fall victim to the cold start problem [17]: for new users or new tracks, the recommendation model does not have any usable information and must guess preferences until the user and/or track has interacted with the system enough to learn a profile [6].
Sequential recommendation in general can alleviate this issue by learning session-level relationships instead of, or in tandem with user-level relationships.By learning session item relationships from sequential interaction, item profiles can be rapidly built as they interact with the system as the recommendation engine can compare user sessions directly rather than using aggregate statistics via collaborative filtering, which takes much more data to build robust representations [6].
This study aims to leverage implicit and explicit signals present within listening sessions to learn robust profiles for sequential recommendations.Prior work has considered direct incorporation of user feedback for ad-hoc adjustments based on content and context similarity, e.g.[8,13].In this work, we investigate learning sessionlevel information via transformer-based architectures, influenced by SoTA methods for sequential retail recommendation, as well as incorporating user feedback through a learned contrastive task.To our best knowledge, learning from negative signals/user feedback has not been explored thoroughly for sequential music recommendation due to a lack of public data containing thorough user feedback.Many public music recommendation datasets, such as Lastfm-1K [3], were collected before the streaming boom, where logged listening history would primarily be sourced from user creation, leading to a low source of negative signals.For this study, we employ the Music Streaming Sessions Dataset from Spotify [1].Since many of the interactions present are from programmatic or expert curation, rather than user curation, they can be considered as exploration events where the user reacts positively (listens to track in entirety) or negatively (skips track).This provides a rich amount of negative samples to learn effective session-level representations from.

RELATED WORK
Sequential recommendation systems can generally be divided into two types: session-aware systems leverage session-level history from identifiable users, while session-based systems ignore user-labels and aim to build user-agnostic representations using solely discrete sessions.[15].In this study, we investigate a session-based system that implicitly learns a user profile through anonymous listening sessions.
Several session-based approaches have been proposed for retail recommendation tasks.CASER [20] and NextItNet [23] leverage convolutional filters to learn sequential representations.BERT4Rec [18] leverages the bidirectional attention mechanism from BERT [4] to learn a robust vocabulary of items for sequential recommendation.
Several sequential based approaches have been proposed for music recommendation tasks incorporating a variety of information to drive recommendation [16].Most of such approaches leverage contextual and/or content features, largely by extensive user profiles and music tags.Relevant work for these respective approaches include CoSERNN [5], and Online Learning to Rank for Sequential Music Recommendation [14].The former leverages contextual information such as device used, time of day during recommendation, etc. to drive contextual user-sequential embeddings for sequential recommendation, while the latter leverages content features via music tags for an online learning to rank scheme.For a study closest to our task, Wen et.al investigate leveraging implicit user feedback immediately after click for video and music recommendation, and find performance gains incorporating this information into a variety of recommendation approaches [22].Most stateof-the-art sequential music recommendation approaches leverage several types of information that often are not present in public datasets (e.g lyrics, user contextual/demographic information, music tags, etc.).It would be increasingly difficult to re-implement and test these systems in a cold-start or academic setting due to the amount and variety of data required.Our approach aims to alleviate this data issue by taking advantage of implicit relationships from data present solely in listening sessions of songs, namely item labels and timestamps of user events.We additionally do not take into account long-term user history due to a lack of user labels; Thus, we focus on creating a session-based system.

METHOD 3.1 Problem Statement
In our scenario, we define a session of length and set of possible tracks ∈ for user .Track , where 1,2,... ∈ represents the track at each time step in session , where ∈ [1 . . .]. Generally, the task of a sequential recommendation system is to predict the desired next item ℎ at time step + 1 for each ∈ , given an interaction history , where = { ∈ | ≤ }.
For negative feedback-agnostic sequential recommendation systems (i.e where the user has not explicitly responded negatively to any item), we define ℎ for track as the next track in the sequence, +1 .
For our feedback-aware system, we define the set of positive examples (no-skip) as and negative examples (skipped tracks) as per sequence , such that: ∈ , ∈ , and all , ∈ where , correspond to the time step of each example in session .Additionally for clarity, we define and as the set of time steps for all positive and negative examples, respectively, where ∈ and ∈ .For any track , we define the desired next track ℎ as the next positive example in the session, , such that: Where the difference − represents the number of skipped tracks between track and its next positive sample.
To predict the desired next track at time step , we model a probability distribution (ℎ = | ) over all possible tracks.Sorting this distribution provides a ranking of the most-relevant items.By learning from negative feedback, we aim to both raise the ranking of , as well as lower the rankings of items in in predicting each ℎ .

Model Architecture
We investigate unidirectional and bidirectional transformer-based architectures in this study, inspired by the SASRec [9] and BERT4Rec [19] architectures, respectively.For both approaches we use the same base architecture described below, with the sole differences being the training procedure, learning objective, and the use of a causal attention mask in the case of the unidirectional model.We keep the implementation analagous to that of the aformentioned authors for better comparison.

Track Embeddings.
We store learned track embeddings in a lookup-table ∈ of size ×R | | , where is the number of tracks and d is the embedding dimensionality.(•) denotes the function retrieving the embeddings of a track or set of tracks from table .

Positional Embeddings.
To inject information about the position of each track in the sequence, we add a learnable positional embedding of size × R | | to each track embedding in the sequence, where K corresponds to the size of the sequence.

Encoder.
We employ a standard transformer encoder to learn contextual session-level information.This is a fully attention based model employing multiple multi-head self-attention layers and positionwise feedforward layers to learn contextual information from sequential inputs.

Prediction Layer.
After obtaining hidden vectors from the encoder with contextual information, we project them through a fully connected layer with GELU activation [7] to obtain predicted embeddings ˆ for each ∈ .We then compute an inner product with the embedding table and apply a sampled softmax to get a probability distribution over each track.

Sampled So max.
Additionally for training stability with such a large amount of classes (∼ 1M tracks in this study), we employ a sampled softmax function during training.For each mini-batch for each session, we uniformly sample 1000 unseen tracks and rank the target tracks alongside these.These 1000 tracks are re-sampled each epoch, such that as training continues, the model continually learns to "rank" the target items with an increasing subset of the total tracks, as the number of unique tracks sampled for comparison increase.

Sequential Recommendation Task
For both approaches, we employ the same learning objective, the negative log likelihood (NLL), for training; however they differ in how this learning objective is used.

Unidirectional.
We employ the next-item prediction task for this approach.For each ∈ , we task the model with predicting the next item in the sequence, +1 .We then compute log-probabilities and pass this to the NLL Loss.Additionally, attention maps are computed using a causal mask, preserving the auto-regressive nature of unidirectional transformers.

Bidirectional.
We employ the cloze, or masked language modelling (MLM) task for this approach.We randomly mask a proportion of each sequence with a special token [MSK] and task the model with predicting what the correct track is at these indices with a bidirectional attention map.For the sequential recommendation task, we also append the [MSK] token to the end of the sequence and set the target of this to the last track in the session targets, to ensure that this target does not appear in the bidirectional attention map.

Skip-informed Contrastive Task
To learn negative sequential track relationships, we employ a contrastive learning task using the skipped tracks in each listening session.We employ noise contrastive estimation with InfoNCE [21] shown below: Given a context vector , positive anchor and set of noise samples ∈ , this loss term uses a categorical cross entropy to classify the positive anchor from the set of noise samples, given scoring function (x, c).
For each track ∈ , we adapt this to our task of promoting the next true positive sample and penalizing all negative samples ∈ by defining the following: This maximizes the cosine similarity between the embedding of track and next-positive-sample while minimizing the similarity between and all ∈ ( ).Since during prediction, logits are computed by the inner product of ˆ and , this directly affects the rankings of and all ∈ , by drawing and closer together in the learned embedding space, and consequently pushing and all ∈ farther away in the embedding space.We experiment with setting the context vector as both ˆ and .Setting = ˆ includes the current session context, while setting = ignores current session context and instead relies solely on the overall learned representation of the track.We explore both to examine the the extent to which immediate context and contextual history affect the learning of negative preference, respectively.

Dataset
For this study we use the Music Streaming Sessions Dataset (MSSD) [1] for training and evaluation, which contains 160 Million user sessions of 10 to 20 consecutively listened songs (<60 seconds between listens).These listening sessions are uniformly sampled from a variety of contexts, such as the user's personally curated collections, expertly curated playlists, contextual non-personalized recommendations, and personalized recommendations.
Notably, this dataset is pseudonymized, meaning all included sessions lack a user label.Consequently, we treat each session as a new user, ignoring long term history.
Skip labels are provided for each track in each session with strength 1-3, defined per the authors as the track "played very briefly", "played briefly", and "played mostly (but not completely)", respectively.For this study, we are primarily interested in strong negative interactions and therefore only consider tracks with skip strength 1 and 2 as negative examples in each session.
Due to time and computational restraints, we uniformly sample ∼450K discrete sessions containing ∼2 million item interactions with ∼1 million total unique tracks to train and evaluate our models.We note that our subset of sessions contains roughly 15% skipped tracks.

Training Procedure
As with other contrastive recommendation systems [2,24], we simply aggregate the sequential task loss and the contrastive shown below within one single training pass where and are scalar terms.We empirically tune these parameters through the validation set.

Hyperparameters and Implementation
As our data contains variable length sessions between 10 and 20 interactions, we pad all sessions to length 20.We stack 2 encoder blocks with 8 attention heads.The embedding and hidden dimensions are both set to 128.Masking for the bidirectional model is applied per batch with proportion = 0.2.We initialize all parameters via truncated normal sampling with = 0, = 1 in range [−0.02, 0.02].We tune the optimal , ∈ [.25, .5, .75,1] using the validation set and select = 0.5, = 0.5.We use the ADAM optimizer [10] with a learning rate of 0.005, selected after tuning through the validation set with ∈ [0.0001, 0.0005, 0.001, 0.005].All models were implemented in python using pytorch-lightning and trained using an NVIDIA RTX 2070 GPU.

RESULTS AND DISCUSSION 4.1 Evaluation
We employ the next-item recommendation task used by [9,19] for our evaluation.For each sequence, we leave out the final and penultimate items as the testing and validation targets, respectively and reserve the rest of the sequence for training.For each target, we uniformly sample 1000 unobserved tracks, where the task becomes to rank the target among these tracks.We employ the Hit Rate@K (equivalent to recall) as our evaluation metric, with ∈ [1,5,10,20].The results are shown in Table 1.

Discussion
We note a number of observations from our experiments.Namely: (1) The skip-informed contrastive task consistently outperforms the feedback-agnostic models, indicating that learning from negative feedback is beneficial for sequential music recommendation (2) The unidirectional models consistently outperform the bidirectional models, with a waning performance gap as the top-K for the hit rate increases.(3) Using the final hidden state ˆ with immediate contextual information as the context vector for the contrastive task performs similarly but consistently slightly worse than using the item embeddings.
Overall, we observe that our contrastive task reliably increases the hit rate in a next-item recommendation scenario, with the exception of the HR@20 for the bidirectional model using only track embeddings.Interestingly, even though we create a mismatch between the targets for the sequential recommendation task and contrastive task, the hit rate for the sequential recommendation task increases, inferring that optimizing for the next positive example ( ) and next track ( +1 ) in tandem raises the performance in selecting the next track during inference.We also observe waning performance gains as the number of tracks in the ranking window increases, likely due to the fact that the contrastive task only relates observed tracks with each other.As the amount of unobserved tracks in the comparison increases (i.e HR@1 to HR@20), the weaker the effect of the contrastive task.Our experiments imply that the effect of learning from negative feedback in this fashion mostly affect the top ranked recommendations.
The relatively weak performance of the BERT-like architecture may be due to the relative high density of our dataset and our relatively short sequence lengths, so training in an autoregressive manner with each sample in the training sequence per each epoch may be better for learning latent sequential track relationships.More work is likely needed to find an optimal setup using bidirectional attention with the MLM task.The slight performance improvement when using the track embeddings as the contextual vector for the contrastive task may imply that while immediate session-level contextual information is useful in learning from negative feedback, reducing this emphasis may provide a slightly stronger signal for preference of a user's next desired track.

CONCLUSION AND FUTURE WORK
Overall, we have presented both a study on the use of transformerbased architectures for sequential music recommendation, as well as a contrastive-based task to learn from negative feedback.We show through our experiments that the contrastive task results in greater hit rate on both unidirectional and bidirectional architectures.Multiple avenues for future work arise, namely the inclusion of long-term user profiles for better modelling of long term and changing user-taste.Additionally, contextual and content information can be injected into the embeddings to learn more powerful contextual representations.An analysis of the performance on different session types and streaming behaviors (playlist, autogenerated, user-curated, etc.) [12] would also provide better insight into the performance in different listening contexts.