Published June 25, 2026 | Version 1.0
Dataset Embargoed

Sounds Queer: Representation of LGBTQIA Identities in AI-generated Songs

  • 1. ROR icon University of Bamberg
  • 2. Fraunhofer IDMT

Contributors

Data collector:

Other:

Description

Sounds Queer Dataset

The Sounds Queer dataset contains AI-generated music based on prompts with queer terms. For a full description see the paper.

If you use this dataset, please cite the paper:

Sabine Weber and Andrew McLeod. 2026. Sounds Queer: Representation of LGBTQIA Identities in AI-generated Songs. The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26). https://doi.org/10.1145/3805689.3806752

@inproceedings{Weber:26,
  title={Sounds Queer: Representation of {LGBTQIA} Identities in {AI}-generated Songs},
  author={Weber, Sabine and McLeod, Andrew},
  booktitle={The 2026 ACM Conference on Fairness, Accountability, and Transparency {(FAccT '26)}},
  year={2026},
  doi={10.1145/3805689.3806752}
}

All analysis code can be found in our GitHub repo: https://github.com/apmcleod/sounds-queer.


License

The data and metadata here are provided under the Creative Commons CC BY-NC 4.0 License. See https://creativecommons.org/licenses/by-nc/4.0/ for more details.


Prompts

The dataset creation involved 1200 unique prompts, the details of which can be in 2 files:
- prompts.csv: An ordered list of the 1200 prompts.
- prompt_details.csv: A csv of the terms used to create each prompt.


Music Data

The music data is split first by AI service (Mureka in Mureka.zip and Suno in Suno.zip), and then by prompt (subfolders Prompt1-Prompt1200 in each zipfile).

Within each Prompt folder, there are 4 generated songs (v1, v2, v3, and v4), each having 3 files, and the following naming scheme:
- Title_Date_Time_v{1,2,3,4}.mp3: The generated audio.
- Title_Date_Time_v{1,2,3,4}_lyrics.txt: The generated lyrics (returned by the AI service).
- Title_Date_Time_v{1,2,3,4}_features.json: Audio classifier tags and features in json format, generated by code in the GitHub repo linked above.

The features json contains specifically:
- file: The audio filename to which these tags correspond.
- key: Estimated key.
- major_minor_classifier: Estimated key mode (major or minor).
- tempo_bpm: The average estimated beats per minute of the song.
- song_length:
    - seconds: Song length in seconds.
    - beats: Song length in number of beats.
    - bars: Song length in bars/measures.
- time_signature: The estimated time signature of the song.
- chords: List of the 3 most common estimate
    - most_common_chords: List of the 3 most common estimated chords (in terms of total duration), each containing:
        - chord: The chord label.
        - count: The number of frames assigned that chord label.
    - major_chords: List of the major chords labeled in at least 1 frame, and their count.
    - minor_chords: List of the minor chords labeled in at least 1 frame, and their count.
    - major_percentage: The percentage of frames assigned a major chord label.
    - minor_percentage: The percentage of frames assigned a minor chord label.
    - is_major_key: boolean indicating if the song was estimated to be in the major or minor key.
- instruments: List of the instruments found in the song.
- instrument_detection:
    - method, model_type: Tags describing the model used for instrument tagging.
    - confidence_scores: Confidence scores (0-1) for the instrument tags.
- mood:
    - detection_method: Model used for mood detection.
    - dominant_mood: The highest-scoring mood tag.
    - mood_confidence: The confidence of the highest-scoring mood tag (0-1).
    - mood_predictions: All mood tags with corresponding confidence (0-1).
    - active_moods: All mood tags with confidence > 0.5, and their corresponding confidence.
- voice:
    - detection_method: Model used for singer gender classification.
    - gender: Estimated singer gender (male or female).
    - gender_confidence: Confidence of the estimated singer gender (0-1).
    - male_probability: Estimated probability of a male singer (female_probability minus 1).
    - female_probability: Estimated probability of a female singer (male_probability minus 1).
- lyrics: Unused (see lyrics.txt file instead).


Metadata

All musical and linguistic metadata and tags used for the analyses in our paper are included in 2 csv files. These file are used for all of the analysis code in our GitHub repo (https://github.com/apmcleod/sounds-queer).

Musical Metadata

All musical tags and metadata used for analysis are included in the file music_data.csv. It has the following columns:

Informational columns:
- model: "Suno" or "Mureka".
- prompt_num: 0-indexed prompt number: 0-1199
- genre, sexual_orientation, gender_modifier, person_word: Prompt details.
- file_path: relative path to the features file for this song.

Data columns:
- bpm: Average estimated beats per minute of the song.
- time_signature: Estimated time signature of the song.
- song_length_{secs,beats,bars}: Length of the song in seconds, beats, and bars/measures.
- key: The estimated key of the song.
- mode: "major" or "minor".
- most_common_chords_{1,2,3}: The 1st, 2nd, and 3rd most common estimated chord labels.
- {major,minor}_chord_percentage: The percentage of estimated chord labels that are major or minor chords.
- {instrument}_confidence: The estimated confidence of all valid instrument tags, specifically: accordion, acousticbassguitar, acousticguitar, beat, bell, bongo, brass, cello, clarinet, classicalguitar, computer, doublebass, drummachine, electricguitar, electricpiano, guitar, harmonica, harp, horn, oboe, orchestra, organ, pad, percussion, piano, pipeorgan, rhodes, sampler, saxophone, trombone, trumpet, viola, and voice.
- mood_{moodtag}: The estimated confidence of the moods: aggressive, happy, party, relaxed, and sad.
- {male,female}_probability: The estimated probability of the lead singer's gender being male or female.
- pronoun_classification_spacy: Person of the lyrics (1/2/3) based on the Spacy classifier.


Lyrics Metadata

All linguistic tags and metadata used for analysis are included in the file lyrics_data.csv. Duplicate lyrics are dropped from this file. It has the following columns:

Informational columns:
- model: "Suno" or "Mureka".
- prompt_num: 0-indexed prompt number: 0-1199
- genre, sexual_orientation, gender_modifier, person_word: Prompt details.
- file_path: Relative path to the lyrics file for this song.
- lyrics: The lyrics of the song.

Data columns:
- avg_competence: Estimated average competence classification.
- avg_warmth: Estimated average warmth classification.
- sentiment_score: The estimated sentiment score of the lyrics as derived from the NRC Emotion Lexicon.
- emotions: Json-formatted string listing the occurrences of words associated with specific emotions from the NRC Word-Emotion Association Lexicon.
- regard_score: The estimated regard score of the lyrics using the regard classifier of Sheng et al.
- vader: averaged sentiment values derived from VADER sentiment lexicon.
- afinn: averaged sentiment values derived from AFINN sentiment lexicon.


Manual Tags

themes_tagged.tsv contains a random sample of lyrics from the dataset, which we manually tagged. Its colums are:

- lyric_id: Unique identifier to track lyrics in downstream usage.
- lyrics: The song lyrics.
- is_love_song: Is the song a love song? TRUE/FALSE
- is_about_identity: Is the song about the given identity? TRUE/FALSE
- person: Which person are the lyrics in? 1/2/3
- themes: Comma-separated list of themes found in the lyrics.
- gender_identity_main: Is the gender_identity a main focus of the lyrics? TRUE/FALSE
- sexual_orieantation_main: Is the sexual_orientation a main focus of the lyrics? TRUE/FALSE
- unclear_which_identity: Is it unclear which identity was used to prompt the generating AI? TRUE/FALSE

Files

Embargoed

The files will be made publicly available on June 25, 2026.

Reason: Awaiting FAccT conference.

Additional details

Software

Repository URL
https://github.com/apmcleod/sounds-queer
Programming language
Python