Published January 26, 2026 | Version 2.0.0
Dataset Open

UConn Voter Center - Voting Bubbles with Marginal Marks

Authors/Creators

  • 1. ROR icon University of Connecticut

Description

We introduce the UConn Bubbles with Marginal Marks dataset. This dataset contains images of voting bubbles, scanned from Connecticut ballots, either captured as grayscale (8 bpp) or color (RGB, 24 bpp) artifacts, and extracted through segmentation using ballot geometry. This dataset is an improvement on our prior dataset UConn Bubbles with Swatches, which should be considered superseded by this dataset. These images are organized into 2 main groups of datasets, described below. Each image of a bubble is 40x50 pixels. The labels are produced from an optical lens scanner. The two main categories of the dataset are: preprint, and postprint.

1. Pre-Print Dataset

This set of bubble images are the original digital images of empty and filled bubbles. This category is further split into the following sub-categories:

  • OnlyBubbles: This dataset contains images of empty and filled bubbles (class 0/1) from the original dataset.
  • OnlySynthetic: This dataset contains synthetically-generated marginal marks (penrest, check, cross, straight scribble, random scribble) superimposed on top of blank bubbles.
  • Combined: This dataset is a combination of OnlyBubbles and OnlySynthetic.

    Note that for OnlySynthetic and Combined, we performed manual handlabeling of the first 100 synthetic marks of each class sorted from lowest to highest pixel intensity for pen rest, cross, and check. It was sorted from highest to lowest for Straight and Random Scribbles. Some synthetic marks were reassigned to the opposite class. This only occurs for Cross and Checks, reassigning from class 1 (filled) to class 0 (empty) as they were too light/visually imperceptible/ambiguous to be classified as a filled bubble.


    Each dataset is split into train/validation sets. See the table below for dataset split specifications.
     
    Train/Val Data Combination Grayscale/RGB Data Size
    Train OnlyBubbles Grayscale 28,589
    Train OnlyBubbles RGB 28,589
    Train OnlySynthetic Grayscale 14,090
    Train OnlySynthetic RGB 14,090
    Train Combined Grayscale 42,679
    Train Combined RGB 42,679
    Validation OnlyBubbles Grayscale 7,137
    Validation OnlyBubbles RGB 7,137
    Validation OnlySynthetic Grayscale 3,515
    Validation OnlySynthetic RGB 3,515
    Validation Combined Grayscale 10,652
    Validation Combined RGB 10,652

    2. Post-Print Dataset


    This set of bubble images have been printed once and scanned again. We use the following equipment for printing/scanning:
    • Printer:
      • Grayscale: HP LaserJet Pro MFP 3101fdw
      • RGB: Brother HL-L8360CDW
    • Scanner: Ricoh fujitsu fi-7160

    We produce the following post-print datasets. Note that some do not contain the same amount of images as their corresponding pre-print versions, they are abbreviated versions. We anticipate producing a larger and complete range of post-print data in the future.
     
    Train/Val Data Combination Grayscale/RGB Data Size
    Train OnlyBubbles Grayscale 2,000
    Train Combined Grayscale 42,679
    Validation OnlyBubbles Grayscale 2,000
    Validation OnlySynthetic Grayscale 3,515
    Validation Combined Grayscale 10,652
    Validation Combined RGB 10,652

    Examples of Dataset

    See the linked Github repo (https://github.com/VoterCenter/Busting-the-Ballot/blob/main/dataset/read_dataset_example.py) for how to read in these datasets.

    Here we include an example of marks from each dataset class from `Combined`. Note that OnlyBubbles + OnlySynthetic = Combined. We also include the original labels for each class and their binary mapping which we use in our dataset.
     
    Mark Type Binary Label Original Label
    Empty 0 0
    Filled (original) 1 1
    Penrest 0 2
    Check 1 3
    Cross 1 4
    Straight scribble 1 5
    Random scribble 1 6

     

Files

uconn_votercenter_v2.zip

Files (686.3 MB)

Name Size Download all
md5:f4befab51b1a80432828f4f5d7996274
686.3 MB Preview Download

Additional details

Dates

Updated
2026-01-26

Software

Repository URL
https://github.com/VoterCenter/Busting-the-Ballot
Programming language
Python
Development Status
Active