Published December 5, 2022 | Version v1
Dataset Open

[LS2N_IPI_DisFER] Comparing the Robustness of Humans and Deep Neural Networks on Facial Expression Recognition

  • 1. ROR icon Nantes Université
  • 2. ROR icon Laboratoire des Sciences du Numérique de Nantes



Distorted-FER (DisFER), a new facial expression recognition (FER) dataset composed of a wide number of distorted images of faces.


Materials and Methods


The source images used in our experiment come from the Facial Expression Recognition 2013 (FER-2013) dataset [1]. This dataset was firstly introduced in 2013 at the International Conference on Machine Learning, and has been used in a large number of research works since then, as it encompasses naturalistic conditions and challenges. This dataset consists of 35,887 images of faces in 48 × 48 format, collected thanks to a Google search. Human accuracy on FER-2013 was estimated by its authors around 65.5% [1].
To build the Distorted-FER (DisFER) dataset, we randomly selected, from FER-2013, twelve images per basic emotion, as defined by Ekman [2] (i.e., anger, disgust, fear, happiness, neutral, sadness, and surprise). This yields a total of 84 source images. Each original stimulus was then distorted using three different types of distortions, i.e., Gaussian blur (GB), Gaussian noise (GN), and salt-and-pepper noise (SP). Each distortion was applied at distinct levels: three standard deviation values were tested for GB, i.e., 0.8, 1.1, and 1.4; similarly for GN with standard deviation values equal to 10, 20, and 30; while probability levels of 0.02, 0.04, and 0.06 were chosen for SP; corresponding to low, medium, and high distortions, respectively. 

Crowdsourcing Experiment

In order to collect as many votes as possible on our dataset, and because rating 840 images is time-consuming and can be extremely tiring for a single participant, we decided to set up a crowdsourcing experiment. Such experiments indeed allow the conduct of large-scale subjective tests with reduced costs and efforts.
The DisFER dataset was therefore split into twenty-one playlists of forty images each, with a view to keep the tests as fast as possible—as crowdsourcing experiments should not last more than ten minutes or so. Playlists were carefully designed to contain the same numbers of images of a given configuration (i.e., emotion, distortion types, and distortion levels). Among a playlist, images were randomly displayed to participants.
Each participant was asked to choose which emotion (i.e., anger, disgust, fear, happiness, neutral, sadness, or surprise) they recognized in the displayed image. No time constraint was imposed on participants to fulfill the task.
A total of 1051 participants (including 50% of females) were recruited using the Prolific platform [3]. Prolific takes into consideration researchers’ needs by maintaining a subject recruitment process that is similar to that of a laboratory experiment. Indeed, participants are fully informed that they are being recruited for a research study. Consequently, this platform allows researchers to eliminate ethical concerns, and it further improves the reliability of collected data.
Participants were aged between 19 and 75 years old (with a mean of 30±8.53 -- note that three participants did not wish to respond). Twenty playlists out of twenty-one were entirely watched and rated by fifty distinct participants, whereas one playlist was watched and evaluated by fifty-one participants.


[1] Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A. Challenges in Representation Learning: A Report on Three Machine Learning Contests. In Proceedings of the Neural Information Processing, Daegu, South Korea, 3–7 November 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 117–124
[2] Ekman, P. An argument for basic emotions. Cogn. Emot. 1992, 6, 169–200



Files (2.9 MB)

Name Size Download all
1.2 MB Preview Download
3.0 kB Preview Download
1.7 MB Preview Download