Dataset for eye-tracking tasks

In recent years many different deep neural networks were developed, but due to a large number of layers in deep networks, their training requires a long time and a large number of datasets. Today is popular to use trained deep neural networks for various tasks, even for simple ones in which such deep networks are not required. The well-known deep networks such as YoloV3, SSD, etc. are intended for tracking and monitoring various objects, therefore their weights are heavy and the overall accuracy for a specific task is low. Eye-tracking tasks need to detect only one object - an iris in a given area. Therefore, it is logical to use a neural network only for this task. But the problem is the lack of suitable datasets for training the model. In the manuscript, we presented a dataset that is suitable for training custom models of convolutional neural networks for eye-tracking tasks. Using data set data, each user can independently pre-train the convolutional neural network models for eye-tracking tasks. This dataset contains annotated 10,000 eye images in an extension of 416 by 416 pixels. The table with annotation information shows the coordinates and radius of the eye for each image. This manuscript can be considered as a guide for the preparation of datasets for eye-tracking devices


Data value
We provided a fully labeled dataset with eye position in an image with a resolution of 416 by 416 pixels. The dataset can be used to develop the development of convolutional neural networks for the detection, segmentation, and classification of the position of the iris. Using data from a data set, each user can independently train a neural network using a small set of personal data to search for a specific (user) type of iris.

Introduction
Today Eye-tracking is used to support multimedia learning, help in browsing the web, and is widely used in real-time graphics systems, which is especially popular in video games. The main problem of modern Eye-tracking systems is their high price. Equipment with accuracy of 0.5 ° has prices from several thousand dollars. The most common distance eye-trackers use the corneal reflection method (CR). The eyes are exposed to direct invisible infrared (IR) light, which results in the reflection in the cornea. The physiology of this process is described in detail in the manuscript [1]. The next research that uses wellknown deep networks with trained weights [2,3]. The neural network model must have a high accuracy of iris recognition. The color of the eyes of each person is unique, as a result of which the neural network should focus on the characteristics that are not directly related to color since it is not possible to train the network for all possible colors. In this paper, a dataset is presented in the resolution of the allowing neural network to identify useful signs for recognizing the position of the pupil. The dataset is designed to search for signs between the iris and sclera. Figure 1 shows the process of selecting an image extension to create a dataset. a b c d e f Fig.1. Examples of images with different resolutions: a -416x416 pixels, b -200x200 pixels, c -100x100 pixels, d -50x50 pixels, e -25x25 pixels, f -10x10 pixels Visually, it is difficult to notice the difference between an image with a resolution of 416 by 416 pixels and 50 by 50 pixels. But this dataset is designed to determine the following features, Fig. 2.

Fig.2. Features for a neural network
As a result, to more accurately determine the boundary between the iris and sclera we decided to use an image with a resolution of 416x416 pixels. To see how this image will look between the layers, we created an ultraprecise neural network and saved various variants of images between the layers, Fig. 3.

Fig. 3. Neural Convolution Network Diagram
To see how this image will look between the layers, we created the next convolutional neural network, Fig. 3.

Fig. 4. Visualization of the input image on different layers of the neural network model
Visual analysis of the images shows that already on the 4th layer of the convolutional neural network, we can get the necessary signs to determine the position of the eye. Therefore, a dataset with sufficient eye area expansion is needed. Consistent with the overview of Eye-tracking, Datasets by Winkler, S., et al. [4], and an analysis of the available datasets, we considered the following sources: -Columbia Gaze Data Set.  [8]. Initially, the images, image resolution of 1280 by 720 pixels, presented in the dataset are as follows, fig. 5.

Fig. 5. Example from the image collection
To determine the eye area in the images, we used the dlib library. The landmark detection algorithm proposed by dlib is an implementation of the Regression Tree Ensemble (ERT), introduced in 2014 by Casemi and Sullivan. This method uses a simple and quick function to directly estimate the location of a landmark. These estimated positions are subsequently refined using an iterative process performed by a cascade of regressors. Regressors make a new estimate from the previous one, trying to reduce the error of alignment of the estimated points at each iteration. At the first stage, the dlib.get_frontal_face_detector () function determines the face contour. Next, using the dlib.shape_predictor command ("shape_predictor_68_face_landmarks.dat") we define facial features. Where shape_predictor_68_face_landmarks.dat is a trained model for 68 landmarks. We only take points 36 to 41 and add 20 millimeters each point to expand the range, and then using the OpenCV ROI we limit the area with the eye in the video stream, fig. 6. This operation gives us an image with pixels 77 by 55 pixels, we increase by 416 and 416 pixels, as a result, we get the following image, Fig. 7. a b Fig. 7. Images from the dataset: a -before converted, b -after converted Next, the most time-consuming part of the research, finding the iris on the processed images were realized. Taking into account that the eye area was selected using the dlib library, specifically for this set of images, we studied in detail the ratio of the pupil size to the size of the eye area, which ultimately amounted to 14%. Next, we wrote the program that, by enumerating various values of the threshold function -THRESH_BINARY, selected an image in which the iris would have a size of 14% relative to the image with the eye area. This code is presented in GitHub in the file 1.Convert_the_eye_Thershold.py. The algorithm for obtaining the dataset is shown in Fig. 5.

Receive image with eye
Dlib library restrict the eye field

Iris position calculation
Saving new images and center and radius coordinates The implementation of this algorithm for image acquisition, for clarity, is presented in the images, fig.8 a b c d e Fig.8. The process of obtaining images for dataset: ainitial image, beye filed, c -after filters, diris mask, efinal image with circle around iris Figure 9 shows the data storage format for the iris position. Fig.9. format for the iris position

Conclusion and discussion
As a result, 10,000 images were obtained with the coordinates of the center of the pupil and the radius. For annotating images, a set of images with a resolution of 1280 by 720 pixels was used. To convert the images, the dlib library allocated the eye region with a resolution of 77 by 55 pixels, later the OpenCV library increased the resolution to a scale of 416 by 416 pixels. After we created the program with an experimentally obtained equation that allows identifying the iris in the image. This dataset is intended for pre-training models of convolutional neural networks for the eye-tracking tasks. This dataset was tested on its own model of the convolutional neural network for training the initial layers of the neural network model. To train the last layers, a personal dataset of 1000 photos was used. As a result, the tracking error was three degrees. Given that the tracking was carried out on a web camera is a good result.
The author certifies that he has NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers' bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patentlicensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.