NGC | Catalog
CatalogModelsGaze Estimation

Gaze Estimation

Logo for Gaze Estimation
Features
Description
Detect a persons eye gaze point of regard and gaze vector.
Publisher
NVIDIA
Latest Version
trainable_v1.0
Modified
July 24, 2023
Size
17.44 MB

GazeNet Model Card

Model Overview

The model described in this card detects a person's eye gaze point of regard (X, Y, Z) and gaze vector (theta and phi). The eye gaze vector can also be derived from eye position and eye gaze points of regard.

Model Architecture

GazeNet is a multi-input and multi-branch network. The four input for GazeNet consists: Face crop, left eye crop, right eye crop, and facegrid. Face, left eye, and right eye branch are based on AlexNet as feature extractors. The facegrid branch is based on fully connected layers. Please see the paper in the citations for an example of the model architecture.

Training Algorithm

This model was trained using the GazeNet entrypoint in TAO. The training algorithm optimizes the network to minimize the root mean square error between predicted and ground truth point of regards.

Training Data

GazeNet trainable model was trained on a proprietary dataset with more than 220K Images. The training dataset consists of images taken from cameras mounted at varied heights and angles and with various illumination conditions.

Training Data Ground-truth Labeling Guidelines

The training dataset is created by labeling ground-truth bounding-boxes and landmarks by human labelers. he face bounding box and fiducial landmarks are used to prepare inputs (face crop image, left eye crop image, right eye crop image, and facegrid) to the gaze model. For Face bounding boxes labeling, please refer to the FaceNet model card. For Facial landmarks labeling, please refer to the FPENet model card.

During GazeNet data collection, we asked subjects to look at a dot on the screen, while collecting image data. The X, Y, Z position (point of regard) of the dot in the camera coordinate system are obtained through calibration between camera and screen.

Performance

Evaluation Data

Dataset

The inference performance of GazeNet v1.0 model was measured against 110K proprietary images across a variety of subjects, illuminitation conditions, camera heights and camera angles.

Methodology and KPI

The key performance indicators (KPI) is the error of point of regard. Point of regard is the point at which the eye is looking.

GazeNet KPI
Content Position X Error Position Y Error Position Z Error Position XYZ Error Position XY Error Theta Phi Theta-Phi
(in cm) (in cm) (in cm) (in cm) (in cm) (in degree) (in degree) (in degree)
Evaluation set 3.55 3.93 1.66 6.29 5.92 2.48 2.45 3.9

Real-time Inference Performance

The inference uses FP16 precision. The inference performance runs with trtexec on Jetson Nano, AGX Xavier, Xavier NX and NVIDIA T4 GPU. The Jetson devices run at Max-N configuration for maximum system performance. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications.

Device Precision Batch_size FPS
Nano FP16 1 87
NX FP16 1 510
Xavier FP16 1 704
T4 FP16 1 1698

How to use this model

This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream 6.0 or TensorRT.

Primary use case for this model is to detect eye point of regard and gaze vector. The model can be used to detect eye gaze point of regard by using appropriate video or image decoding and pre-processing. See the following image for an illustration of eye gaze estimation usage.

There are two flavors of the model:

  • trainable
  • deployable

The trainable model is intended for training using TAO Toolkit and the user's own dataset. This can provide high fidelity models that are adapted to the use case. The Jupyter notebook available as a part of TAO container can be used to re-train. The deployable model is intended for efficient deployment on the edge using DeepStream or TensorRT. The trainable and deployable models are encrypted and will only operate with the following key:

  • Model load key: nvidia_tlt

Please make sure to use this as the key for all TAO commands that require a model load key.


Input

GazeNet is a multi-input network, which takes in face crop image, left eye crop image, right crop image, and facegrid.

  • Face Image which is gray scale. 224 x 224 x 1
  • Left Eye Image which is gray scale. 224 x 224 x 1
  • Right Eye Image which is gray scale. 224 x 224 x 1
  • Facegrid Input which are binary mask. 25 x 25 x 1

Output

3D point of regards (X, Y, Z) and gaze vector (theta and phi)

Instructions to deploy this model with DeepStream

To create the entire end-to-end video analytic application, deploy this model with DeepStream.

Limitations

Small faces

NVIDIA GazeNet model does not give good results on detecting small faces (generally, if the face occupies less than 10% of the image area, the face is small)

Model versions

  • trainable_v1.0 - Pre-trained model that is intended for training.
  • deployabale_v1.0 - Deployment models that is intended to run on the inference pipeline.

References

Citations

Krafka, K., Aditya K., Petr K., Harini K., Suchendra B., Wojciech M., and Antonio T.: Eye tracking for everyone. In: CVPR. (2016)

Using TAO Pre-trained Models

Technical blogs

Suggested reading

License

License to use this model is covered by the Model EULA. By downloading the public and release version of the model, you accept the terms and conditions of these licenses.

Ethical Considerations

NVIDIA GazeNet model detects eye gaze point of regard. However, no additional information such as race, gender, and skin type about the faces is inferred.

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.