What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation

Yihua Cheng1, Yaning Zhu2, Zongji Wang3, Hongquan Hao4, Yongwei Liu4,

Shiqing Cheng4, Xi Wang4, Hyung Jin Chang1
1 University of Birmingham, 2HUST, 3Chinese Academy of Sciences, 4CalmCar,
Interpolate start reference image.


In this paper, we present three novel elements to advance vision-based in-vehicle gaze research.

1. Dataset. we introduce IVGaze, a pioneering dataset capturing in-vehicle gaze, compiled from 125 individuals and covering a large range of gaze and head poses within vehicles. Conventional gaze collection systems are inadequate for in-vehicle use. In this dataset, we propose a new vision-based solution for in-vehicle gaze collection, introducing a refined gaze target calibration method to tackle annotation challenges.

2. Gaze Estimation. Our research focuses on in-vehicle gaze estimation leveraging the IVGaze. Images of in-vehicle faces often suffer from low resolution, prompting our introduction of a gaze pyramid transformer that leverages transformer-based multilevel features integration. Expanding upon this, we introduce the dual-stream gaze pyramid transformer (GazeDPTR). Employing perspective transformation, we rotate virtual cameras to normalize images, utilizing camera pose to merge normalized and original images for accurate gaze estimation. GazeDPTR shows SOTA performance on the IVGaze dataset.

3. Extensive Application. We explore a novel strategy for gaze zone classification by extending the GazeDPTR. A foundational tri-plane and project gaze onto these planes are newly defined. Leveraging both positional features from projection points and visual feature from images, we achieve superior accuracy compared to relying solely on visual features, demonstrating the advantage of gaze estimation.

IVGaze Dataset

Data Collection System

We introduce an in-vehicle vision-based gaze collection system. This system utilizes an infrared camera to capture human faces and is calibrated with a depth camera. Gaze targets are represented by red points on stickers affixed within the vehicle.

Data Statistics

We collect 44,795 images from 125 subjects. The horizontal gaze is from -50° to 90°, and the vertical gaze is from -40° to 40°. Interpolate start reference image.

Dual-Stream Gaze Pyramid Transformer

Network Architecture

We propose a gaze pyramid transformer (GazePTR) that utilizes a transformer to integrate multilevel features. Expanding upon this, we propose a dual-stream gaze pyramid transformer (GazeDPTR). We rotate virtual cameras via perspective transformation to normalize images, and leverage camera pose to merge normalized and original images. we extend GazeDPTR for the downstream gaze zone classification task with a foundational tri-plane.

Interpolate start reference image.

Quantitative Comparison

We evaluate our method in IVGaze dataset. Our work is built based on GazeTR. We define a new metric, Average Precision (AP). The AP of <k° means an estimation is considered correct if the angular error is lower than k°.

Interpolate start reference image.

Gaze Estmation in Vehicles


  author    = {Yihua Cheng, Yaning Zhu, Zongji Wang, Hongquan Hao, Yongwei Liu, Shiqing Cheng, Xi Wang, Hyung Jin Chang},
  title     = {What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2024},


Please feel free to email Dr. Yihua Cheng if you have any questions or would like to collaborate. The latest contact information can be found here.