PG-Occ
Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction

Chi Yan1,2, Dan Xu1,*
1The Hong Kong University of Science and Technology
2ZEEKR Automobile R&D Co., Ltd
*Corresponding author

Abstract

The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present PG-Occ, an innovative Progressive Gaussian Transformer framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that PG-Occ achieves state-of-the-art performance with a relative 14.3% mIoU improvement over the previous best performing method.

Overview

The radar chart compares occupancy prediction accuracy across multiple methods, showing the superior performance of PG-Occ. The central panel highlights the key components: progressive Gaussian modeling with online feed-forward densification, anisotropy-aware sampling with adaptive receptive fields, and open-vocabulary retrieval conditioned on prompt inputs. The bottom row illustrates an example progression from the current input view through successive densification stages to the final occupancy prediction.
Overview illustration

Framework

Architecture of the proposed PG-Occ framework. The scene is represented as feature Gaussian blobs, starting from a base layer and progressively refined and densified through multiple stages. Multi-camera inputs are processed to extract spatio-temporal features, which guide the update and refinement of the Gaussians, producing an any-resolution 3D occupancy field for both geometric reconstruction and open-vocabulary semantic understanding.
Framework illustration

More demo videos

📷 Third-person Perspective Demos.

We showcase third-person view comparisons between PG-Occ predictions and Ground Truth for scenes #0099, #0770, and #0557.


🏆 Comparison with Previous state-of-the-art Methods.

We provide a baseline comparison for scene #0103 among GaussTR, our method (PG-Occ), and Ground Truth. The video demonstrates the qualitative differences between these approaches, highlighting the improvements achieved by PG-Occ over previous state-of-the-art methods.

Results on Occ3D-nuScenes dataset

We report open-vocabulary occupancy prediction results on the Occ3D-nuScenes dataset, grouping methods based on the sensor modalities used during training. LangOcc, GaussTR, and PG-Occ all use MaskCLIP (Zhou et al., 2022) for text supervision. PG-Occ achieves state-of-the-art performance with an mIoU of 15.15, a 14.3% relative improvement over previous best methods. Notably, even without LiDAR data during training, PG-Occ outperforms VEON and other competitors.
Results table

BibTeX


@article{yan2025pgocc,
  title={Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction},
  author={Yan, Chi and Xu, Dan},
  journal={arXiv preprint arXiv:2510.04759},
  year={2025}
}