DINO-Foresight: Looking into the Future with DINO

1. Archimedes/Athena RC | 2. valeo.ai | 3. National Technical University of Athens | 4. University of Crete | 5. IACM-Forth
NeurIPS 2025

Abstract

Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce DINO-Foresight, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show the very strong performance, robustness and scalability of our framework.

Overview

Our framework employs frozen Vision Foundation Model (VFM) encoders to extract semantic features from context frames. A masked feature transformer then processes these features in a self-supervised manner, predicting future VFM features by minimizing SmoothL1 loss between predicted and actual features. At test time, the forecasted features enable plug-and-play integration with task-specific prediction heads—such as semantic/instance segmentation, depth estimation, and surface normals—without retraining the core predictor, enabling modular and efficient multi-task scene understanding.

DINO-Foresight Overview

Hierarchical Target Feature Construction

Our framework constructs a semantically rich feature space by extracting features from multiple layers of a frozen Vision Transformer (ViT) encoder, capturing semantic information at varying levels of abstraction. Given a sequence of \(N\) image frames, we extract features \(\mathbf{F}_l \in \mathbb{R}^{N \times H' \times W' \times D_{\text{enc}}}\) from \(L\) layers of the ViT model, where \(D_{\text{enc}}\) is the feature dimension and \(H' \times W'\) is the spatial resolution. These features are concatenated along the channel dimension, resulting in \(\mathbf{F}_{\text{concat}} \in \mathbb{R}^{N \times H' \times W' \times (L \cdot D_{\text{enc}})}\). To manage the high dimensionality of the concatenated features, we apply Principal Component Analysis (PCA) for token-wise dimensionality reduction, transforming \(\mathbf{F}_{\text{concat}}\) into a lower-dimensional representation \(\mathbf{F}_{\text{PCA}} \in \mathbb{R}^{N \times H' \times W' \times D}\), where \(D \ll L \cdot D_{\text{enc}}\). These PCA-reduced features serve as the target features for our masked feature transformer, enabling efficient yet information-rich future prediction.

As shown in Table 1, using features from multiple layers (3, 6, 9, 12) significantly improves semantic segmentation performance compared to using only the final layer (12). Furthermore, Table 2 demonstrates that PCA-based dimensionality reduction simplifies the modeling process and leads to improved performance.

Multi-Layer Features Table
Table 1: Impact of multi-layer features on semantic segmentation performance.
Dimensionality Reduction Table
Table 2: Effect of PCA dimensionality reduction on segmentation and depth prediction.
Hierarchical Target Feature Construction
Figure 1: Multi-layer features from a frozen ViT encoder are concatenated and compressed via PCA to form compact target features for the masked feature transformer.

Comparison of VFM Encoders across tasks

We evaluate our method using three different Vision Foundation Model (VFM) encoders: DINOv2 with registers (self-supervised), EVA2-CLIP (vision-language contrastive), and SAM (Segment Anything Model, supervised for instance segmentation). For each encoder, we use the ViT-B variant and assess performance across multiple scene understanding tasks. As shown in Table 3, DINOv2 consistently outperforms other encoders across all tasks, achieving the best results for both short-term and mid-term predictions. This aligns with expectations, as the DINOv2-based Oracle also performs best in all cases. Notably, our model effectively predicts future-frame features for all VFMs, significantly improving over the Copy-Last baseline in every scenario.

Additionally, we compare VFM features against VAE-based latents used in latent generative models. The results clearly demonstrate that VAE latents lack the high-level semantic information necessary for accurate scene understanding tasks. Even the Oracle performance with VAE latents is significantly worse than with VFM features, highlighting the fundamental advantage of forecasting semantically rich VFM features over low-level VAE representations. Based on these findings, we select DINOv2 as our default VFM encoder due to its superior performance across all evaluation metrics.

Comparison of VFM Encoders
Table 3: Comparison of VFM encoders (DINOv2, EVA2-CLIP, SAM) and VAE latents across semantic segmentation, depth estimation, and surface normal prediction. For each method, we show Oracle (upper bound), Copy-Last (lower bound), and our Prediction results. DINOv2 consistently achieves the best performance across all tasks and temporal horizons.

Training-Efficient Strategies for High-Resolution Feature Forecasting

High-resolution features are crucial for pixel-wise scene understanding tasks, but training on high-resolution inputs (\(448 \times 896\)) is computationally expensive. To address this challenge, we explore three strategies as shown in Table 4: position interpolation (training on \(224 \times 448\) and adapting at test time), sliding-window processing (training on cropped features), and two-phase training (low-resolution pretraining followed by high-resolution fine-tuning). Comparing Oracle baselines at different resolutions (rows a-b) demonstrates that high-resolution features significantly improve segmentation performance. Among forecasting approaches, position interpolation (row d) performs poorly due to distribution shift, while both sliding-window (row e) and two-phase training (row f) achieve strong results. We adopt the two-phase approach as our default strategy due to its simplicity and superior performance, enabling the masked transformer to leverage larger spatial context when propagating VFM features.

High-Resolution Training Strategies Comparison
Table 4: Comparison of training strategies for high-resolution feature forecasting. Two-phase training (row f) achieves the best balance between computational efficiency and prediction accuracy.

Discrete vs. Continuous VFM Representations

Recent work in generative modeling has explored both discrete and continuous representations for image and video generation. While recent findings favor continuous representations, showing that removing vector quantization can improve generation quality, we investigate this question in the context of VFM feature forecasting. To compare these approaches, we employ 4M's pretrained DINOv2 tokenizer, which encodes DINOv2 features into discrete codes from a vocabulary of size 8192. We train a quantized variant of DINO-Foresight using cross-entropy loss to predict these discrete codes, which are then decoded back to DINOv2 features at inference time. As shown in Table 5, while the discretized variant achieves comparable Oracle performance to the continuous case, our continuous VFM feature forecasting approach yields superior future semantic prediction results across both short-term and mid-term horizons. These findings suggest that preserving the rich, continuous representations from VFMs—without quantization—offers clear advantages for dense semantic forecasting tasks.

Discrete vs Continuous VFM Representations Comparison
Table 5: Comparison of continuous DINOv2 features (our approach) against 4M's DINOv2 tokenizer with discrete codes on semantic segmentation forecasting. Continuous representations consistently outperform discrete tokenization.

Comparison with State-of-the-Art

DINO-Foresight demonstrates a key advantage over existing methods: a single feature prediction model achieves competitive or superior performance across multiple scene understanding tasks. As shown in Table 6, our approach handles semantic segmentation, instance segmentation, depth estimation, and surface normal prediction simultaneously, while prior works either require separate prediction models per task (e.g., PFA) or handle at most two tasks (e.g., Futurist). We compare against VISTA, a large-scale video latent diffusion model with 2.5 billion parameters trained on 1,740 hours of driving videos. Despite VISTA's substantially larger scale, DINO-Foresight consistently outperforms it across all tasks on both Cityscapes and nuScenes datasets. Moreover, our approach is significantly more efficient—mid-term forecasting on 500 Cityscapes validation scenes takes approximately 5 minutes versus VISTA's 8.3 hours (both on a single A100 GPU). These results validate that operating in semantically rich VFM feature space provides clear advantages over RGB-level generation for scene understanding tasks, enabling accurate predictions while remaining computationally efficient.

Comparison with State-of-the-Art Methods
Table 6: Comprehensive comparison with state-of-the-art methods across semantic segmentation, instance segmentation, depth prediction, and surface normal forecasting on Cityscapes. DINO-Foresight achieves strong performance across all tasks with a single unified model, while previous methods require separate models per task or handle limited task combinations.

Emerging Visual Representations in the Masked Feature Transformer

Beyond forecasting VFM features, we investigate whether intermediate representations within our masked feature transformer can serve as enhanced visual features for downstream tasks. Inspired by self-supervised learning methods that extract robust features from unlabeled data, we examine features from the 6th, 9th, 10th, 11th, and 12th transformer layers to assess whether these intermediate representations can further improve upon the already strong predicted VFM features. As shown in Figure 2, incorporating intermediate transformer features enhances both semantic segmentation and depth prediction performance across most layers, with the 9th layer features yielding the best segmentation results and 6th layer features achieving optimal depth performance. While improvements are modest—as expected given the strength of predicted VFM features alone—these findings suggest that our future prediction framework holds promise as a self-supervised representation learning method. The masked transformer learns meaningful visual representations through temporal prediction, potentially enhancing the pretrained VFM features and opening new directions for self-supervised learning through future forecasting.

Impact of Intermediate Transformer Features
Figure 2: Performance comparison using only predicted VFM features (dashed line) versus combining them with intermediate transformer features from layers 6, 9, 10, 11, and 12 (blue bars). Results demonstrate that intermediate features improve both semantic segmentation (mIoU) and depth prediction (AbsRel reduction) across short-term and mid-term horizons.

Citation

            
@inproceedings{karypidis2025dinoforesight,
title={{DINO}-Foresight: Looking into the Future with {DINO}},
author={Efstathios Karypidis and Ioannis Kakogeorgiou and Spyros Gidaris and Nikos Komodakis},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=gimtybo07H}
}