Our framework constructs a semantically rich feature space by extracting features from multiple layers of a frozen Vision Transformer (ViT) encoder, capturing semantic information at varying levels of abstraction. Given a sequence of \(N\) image frames, we extract features \(\mathbf{F}_l \in \mathbb{R}^{N \times H' \times W' \times D_{\text{enc}}}\) from \(L\) layers of the ViT model, where \(D_{\text{enc}}\) is the feature dimension and \(H' \times W'\) is the spatial resolution. These features are concatenated along the channel dimension, resulting in \(\mathbf{F}_{\text{concat}} \in \mathbb{R}^{N \times H' \times W' \times (L \cdot D_{\text{enc}})}\). To manage the high dimensionality of the concatenated features, we apply Principal Component Analysis (PCA) for token-wise dimensionality reduction, transforming \(\mathbf{F}_{\text{concat}}\) into a lower-dimensional representation \(\mathbf{F}_{\text{PCA}} \in \mathbb{R}^{N \times H' \times W' \times D}\), where \(D \ll L \cdot D_{\text{enc}}\). These PCA-reduced features serve as the target features for our masked feature transformer, enabling efficient yet information-rich future prediction.
As shown in Table 1, using features from multiple layers (3, 6, 9, 12) significantly improves semantic segmentation performance compared to using only the final layer (12). Furthermore, Table 2 demonstrates that PCA-based dimensionality reduction simplifies the modeling process and leads to improved performance.