Do 3D foundation models have an emergent understanding of extreme-views?
The pre-trained VGGT model was trained primarily on overlapping images.
Surprisingly, when tested on non-overlapping image pairs, the model still produces plausible estimates of relative pose,
with nearly half of the pairs yielding a rotation error below 30°.
Careful fine-tuning of a small number of parameters substantially improves results, as shown above by the error distribution on our new relative pose in-the-wild benchmark (UnScenePairs-t).
Hover over the interactive canvas above to view random non-overlapping image pairs from our benchmark,
along with the rotation errors before and after our lightweight finetuning scheme.
3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with dedicated test splits for both relative pose estimation and dense 3D reconstruction. All code and data will be released.
3D foundation models (3DFMs) have recently shown remarkable progress in reconstructing scene geometry directly from unstructured images. However, despite their growing adoption, their internal structure has remained largely unexplored. In this work, we first analyze their internal 3D language via cross-view attention maps, revealing that these models already encode a surprisingly rich understanding of scene geometry within their shared alternating attention backbone. We provide interactive cross-view attention visualizations below:
Hover and select regions in the left image to view its corresponding cross-view attention map on the right. The heatmap is overlaid on the second image, with warmer colors (red/yellow) indicating higher attention values.
For regions with direct visual overlap, the high attention areas accumulate precisely at the corresponding locations, demonstrating the model's ability to identify visual correspondences.
Building on our findings, we propose a lightweight alignment framework that applies rotation-based supervision only on relative camera poses between image pairs using a geodesic loss. To preserve the model's pre-trained knowledge, we adopt a minimal backbone fine-tuning strategy that targets both the minimal set of layers and parameters within the backbone, updating only around 80k parameters, four orders of magnitude smaller than the full model. This targeted approach achieves effective alignment of the model's internal 3D language for extreme-view reasoning without degrading per-image depth or point quality.
Existing benchmarks evaluating 3DFMs typically have scenes with constrained 3D environments—e.g., assuming constant illumination, transient objects, and camera intrinsics. To evaluate 3DFMs on unconstrained inputs captured in-the-wild, we create MegaUnScene : a new collection of 476 Internet scenes unseen by existing models. From these scenes, we assemble two test sets for relative pose estimation and one for dense reconstruction.
We construct two subsets for evaluating relative pose estimation: UnScenePairs targets image pairs with predominant rotational motion, while UnScenePairs-t focuses on pairs with larger camera baselines. Unlike prior benchmarks, these subsets capture unconstrained, in-the-wild views unseen by 3DFMs. In total, they comprise over 6,000 image pairs across more than 450 scenes, including substantial non-overlapping splits.
We construct UnSceneRecon, a subset comprising 100 in-the-wild reconstructions with metric scale annotations. This benchmark evaluates dense reconstruction quality on unconstrained Internet photos exhibiting diverse lighting conditions, transients, and varying camera models (see figure below).
We present quantitative evaluations demonstrating the effectiveness of our alignment scheme. As shown below, our method achieves consistent and substantial improvements in extreme-view settings, establishing a new state of the art. Crucially, this targeted adaptation also preserves the 3DFMs' strong pre-trained multi-task capabilities over tasks such as multiview pose estimation and dense reconstruction prediction.
Below we illustrate two challenging image pairs from the UnScenePairs-t benchmark, where each pair is captured from viewpoints with large rotations and significant camera translation. For each example, the figure presents the input images, followed by a spherical visualization of the predicted relative rotations—black indicates the reference camera, blue indicates the ground-truth rotation, red indicates the pre-trained VGGT prediction, and yellow indicates the fine-tuned VGGT prediction. On the right, the corresponding 3D reconstructions are shown, including the sparse ground-truth structure and the dense outputs from both the pre-trained and fine-tuned models. Across both cases, the pre-trained model produces distorted geometry due to incorrect relative rotation prediction, while the fine-tuned model yields accurate relative rotation prediction and coherent reconstruction. Please refer to our Interactive Visualization for results of all models on the sELP, UnScenePairs and UnScenePairs-t test sets.