Active View Selector: Fast and Accurate Active View Selection with Cross Reference Image Quality Assessment
University of Oxford
- Treat the worst rendered view as the next best view.
- Use a Cross Reference IQA network to estimate rendering quality.
- Use a lightweight CNN backbone in CR-IQA network for fast prediction.
Idea
We tackle active view selection in novel view synthesis and 3D reconstruction. Existing methods like FisheRF and ActiveNeRF select the next best view by minimizing uncertainty or maximizing information gain in 3D, but they require specialized designs for different 3D representations and involve complex modelling in 3D space.
Instead, we reframe this as a 2D image quality assessment (IQA) task, selecting views where current renderings have the lowest quality. Since ground-truth images for candidate views are unavailable, full-reference metrics like PSNR and SSIM are inapplicable, while no-reference metrics, such as MUSIQ and MANIQA, lack the essential multi-view context.
Inspired by a recent cross-referencing quality framework CrossScore, we train a model to predict SSIM within a multi-view setup and use it to guide view selection. Our cross-reference IQA framework achieves substantial quantitative and qualitative improvements across standard benchmarks, while being agnostic to 3D representations, and runs 14-33 times faster than previous methods.
Introduction
Our goal is to select the next best view from a set of candidate views for active vision algorithms, for example, path planning for robots. These active vision algorithms are then employed to guide downstream applications, such as novel view synthesis, 3D reconstruction, and space exploration.
In this work, we consider the next best view as the one with the lowest rendering quality among the candidate views. We use a Cross Reference IQA network to estimate the quality of the candidate views, which can estimate the quality of a query image in a multi-view setup and does not require ground-truth images.
Network
The original Cross Reference IQA network CrossScore has a DINOv2 backbone, while it's effective, it's too heavy for active vision applications, which requires more responsive quality prediction. We additionally integrate a lightweight CNN backbone RepViT in CR-IQA network for fast quality prediction.
Self-supervised Training
We follow the same self-supervised training process as CrossScore. We leverage existing NVS systems and abundant multi-view datasets to generate SSIM maps for our training. Specifically, we select Neural Radiance Field (NeRF)-style NVS systems as our data engine. Given a set of images, a NeRF recovers a neural representation of a scene by iteratively reconstructing the given image set with photometric losses.
Results
NVS Quality and View Selection Time
Scene Coverage with MASt3R
Related Research
CrossScore: Towards Multi-View Image Evaluation and Scoring.
Acknowledgement
This research is supported by an ARIA research gift grant from Meta Reality Lab. Yash Bhalgat is supported by EPSRC AIMS CDT EP/S024050/1 and AWS.
BibTeX
@article{wang2024avs, title={Active View Selector: Fast and Accurate Active View Selection with Cross Reference Image Quality Assessment}, author={Zirui Wang and Yash Bhalgat and Ruining Li and Victor Adrian Prisacariu}, booktitle={arXiv preprint arXiv:}, year={2025} }