Active View Selector: Fast and Accurate Active View Selection with Cross Reference Image Quality Assessment

Zirui Wang¹ Yash Bhalgat² Ruining Li² Victor Adrian Prisacariu¹

¹Active Vision Lab ²Visual Geometry Group

University of Oxford

TLDR

Treat the worst rendered view as the next best view.
Use a Cross Reference IQA network to estimate rendering quality.
Use a lightweight CNN backbone in CR-IQA network for fast prediction.

Idea

We tackle active view selection in novel view synthesis and 3D reconstruction. Existing methods like FisheRF and ActiveNeRF select the next best view by minimizing uncertainty or maximizing information gain in 3D, but they require specialized designs for different 3D representations and involve complex modelling in 3D space.

Instead, we reframe this as a 2D image quality assessment (IQA) task, selecting views where current renderings have the lowest quality. Since ground-truth images for candidate views are unavailable, full-reference metrics like PSNR and SSIM are inapplicable, while no-reference metrics, such as MUSIQ and MANIQA, lack the essential multi-view context.

Inspired by a recent cross-referencing quality framework CrossScore, we train a model to predict SSIM within a multi-view setup and use it to guide view selection. Our cross-reference IQA framework achieves substantial quantitative and qualitative improvements across standard benchmarks, while being agnostic to 3D representations, and runs 14-33 times faster than previous methods.

View selection time in seconds (↓) vs NVS quality measured by PSNR (↑) on the Garden scene from Mip-NeRF360 dataset. Our method achieves a 14x speedup over the state-of-the-art view selection method FisherRF, and a 33x speedup over its batched variant, FisherRF4, while achieving improved NVS quality. Notably, several no-reference IQA-based approaches also emerge as strong baselines for this task.

Introduction

Our goal is to select the next best view from a set of candidate views for active vision algorithms, for example, path planning for robots. These active vision algorithms are then employed to guide downstream applications, such as novel view synthesis, 3D reconstruction, and space exploration.

In this work, we consider the next best view as the one with the lowest rendering quality among the candidate views. We use a Cross Reference IQA network to estimate the quality of the candidate views, which can estimate the quality of a query image in a multi-view setup and does not require ground-truth images.

Method Overview: Our method consists of two main components. Left: a lightweight cross-referencing (CR) image quality assessment (IQA) model that evaluates a rendered image by comparing it to multiple real images from different viewpoints of the same scene, generating a per-pixel quality map. This model is designed for multi-view novel view synthesis (NVS), where conventional metrics like PSNR and SSIM are inapplicable due to the lack of ground truth images for the novel view. After training on outputs from NVS methods (e.g., Gaussian Splatting, Nerfacto, TensoRF) across various scenes, it can be applied directly to new real-world scenes in a feed-forward manner. Right: a Gaussian Splatting (GS)-based active view selection system. Starting from four views, this system iteratively selects the next best view by: (a) training a GS model on the current view set, (b) rendering candidate viewpoints, (c) evaluating these with our CR-IQA model, (d) selecting the view with the lowest quality, and (e) repeating the process.

Network

The original Cross Reference IQA network CrossScore has a DINOv2 backbone, while it's effective, it's too heavy for active vision applications, which requires more responsive quality prediction. We additionally integrate a lightweight CNN backbone RepViT in CR-IQA network for fast quality prediction.

Self-supervised Training

We follow the same self-supervised training process as CrossScore. We leverage existing NVS systems and abundant multi-view datasets to generate SSIM maps for our training. Specifically, we select Neural Radiance Field (NeRF)-style NVS systems as our data engine. Given a set of images, a NeRF recovers a neural representation of a scene by iteratively reconstructing the given image set with photometric losses.

Results

NVS Quality and View Selection Time

View selection time (left) and test split PSNR (right) on the Garden scene from the Mip-NeRF360 dataset. Our method provides the highest NVS quality whilst being 14x faster than the state-of-the-art model FisherRF and 33x faster than the batched version FisherRF4. Strong baselines are highlighted with heavier line weights. Note that the upper part of the time axis, from 1 to 20 seconds, is plotted on a log scale.

Scene Coverage with MASt3R

Visualization of Scene Coverage: We compare reconstruction errors for each view selection strategy. Errors are visualized as distances from each point in the “complete” point cloud to its nearest neighbor in the reconstruction produced by the subset of views selected by each method. Blue areas indicate low reconstruction error, while red areas indicate higher errors due to limited view coverage.

Related Research

CrossScore: Towards Multi-View Image Evaluation and Scoring.

Acknowledgement

This research is supported by an ARIA research gift grant from Meta Reality Lab. Yash Bhalgat is supported by EPSRC AIMS CDT EP/S024050/1 and AWS.

BibTeX

@article{wang2024avs,
title={Active View Selector: Fast and Accurate Active View Selection with Cross Reference Image Quality Assessment},
author={Zirui Wang and Yash Bhalgat and Ruining Li and Victor Adrian Prisacariu},
booktitle={arXiv preprint arXiv:},
year={2025}
}