Dimensions of motion:
Monocular prediction through flow subspaces
3DV 2022 Oral

Richard Strong Bowen*12
Richard Tucker*1
Ramin Zabih12
Noah Snavely12
1 Google Research 2 Cornell Tech

Abstract

We introduce a way to learn to estimate a scene representation from a single image by predicting a low-dimensional subspace of optical flow for each training example, which encompasses the variety of possible camera and object movement. Supervision is provided by a novel loss which measures the distance between this predicted flow subspace and an observed optical flow. This provides a new approach to learning scene representation tasks, such as monocular depth prediction or instance segmentation, in an unsupervised fashion using in-the-wild input videos without requiring camera poses, intrinsics, or an explicit multi-view stereo step. We evaluate our method in multiple settings, including an indoor depth prediction task where it achieves comparable performance to recent methods trained with more supervision.

Paper

Video

Embedding examples

Each row of the following table shows an input image (overlaid with a few manually chosen seed points) and the predicted outputs (disparity and embedding) from our network. The rightmost column then shows the segmentation induced by coloring each pixel according to which of the seed points is closest to it in bilateral embedding space. (See Figures 5–6 and Section 4.2 in the paper.)

Input
and seed points

Disparity
Embedding PCA
(dimensions 0–2)
Embedding PCA
(dimensions 3–5)
Induced
segmentation
(Images used under Creative Commons license from YouTube user POPtravel.)

BibTeX

@inproceedings{bowen2022dimensions,
title = {Dimensions of Motion: Monocular Prediction through Flow Subspaces},
author = {Richard Strong Bowen and Richard Tucker and Ramin Zabih and Noah Snavely},
booktitle = {Proceedings of the International Conference on {3D} Vision (3DV)},
year = {2022}
}

*equal authorial contribution