Summarize by Aili

BeNeRF: Neural Radiance Fields from a Single Blurry Image and Event Stream

🌈 Abstract

The paper proposes a method to jointly recover the underlying 3D scene representation and camera motion trajectory from a single blurry image and its corresponding event stream. The method is formulated under the framework of neural radiance fields (NeRF). Extensive experimental evaluations with both synthetic and real datasets demonstrate the superior performance of the proposed method over prior works, even for those requiring multi-view images and longer event streams.

🙋 Q&A

[01] Neural Implicit Representation

1. How does the paper represent the 3D scene using neural implicit representation?

The paper adopts Multi-layer Perceptron (MLP) to represent the 3D scene as the original NeRF [33].
The scene model is represented by a learnable mapping function f(x, d) that takes a 3D point coordinate x and a viewing direction d as input, and outputs the corresponding volume density and color.
Volume rendering is applied by sampling 3D points along the ray originating from the camera center and passing through the pixel, to query the intensity of the pixel.

2. How is the volume rendering formulated?

The volume rendering is formulated as: C(r) = ∫₀¹ T(t)σ(x(t))c(x(t),d)dt where T(t) = exp(-∫₀ˢ σ(x(τ))dτ) is the transmittance.

[02] Camera Motion Trajectory Modeling

1. How does the paper model the camera motion trajectory?

The paper uses a differentiable cubic B-Spline in SE(3) space to model the continuous camera motion trajectory.
The spline is represented by a set of learnable control knots, which define the transformation matrices from the camera coordinate frame to the world frame.
The camera pose at any time can be computed using the De Boor-Cox formula based on the control knots.

2. What are the benefits of using a cubic B-Spline to represent the camera motion?

Cubic B-Spline can better model complex continuous camera motions compared to linear interpolation.
The differentiable nature of the B-Spline allows the camera motion trajectory to be jointly optimized with the neural scene representation.

[03] Blurry Image Formation Model

1. How does the paper model the formation of a blurry image?

The paper models the blurry image formation as the integration of virtual sharp images rendered from the neural scene representation along the camera trajectory during the exposure time.
This blurry image formation model is differentiable with respect to the parameters of the NeRF and the camera motion trajectory.

[04] Event Data Formation Model

1. How does the paper relate the NeRF representation with the event stream?

The paper accumulates the real measured events within a time interval to an image, and then normalizes it to eliminate the effect of the unknown contrast threshold.
Given the interpolated camera poses from the cubic B-Spline, the paper can render two gray-scale images from the NeRF and compute the synthesized accumulated event image.
The synthesized event image is also differentiable with respect to the parameters of the cubic B-Spline and NeRF.

[05] Loss Functions and Optimization

1. What loss functions does the paper use to optimize the NeRF and camera motion?

The paper minimizes the sum of a photometric loss for the blurry image and an event loss for the accumulated events.
Both losses are defined as the L1 difference between the real measurements and the synthesized data.

2. How are the NeRF and camera motion trajectory jointly optimized?

The paper uses two separate Adam optimizers to optimize the scene model (NeRF) and the camera motion (cubic B-Spline), respectively.
The optimization is performed by minimizing the combined photometric and event losses.

[06] Experimental Evaluation

1. What are the key findings from the quantitative evaluations?

The paper's method significantly outperforms prior state-of-the-art single image deblurring methods, as well as event-enhanced single image deblurring methods.
Despite using only a single blurred image and event stream, the paper's method achieves comparable performance to NeRF-based methods that require multi-view images and longer event streams.
On real-world datasets, the paper's method outperforms all prior methods, including those trained with multi-view images.

2. What are the key findings from the qualitative evaluations?

The paper's method is able to recover high quality latent sharp images and high frame-rate video from a single blurry image, without any generalization issues.
The paper's method outperforms prior learning-based methods, which struggle to generalize to domain-shifted images.
The paper's joint optimization of the camera motion and neural scene representation is crucial for achieving superior performance on real-world noisy datasets.

Shared by Daniel Chen ·

Install fromChrome Web Store