On Scaling Up 3D Gaussian Splatting Training
๐ Abstract
The paper introduces Grendel, a distributed system designed to partition 3D Gaussian Splatting (3DGS) parameters and parallelize computation across multiple GPUs. 3DGS is a popular technique for 3D reconstruction, but current training is limited to a single GPU due to memory constraints, making it difficult to handle highresolution and largescale 3D reconstruction tasks. Grendel addresses this by:
 Employing a mixed parallelism approach, using Gaussianwise distribution for Gaussian transformation and pixelwise distribution for image rendering and loss computation.
 Leveraging sparse alltoall communication to transfer necessary Gaussians to pixel partitions and performing dynamic load balancing.
 Supporting batched training with multiple views, unlike existing 3DGS systems that train using one camera view image at a time.
 Exploring optimization hyperparameter scaling strategies and finding that a simple sqrt(batch_size) scaling rule is highly effective.
Evaluations on largescale, highresolution scenes show that Grendel can enhance rendering quality by scaling up 3DGS parameters across multiple GPUs. For example, on the "Rubble" dataset, Grendel achieves a test PSNR of 27.28 by distributing 40.4 million Gaussians across 16 GPUs, compared to a PSNR of 26.28 using 11.2 million Gaussians on a single GPU.
๐ Q&A
[01] Mixed Parallelism Training
1. How does Grendel distribute work according to 3DGS' mixed parallelism? Grendel uses Gaussianwise distribution for the Gaussian transformation step, where each GPU operates on a disjoint subset of Gaussians. For the image rendering and loss computation step, Grendel uses pixelwise distribution, where each GPU operates on a disjoint subset of pixels.
2. How does Grendel leverage spatial locality to reduce communication when transitioning between Gaussianwise and pixelwise distribution? Grendel exploits the spatial locality characteristic of 3DGS, where each pixel only requires a small subset of all 3D Gaussians. It uses sparse alltoall communication to transfer only the necessary Gaussians to each pixel partition, reducing the communication overhead.
3. How does Grendel's design differ from distributed deep neural network training approaches like FSDP? Unlike weight sharding in FSDP, Gaussianwise distribution in Grendel is not just for storage but also for computation (the Gaussian transformation). Additionally, Grendel uses sparse alltoall communication to transfer only relevant Gaussians, unlike the dense allgather communication in FSDP.
[02] Iterative Workload Rebalancing
1. How does Grendel address the challenge of dynamic and unbalanced workloads in 3DGS training? Grendel employs two strategies for workload rebalancing:

Pixelwise Distribution Rebalancing: Grendel measures the running time of each pixel block assigned to a GPU, computes the average perpixel computation time, and uses this to adaptively reassign pixels to different GPUs in subsequent iterations to balance the workload.

Gaussianwise Distribution Rebalancing: As new Gaussians are added during the densification process, Grendel redistributes the 3D Gaussians across GPUs after every few densification steps to restore the uniform distribution.
2. Why is a fixed or uniform pixel distribution not sufficient to handle the dynamic and unbalanced workloads in 3DGS training? The computational load of rendering a pixel varies across space (different pixels) and time (different training iterations) due to the changing density, position, shape, and opacity of Gaussians. A fixed or uniform pixel distribution cannot guarantee balanced workloads, necessitating Grendel's adaptive pixel distribution strategy.
[03] Scaling Hyperparameters for Batched Training
1. What is the key hypothesis behind Grendel's proposed hyperparameter scaling rules for batched 3DGS training? Grendel proposes the Independent Gradients Hypothesis, which assumes that the gradients received by 3D Gaussian parameters from each camera view are independent of gradients induced by other views. This allows Grendel to derive scaling rules for the Adam optimizer's learning rate and momentum based on the batch size.
2. What are the proposed scaling rules, and how do they differ from linear scaling commonly used for neural networks? Grendel proposes a squareroot learning rate scaling (learning_rate = original_learning_rate / sqrt(batch_size)) and an exponential momentum scaling (momentum = original_momentum * exp(sqrt(batch_size))) for batched 3DGS training. These rules differ from the linear learning rate scaling used for neural networks, as they account for the unique optimization dynamics of 3DGS.
3. How does Grendel empirically validate the effectiveness of the proposed hyperparameter scaling rules? Grendel trains the "Rubble" scene to 15,000 iterations with a batch size of 1, then resets the optimizer and continues training with different batch sizes. It compares the training trajectory and update similarity when using the proposed scaling rules versus alternative scaling strategies, demonstrating the effectiveness of the squareroot learning rate and exponential momentum scaling.
[04] Performance and Memory Scaling
1. How does Grendel's performance scale with the number of GPUs and batch size? Grendel's evaluations show that both increasing the number of GPUs and increasing the batch size lead to significant performance improvements. For the Rubble scene, Grendel's throughput increases from 5.55 images per second (4 GPUs, batch size 1) to 38.03 images per second (32 GPUs, batch size 64).
2. How does Grendel's memory scaling enable the use of more Gaussians to improve reconstruction quality? Grendel's ability to distribute 3DGS parameters across multiple GPUs allows it to use a larger number of Gaussians to represent scenes. Evaluations on the Rubble, MatrixCity Block_All, and Bicycle datasets show that increasing the number of Gaussians, enabled by Grendel's multiGPU scaling, leads to significant improvements in reconstruction quality (PSNR, SSIM, LPIPS).
3. What are the key findings from Grendel's ablation study on the importance of its dynamic load balancing techniques? Grendel's ablation study shows that its dynamic load balancing techniques, both for pixelwise distribution and Gaussianwise distribution, are crucial for achieving high performance across various types and scales of scenes. Without these techniques, Grendel's training throughput is significantly lower, demonstrating the importance of addressing the dynamic and unbalanced workloads in 3DGS training.