UnitySplat2Data

Abstract

This paper introduces a novel pipeline for generating large-scale, highly realistic, and automatically labeled datasets for computer vision tasks in robotic environments. Our approach addresses the critical challenges of the domain gap between synthetic and real-world imagery and the time-consuming bottleneck of manual annotation. We leverage 3D Gaussian Splatting (3DGS) to create photorealistic representations of the operational environment and objects. These assets are then used in a game engine where physics simulations create natural arrangements. A novel, two-pass rendering technique combines the realism of splats with a shadow map generated from proxy meshes. This map is then algorithmically composited with the image to add both physically plausible shadows and subtle highlights, significantly enhancing realism. Pixel-perfect segmentation masks are generated automatically and formatted for direct use with object detection models like YOLO. Our experiments show that a hybrid training strategy, combining a small set of real images with a large volume of our synthetic data, yields the best detection and segmentation performance, confirming this as an optimal strategy for efficiently achieving robust and accurate models.

Method Overview

Diagram of the proposed method for generating synthetic datasets

An overview of our end-to-end pipeline, divided into "Asset Acquisition and Preparation" and "Synthetic Scene Generation and Rendering" stages.

Hybrid Rendering for Realism

A key innovation of our method is a two-pass hybrid rendering approach. For each frame, we generate a photorealistic render of the Gaussian Splats (the "Appearance pass") and a separate "Shadow pass" from simple object meshes. This shadow map is algorithmically processed and composited to create physically plausible soft shadows and highlights, significantly bridging the domain gap.

Automated Label Generation

Leveraging the controlled environment of the game engine, we generate pixel-perfect labels automatically. We render each object's mesh with a unique solid color to an-memory buffer, creating a multi-colored "ID map". By processing this map, we can extract the precise contour of each visible object part, generating perfect segmentation masks without manual labor.

Results

Our experiments show that a hybrid training strategy, combining a small set of real images with a large volume of our synthetic data, consistently achieves the highest performance across all model sizes and metrics. This approach successfully leverages the high domain fidelity of real images and the vast variation in object poses, lighting, and backgrounds provided by our synthetic dataset, confirming it as an optimal strategy for training robust and accurate models.

BibTeX

@article{nizeniec2025splatting,
  author    = {Niżeniec, Patryk and Iwanowski, Marcin},
  title     = {Computer vision training dataset generation for robotic environments using Gaussian splatting},
  journal   = {arXiv preprint arXiv:2512.13411},
  year      = {2025},
  url       = {https://arxiv.org/abs/2512.13411},
}