Empirical Analysis of Sim-and-Real Cotraining for Diffusion Policies

Real Only Policy

Policies trained on a limited number of real-world demonstrations perform poorly.

Real demos: 50, Sim demos: 0
Success Rate: 50% (10/20)

Real demos: 10, Sim demos: 0
Success Rate: 10% (2/20)

Sim+Real Policy

Cotraining with simulated data drastically improves performance over real-only policies.

Real demos: 50, Sim demos: 2000
Success Rate: 90% (18/20)

Real demos: 10, Sim demos: 2000
Success Rate: 70% (14/20)

Abstract

Cotraining with demonstration data generated both in simulation and on real hardware has emerged as a promising recipe for scaling imitation learning in robotics. This work seeks to elucidate basic principles of this sim-and-real cotraining to inform simulation design, sim-and-real dataset creation, and policy training.

Our experiments confirm that cotraining with simulated data can dramatically improve performance, especially when real data is limited. We show that these performance gains scale with additional simulated data up to a plateau; adding more real-world data increases this performance ceiling. The results also suggest that reducing physical domain gaps may be more impactful than visual fidelity for non-prehensile or contact-rich tasks. Perhaps surprisingly, we find that some visual gap can help cotraining -- binary probes reveal that high-performing policies must learn to distinguish simulated domains from real. We conclude by investigating this nuance and mechanisms that facilitate positive transfer between sim-and-real.

Focusing narrowly on the canonical task of planar pushing from pixels allows us to be thorough in our study. In total, our experiments span 50+ real-world policies (evaluated on 1000+ trials) and 250 simulated policies (evaluated on 50,000+ trials).

Real World Experiments

We cotrain and evaluate policies with varying amounts of both real and simulated data with different mixing ratios \(\alpha\). We find that cotraining can improve performance by 2-7x, but is sensitive to the mixing ratio. Scaling up simulated data improves performance and reduces sensitivity to the mixing ratio.

Simulation Experiments

We scale up our investigation and improve its statistical significance by repeating our real-world experiments in simulation. To emulate sim-and-real cotraining in sim, we create a second target sim environment that serves as a surrogate to the real-world.

Sim Data

Visuals: No shadows, white light, incorrect camera extrinsics.
Physics: Quasistatic dynamics.
Actions: Optimization-based planner.

Target Sim Data

(mimics the real-world environment)

Visuals: Two shadows, 'carbon arc' light.
Physics: Drake's physics engine.
Actions: Human teleoperation.

Cotrained Policy (Sim+Target Sim)

Target sim demos: 50, Sim demos: 4000
Success Rate: 81% (162/200)
To increase the speed of the visualization: Open Controls ➜ Animations ➜ default ➜ timeScale
Note: the meshcat visualization does not include the shadows, but they are correctly rendered in the camera inputs.

We find that scaling up simulated data generation improves performance, but this trend eventually plateaus. This suggests that sim data can effectively supplement real data, but cannot replace it; real data is still needed to increase the cotraining ceiling.

Distribution Shifts Experiments

We also use our simulation to study the following questions about the distributions shifts and sim-and-real cotraining:

Which sim2real gaps matter for cotraining? How does this inform simulator design for data generation?
What are the best practices for cotraining under different types and magnitudes of distribution shifts?

We introduce 6 different types of distribution shifts (spanning visual, task, and physical gaps) into the sim data at different magnitudes and cotrain policies. Four of the six shifts are visualized below.

Overall, larger sim2real gaps along all axis reduce performance. Our results show that:

For non-prehensile manipulation tasks, simulated data generation environments should match the physics as closely as possible. This may be less important for more semantic tasks, like pick-and-place.
Better rendering increases performance; however, perfect rendering is unnecessary (and difficult to achieve).
The optimal mixing ratio and the magnitude of the sim2real gaps did not show a clear trend.

Analysis

1. High Performing Policies Distinguish Sim vs Real

Binary probing experiments show that high performing policies implicitly discern sim vs real and adjust their behavior accordingly. We show that this an important mechanism of sim-and-real cotraining since the different physics of sim and real require different behaviors and strategies.

2. Sim improves data coverage

We find that cotraining improves performance by improving data coverage and preventing compounding errors from missing states in the real-world dataset. Interestingly, cotrained policies mostly rely on real data for action generation and use cotraining data in local neighborhoods where real data is missing. This results in smoother and more robust policies.

3. Power Laws

Scaling up sim data reduces the test loss and action MSE on the real-world distribution predictably according to a power law. These power laws are evidence of positive transfer and provide insight into the relative value between sim and real data.

4. Ablations

We conducted several additional ablation studies, including:

Finetuning experiments
Domain conditioning via classifier free guidance
Alternative formulations for cotraining based on adversarial objectives and the maximum mean discrepancy (MMD) metric

For a comprehensive analysis of all ablation studies, please refer to our paper .

BibTeX

@article{wei2025simandrealcotraining,
      title={Empirical Analysis of Sim-and-Real Cotraining of Diffusion Policies for Planar Pushing from Pixels}, 
      author={Adam Wei and Abhinav Agarwal and Boyuan Chen and Rohan Bosworth and Nicholas Pfaff and Russ Tedrake},
      year={2025},
      eprint={2503.22634},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2503.22634}, 
}