Chapter 15: From Human to Robot — Embodiment Retargeting

Overview

Converting human demonstration data (Chapter 6) into robot policies requires overcoming the cross-embodiment gap between human and robot. The human hand's 27 DoF and a robot hand's 7-22 DoF represent fundamentally different kinematic structures, with differences in visual appearance and tactile properties as well. This chapter addresses three dimensions of the gap — kinematic, visual, and tactile — and their respective solutions, as well as emerging paradigms of human-robot co-training and teleop-free policy learning from human data alone.

After reading this chapter, you will be able to... - Distinguish the three dimensions of the cross-embodiment gap. - Compare kinematic retargeting approaches (AnyTeleop, ImMimic, DexH2R, ManipTrans). - Understand recent visual gap solutions including Mirage and H2R. - Understand UniTacHand [#16] and OSMO [#18]'s tactile gap solutions. - Explain the scaling laws of human-robot co-training (EgoMimic, EgoScale). - Assess the possibilities and limitations of teleop-free approaches (X-Sim, EgoZero, VidBot). - Explain why no general solution exists for the tactile domain gap.

15.1 The Cross-Embodiment Gap: Kinematic, Visual, and Tactile

Gap Dimension	Human	Robot	Core Problem
Kinematic	27 DoF	7-22 DoF	Joint structure, range, coupling differences
Visual	Skin, nails, flexible	Metal, plastic, rigid	Visual policy appearance dependence
Tactile	~17,000 receptors	0-17 sensors	Density, type, distribution entirely different

Key observation from Seminar 1: UniTacHand is the only paper that has addressed the tactile domain gap, and no general methodology exists.

15.2 Kinematic Retargeting: AnyTeleop, ImMimic, DexH2R

AnyTeleop (2023)

Qin et al. [2023, RSS] Dex-Retarget: human keypoints → robot joint positions. Limitation: naive mapping loses physical feasibility.

Figure 15.1: AnyTeleop — a general vision-based teleoperation system operating across IsaacGym and SAPIEN simulators as well as real-world setups, supporting diverse arms, hand types, and camera configurations. Source: Qin et al. (2023), Fig. 1.

ImMimic (2025)

Liu et al.^[2]: Interpolation between large-scale human trajectories and few teleoperation trajectories. Combines human data diversity with teleop physical feasibility.

Key Paper: Liu et al. 2025. "ImMimic: Large-scale Human Trajectory + Few-shot Teleoperation Interpolation." Data augmentation through interpolation, mitigating the cross-embodiment gap from the data side.

DexH2R (2024)

Task-oriented retargeting: extracts the intent of human demonstrations and reproduces it within the robot's kinematics.

ManipTrans (CVPR 2025)

Lv et al. [2025, CVPR] — the first large-scale framework for bimanual manipulation retargeting:

Retargets human bimanual trajectories to robot bimanual hands
DexManipNet: dataset of 3,300+ bimanual manipulation episodes
Optimization-based retargeting ensuring both physical feasibility and contact consistency
Extends beyond single-hand retargeting into bimanual coordination

Park et al. (CMU/SNU, 2025)

Park et al. [arXiv, Jan 2025]:

Aligns the joint motion manifolds of human and robot hands
Latent-space mapping naturally preserves physical feasibility
Retargeting accuracy 0.59 vs. 0.39 over baseline
Manifold-based approach enables rapid adaptation to new robot hands

Human2Sim2Robot

Human video → simulation reproduction → robot policy. Extension of internet video utilization (→ Chapter 6.5).

Figure 15.2: MANO — a parametric blendshape model of the human hand's articulated and non-rigid deformations, combined with SMPL+H body to recover hand shape and pose. MANO underpins the UV-map-based retargeting approaches later in §11.4 (UniTacHand). Source: Romero et al. (2017), Fig. 1.

15.3 Bridging the Visual Gap: DexUMI [#8], RoboPaint [#15], Mirage, H2R

DexUMI (2025)

SAM2 inpainting: Erases human hand from camera images, replaces with robot hand. Visual policies become independent of human hand appearance. Kinematic gap resolved separately via exoskeleton.

RoboPaint (2025)

3DGS reconstructs real scenes in simulation, reducing the visual sim-to-real gap (→ Chapter 14.4).

Mirage (RSS 2024)

Chen et al. [2024, RSS] — Cross-Painting for zero-shot visual transfer:

Cross-painting technique visually transforms human hands into robot hands
Achieves zero-shot visual transfer without additional training
Directly converts human demonstration videos into robot-perspective images for policy learning
Complementary to DexUMI's inpainting — Mirage does style transfer, DexUMI does full replacement

H2R (2025)

Human-to-Robot Video Augmentation:

Transforms human demonstration videos into robot visual data for pretraining
Automatically generates synthetic robot video from large-scale human video
Maximizes data efficiency for visual representation learning
A novel approach: bridging the visual gap at the pretraining stage

Masquerade (2025)

A framework for transforming human video into robot-compatible visuals:

Automatically converts visual elements of human demonstrations for the robot domain
Combines masking and regeneration in a style transfer pipeline
Makes existing human video datasets immediately usable for robot learning

Figure 15.3: DexUMI — a framework that combines a wearable exoskeleton with SAM2-based visual inpainting to transfer human hand demonstrations to diverse robot hands (Inspire-Hand, XHand). Achieves an average 86% success rate across long-horizon, contact-rich, multi-finger, and precise tasks. Source: Xu et al. (2025), Fig. 1.

15.4 Bridging the Tactile Gap: UniTacHand and OSMO

The tactile gap is the hardest of the three dimensions to resolve.

UniTacHand (2025)

Zhang et al.^[2]:

MANO [#17] UV map: Unfolds hand surface into 2D UV space
Human glove tactile → projected onto MANO UV map
Robot hand tactile → projected onto same MANO UV map
Shared representation space for tactile policy learning
The only paper addressing the tactile domain gap

Key Paper: Zhang, C., et al. (2025). "UniTacHand: Unified Spatio-Tactile Representation for Human to Robotic Hand Skill Transfer." arXiv:2512.21233. The first systematic treatment of tactile cross-embodiment transfer, using MANO UV maps as a shared representation.

Figure 15.4: UniTacHand overview — Stage 1 projects both human-glove and DexHand tactile signals onto a 2D MANO UV map; Stage 2 aligns their heterogeneous structures and adversarial morphologies into a unified latent space. This tactile-centric alignment enables zero-shot transfer of robot tactile policies. Source: Zhang et al. (2025), Fig. 1.

OSMO (2025)

OSMO glove [arXiv:2512.08920] (→ Chapter 6.3):

Embodiment Bridge: Same tactile glove on both human and robot
Physically eliminates the tactile domain gap — identical sensor means no representation difference
12 three-axis magnetic sensors, open-source

Fundamental difference between the two approaches:

UniTacHand: Software solution (representation alignment)
OSMO: Hardware solution (physical sharing)

Figure 15.5: OSMO — an open-source tactile glove with 3-axis magnetic sensors providing full-hand tactile coverage. Demonstrates human-to-robot skill transfer without any robot teleop data. Source: Yin et al. (2025), Fig. 1.

15.5 Mechanical Coupling: The DEXOP [#10] Four-Bar Linkage Approach

DEXOP^[8] directly mechanically couples human and robot fingers via four-bar linkage:

Resolves all three gaps simultaneously:
Kinematic: Mechanical 1:1 mapping
Visual: Robot acts directly, only robot visible to camera
Tactile: Direct contact feedback to human
8x faster data collection than teleoperation
51.3% vs. 42.5% success (vs. teleoperation)

Figure 15.6: DEXOP overview — a passive exoskeleton coupling human fingers to a passive robotic hand via four-bar linkages. Equipped with whole-hand tactile sensors, it enables precise, contact-rich, long-horizon, and bimanual dexterous periopeation. Source: Fang et al. (2025), Fig. 1.

Figure 15.7: Exploded view of DEXOP-12 — (a) the linkage systems for index, middle, ring, and thumb fingers connecting the wearable exoskeleton to a passive robotic hand; (b) annotated view of the two-stage four-bar linkage coupling the thumb's TM joint. Source: Fang et al. (2025), Fig. 4.

15.6 Human + Robot Co-training

The most prominent emerging paradigm is co-training — jointly training on human demonstration data and robot data. Rather than requiring kinematic retargeting, human data directly improves robot policies.

EgoMimic (Georgia Tech, 2024)

Figure 15.8: EgoMimic overview — Aria smart-glasses-captured human 3D hand trajectories (Human Data) and bimanual teleoperated robot data (Robot Data) are merged into unified Training Data for joint policy learning. Source: Kareer et al. (2024), Fig. 1.

Figure 15.9: EgoMimic hardware — the human session only requires Aria smart glasses (enabling in-the-wild data collection), while the robot session mounts the same Aria as a side camera to minimize viewpoint and sensor-domain mismatch. Source: Kareer et al. (2024), Fig. 2.

Kareer et al. [arXiv, 2024]:

Co-trains on human egocentric video and robot data
1 hour of human demonstrations outperforms 2 hours of robot data — counterintuitive but reproducible
+34-228% performance improvement over robot-only training
The richness and diversity of human data compensates for robot data quantity

Key Paper: Kareer et al. 2024. "EgoMimic: Scaling Imitation Learning via Egocentric Video." arXiv preprint. Demonstrates the counterintuitive result that 1 hour of human egocentric data outperforms 2 hours of robot data. Establishes the foundation for the human-robot co-training paradigm.

EgoScale (NVIDIA, 2026)

EgoScale [arXiv, Feb 2026]:

Leverages 20,854 hours of large-scale human egocentric data
Discovers a log-linear scaling law between human data scale and robot performance: R² = 0.998
+54% improvement over robot-only training
Predictable performance gains with data scale — enables estimating return on investment

Key Paper: NVIDIA 2026. "EgoScale: Scaling Robot Policy Learning with Human Egocentric Data." arXiv preprint. Discovers a log-linear scaling law (R² = 0.998) from 20,854 hours of human data. Demonstrates that human data scaling provides predictable gains for robot learning.

AoE (2026)

AoE (Augmentation of Experience) [arXiv, Feb 2026]:

Small-scale mixture of 50 teleop + 200 human ego demonstrations
Close Laptop task: success rate 45% → 95%
Small amounts of human data effectively compensate for limited robot data
A practical co-training approach under resource constraints

pi0 [#2] Human-to-Robot Transfer (Physical Intelligence, 2025)

Physical Intelligence [Dec 2025]:

Co-finetuning the pretrained pi0 foundation model with human data
2x performance improvement across 4 generalization scenarios (new objects, environments, tasks, robots)
Emergent alignment: the model automatically learns human-robot correspondences without explicit retargeting
Cross-embodiment transfer in the Foundation Model era — at sufficient scale, explicit mapping becomes unnecessary

15.7 Teleop-Free Approaches

A more radical direction entirely eliminates robot data, learning robot policies from human data alone. If successful, this fundamentally removes the cost and bottleneck of teleoperation.

X-Sim (Cornell, CoRL 2025 Oral)

Dan et al. [2025, CoRL Oral]:

A single human RGBD video → RL training in simulation → real robot deployment
Achieves real robot control with zero robot data
Complete realization of the Real-to-Sim-to-Real pipeline
CoRL 2025 Oral — demonstrates the feasibility of generating robot policies from human demonstrations alone

Figure 15.10: X-Sim overview — (Real→Sim) collect a human video and generate a photorealistic simulation; (Train X-Sim Policy) train RL with privileged state, then render varied lighting/views to distill an image-based policy; (Sim→Real) deploy the image-based policy and auto-calibrate sim vs. real. Source: Dan et al. (2025), Fig. 1.

EgoZero (2025)

EgoZero^[25]:

Learns robot policies from smart glasses egocentric video alone
Achieves 70% success rate across 7 manipulation tasks
Zero robot data — uses only wearable device data from daily activities
The most natural form of human demonstration collection, requiring no special equipment (gloves, exoskeletons)

VidBot (TU Munich, CVPR 2025)

VidBot [2025, CVPR]:

Internet video → 3D affordance extraction → zero-shot robot control
+20% improvement over existing robot data approaches
Transforms the internet's vast video data into robot learning resources
Affordance-based representations naturally abstract away the embodiment gap

Human2Bot (Autonomous Robots, 2025)

Human2Bot ^[27]:

Human video → task similarity reward design → zero-shot robot control
No robot data required — human video serves as the reward function
Indirect knowledge transfer through reward shaping
Task-level abstraction bypasses the embodiment gap entirely

15.8 Open Challenges: Tactile Domain Gap and the Limits of New Paradigms

One of Seminar 1's most important insights: no general methodology exists for solving the cross-embodiment gap in the tactile domain.

UniTacHand: MANO UV map-based, limited to MANO-compatible hands
OSMO: Requires identical glove, incompatible with existing sensors
DEXOP: Mechanical coupling, strongly hand-form dependent

Future research directions needed:

Sensor-agnostic tactile transfer: Extending AnyTouch/Sensor-Invariant to cross-embodiment (→ Chapter 3.3)
Universal tactile UV maps: Shared representations applicable beyond MANO
Simulation-based tactile transfer: Overcoming the tactile gap via DiffTactile

Open Problems in Co-training and Teleop-Free Approaches

The new paradigms covered in Sections 10.6 and 10.7 are powerful, but several challenges remain:

Counterintuitive scaling: EgoMimic's result (1hr human > 2hr robot) shows that diversity of human data is key, but which kinds of diversity matter remains unclear.
Human-only robot control scope: X-Sim, EgoZero, and VidBot demonstrate robot control from human data alone, but the range of successful tasks remains limited.
Universality of scaling laws: Whether EgoScale's log-linear law (R² = 0.998) generalizes across diverse tasks and robot platforms requires further validation.
Conditions for emergent alignment: pi0's results may be a phenomenon that only emerges at Foundation Model scale; the necessary conditions are poorly understood.
Absence of tactile data: Both co-training and teleop-free approaches rely on visual data; transfer of tactile information remains entirely unaddressed.

Summary and Outlook

Embodiment Retargeting proceeds along four dimensions: kinematic (AnyTeleop, ImMimic, ManipTrans), visual (DexUMI, RoboPaint, Mirage), tactile (UniTacHand, OSMO), and mechanical (DEXOP). Recently, human-robot co-training (EgoMimic, EgoScale, pi0) and teleop-free approaches (X-Sim, EgoZero, VidBot) are driving a fundamental paradigm shift.

Three counterintuitive findings are redefining the field's direction:

1 hour human > 2 hours robot (EgoMimic): human data diversity dominates quantity
Human-only robot control is possible (X-Sim, EgoZero, VidBot): robot data may be unnecessary
Log-linear scaling (EgoScale, R² = 0.998): returns on human data investment are predictable

Yet tactile cross-embodiment transfer remains addressed by only one paper (UniTacHand), and neither co-training nor teleop-free approaches handle tactile information. This remains the field's largest open problem.

The "Shared Sensing Platform" direction proposed in Chapter 18 — generalizing OSMO/UniTacHand-style cross-embodiment tactile transfer — is the key research direction for closing this gap.

The next part addresses system integration and outlook (→ Chapter 16: Research Integration).

15.9 Embodiment-Gap Decomposition: Kinematics, Vision, and Touch

The most useful large-scale human-hand data viewpoint is decomposing embodiment gap into three dimensions. The kinematic gap comes from different joints, ranges of motion, and finger counts. The visual gap comes from the appearance difference between human and robot hands. The tactile gap comes from different contact distributions, sensor densities, taxel locations, and palm/fingertip roles even when grasping the same object.

This decomposition strengthens S1's retargeting chapter. Visual gap has relatively mature tools such as Mirage, H2R, robot overlays, and inpainting. Kinematic gap can be addressed with residual RL, motion manifolds, or hardware co-design. Tactile gap is much less mature, with only early signals from OSMO's shared platform, UniTacHand's UV maps, and EquiTac's equivariant tactile representation.

For tactile in-hand manipulation, this gap is severe. A contact pattern stabilized by a human palm and five fingers cannot be copied directly to LEAP, Allegro, XHand, Tesollo, or Wuji hands. Finger count, workspace, palm contact, side contact, and fingertip-pad stiffness all differ. Retargeting must therefore be reframed as contact-role transfer: which robot contact should replace each human contact role?

Translated into this survey: human data provides manipulation priors, robot data provides executability, and tactile data provides contact roles and failure explanations. All three are needed to convert human dexterity into stable robot-hand behavior for contact-rich manufacturing tasks.

Operational Reading Note

The practical value of this chapter is not only the concept of human-to-robot transfer; it is the set of engineering decisions that the concept changes. A deployable robot-hand project should start by asking what state becomes observable after this chapter is applied. The answer should be concrete: contact existence, contact patch, normal force, shear direction, slip margin, object pose, task phase, operator override, or product-damage risk. If a variable cannot be logged or consumed by a controller, it remains an explanatory idea rather than a system capability.

The second decision is the unit of evidence. Research demos often report one success metric, but tactile manipulation improves through failure records. A useful attempt record contains the object or SKU, the selected grasp candidate, the robot hand and sensor configuration, calibration version, task phase, tactile summary, policy action, safety intervention, and final outcome. This record is what connects the sensor chapters to the data chapter, the control chapters to the learning chapters, and the manufacturing chapters to QA.

The third decision is where the chapter sits in the control stack. Some ideas belong in fast reflex loops, some in contact MPC, some in policy inputs, and some only in offline diagnosis. Mixing these time scales creates brittle systems: a VLA cannot react to millisecond slip, and a low-level force controller cannot infer the next process step. The right architecture separates fast contact stabilization, mid-level grasp or rearrangement control, and slow task reasoning.

Finally, the chapter should be evaluated by the failure modes it removes. A method that improves benchmark success but leaves the team unable to distinguish perception failure, contact-acquisition failure, force-closure failure, execution-time slip, or maintenance drift is not yet production-ready. A method with slightly lower headline performance but better logs, safer force limits, and clearer recovery hooks may be the stronger basis for manufacturing Physical AI.

References

Qin, Y., et al. (2023). AnyTeleop: A general vision-based dexterous robot teleoperation system. RSS 2023. scholar
Liu, Y., et al. (2025). ImMimic: Large-scale human trajectory + few-shot teleoperation interpolation. scholar
Li, Y., et al. (2024). DexH2R: Task-oriented dexterous manipulation from human to robots. arXiv preprint. scholar
Xu, M., Zhang, H., Hou, Y., Xu, Z., Fan, L., Veloso, M., & Song, S. (2025). DexUMI: Using human hand as the universal manipulation interface for dexterous manipulation. arXiv preprint. #8 scholar
Various. (2025). RoboPaint: 3DGS for Real-Sim-Real visual transfer. #15 scholar
Zhang, C., Xue, Z., Yin, S., Zhao, B., et al. (2025). UniTacHand: Unified spatio-tactile representation for human to robotic hand skill transfer. arXiv preprint. #16 scholar
Yin, J., Qi, H., Wi, Y., Kundu, S., Lambeta, M., Yang, W., Wang, C., Wu, T., Malik, J., & Hellebrekers, T. (2025). OSMO: Open-source tactile glove for human-to-robot skill transfer. arXiv preprint. arXiv:2512.08920. #18 scholar
Fang, H.-S., Romero, B., Xie, Y., et al. (2025). DEXOP: A device for robotic transfer of dexterous human manipulation. arXiv preprint. arXiv:2509.04441. #10 scholar
Dan, Y., et al. (2025). X-Sim: Cross-embodiment learning via real-to-sim-to-real. CoRL 2025 (Oral). scholar
Si, Z., Qian, K., Sontakke, N., et al. (2025). ExoStart: Efficient learning for dexterous manipulation with sensorized exoskeleton demonstrations. arXiv preprint. #9 scholar
Shaw, K., Bahl, S., & Pathak, D. (2024). Learning dexterity from human hand motion in internet videos. arXiv preprint. arXiv:2212.04498. scholar
Romero, J., Tzionas, D., & Black, M. J. (2017). Embodied hands (MANO). SIGGRAPH Asia 2017. #17 scholar
Various. (2025). Human2Sim2Robot: Internet video to robot policy pipeline. scholar
Various. (2025). AnyTouch: Unified static-dynamic tactile representation. arXiv:2502.12191. scholar
Various. (2025). Sensor-invariant tactile representation. OpenReview. scholar
Lv, Y., et al. (2025). ManipTrans: Efficient bimanual dexterous manipulation retargeting. CVPR 2025. scholar
Park, J., et al. (2025). Joint motion manifold for human-to-robot hand retargeting. arXiv preprint, Jan 2025. scholar
Chen, B., et al. (2024). Mirage: Cross-embodiment zero-shot transfer via cross-painting. RSS 2024. scholar
Various. (2025). H2R: Human-to-robot video augmentation for policy pretraining. arXiv preprint. scholar
Various. (2025). Masquerade: Human video to robot visual transformation. arXiv preprint. scholar
Kareer, S., et al. (2024). EgoMimic: Scaling imitation learning via egocentric video. arXiv preprint. scholar
Zheng, R., Niu, D., Xie, Y., et al. (2026). EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data. arXiv:2602.16710. source
Various. (2026). AoE: Augmentation of experience via human ego demonstrations. arXiv preprint, Feb 2026. scholar
Physical Intelligence. (2025). pi0 human-to-robot transfer: Co-finetuning for cross-embodiment generalization. Technical Report, Dec 2025. #2 scholar
Various. (2025). EgoZero: Zero robot data policy learning from smart glasses. arXiv preprint. scholar
Various. (2025). VidBot: Internet video to 3D affordance for zero-shot robot control. CVPR 2025. scholar
Various. (2025). Human2Bot: Task similarity reward from human video for zero-shot robot control. Autonomous Robots, 2025. scholar [Autonomous Robots, 2025]
Um, T. (2026). S2 From Human Hands to Robot Hands: large-scale tactile hand data survey. source [Um, 2026]