Glossary

Key terms in A-Z order. `(Ch N)` marks the chapter where the term is introduced or discussed in depth.

Definitions are kept in sync with the monorepo master at `glossary/master_en.md`.

A

ACT (Action Chunking with Transformers): Transformer-based action chunking — learns continuous action sequences from demonstrations to stabilize delayed-reward tasks. (Ch6, Ch7, Ch8, Ch9, Ch11)

ADR (Automatic Domain Randomization): Auto-expands the range of physical/non-physical parameters during training as a sim-to-real strategy. (Ch7, Ch9)

ALOHA: Low-cost (<$20K) bimanual teleoperation hardware (includes ALOHA and ALOHA 2). (Ch7, Ch11)

C

Capacitive sensor: Measures contact via capacitance change between electrodes. (Ch6)

Closed-loop: Architecture that feeds execution results back to update plans. (Ch2, Ch8)

Compliance: Mechanical yielding to external force — essential for contact-rich manipulation. (Ch2, Ch3, Ch4, Ch7, Ch8, Ch13)

Contrastive learning: Representation learning that pulls similar pairs together and pushes dissimilar pairs apart. (Ch1, Ch3)

Co-training: Strategy of jointly training on human and robot data for complementary representations. (Ch8, Ch9, Ch10)

Cross-embodiment gap: Transfer gap arising from kinematic, visual, and tactile differences across agents (e.g., human vs. robot, or robot vs. robot). (Ch10)

D

DEXOP: Framework for learning manipulation policies from human hand data. (Ch6, Ch10)

Dexterous manipulation: Precise object manipulation with multi-fingered hands — in-hand rotation, assembly, etc. (Ch1, Ch2, Ch3, Ch4, Ch5, Ch6, Ch7, Ch8, Ch9, Ch10, Ch11, Ch12, Ch13)

DexUMI: Framework for converting human demonstrations into robot policies. (Ch3, Ch6, Ch10)

Diffusion Policy: Policy learning via conditional denoising diffusion over action distributions. (Ch1, Ch2, Ch3, Ch5, Ch7, Ch8, Ch11, Ch13)

DoF (Degrees of Freedom): Number of independent joint axes. Human hand has ~27 DoF. (Ch2, Ch3, Ch4, Ch6, Ch8, Ch10, Ch11, Ch12)

Domain Randomization: Randomizes simulation parameters to improve policy robustness. (Ch7, Ch9)

E

Embodiment Retargeting: Maps motion from one embodiment (e.g., human hand) to another (e.g., robot hand) joint space. (Ch9, Ch10)

E-skin: Flexible electronic substrate integrated with tactile sensors. (Ch2)

F

FEM (Finite Element Method): Numerical method for deformation simulation. (Ch3, Ch9)

Flow Matching: Learns action distributions via continuous normalizing flows — core technique of pi0. (Ch7, Ch8, Ch9)

ForceVLA: VLA extended with force sensing using MoE routing to branch on contact modes. (Ch7, Ch8, Ch11, Ch13)

Foundation Model: Large-scale pretrained general-purpose model — e.g., Sparsh (tactile), pi0 (VLA). (Ch3, Ch6, Ch8, Ch9, Ch10, Ch11, Ch12, Ch13)

F-TAC Hand: Robot hand platform integrating high-resolution tactile sensing. (Ch1, Ch2, Ch4, Ch5, Ch11)

G

GelSight: MIT-developed photometric-stereo-based vision-tactile sensor. (Ch1, Ch2, Ch3, Ch9, Ch11, Ch13)

Grounding: Connecting an LLM's abstract language to feasible actions, objects, and states in the environment. (Ch13)

H

Human hand data: Datasets of human hand manipulation captured via gloves, video, or motion capture. (Ch1, Ch2, Ch5, Ch6)

I

IL (Imitation Learning): Learns policies by directly imitating human demonstrations. (Ch7)

Impedance control: Regulates the force-displacement relationship [Hogan, 1985]. (Ch1, Ch5, Ch7, Ch13)

In-hand manipulation: Changing the position/pose of a grasped object. (Ch1, Ch2, Ch3, Ch4, Ch7, Ch9)

K

Kinesthetic teaching: Human physically guides the robot to demonstrate motions. (Ch1, Ch3, Ch7)

M

MANO: Statistical human hand model trained on 1,000 3D scans (778 vertices, 16 joints). (Ch3, Ch6, Ch8, Ch10)

MoE (Mixture of Experts): Architecture that dynamically routes inputs to multiple expert networks — ForceVLA is a representative example. (Ch7, Ch8, Ch11, Ch13)

O

Open X-Embodiment: Largest open-source robot dataset, aggregating 1M+ trajectories from 34 labs. (Ch3, Ch8, Ch11, Ch12)

OpenVLA: Open-source VLA foundation model (7B params, trained on Open X-Embodiment). (Ch8, Ch11, Ch12)

OSMO: Learns and transfers robot policy from human hand motion. (Ch2, Ch3, Ch6, Ch10, Ch11, Ch13)

P

PaLM-E: Google's embodied multimodal language model unifying image, state, and language into one token space. (Ch8)

Particle jamming: Mechanism that transitions from soft to rigid via particle densification. (Ch5)

Photometric stereo: Reconstructs 3D shape via multi-directional illumination and a camera — core of GelSight. (Ch1, Ch2)

Physical AI: AI that understands and interacts with the physical world — convergence of Foundation Models, simulation, and sensors. (Ch1, Ch4, Ch11, Ch12, Ch13)

pi0 (π₀): Physical Intelligence's flow-based VLA foundation model. (Ch1, Ch7, Ch8, Ch10, Ch12, Ch13)

Point cloud: 3D coordinate set representing tactile or visual data. (Ch3, Ch7, Ch11)

R

RL (Reinforcement Learning): Policy learning by maximizing reward. (Ch3, Ch4, Ch6, Ch7, Ch8, Ch9, Ch10, Ch13)

RT-2: Google DeepMind's VLA model jointly trained on web VQA and robot manipulation. (Ch8, Ch12, Ch13)

S

Scaling laws: Laws relating data/model scale to performance — active research in robot learning. (Ch3, Ch10, Ch13)

Shear force: Force parallel to the contact surface — essential for slip detection. (Ch1, Ch2, Ch6)

SIMPLER: Benchmark that aligns simulation evaluation with real-world performance. (Ch5)

Sim-to-Real: Process/strategies for transferring policies trained in simulation to the real world. (Ch3, Ch4, Ch6, Ch7, Ch8, Ch9, Ch10, Ch11, Ch12, Ch13)

Slip detection: Detecting an object slipping from the hand — typically via shear-force monitoring. (Ch1, Ch2, Ch3, Ch7, Ch11, Ch13)

Synergy: Coordinated motion pattern across joints — often captured via PCA-based dimensionality reduction. (Ch4, Ch5)

T

Tactile foundation model: General-purpose representation pretrained on large-scale tactile data — e.g., Sparsh, UniTouch. (Ch3, Ch11)

Tactile skin: Large-area flexible tactile sensor array covering hands/arms. (Ch2, Ch7, Ch9)

Taxel: Tactile pixel — unit sensing element. Example: Digit 360 ≈ 8.3M taxels. (Ch2)

Teleoperation (TeleOp): Humans remotely operating a robot to collect demonstration data. (Ch3, Ch4, Ch6, Ch7, Ch10, Ch11, Ch13)

Tendon-driven: Actuation via tendons transmitting force — e.g., SoftHand, ORCA. (Ch4, Ch5, Ch12)

Torque control: Directly controls joint torque — essential in contact-rich environments. (Ch4)

U

Underactuation: Design where the number of actuators is fewer than joints, enabling passive shape adaptation. (Ch4, Ch5, Ch11)

UV map: 2D planar unwrapping of a 3D surface — e.g., MANO UV map. (Ch3, Ch6, Ch10)

V

Visuo-tactile: Representation or model fusing vision and tactile information — e.g., Robot Synesthesia, 3D-ViTac. (Ch2, Ch3, Ch7, Ch11)

VLA (Vision-Language-Action): Unified model that directly outputs robot actions from vision and language input. (Ch1, Ch7, Ch8, Ch10, Ch11, Ch12, Ch13)

VSA (Variable Stiffness Actuator): Actuator that can actively toggle between compliant and rigid modes. (Ch5, Ch11)