Glossary
Key terms in A-Z order. `(Ch N)` marks the chapter where the term is introduced or discussed in depth.
Definitions are kept in sync with the monorepo master at `glossary/master_en.md`.
A
ACT (Action Chunking with Transformers): Transformer-based action chunking — learns continuous action sequences from demonstrations to stabilize delayed-reward tasks. (Ch6, Ch7, Ch8, Ch9, Ch11)
ADR (Automatic Domain Randomization): Auto-expands the range of physical/non-physical parameters during training as a sim-to-real strategy. (Ch7, Ch9)
ALOHA: Low-cost (<$20K) bimanual teleoperation hardware (includes ALOHA and ALOHA 2). (Ch7, Ch11)
C
Capacitive sensor: Measures contact via capacitance change between electrodes. (Ch6)
Closed-loop: Architecture that feeds execution results back to update plans. (Ch2, Ch8)
Compliance: Mechanical yielding to external force — essential for contact-rich manipulation. (Ch2, Ch3, Ch4, Ch7, Ch8, Ch13)
Contrastive learning: Representation learning that pulls similar pairs together and pushes dissimilar pairs apart. (Ch1, Ch3)
Co-training: Strategy of jointly training on human and robot data for complementary representations. (Ch8, Ch9, Ch10)
Cross-embodiment gap: Transfer gap arising from kinematic, visual, and tactile differences across agents (e.g., human vs. robot, or robot vs. robot). (Ch10)
D
DEXOP: Framework for learning manipulation policies from human hand data. (Ch6, Ch10)
Dexterous manipulation: Precise object manipulation with multi-fingered hands — in-hand rotation, assembly, etc. (Ch1, Ch2, Ch3, Ch4, Ch5, Ch6, Ch7, Ch8, Ch9, Ch10, Ch11, Ch12, Ch13)
DexUMI: Framework for converting human demonstrations into robot policies. (Ch3, Ch6, Ch10)
Diffusion Policy: Policy learning via conditional denoising diffusion over action distributions. (Ch1, Ch2, Ch3, Ch5, Ch7, Ch8, Ch11, Ch13)
DoF (Degrees of Freedom): Number of independent joint axes. Human hand has ~27 DoF. (Ch2, Ch3, Ch4, Ch6, Ch8, Ch10, Ch11, Ch12)
Domain Randomization: Randomizes simulation parameters to improve policy robustness. (Ch7, Ch9)
E
Embodiment Retargeting: Maps motion from one embodiment (e.g., human hand) to another (e.g., robot hand) joint space. (Ch9, Ch10)
E-skin: Flexible electronic substrate integrated with tactile sensors. (Ch2)
F
FEM (Finite Element Method): Numerical method for deformation simulation. (Ch3, Ch9)
Flow Matching: Learns action distributions via continuous normalizing flows — core technique of pi0. (Ch7, Ch8, Ch9)
ForceVLA: VLA extended with force sensing using MoE routing to branch on contact modes. (Ch7, Ch8, Ch11, Ch13)
Foundation Model: Large-scale pretrained general-purpose model — e.g., Sparsh (tactile), pi0 (VLA). (Ch3, Ch6, Ch8, Ch9, Ch10, Ch11, Ch12, Ch13)
F-TAC Hand: Robot hand platform integrating high-resolution tactile sensing. (Ch1, Ch2, Ch4, Ch5, Ch11)
G
GelSight: MIT-developed photometric-stereo-based vision-tactile sensor. (Ch1, Ch2, Ch3, Ch9, Ch11, Ch13)
Grounding: Connecting an LLM's abstract language to feasible actions, objects, and states in the environment. (Ch13)
H
Human hand data: Datasets of human hand manipulation captured via gloves, video, or motion capture. (Ch1, Ch2, Ch5, Ch6)
I
IL (Imitation Learning): Learns policies by directly imitating human demonstrations. (Ch7)
Impedance control: Regulates the force-displacement relationship [Hogan, 1985]. (Ch1, Ch5, Ch7, Ch13)
In-hand manipulation: Changing the position/pose of a grasped object. (Ch1, Ch2, Ch3, Ch4, Ch7, Ch9)
K
Kinesthetic teaching: Human physically guides the robot to demonstrate motions. (Ch1, Ch3, Ch7)
M
MANO: Statistical human hand model trained on 1,000 3D scans (778 vertices, 16 joints). (Ch3, Ch6, Ch8, Ch10)
MoE (Mixture of Experts): Architecture that dynamically routes inputs to multiple expert networks — ForceVLA is a representative example. (Ch7, Ch8, Ch11, Ch13)
O
Open X-Embodiment: Largest open-source robot dataset, aggregating 1M+ trajectories from 34 labs. (Ch3, Ch8, Ch11, Ch12)
OpenVLA: Open-source VLA foundation model (7B params, trained on Open X-Embodiment). (Ch8, Ch11, Ch12)
OSMO: Learns and transfers robot policy from human hand motion. (Ch2, Ch3, Ch6, Ch10, Ch11, Ch13)
P
PaLM-E: Google's embodied multimodal language model unifying image, state, and language into one token space. (Ch8)
Particle jamming: Mechanism that transitions from soft to rigid via particle densification. (Ch5)
Photometric stereo: Reconstructs 3D shape via multi-directional illumination and a camera — core of GelSight. (Ch1, Ch2)
Physical AI: AI that understands and interacts with the physical world — convergence of Foundation Models, simulation, and sensors. (Ch1, Ch4, Ch11, Ch12, Ch13)
pi0 (π₀): Physical Intelligence's flow-based VLA foundation model. (Ch1, Ch7, Ch8, Ch10, Ch12, Ch13)
Point cloud: 3D coordinate set representing tactile or visual data. (Ch3, Ch7, Ch11)
R
RL (Reinforcement Learning): Policy learning by maximizing reward. (Ch3, Ch4, Ch6, Ch7, Ch8, Ch9, Ch10, Ch13)
RT-2: Google DeepMind's VLA model jointly trained on web VQA and robot manipulation. (Ch8, Ch12, Ch13)
S
Scaling laws: Laws relating data/model scale to performance — active research in robot learning. (Ch3, Ch10, Ch13)
Shear force: Force parallel to the contact surface — essential for slip detection. (Ch1, Ch2, Ch6)
SIMPLER: Benchmark that aligns simulation evaluation with real-world performance. (Ch5)
Sim-to-Real: Process/strategies for transferring policies trained in simulation to the real world. (Ch3, Ch4, Ch6, Ch7, Ch8, Ch9, Ch10, Ch11, Ch12, Ch13)
Slip detection: Detecting an object slipping from the hand — typically via shear-force monitoring. (Ch1, Ch2, Ch3, Ch7, Ch11, Ch13)
Synergy: Coordinated motion pattern across joints — often captured via PCA-based dimensionality reduction. (Ch4, Ch5)
T
Tactile foundation model: General-purpose representation pretrained on large-scale tactile data — e.g., Sparsh, UniTouch. (Ch3, Ch11)
Tactile skin: Large-area flexible tactile sensor array covering hands/arms. (Ch2, Ch7, Ch9)
Taxel: Tactile pixel — unit sensing element. Example: Digit 360 ≈ 8.3M taxels. (Ch2)
Teleoperation (TeleOp): Humans remotely operating a robot to collect demonstration data. (Ch3, Ch4, Ch6, Ch7, Ch10, Ch11, Ch13)
Tendon-driven: Actuation via tendons transmitting force — e.g., SoftHand, ORCA. (Ch4, Ch5, Ch12)
Torque control: Directly controls joint torque — essential in contact-rich environments. (Ch4)
U
Underactuation: Design where the number of actuators is fewer than joints, enabling passive shape adaptation. (Ch4, Ch5, Ch11)
UV map: 2D planar unwrapping of a 3D surface — e.g., MANO UV map. (Ch3, Ch6, Ch10)
V
Visuo-tactile: Representation or model fusing vision and tactile information — e.g., Robot Synesthesia, 3D-ViTac. (Ch2, Ch3, Ch7, Ch11)
VLA (Vision-Language-Action): Unified model that directly outputs robot actions from vision and language input. (Ch1, Ch7, Ch8, Ch10, Ch11, Ch12, Ch13)
VSA (Variable Stiffness Actuator): Actuator that can actively toggle between compliant and rigid modes. (Ch5, Ch11)