Part IV: Learning and Transfer

Chapter 13: Vision-Language-Action Models — See, Speak, Act

Written: 2026-04-01 Last updated: 2026-06-09

Overview

VLA (Vision-Language-Action) models combine large vision-language models (VLMs) with robot actions to pursue general-purpose robot control that sees, understands, and acts. This paradigm, which began with RT-2 in 2023, is now adopted as the "brain" of every major humanoid robot. This chapter covers the VLA lineage, pi0 [#2]'s Flow Matching approach, post-deployment improvement, tactile integration, and scaling strategies.

After reading this chapter, you will be able to... - Trace the key inflection points in the VLA lineage from RT-1 to Gemini Robotics. - Understand pi0's VLM + Flow Matching architecture. - Describe how tactile sensing is integrated into VLAs (ForceVLA [#1], Tactile-VLA). - Explain the cross-embodiment data strategy of Open X-Embodiment.

13.1 The VLA Lineage: From RT-1 to Gemini Robotics

The evolution of VLA models reflects a paradigm shift in robot learning:

RT-1 (2023)

Google / Everyday Robots' RT-1 [2] was the first large-scale real-world robot Transformer:

  • 130K episodes from 13 robots
  • Proved the feasibility of large-scale Transformers for real-world control
  • 800+ citations (RSS 2023)

RT-2 (2023)

Google DeepMind's RT-2 [2] established the VLA paradigm:

  • Fine-tuned large VLMs (PaLI-X, PaLM-E) on robot demonstration data
  • Transferred web knowledge to robotic control
  • Could execute commands like "pick up the apple"
Figure 13.1: RT-2 represents robot actions as
Figure 13.1: RT-2 represents robot actions as "another language" — text tokens that are co-fine-tuned alongside Internet-scale vision-language datasets. At inference the text tokens are de-tokenized into robot actions, enabling closed-loop control and transferring the VLM's generalization, semantic understanding, and reasoning to robot control. Source: Brohan et al. (2023), Fig. 1.
Key Paper: Brohan et al. 2023. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." CoRL 2023. The landmark paper establishing the VLA paradigm. First demonstrated that vision-language knowledge from large VLMs can transfer to robot actions.

OpenVLA (2024)

Stanford/DeepMind's OpenVLA [3] launched open-source VLAs:

  • 7B parameters — 1/10 the size of RT-2-X
  • Trained on Open X-Embodiment
  • Outperformed RT-2-X
  • Fully open-source: weights, code, data

Octo (2024)

UC Berkeley's Octo[4] is a generalist robot policy:

  • Transformer-based diffusion policy
  • Pretrained on 800K episodes
  • Flexible task/observation definitions
  • Quick finetuning support
Figure 13.2: Octo model architecture. Language instructions are tokenized by a lightweight language encoder (green) and camera images by a CNN (blue); these tokens feed a Transformer backbone whose readout tokens drive a diffusion action head. During fine-tuning, new observation streams or action spaces can be attached as additional block-wise components, adapting Octo to new embodiments without retraining the backbone. Source: Octo Model Team (2024), Fig. 2.
Figure 13.2: Octo model architecture. Language instructions are tokenized by a lightweight language encoder (green) and camera images by a CNN (blue); these tokens feed a Transformer backbone whose readout tokens drive a diffusion action head. During fine-tuning, new observation streams or action spaces can be attached as additional block-wise components, adapting Octo to new embodiments without retraining the backbone. Source: Octo Model Team (2024), Fig. 2.

13.2 pi0: Vision-Language Models Meet Flow Matching

Physical Intelligence's pi0 [5] represents a current state of the art:

  • PaLiGemma 3B VLM backbone
  • Flow Matching action expert: Action generation via continuous normalizing flows
  • 7 robots, 68 tasks, 10,000+ hours of data
  • Pre-training → post-training two-stage recipe
Key Paper: Physical Intelligence. (2024). "pi0: A Vision-Language-Action Flow Model for General Robot Control." arXiv:2410.24164. PaLiGemma VLM + Flow Matching for general robot control across 8 embodiments. The two-stage pre-training/post-training recipe is the key innovation.

pi0's core innovation is applying Flow Matching [8] to action generation. Unlike standard diffusion models, flow matching learns continuous normalizing flows without simulation, enabling more efficient action generation.

Figure 13.3: Overview of the pi0 framework. A pretraining mixture of self-collected dexterous manipulation data and open-source datasets (including OXE) trains a PaLiGemma-based VLM backbone (~2B params, initialized from Internet pretraining) coupled to a 300M-parameter Flow Matching action expert. The single model drives 14-DoF bimanual manipulators, 18-DoF mobile manipulators, and 7+6-DoF single-arm platforms from one policy. Source: Black et al. (2024), Fig. 3.
Figure 13.3: Overview of the pi0 framework. A pretraining mixture of self-collected dexterous manipulation data and open-source datasets (including OXE) trains a PaLiGemma-based VLM backbone (~2B params, initialized from Internet pretraining) coupled to a 300M-parameter Flow Matching action expert. The single model drives 14-DoF bimanual manipulators, 18-DoF mobile manipulators, and 7+6-DoF single-arm platforms from one policy. Source: Black et al. (2024), Fig. 3.

13.3 pi0.6/RECAP: Continuous Improvement Through Post-Deployment RL

pi0.6[15] [#4] and RECAP realize continuous improvement via post-deployment reinforcement learning:

  • Deploy initial pi0 → collect failure/success data
  • Fine-tune with RL on collected data
  • Deployment-learning-improvement data flywheel

This approach overcomes VLA's fundamental limitation — failure outside the demonstration data distribution — through continuous post-deployment learning.

pi0 Human-to-Robot Transfer (2025)

Physical Intelligence [Dec 2025] applied human co-finetuning to pi0:

  • 2x performance improvement across 4 generalization scenarios
  • Emergent alignment: Joint human/robot training produces natural alignment without explicit retargeting
  • Consistent with the co-training approaches (EgoMimic, EgoScale) discussed in Chapter 15.6

EgoVLA (2025)

EgoVLA [arXiv Jul 2025] is a VLA pretrained on egocentric human videos then fine-tuned for robots:

  • Learns action representations from MANO [#17] hand parameters in human egocentric video
  • Resolves human hand → robot hand representation alignment within the VLA framework
  • Integrates Chapter 6.1's MANO model with Chapter 15's retargeting approaches in a VLA context

PhysBrain (2025)

PhysBrain [arXiv Dec 2025] fine-tunes VLMs for the physical world using large-scale VQA data:

  • Generates 3M VQA pairs from Ego4D/BuildAI
  • VLM fine-tune → 53.9% SimplerEnv success
  • Demonstrates that human egocentric video is effective for teaching VLAs physical common sense

13.4 Tactile Integration: ForceVLA and Tactile-VLA

Integrating tactile/force information as a "first-class modality" in VLAs is an emerging direction.

ForceVLA (2025)

Yu et al.[6] (→ introduced in Chapter 12.4):

  • FVLMoE: Dynamic routing across 4 experts
  • Force sensor integration into pi0-based VLA
  • +23.2 percentage points over force-free baseline
  • 90% success under visual occlusion

Tactile-VLA (2025)

Tactile-VLA[10] unlocks VLA's physical knowledge through tactile sensing:

  • Pretrained vision-language knowledge contributes to tactile generalization
  • Improved generalization on contact-rich tasks

Challenges of Tactile Integration

The key challenge is temporal resolution mismatch:

  • Vision: ~30 Hz
  • Force/tactile: 100-1,000 Hz
  • Transformer latency: Real-time constraints during inference

ForceVLA's MoE approach partially addresses this through dynamic routing, but a fundamental solution remains an open problem (→ Chapter 18.1).

Figure 13.4: Three capabilities unlocked by Tactile-VLA. (a) Tactile-Aware Instruction Following — mapping language modifiers like
Figure 13.4: Three capabilities unlocked by Tactile-VLA. (a) Tactile-Aware Instruction Following — mapping language modifiers like "softly" / "hard" to force control; (b) Tactile-Relevant Common Sense — inferring grip force for unseen objects (heavy iron ball vs. fragile pitaya) from hardness/weight priors; (c) Adaptive Tactile-Involved Reasoning — recovering from wipe failures through CoT-based force adjustment. Source: Huang et al. (2025), Fig. 1.

13.5 Scaling and Data: Open X-Embodiment

VLA performance strongly depends on data scale. Open X-Embodiment [2024, ICRA] is the key solution:

  • 1M+ trajectories from 34 laboratories
  • 22 embodiments: Diverse robot forms
  • RT-1-X: 50% improvement via cross-embodiment training
  • RT-2-X: 3x performance improvement
  • 300+ citations
Key Paper: Open X-Embodiment Collaboration 2024. "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." ICRA 2024. The largest open-source robot dataset from 34 labs with 1M+ trajectories. Demonstrated the power of cross-embodiment learning.

NVIDIA's synthetic data pipeline plays a critical role: 780K trajectories (6,500 hours equivalent) generated in 11 hours, improving real performance by 40%. GR00T N1 [2025] is the world's first open humanoid foundation model applying cross-embodiment VLA to manipulation. GR00T N1.6[17] added reasoning capabilities via Cosmos Reason.

Figure 13.5: Open X-Embodiment dataset overview. 21 institutions contributed 22 embodiments, 527 skills, 60 datasets, and 1M+ trajectories in a unified format, enabling cross-embodiment learning at scale. Source: Open X-Embodiment Collaboration (2024), Fig. 1.
Figure 13.5: Open X-Embodiment dataset overview. 21 institutions contributed 22 embodiments, 527 skills, 60 datasets, and 1M+ trajectories in a unified format, enabling cross-embodiment learning at scale. Source: Open X-Embodiment Collaboration (2024), Fig. 1.

13.6 Limitations and Outlook for VLAs

Current Limitations

The VLA Systematic Review [2026, Information Fusion] analyzed 102 models, 26 datasets, and 12 simulation platforms to identify:

  1. Insufficient cross-task generalization: Analysis of 164 VLA papers at ICLR 2026 shows cross-task generalization remains inadequate
  2. Long-horizon task failure: Error compounding in multi-step tasks beyond 5-30 seconds
  3. Open-world objects: Failure on objects absent from training data
  4. Material properties not captured: Vision alone cannot infer friction, compliance — the case for tactile integration

Architecture Design Principles

"What Matters in Building VLA Models" [2025, Nature Machine Intelligence] finds through systematic analysis:

  • Hierarchical/late fusion architectures achieve best generalization
  • Diffusion decoders are optimal for action generation
  • These principles align with ForceVLA's MoE architecture

Gemini Robotics (2025)

Google DeepMind's Gemini Robotics[6] is a VLA family built on Gemini 2.0:

  • Gemini Robotics-ER: Embodied reasoning including spatial understanding and grasp prediction
  • Precision control for dexterous manipulation
  • The most ambitious attempt toward a "universal robot brain"
Figure 13.6: Gemini Robotics family overview. Building on Gemini 2.0's spatial-temporal reasoning, Gemini Robotics-ER (embodied reasoning) and Gemini Robotics (robotics-specific VLA) branch out and are designed to adapt to new embodiments, dexterous tasks, and advanced reasoning. Embodied reasoning primitives — 3D object detection, 2D pointing, grasp/trajectory prediction — form the upper control layer that wraps the VLA. Source: Gemini Robotics Team (2025), Fig. 1.
Figure 13.6: Gemini Robotics family overview. Building on Gemini 2.0's spatial-temporal reasoning, Gemini Robotics-ER (embodied reasoning) and Gemini Robotics (robotics-specific VLA) branch out and are designed to adapt to new embodiments, dexterous tasks, and advanced reasoning. Embodied reasoning primitives — 3D object detection, 2D pointing, grasp/trajectory prediction — form the upper control layer that wraps the VLA. Source: Gemini Robotics Team (2025), Fig. 1.

Outlook: First of Eight Industry Trends

As detailed in Chapter 17, "VLA as Standard Brain" is the first of eight industry trends. Every major humanoid — Figure's Helix, NVIDIA's GR00T, Google's Gemini, Physical Intelligence's pi0 — has adopted VLA architecture.


Summary and Outlook

VLA models have rapidly evolved from RT-1's feasibility proof to pi0's general control and Gemini Robotics' embodied reasoning. Open X-Embodiment's cross-embodiment data and NVIDIA's synthetic data address scale; ForceVLA and Tactile-VLA integrate touch as a first-class modality; pi0.6/RECAP enables continuous post-deployment improvement.

However, key challenges remain: long-horizon tasks, open-world generalization, and vision-tactile temporal alignment. These are systematically addressed in Chapter 18.

The next chapter examines Sim-to-Real transfer — bringing VLA and RL policies from simulation to reality (→ Chapter 14).


Manufacturing-Cell Checkpoint

For tactile VLA systems, the first design question is where the tactile stream is consumed. Feeding every taxel into a giant model is costly and slow. Treating tactile data only as a post-hoc log misses real-time contact failures. A practical architecture splits tactile data into fast reflex inputs and compact contact summaries. The VLA consumes state such as contact existence, slip history, and force-limit violations rather than raw sensor frames at every control tick.

In manufacturing, a VLA should improve after failure, not merely imitate successful videos. Tactile logs must be linked to grasp candidates, SKU, pose estimates, operator overrides, and product-damage flags. Only then can post-deployment update loops and the S6/S9 manufacturing data flywheel work. Touch is therefore not just an extra modality for the model; it is a safety gate and a failure-labeling channel.

Operational Reading Note

The practical value of this chapter is not only the concept of VLA and tactile integration; it is the set of engineering decisions that the concept changes. A deployable robot-hand project should start by asking what state becomes observable after this chapter is applied. The answer should be concrete: contact existence, contact patch, normal force, shear direction, slip margin, object pose, task phase, operator override, or product-damage risk. If a variable cannot be logged or consumed by a controller, it remains an explanatory idea rather than a system capability.

The second decision is the unit of evidence. Research demos often report one success metric, but tactile manipulation improves through failure records. A useful attempt record contains the object or SKU, the selected grasp candidate, the robot hand and sensor configuration, calibration version, task phase, tactile summary, policy action, safety intervention, and final outcome. This record is what connects the sensor chapters to the data chapter, the control chapters to the learning chapters, and the manufacturing chapters to QA.

The third decision is where the chapter sits in the control stack. Some ideas belong in fast reflex loops, some in contact MPC, some in policy inputs, and some only in offline diagnosis. Mixing these time scales creates brittle systems: a VLA cannot react to millisecond slip, and a low-level force controller cannot infer the next process step. The right architecture separates fast contact stabilization, mid-level grasp or rearrangement control, and slow task reasoning.

Finally, the chapter should be evaluated by the failure modes it removes. A method that improves benchmark success but leaves the team unable to distinguish perception failure, contact-acquisition failure, force-closure failure, execution-time slip, or maintenance drift is not yet production-ready. A method with slightly lower headline performance but better logs, safer force limits, and clearer recovery hooks may be the stronger basis for manufacturing Physical AI.

Chapter-Specific Implementation Framework

Turning VLA and tactile integration into a working system begins with state definition. The concept should not remain an abstract performance claim; it should become a variable that a controller and a logger can both read. For this chapter, the relevant state may include tactile summary, contact patch, normal force, shear direction, object pose, task phase, safety margin, operator override, and product-damage risk. Each variable needs a coordinate frame, a timestamp convention, a calibration version, and an owner in the control stack. Without this discipline, a successful trial is hard to explain and a failed trial is almost impossible to diagnose.

The second step is time-scale separation. A fast loop at hundreds of hertz or 1 kHz should handle motor current, force derivatives, shear spikes, and slip reflexes. A mid-level loop at tens of hertz should update contact pose, grasp phase, and reference finger motion. A slower task loop should reason over object identity, SKU, fixture state, instruction, and the next grasp candidate. VLA and tactile integration must be assigned to the right layer. A VLA cannot react to millisecond slip. A low-level force controller cannot infer the next process step. A robust architecture lets these layers communicate through compact state summaries rather than forcing every signal into one monolithic policy.

The third step is a record schema. A useful attempt record should contain attempt id, robot-hand model, sensor layout, calibration version, task phase, object or SKU id, selected grasp, measured contact patch, normal and shear force summary, slip event, policy output, safety intervention, operator note, and final outcome. In a manufacturing cell this record is also a QA trace. A research demo can be persuasive with a video, but a production experiment needs replayable evidence. For that reason, the result table for VLA and tactile integration should include failure-type distribution, retry count, product-damage rate, cycle-time variance, and intervention frequency alongside success rate.

The fourth step is a small test protocol. Starting with every object and every hand motion makes failures uninterpretable. A better protocol begins with atomic tasks: contact acquisition, stable hold, controlled release, contact switch, recovery after slip, and force-limited correction. The next stage composes two or three atoms into sequential manipulation. Only after that should the system attempt a Cosmax-style first grasp, in-hand rearrangement, and second grasp sequence. This staged protocol reveals whether VLA and tactile integration actually removes a failure mode or merely shifts the failure later in the trajectory.

The fifth step is to treat hardware and maintenance as experimental variables. The same algorithm can behave differently when gel surfaces wear, pads become contaminated, cable tension changes, a sensor is replaced, calibration drifts, backlash grows, or surface humidity changes. The log therefore needs software version, pad age, cleaning state, calibration time, replacement event, and fault code. These fields are not administrative details. They determine whether a performance drop comes from the learned policy, the contact model, the sensor, the hand mechanics, or the production environment.

The sixth step is failure-driven decision making. The team should ask which failure class improves after adding VLA and tactile integration: perception before contact, contact acquisition, force-closure insufficiency, execution-time slip, collision, product damage, or operator override. If the answer is unclear, the method is not yet actionable. If the answer is clear, the next investment becomes much easier to choose. A contact-state problem suggests better sensing or calibration. A closure-margin problem suggests hand geometry or force control. A replay mismatch suggests simulation fidelity. A repeated intervention suggests task design, fixture design, or operator workflow.

Implementation question Evidence to log Passing criterion
Is the state observable? sensor packet, calibrated value, contact frame controller and QA read the same value
Are control layers separated? fast reflex, mid-level planner, slow policy timestamps fast contact events do not wait for slow task reasoning
Can failures be classified? failure type, task phase, intervention note root cause narrows to a small set of candidates
Is maintenance visible? pad age, calibration version, replacement event hardware drift can be separated from policy error
Does it connect to manufacturing KPI? cycle time, damage rate, retry count, downtime research success translates into operating metrics

References

  1. Brohan, A., Brown, N., et al. (2023). RT-1: Robotics Transformer for real-world control at scale. RSS 2023. arXiv:2212.06817. scholar
  2. Brohan, A., Brown, N., et al. (2023). RT-2: Vision-Language-Action models transfer web knowledge to robotic control. CoRL 2023. arXiv:2307.15818. scholar
  3. Kim, M. J., Pertsch, K., Karamcheti, S., et al. (2024). OpenVLA: An open-source Vision-Language-Action model. arXiv:2406.09246. scholar
  4. Octo Model Team. (2024). Octo: An open-source generalist robot policy. arXiv:2405.12213. #55 scholar
  5. Black, K., Brown, N., et al. (2024). pi0: A Vision-Language-Action flow model for general robot control. arXiv:2410.24164. #2 scholar
  6. Google DeepMind. (2025). Gemini Robotics: Bringing AI into the physical world. arXiv:2503.20020. scholar
  7. Open X-Embodiment Collaboration. (2024). Open X-Embodiment: Robotic learning datasets and RT-X models. ICRA 2024. arXiv:2310.08864. scholar
  8. Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. ICLR 2023. arXiv:2210.02747. scholar
  9. Yu, J., Liu, H., Yu, Q., et al. (2025). ForceVLA: Enhancing VLA models with a force-aware MoE for contact-rich manipulation. NeurIPS 2025. #1 scholar
  10. Huang, J., Wang, S., Lin, F., Hu, Y., Wen, C., & Gao, Y. (2025). Tactile-VLA: Unlocking vision-language-action model's physical knowledge for tactile generalization. arXiv:2507.09160. scholar
  11. Helmut, E., Funk, N., Schneider, T., de Farias, C., & Peters, J. (2025). Tactile-conditioned diffusion policy for force-aware robotic manipulation. ICRA 2026. arXiv:2510.13324. scholar
  12. Various. (2026). VLA systematic review. Information Fusion (Elsevier). https://doi.org/10.1016/j.inffus.2025.103148. scholar
  13. Various. (2025). What matters in building VLA models. Nature Machine Intelligence. https://doi.org/10.1038/s42256-025-01168-7. scholar
  14. Various. (2025). Diffusion models for robotic manipulation survey. Frontiers. https://doi.org/10.3389/frobt.2025.1606247. scholar
  15. Various. (2025). pi0.6/RECAP: Post-deployment RL for continuous improvement. #4 scholar
  16. NVIDIA. (2025). GR00T N1: Open humanoid foundation model. scholar
  17. NVIDIA. (2026). GR00T N1.6: Added reasoning via Cosmos Reason. scholar
  18. Figure AI. (2025). Helix: VLA for full humanoid upper body (35 DoF). scholar
  19. Physical Intelligence. (2025). pi0 human-to-robot transfer: Human co-finetuning for generalization. Technical report. scholar
  20. Various. (2025). EgoVLA: Egocentric human video pretraining for robot VLA. arXiv preprint. scholar
  21. Various. (2025). PhysBrain: Physical world fine-tuning of VLMs via 3M VQA from Ego4D. arXiv preprint. scholar