Chapter 13: Vision-Language-Action Models — See, Speak, Act
Overview
VLA (Vision-Language-Action) models combine large vision-language models (VLMs) with robot actions to pursue general-purpose robot control that sees, understands, and acts. This paradigm, which began with RT-2 in 2023, is now adopted as the "brain" of every major humanoid robot. This chapter covers the VLA lineage, pi0 [#2]'s Flow Matching approach, post-deployment improvement, tactile integration, and scaling strategies.
After reading this chapter, you will be able to... - Trace the key inflection points in the VLA lineage from RT-1 to Gemini Robotics. - Understand pi0's VLM + Flow Matching architecture. - Describe how tactile sensing is integrated into VLAs (ForceVLA [#1], Tactile-VLA). - Explain the cross-embodiment data strategy of Open X-Embodiment.
13.1 The VLA Lineage: From RT-1 to Gemini Robotics
The evolution of VLA models reflects a paradigm shift in robot learning:
RT-1 (2023)
Google / Everyday Robots' RT-1 [2] was the first large-scale real-world robot Transformer:
- 130K episodes from 13 robots
- Proved the feasibility of large-scale Transformers for real-world control
- 800+ citations (RSS 2023)
RT-2 (2023)
Google DeepMind's RT-2 [2] established the VLA paradigm:
- Fine-tuned large VLMs (PaLI-X, PaLM-E) on robot demonstration data
- Transferred web knowledge to robotic control
- Could execute commands like "pick up the apple"
Key Paper: Brohan et al. 2023. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." CoRL 2023. The landmark paper establishing the VLA paradigm. First demonstrated that vision-language knowledge from large VLMs can transfer to robot actions.
OpenVLA (2024)
Stanford/DeepMind's OpenVLA [3] launched open-source VLAs:
- 7B parameters — 1/10 the size of RT-2-X
- Trained on Open X-Embodiment
- Outperformed RT-2-X
- Fully open-source: weights, code, data
Octo (2024)
UC Berkeley's Octo[4] is a generalist robot policy:
- Transformer-based diffusion policy
- Pretrained on 800K episodes
- Flexible task/observation definitions
- Quick finetuning support
13.2 pi0: Vision-Language Models Meet Flow Matching
Physical Intelligence's pi0 [5] represents a current state of the art:
- PaLiGemma 3B VLM backbone
- Flow Matching action expert: Action generation via continuous normalizing flows
- 7 robots, 68 tasks, 10,000+ hours of data
- Pre-training → post-training two-stage recipe
Key Paper: Physical Intelligence. (2024). "pi0: A Vision-Language-Action Flow Model for General Robot Control." arXiv:2410.24164. PaLiGemma VLM + Flow Matching for general robot control across 8 embodiments. The two-stage pre-training/post-training recipe is the key innovation.
pi0's core innovation is applying Flow Matching [8] to action generation. Unlike standard diffusion models, flow matching learns continuous normalizing flows without simulation, enabling more efficient action generation.
13.3 pi0.6/RECAP: Continuous Improvement Through Post-Deployment RL
pi0.6[15] [#4] and RECAP realize continuous improvement via post-deployment reinforcement learning:
- Deploy initial pi0 → collect failure/success data
- Fine-tune with RL on collected data
- Deployment-learning-improvement data flywheel
This approach overcomes VLA's fundamental limitation — failure outside the demonstration data distribution — through continuous post-deployment learning.
pi0 Human-to-Robot Transfer (2025)
Physical Intelligence [Dec 2025] applied human co-finetuning to pi0:
- 2x performance improvement across 4 generalization scenarios
- Emergent alignment: Joint human/robot training produces natural alignment without explicit retargeting
- Consistent with the co-training approaches (EgoMimic, EgoScale) discussed in Chapter 15.6
EgoVLA (2025)
EgoVLA [arXiv Jul 2025] is a VLA pretrained on egocentric human videos then fine-tuned for robots:
- Learns action representations from MANO [#17] hand parameters in human egocentric video
- Resolves human hand → robot hand representation alignment within the VLA framework
- Integrates Chapter 6.1's MANO model with Chapter 15's retargeting approaches in a VLA context
PhysBrain (2025)
PhysBrain [arXiv Dec 2025] fine-tunes VLMs for the physical world using large-scale VQA data:
- Generates 3M VQA pairs from Ego4D/BuildAI
- VLM fine-tune → 53.9% SimplerEnv success
- Demonstrates that human egocentric video is effective for teaching VLAs physical common sense
13.4 Tactile Integration: ForceVLA and Tactile-VLA
Integrating tactile/force information as a "first-class modality" in VLAs is an emerging direction.
ForceVLA (2025)
Yu et al.[6] (→ introduced in Chapter 12.4):
- FVLMoE: Dynamic routing across 4 experts
- Force sensor integration into pi0-based VLA
- +23.2 percentage points over force-free baseline
- 90% success under visual occlusion
Tactile-VLA (2025)
Tactile-VLA[10] unlocks VLA's physical knowledge through tactile sensing:
- Pretrained vision-language knowledge contributes to tactile generalization
- Improved generalization on contact-rich tasks
Challenges of Tactile Integration
The key challenge is temporal resolution mismatch:
- Vision: ~30 Hz
- Force/tactile: 100-1,000 Hz
- Transformer latency: Real-time constraints during inference
ForceVLA's MoE approach partially addresses this through dynamic routing, but a fundamental solution remains an open problem (→ Chapter 18.1).
13.5 Scaling and Data: Open X-Embodiment
VLA performance strongly depends on data scale. Open X-Embodiment [2024, ICRA] is the key solution:
- 1M+ trajectories from 34 laboratories
- 22 embodiments: Diverse robot forms
- RT-1-X: 50% improvement via cross-embodiment training
- RT-2-X: 3x performance improvement
- 300+ citations
Key Paper: Open X-Embodiment Collaboration 2024. "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." ICRA 2024. The largest open-source robot dataset from 34 labs with 1M+ trajectories. Demonstrated the power of cross-embodiment learning.
NVIDIA's synthetic data pipeline plays a critical role: 780K trajectories (6,500 hours equivalent) generated in 11 hours, improving real performance by 40%. GR00T N1 [2025] is the world's first open humanoid foundation model applying cross-embodiment VLA to manipulation. GR00T N1.6[17] added reasoning capabilities via Cosmos Reason.
13.6 Limitations and Outlook for VLAs
Current Limitations
The VLA Systematic Review [2026, Information Fusion] analyzed 102 models, 26 datasets, and 12 simulation platforms to identify:
- Insufficient cross-task generalization: Analysis of 164 VLA papers at ICLR 2026 shows cross-task generalization remains inadequate
- Long-horizon task failure: Error compounding in multi-step tasks beyond 5-30 seconds
- Open-world objects: Failure on objects absent from training data
- Material properties not captured: Vision alone cannot infer friction, compliance — the case for tactile integration
Architecture Design Principles
"What Matters in Building VLA Models" [2025, Nature Machine Intelligence] finds through systematic analysis:
- Hierarchical/late fusion architectures achieve best generalization
- Diffusion decoders are optimal for action generation
- These principles align with ForceVLA's MoE architecture
Gemini Robotics (2025)
Google DeepMind's Gemini Robotics[6] is a VLA family built on Gemini 2.0:
- Gemini Robotics-ER: Embodied reasoning including spatial understanding and grasp prediction
- Precision control for dexterous manipulation
- The most ambitious attempt toward a "universal robot brain"
Outlook: First of Eight Industry Trends
As detailed in Chapter 17, "VLA as Standard Brain" is the first of eight industry trends. Every major humanoid — Figure's Helix, NVIDIA's GR00T, Google's Gemini, Physical Intelligence's pi0 — has adopted VLA architecture.
Summary and Outlook
VLA models have rapidly evolved from RT-1's feasibility proof to pi0's general control and Gemini Robotics' embodied reasoning. Open X-Embodiment's cross-embodiment data and NVIDIA's synthetic data address scale; ForceVLA and Tactile-VLA integrate touch as a first-class modality; pi0.6/RECAP enables continuous post-deployment improvement.
However, key challenges remain: long-horizon tasks, open-world generalization, and vision-tactile temporal alignment. These are systematically addressed in Chapter 18.
The next chapter examines Sim-to-Real transfer — bringing VLA and RL policies from simulation to reality (→ Chapter 14).
Manufacturing-Cell Checkpoint
For tactile VLA systems, the first design question is where the tactile stream is consumed. Feeding every taxel into a giant model is costly and slow. Treating tactile data only as a post-hoc log misses real-time contact failures. A practical architecture splits tactile data into fast reflex inputs and compact contact summaries. The VLA consumes state such as contact existence, slip history, and force-limit violations rather than raw sensor frames at every control tick.
In manufacturing, a VLA should improve after failure, not merely imitate successful videos. Tactile logs must be linked to grasp candidates, SKU, pose estimates, operator overrides, and product-damage flags. Only then can post-deployment update loops and the S6/S9 manufacturing data flywheel work. Touch is therefore not just an extra modality for the model; it is a safety gate and a failure-labeling channel.
Operational Reading Note
The practical value of this chapter is not only the concept of VLA and tactile integration; it is the set of engineering decisions that the concept changes. A deployable robot-hand project should start by asking what state becomes observable after this chapter is applied. The answer should be concrete: contact existence, contact patch, normal force, shear direction, slip margin, object pose, task phase, operator override, or product-damage risk. If a variable cannot be logged or consumed by a controller, it remains an explanatory idea rather than a system capability.
The second decision is the unit of evidence. Research demos often report one success metric, but tactile manipulation improves through failure records. A useful attempt record contains the object or SKU, the selected grasp candidate, the robot hand and sensor configuration, calibration version, task phase, tactile summary, policy action, safety intervention, and final outcome. This record is what connects the sensor chapters to the data chapter, the control chapters to the learning chapters, and the manufacturing chapters to QA.
The third decision is where the chapter sits in the control stack. Some ideas belong in fast reflex loops, some in contact MPC, some in policy inputs, and some only in offline diagnosis. Mixing these time scales creates brittle systems: a VLA cannot react to millisecond slip, and a low-level force controller cannot infer the next process step. The right architecture separates fast contact stabilization, mid-level grasp or rearrangement control, and slow task reasoning.
Finally, the chapter should be evaluated by the failure modes it removes. A method that improves benchmark success but leaves the team unable to distinguish perception failure, contact-acquisition failure, force-closure failure, execution-time slip, or maintenance drift is not yet production-ready. A method with slightly lower headline performance but better logs, safer force limits, and clearer recovery hooks may be the stronger basis for manufacturing Physical AI.
Chapter-Specific Implementation Framework
Turning VLA and tactile integration into a working system begins with state definition. The concept should not remain an abstract performance claim; it should become a variable that a controller and a logger can both read. For this chapter, the relevant state may include tactile summary, contact patch, normal force, shear direction, object pose, task phase, safety margin, operator override, and product-damage risk. Each variable needs a coordinate frame, a timestamp convention, a calibration version, and an owner in the control stack. Without this discipline, a successful trial is hard to explain and a failed trial is almost impossible to diagnose.
The second step is time-scale separation. A fast loop at hundreds of hertz or 1 kHz should handle motor current, force derivatives, shear spikes, and slip reflexes. A mid-level loop at tens of hertz should update contact pose, grasp phase, and reference finger motion. A slower task loop should reason over object identity, SKU, fixture state, instruction, and the next grasp candidate. VLA and tactile integration must be assigned to the right layer. A VLA cannot react to millisecond slip. A low-level force controller cannot infer the next process step. A robust architecture lets these layers communicate through compact state summaries rather than forcing every signal into one monolithic policy.
The third step is a record schema. A useful attempt record should contain attempt id, robot-hand model, sensor layout, calibration version, task phase, object or SKU id, selected grasp, measured contact patch, normal and shear force summary, slip event, policy output, safety intervention, operator note, and final outcome. In a manufacturing cell this record is also a QA trace. A research demo can be persuasive with a video, but a production experiment needs replayable evidence. For that reason, the result table for VLA and tactile integration should include failure-type distribution, retry count, product-damage rate, cycle-time variance, and intervention frequency alongside success rate.
The fourth step is a small test protocol. Starting with every object and every hand motion makes failures uninterpretable. A better protocol begins with atomic tasks: contact acquisition, stable hold, controlled release, contact switch, recovery after slip, and force-limited correction. The next stage composes two or three atoms into sequential manipulation. Only after that should the system attempt a Cosmax-style first grasp, in-hand rearrangement, and second grasp sequence. This staged protocol reveals whether VLA and tactile integration actually removes a failure mode or merely shifts the failure later in the trajectory.
The fifth step is to treat hardware and maintenance as experimental variables. The same algorithm can behave differently when gel surfaces wear, pads become contaminated, cable tension changes, a sensor is replaced, calibration drifts, backlash grows, or surface humidity changes. The log therefore needs software version, pad age, cleaning state, calibration time, replacement event, and fault code. These fields are not administrative details. They determine whether a performance drop comes from the learned policy, the contact model, the sensor, the hand mechanics, or the production environment.
The sixth step is failure-driven decision making. The team should ask which failure class improves after adding VLA and tactile integration: perception before contact, contact acquisition, force-closure insufficiency, execution-time slip, collision, product damage, or operator override. If the answer is unclear, the method is not yet actionable. If the answer is clear, the next investment becomes much easier to choose. A contact-state problem suggests better sensing or calibration. A closure-margin problem suggests hand geometry or force control. A replay mismatch suggests simulation fidelity. A repeated intervention suggests task design, fixture design, or operator workflow.
| Implementation question | Evidence to log | Passing criterion |
|---|---|---|
| Is the state observable? | sensor packet, calibrated value, contact frame | controller and QA read the same value |
| Are control layers separated? | fast reflex, mid-level planner, slow policy timestamps | fast contact events do not wait for slow task reasoning |
| Can failures be classified? | failure type, task phase, intervention note | root cause narrows to a small set of candidates |
| Is maintenance visible? | pad age, calibration version, replacement event | hardware drift can be separated from policy error |
| Does it connect to manufacturing KPI? | cycle time, damage rate, retry count, downtime | research success translates into operating metrics |
References
- Brohan, A., Brown, N., et al. (2023). RT-1: Robotics Transformer for real-world control at scale. RSS 2023. arXiv:2212.06817. scholar
- Brohan, A., Brown, N., et al. (2023). RT-2: Vision-Language-Action models transfer web knowledge to robotic control. CoRL 2023. arXiv:2307.15818. scholar
- Kim, M. J., Pertsch, K., Karamcheti, S., et al. (2024). OpenVLA: An open-source Vision-Language-Action model. arXiv:2406.09246. scholar
- Octo Model Team. (2024). Octo: An open-source generalist robot policy. arXiv:2405.12213. #55 scholar
- Black, K., Brown, N., et al. (2024). pi0: A Vision-Language-Action flow model for general robot control. arXiv:2410.24164. #2 scholar
- Google DeepMind. (2025). Gemini Robotics: Bringing AI into the physical world. arXiv:2503.20020. scholar
- Open X-Embodiment Collaboration. (2024). Open X-Embodiment: Robotic learning datasets and RT-X models. ICRA 2024. arXiv:2310.08864. scholar
- Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. ICLR 2023. arXiv:2210.02747. scholar
- Yu, J., Liu, H., Yu, Q., et al. (2025). ForceVLA: Enhancing VLA models with a force-aware MoE for contact-rich manipulation. NeurIPS 2025. #1 scholar
- Huang, J., Wang, S., Lin, F., Hu, Y., Wen, C., & Gao, Y. (2025). Tactile-VLA: Unlocking vision-language-action model's physical knowledge for tactile generalization. arXiv:2507.09160. scholar
- Helmut, E., Funk, N., Schneider, T., de Farias, C., & Peters, J. (2025). Tactile-conditioned diffusion policy for force-aware robotic manipulation. ICRA 2026. arXiv:2510.13324. scholar
- Various. (2026). VLA systematic review. Information Fusion (Elsevier). https://doi.org/10.1016/j.inffus.2025.103148. scholar
- Various. (2025). What matters in building VLA models. Nature Machine Intelligence. https://doi.org/10.1038/s42256-025-01168-7. scholar
- Various. (2025). Diffusion models for robotic manipulation survey. Frontiers. https://doi.org/10.3389/frobt.2025.1606247. scholar
- Various. (2025). pi0.6/RECAP: Post-deployment RL for continuous improvement. #4 scholar
- NVIDIA. (2025). GR00T N1: Open humanoid foundation model. scholar
- NVIDIA. (2026). GR00T N1.6: Added reasoning via Cosmos Reason. scholar
- Figure AI. (2025). Helix: VLA for full humanoid upper body (35 DoF). scholar
- Physical Intelligence. (2025). pi0 human-to-robot transfer: Human co-finetuning for generalization. Technical report. scholar
- Various. (2025). EgoVLA: Egocentric human video pretraining for robot VLA. arXiv preprint. scholar
- Various. (2025). PhysBrain: Physical world fine-tuning of VLMs via 3M VQA from Ego4D. arXiv preprint. scholar