Part V: Integration and Manufacturing Outlook

Chapter 18: Limitations and Future Directions — Toward Manufacturing Physical AI

Written: 2026-04-01 Last updated: 2026-06-09

Overview

This book has painted the full picture from tactile sensors to VLA models, from robot hands to industrial deployment. This final chapter systematically organizes the Top 10 common limitations across 131 papers, presents future research directions in five groups, identifies ten manufacturing-specific challenges, and proposes our research agenda.

After reading this chapter, you will be able to... - List the 10 most common limitations in tactile manipulation research. - Describe future directions across 5 research groups (sensing, learning, hardware, data, deployment). - Understand the 3-5 year gap between research demonstrations and manufacturing deployment. - Explain the mechanism-tactile-learning triangle as a research direction.

18.1 Top 10 Common Limitations

Rank Limitation Frequency Core Issue Related Chapter
1 Sim-to-Real Gap 20+ Tactile harder than visual Ch14
2 Novel Object Generalization 15+ Material properties not captured Ch12, Ch13
3 Sensor Durability 12+ Gel wear, calibration loss Ch2
4 Data Scarcity 12+ ~10 demos/hr; tactile especially hard. EgoScale (NVIDIA, 2026) [21] reported log-linear validation-loss scaling over 20,854 hours of egocentric human video and correlation with downstream success. This strengthens the case for large-scale human data, but the data lacks tactile/force channels, so contact-rich manufacturing still requires tactile specialist data Ch3, Ch6
5 Hardware Cost/Fragility 10+ No MTBF data Ch4
6 Cross-Embodiment Transfer 10+ Tactile: UniTacHand only. However, rapid progress in kinematic/visual domains during 2024-2026 — EgoMimic (Georgia Tech, 2024) [22]: 1hr human data > 2hr robot data (+34-228%); X-Sim (Cornell, CoRL 2025 Oral) [23]: real task success with zero robot data; VidBot (TU Munich, CVPR 2025) [24]: +20% zero-shot from internet video alone; EgoZero (2025) [25]: 70% success on 7 tasks from smart glasses only. A general solution for the tactile domain remains elusive, but cross-embodiment transfer possibilities are expanding rapidly Ch15
7 Multi-Modal Fusion Timing 8+ Vision 30Hz vs tactile 1000Hz Ch13, Ch16
8 Safety in Human Proximity 7+ ISO/TS 15066 not met Ch5
9 Long-Horizon Task 7+ Error compounding beyond 5-30s Ch13
10 Evaluation Standardization 6+ No established benchmark Ch16

18.2 Future Research Directions (5 Groups)

A. Sensing & Perception

  • Tactile foundation models at scale (Sparsh 460K → 10x+ needed)
  • Neuromorphic sensors: spike-based, event-driven (NRE-skin)
  • Self-healing/self-calibrating sensors for industrial longevity
  • Sensor-agnostic representations (AnyTouch, Sensor-Invariant)
  • Whole-hand dense tactile coverage as standard (F-TAC direction)
  • Affordable compact F/T sensors: CoinFT [28] demonstrated that 6-axis F/T sensing at <$10 is achievable, but key challenges remain — vulnerability to tensile/peeling forces and the need for standardized packaging across different hand morphologies [Choi, SNU Seminar 2026]

B. Learning & Control

  • VLA + tactile as first-class modality (ForceVLA [#1], Tactile-VLA)
  • Post-deployment RL for continuous improvement (pi0.6 [#4]/RECAP)
  • One-demo/zero-demo learning from human video
  • World models with tactile feedback for model-based planning
  • Hierarchical VLA for long-horizon dexterous tasks with error recovery

C. Hardware & Design

  • Sub-$1K dexterous hands with integrated tactile (LEAP cost + F-TAC sensing)
  • Modular, replaceable sensor skin (AnySkin 12-second replacement)
  • Variable stiffness for safety + dexterity co-optimization (Seminar 3)
  • Standardized hand-sensor integration (Digit Plexus direction)

D. Data & Simulation

  • Tactile simulation fidelity (DiffTactile, TacEx)
  • Massive synthetic data (NVIDIA 780K approach for tactile)
  • Shared tactile datasets (Touch-and-Go 3M → 100M+)
  • Cross-embodiment data reuse (OXE for hands)
  • Explosive growth in egocentric data collection: EgoDex (Apple, 2025) [26] provides 829 hours and 90 million frames of hand manipulation data, while Ego4D (Meta) [27] offers 3,670 hours from 931 participants. Human egocentric video is emerging as a key data source for robot learning (→ Ch6.5)
  • Scaling-law-driven data strategy: EgoScale (NVIDIA, 2026) [21] reported a log-linear relation between egocentric human-data scale and validation loss, plus downstream correlation in a specific 22-DoF hand setting. This supports treating data scale as a design variable, but it does not by itself establish a general ROI law for contact-rich manufacturing tasks because the data has no force/tactile channel

E. Deployment & Application

  • Safety certification for human-proximity dexterous manipulation
  • Quality inspection via touch (defects invisible to cameras)
  • Deformable object manipulation at production speed
  • Multi-robot collaborative manipulation (Helix dual-robot)

18.3 Ten Manufacturing-Specific Challenges

The gap from research demonstration to manufacturing floor is typically 3-5 years:

Challenge Severity Expected Related Chapters
Cycle time matching Critical 3-5 yrs Ch12, Ch13
Multi-shift reliability (24/7) Critical 2-4 yrs Ch2, Ch4
Safety certification Critical 2-3 yrs Ch5
Sub-mm assembly + force control High 3-5 yrs Ch12, Ch15
Deformable material handling High 2-4 yrs Ch12, Ch5
Tool use (screwdriver, wrench) High 3-5 yrs Ch12
Maintenance by technicians High 2-4 yrs Ch4, Ch17
Mixed small-part bin picking Medium 1-3 yrs Ch12
Tactile quality inspection Medium 2-4 yrs Ch2, Ch16
ROI vs traditional automation Medium Ongoing Ch17
Key Observation: Current factory deployments (BMW, Amazon, Mercedes) are at the logistics level. Dexterous assembly has not reached production. This book has clearly distinguished these capability levels.

18.4 Our Research Agenda: The Mechanism-Tactile-Learning Triangle

Synthesizing insights from all 16 preceding chapters into a unified research direction:

Axis 1 — Mechanism (Physical Intelligence):

When intelligent mechanisms induce continuous contact (→ Chapter 5), state stability improves and control simplifies.

Axis 2 — Tactile (Sensing):

When tactile sensors recognize the contact state during continuous contact (→ Chapter 2), finer force control and slip detection become possible. CoinFT's multi-axis sensing at ~360 Hz and ACP's ~500 Hz compliance control demonstrate that the required temporal resolution is achievable with current hardware [Choi, SNU Seminar 2026].

Axis 3 — Learning (VLA/Diffusion):

When VLA/Diffusion Policy leverages tactile feedback from stable contact (→ Chapters 7, 8), sample efficiency and generalization improve. The UMI-FT results validate this: policies trained with in-the-wild data achieved 100% success in unseen environments, versus 20% for lab-only training — demonstrating that data diversity enabled by scalable collection (Axis 2) directly improves learning (Axis 3) [Choi, SNU Seminar 2026].

Core Proposition: When mechanism physically guarantees continuous contact, tactile recognizes the contact state, and learning exploits this information — the burden on each axis is reduced and overall system robustness improves.
Figure 18.1: Tactile-VLA — a concrete recent instance of the mechanism (robot hand) + tactile (GelSight-based tactile tokens) + learning (VLA) triangle integration. Tactile becomes a first-class modality inside the VLA loop, serving as empirical grounding for the research direction proposed in this chapter. Source: Tactile-VLA (2025), Fig. 1.
Figure 18.1: Tactile-VLA — a concrete recent instance of the mechanism (robot hand) + tactile (GelSight-based tactile tokens) + learning (VLA) triangle integration. Tactile becomes a first-class modality inside the VLA loop, serving as empirical grounding for the research direction proposed in this chapter. Source: Tactile-VLA (2025), Fig. 1.
Figure 18.2: Hierarchical tactile control architecture — high-level planner (VLA / world model) above a low-level reflex layer (compliance controller, tactile-conditioned residual policy). Source: Choi, SNU Data Science Seminar 2026.
Figure 18.2: Hierarchical tactile control architecture — high-level planner (VLA / world model) above a low-level reflex layer (compliance controller, tactile-conditioned residual policy). Source: Choi, SNU Data Science Seminar 2026.

Additional Directions

  • Shared Sensing Platform: Generalizing OSMO [#18]/UniTacHand [#16] cross-embodiment tactile transfer (→ Chapter 15.4)
  • Factory-Specific Foundation Model: Tactile-visual foundation model specialized for factory environments
  • Open Hardware + Open Data: Sub-$2K hands + Touch100k-scale dataset expansion
  • Continuous-Contact Manipulation: Seminar 3's original insight — mechanism-induced continuous contact reduces sensing/learning burden

Key Counterintuitive Findings from 2024-2026

Recent studies report results that challenge conventional assumptions about robot learning:

  1. 1 hour of human data > 2 hours of robot data: EgoMimic [22] demonstrated that 1 hour of human demonstrations outperforms 2 hours of robot teleoperation by +34-228%. The quality and diversity of human data overwhelms the quantity of robot data.
  1. Human data alone can control robots: X-Sim [23] achieved real robot task success with zero robot data using only human video, EgoZero [25] achieved 70% success on 7 tasks from smart glasses demonstrations alone, and VidBot [24] achieved +20% zero-shot performance from internet videos only.
  1. Log-linear scaling: human data scale correlates with downstream success: EgoScale [21] reported log-linear validation-loss scaling across 20,854 hours of egocentric human data and downstream improvements in a specific 22-DoF hand setting. The result is important, but it should be read with a caveat: the data is vision/action-labeled rather than tactile/force-rich, so contact-rich manufacturing needs additional tactile specialist data.
Implications: These three findings challenge the prevailing paradigm that "robot learning requires robot data." Large-scale collection of human egocentric data combined with cross-embodiment transfer may be the key pathway to resolving the data bottleneck in tactile robotics.

18.9 Manufacturing Manual Work and Robot-Hand-Centered Integration

The core argument from the manufacturing Physical AI and NVIDIA robotics surveys applies directly here. Manufacturing Physical AI is not the purchase of a humanoid; it is an operating loop that accumulates process data, evaluation harnesses, failure logs, and QA traces in bounded cells [29]. The robot hand is the end-effector in that loop, but it is also the component exposed to the most uncertainty.

For a Cosmax-style cosmetics manufacturing line, the priorities are concrete. Sequential multi-object grasping and cluttered manipulation become bottlenecks before generic rigid pick/place. Once vision is occluded, tactile force and slip margin become safety gates. Deployability depends less on finger count and DoF than on sensor replacement, calibration drift, cleaning, cycle time, and operator override. Isaac/GR00T/EgoScale-style stacks should be treated not as turnkey solutions, but as data factories linking task schemas, USD/CAD assets, synthetic/real evaluation, and failure replay.

The integration outlook is therefore simple: the 2026 robot hand is no longer just an end-effector with more fingers. It is becoming a process sensor plus actuator connected to tactile sensing, teleoperation, simulation, VLA training, and manufacturing QA loops.

Closing Message

Tactile sensing is the last puzzle piece of robotic manipulation. Three converging trends — cost reduction of vision-based sensors (GelSight → DIGIT $350 → Digit 360), democratization of open-source hands (Shadow $100K → LEAP $2K), and emergence of foundation models (RT-2 → pi0 → Gemini Robotics) — are driving tactile robotics at unprecedented speed.

As of 2026, touch is no longer optional. It is becoming standard.

And converting this standard into dexterous manipulation on manufacturing floors is the core challenge of "Physical AI for Manufacturing" — the vision this book proposes.


18.10 Manufacturing-Cell Checkpoint

The final outlook should be operational rather than merely optimistic. In 2026, robot hands are moving toward more DoF, denser tactile sensing, and larger VLA backbones. Manufacturing value appears only when those pieces are embedded in bounded cells, task schemas, evaluation harnesses, failure logs, and QA traces. In cosmetics manufacturing, small containers, slippery surfaces, deformable packaging, and repeated multi-object handling make tactile sensing a safety gate and process sensor rather than a research ornament.

The roadmap is best split into three axes. Hand hardware should be selected around replaceability, cleanability, force limits, and tactile options. The data pipeline should distinguish robot-executable trajectories from large-scale human-hand observations, then fill the contact-rich gap with tactile specialist data. Learning and control should connect VLA, MPC, tactile reflexes, and residual learning in one loop. When these axes align, the robot hand becomes the observe-control-improve interface of manufacturing Physical AI, not just a more capable end effector.

Operational Reading Note

The practical value of this chapter is not only the concept of future research roadmap; it is the set of engineering decisions that the concept changes. A deployable robot-hand project should start by asking what state becomes observable after this chapter is applied. The answer should be concrete: contact existence, contact patch, normal force, shear direction, slip margin, object pose, task phase, operator override, or product-damage risk. If a variable cannot be logged or consumed by a controller, it remains an explanatory idea rather than a system capability.

The second decision is the unit of evidence. Research demos often report one success metric, but tactile manipulation improves through failure records. A useful attempt record contains the object or SKU, the selected grasp candidate, the robot hand and sensor configuration, calibration version, task phase, tactile summary, policy action, safety intervention, and final outcome. This record is what connects the sensor chapters to the data chapter, the control chapters to the learning chapters, and the manufacturing chapters to QA.

The third decision is where the chapter sits in the control stack. Some ideas belong in fast reflex loops, some in contact MPC, some in policy inputs, and some only in offline diagnosis. Mixing these time scales creates brittle systems: a VLA cannot react to millisecond slip, and a low-level force controller cannot infer the next process step. The right architecture separates fast contact stabilization, mid-level grasp or rearrangement control, and slow task reasoning.

Finally, the chapter should be evaluated by the failure modes it removes. A method that improves benchmark success but leaves the team unable to distinguish perception failure, contact-acquisition failure, force-closure failure, execution-time slip, or maintenance drift is not yet production-ready. A method with slightly lower headline performance but better logs, safer force limits, and clearer recovery hooks may be the stronger basis for manufacturing Physical AI.

This framing also clarifies the investment sequence. The first investment should be a bounded task cell with measurable failures, not a broad humanoid deployment. The second should be tactile and force logging that survives maintenance, cleaning, and operator overrides. The third should be a simulation and replay harness that can reproduce at least the dominant failure classes. Only after those pieces exist does it make sense to scale data collection, train larger policies, or compare commercial hands at higher task diversity. In other words, the survey's technical arc ends in an operating discipline: make contact observable, make failure diagnosable, and make each deployment attempt improve the next one.

For readers planning a project, the most useful next artifact is a one-page experiment contract. It should name the SKU family, the robot hand, the tactile sensor layout, the task phases, the allowed force envelope, the expected cycle time, the intervention policy, and the exact log fields that will be reviewed after every batch of trials. This contract keeps hardware selection, data collection, control design, and management expectations aligned. It also prevents a common failure pattern: adding a more capable hand or a larger policy before the team has agreed on what evidence would count as progress.

That evidence should be reviewed on a fixed cadence. Weekly review of failed attempts, sensor drift, operator interventions, and product-damage near misses will usually improve the cell faster than occasional model retraining without diagnosis.

References

  1. Various. (2023). DeXtreme. ICRA 2023. scholar
  2. Various. (2025). Tactile Robotics: Past and Future. arXiv:2512.01106. scholar
  3. Bhirangi, R., et al. (2024). AnySkin. ICRA 2025. scholar
  4. Zhao, Z., Li, W., Li, Y., Liu, T., Li, B., Wang, M., Du, K., Liu, H., Zhu, Y., Wang, Q., Althoefer, K., & Zhu, S.-C. (2025). Embedding high-resolution touch across robotic hands enables adaptive human-like grasping. Nature Machine Intelligence. https://doi.org/10.1038/s42256-025-01053-3 #39 scholar
  5. Various. (2025). NRE-skin. PNAS. scholar
  6. Various. (2026). Bioinspired spiking architecture. Nature Communications. scholar
  7. Yu, J., et al. (2025). ForceVLA: Enhancing VLA models with a force-aware MoE for contact-rich manipulation. NeurIPS 2025. #1 scholar
  8. Various. (2025). Tactile-VLA. OpenReview. scholar
  9. Physical Intelligence. (2025). pi0.5/RECAP: Post-deployment RL for continuous improvement. arXiv preprint. arXiv:2504.16932. #4 scholar
  10. Shaw, K., et al. (2024). Learning from internet videos. CMU. scholar
  11. Si, Z., Zhang, G., Ben, Q., Romero, B., Xian, Z., Liu, C., & Gan, C. (2024). DiffTactile: A physics-based differentiable tactile simulator. ICLR 2024. scholar
  12. Various. (2024). TacEx: GelSight tactile simulation in Isaac Sim. arXiv preprint. arXiv:2411.04776. scholar
  13. NVIDIA. (2026). 780K trajectories in 11 hours. GTC 2026. scholar
  14. Various. (2025). OSMO. arXiv:2512.08920. #18 scholar
  15. Zhang, Y., et al. (2025). UniTacHand. Various. #16 scholar
  16. Bicchi, A. (2000). Hands for dexterous manipulation. IEEE T-RA. scholar
  17. Billard, A., & Kragic, D. (2019). Trends and challenges. Science. scholar
  18. Various. (2026). VLA systematic review. Information Fusion. scholar
  19. Various. (2025). What matters in building VLA models. Nature MI. scholar
  20. Hogan, N. (1985). Impedance control. JDSMC. scholar
  21. Bansal, A., et al. (2026). EgoScale: Scaling laws for egocentric human data in robot learning. arXiv preprint. NVIDIA Research. scholar
  22. Kareer, S., et al. (2024). EgoMimic: Scaling imitation learning via egocentric video. arXiv preprint. Georgia Tech. scholar
  23. Rishabh, A., et al. (2025). X-Sim: Cross-embodiment simulation for robot learning. CoRL 2025 (Oral). Cornell University. scholar
  24. Bahl, S., et al. (2025). VidBot: Learning robot policies from internet videos. CVPR 2025. TU Munich. scholar
  25. Wang, Y., et al. (2025). EgoZero: Robot learning from smart glasses demonstrations. arXiv preprint. scholar
  26. Apple ML Research. (2025). EgoDex: Learning dexterous manipulation from large-scale egocentric video. 829 hours, 90M frames. arXiv preprint. scholar
  27. Grauman, K., et al. (2022). Ego4D: Around the world in 3,000 hours of egocentric video. CVPR 2022. Meta AI. 3,670 hours, 931 participants. scholar
  28. Choi, H., Low, J. E., Huh, T. M., Hong, S., Uribe, G. A., Hoffmann, K. A. W., Di, J., Chen, T. G., Stanley, A. A., & Cutkosky, M. R. (2025). CoinFT: A Coin-Sized, Capacitive 6-Axis Force Torque Sensor for Robotic Applications. arXiv preprint. arXiv:2503.19225. scholar
  29. Um, T. (2026). S6 Physical AI Manufacturing and S9 NVIDIA Physical AI Robotics survey notes. Terry Surveys. [Um, 2026] source