Chapter 3: Tactile Data — Turning Sensor Signals into Representations

Overview

If tactile sensors (Chapter 2) convert physical contact into electrical signals, the question of how to structure those signals, how to collect them, and how to build general-purpose representations through pretraining is the subject of this chapter. Anchored by the taxonomy of Albini et al.^[1], we cover six data structures, collection pipelines, public datasets, and the path toward tactile foundation models through self-supervised pretraining.

After reading this chapter, you will be able to... - Distinguish the six tactile data representation structures and their application contexts. - Understand canonical and sensor-agnostic representations. - Identify major tactile data collection pipelines and public datasets. - Explain the significance and limitations of self-supervised pretraining approaches like Sparsh and UniTouch.

3.1 A Taxonomy of Data Representations

Albini et al.^[1] classified tactile data representations into six structures. This survey (submitted to IEEE T-RO) is establishing itself as the de facto standard taxonomy for tactile data representation.

Key Paper: Albini et al. 2025. "Representing Data in Robotic Tactile Perception — A Review." arXiv preprint (submitted to IEEE T-RO). Identifies 6 data structures (vector, matrix, map, point cloud, mesh, image) with selection guidelines based on hardware, task, and information requirements.

Figure 3.1: General overview of the tactile perception pipeline — the Data Representation block bridges hardware and high-level processing. Source: Albini et al. (2025), Fig. 1.

3.1.0 Checklist: Turning Sensor Signals into Stored Data

Mounting a tactile sensor does not automatically create a usable dataset. The raw signal first has to become a record. A minimal schema should include:

Field	Example	Why it matters
timestamp	robot clock, sensor clock	synchronization with vision/proprioception
sensor_id	left_index_tip_taxel_17	sensor replacement and calibration tracking
frame_id	left_index_distal_link	force direction and contact-location interpretation
raw_value	ADC/RGB/Hall/F-T voltage	recalibration and debugging
calibrated_value	N, Pa, mm, wrench	control and learning input
contact_state	none/stick/slip/break	input to contact dynamics models
quality_flags	saturated, occluded, drifted	filtering training data

With this schema, Chapter 2's sensor outputs can be converted into Chapter 3's vectors, maps, images, and point clouds. Without it, sensor demos remain hard to compare or reuse across policies.

3.1.1 Vector

The simplest form: sensor readings arranged as a 1D vector. For multi-axis sensors, this becomes a force vector such as [fx, fy, fz]. The OSMO glove^[21] [#18] represents tactile state as a 36-dimensional vector from its 12 three-axis sensors.

Suitable tasks: Classification, force control, slip detection

Limitations: Loss of spatial relationship information

3.1.2 Matrix

Sensor array outputs represented as a 2D matrix. The STAG glove's ^[12] 548 piezoresistive sensors produce a pressure matrix mapped to the hand surface.

Suitable tasks: Contact pattern classification, grasp state recognition

Limitations: Distortion when sensors are placed on curved surfaces

3.1.3 Map

Sensor data projected onto a reference surface. UniTacHand ^[13] [#16] projects tactile data onto a MANO UV map, mapping human and robot hand tactile data into a shared representation space (→ Chapter 11.4).

Suitable tasks: Cross-embodiment transfer, whole-hand tactile analysis

Limitations: UV mapping distortion, dependence on reference model

3.1.4 Point Cloud

Contact points represented with 3D coordinates. Robot Synesthesia ^[2] implemented visuotactile in-hand manipulation using point cloud-based tactile representations. A PointNet encoder processes the tactile point cloud to achieve double-ball rotation and three-axis rotation on novel objects (ICRA 2024).

Figure 3.2: Robot Synesthesia — a point cloud-based visuotactile policy trained in simulation generalizes to novel objects in the real world without any real-world data. Source: Yuan et al. (2024), Fig. 1.

Key Paper: Yuan et al. 2024. "Robot Synesthesia: In-Hand Manipulation with Visuotactile Sensing." ICRA 2024. arXiv:2312.01853. Point cloud-based tactile representation with teacher-student RL for visuotactile in-hand manipulation, achieving generalization to novel objects.

Suitable tasks: 3D shape reconstruction, 6-DoF pose estimation, visuo-tactile fusion

Limitations: Non-uniform density, unordered data processing required

3.1.5 Mesh

Contact surfaces modeled as triangular meshes. Combined with finite element methods (FEM) for deformation simulation. DiffTactile^[11] uses mesh-based contact modeling in a differentiable tactile simulator (→ Chapter 10.1).

Suitable tasks: Deformation simulation, force distribution analysis

Limitations: Computational cost, real-time processing difficulty

3.1.6 Image

The raw output of vision-based tactile sensors (GelSight, DIGIT) is already an image. This allows direct use of the rich vision model ecosystem (CNNs, ViTs), making it the most widely used representation today. Sparsh ^[4] performed self-supervised pretraining on 460,000+ tactile images (→ Section 3.6).

Suitable tasks: Texture recognition, object classification, contact map reconstruction

Limitations: Sensor-specific — cannot directly compare images from different sensors

3.2 How Representation Choices Affect Task Performance

The choice of data representation directly impacts learning performance. Wu et al.^[1] proposed Canonical 3D Tactile [#14] representations, transforming raw sensor output into a 3D canonical coordinate system to enable task-agnostic transfer (ICRA 2025).

The key insight, also discussed in Seminar 1, is a two-stage approach: pretrain a tactile encoder on large-scale play data, then learn the policy from a few expert demonstrations. The visuo-tactile imitation learning pipeline combining three-axis tactile, vision (Realsense D435), and robot state was the core pipeline presented in Seminar 1.

Albini et al.^[1] propose selection guidelines along three axes:

Hardware: Sensor output characteristics determine the natural representation (vision sensor → image, distributed sensor → matrix)
Task: Grasp classification → vector/matrix; shape reconstruction → point cloud/mesh
Required information: Normal force only → scalar/vector; 3D contact geometry → point cloud/image

3.3 Canonical and Sensor-Agnostic Representations

A fundamental limitation of tactile research is sensor specificity: representations learned on GelSight do not work on DIGIT; those from DIGIT do not transfer to ReSkin. Three main approaches address this problem:

3.3.1 AnyTouch / AnyTouch 2 (2025)

AnyTouch^[7] learns unified representations of static and dynamic touch across multiple vision-based tactile sensors. AnyTouch 2 [2025] extends this to dynamic perception, enabling sensor-agnostic deployment.

3.3.2 Sensor-Invariant Tactile Representation (2025)

Achieves zero-shot transfer across optical sensor designs by learning representations that strip sensor-specific information while preserving essential contact characteristics.

3.3.3 Canonical 3D Tactile (Wu et al., 2025)

Transforms raw sensor output into a 3D canonical coordinate frame, constructing sensor-independent tactile representations. Combined with task-agnostic play data pretraining, it enables transfer from few demonstrations.

Key Insight: Sensor-agnostic representations are the path toward tactile sensing's "CLIP moment" — the ultimate goal is aligning outputs from diverse tactile sensors into a unified embedding space, just as CLIP did for vision-language.

Figure 3.3: AnyTouch's TacQuad — an aligned multi-modal multi-sensor dataset spanning GelSight Mini, DIGIT, DuraGel, and Tac3D, enabling learning of sensor-agnostic tactile representations. Source: Feng et al. (2025), Fig. 1.

3.4 Data Collection Pipelines

Collecting tactile data is inherently harder than visual data — physical contact is required. Three primary collection methods exist:

3.4.1 Teleoperation

A human operator remotely controls the robot while recording tactile data. This yields high-quality demonstrations but with very low throughput — roughly 10 demonstrations per hour [DexCap benchmark]. DexCap is 3x faster than teleoperation but still limited.

Wu et al.^[1], as presented in Seminar 1, used a two-stage approach: collect large-scale play data via teleoperation, then fine-tune a policy with a few expert demonstrations.

3.4.2 Kinesthetic Teaching

DexForce^[14] [#3] proposed a pipeline using a spring model for kinesthetic teaching that naturally records force-torque information. This captures more natural force profiles than teleoperation (→ Chapter 8.4).

Figure 3.4: DexForce — across six contact-rich tasks (opening AirPods case, unscrewing a nut, etc.) six-axis force/torque is recorded via kinesthetic demonstration, and these force-informed actions are fed to imitation learning. Source: Chen et al. (2025), Fig. 1.

3.4.3 Autonomous Exploration

The robot autonomously explores the environment to collect tactile data. Throughput is high, but demonstration quality may be lower. PP-Tac^[15] [#12] uses physics-based trajectory synthesis for automatic tactile data generation.

3.4.4 Handheld Demonstration Devices (UMI-FT)

UMI-FT ^[25] extends the Universal Manipulation Interface (UMI) ^[26] with CoinFT force/torque sensors on each finger, enabling multimodal human demonstration collection at scale without requiring a full robot setup:

Hardware: Handheld gripper-shaped device with iPhone (RGB + depth), CoinFTs per finger
Modalities captured: Vision, proprioceptive pose, finger-level 6-axis force/torque
Key advantage: Natural haptic feedback during demonstration (unlike teleoperation); scalable deployment anywhere
Learning pipeline: High-level multimodal Diffusion Policy (vision + force → pose, gripper width, grasp force, stiffness) + low-level grasp force and compliance controllers

UMI-FT demonstrated that force/torque sensing is critical for contact-rich tasks: on whiteboard wiping, the policy with force achieved controlled, firm wiping and generalized to different table heights, board sizes, and eraser widths, while baselines without force either rammed the surface or barely skimmed it. On light bulb insertion (requiring ~15-20 N force with haptic search), both compliance and grasp force control were essential — removing either caused failure ^[25].

Figure 3.5: UMI-FT system overview — a handheld gripper equipped with CoinFT six-axis F/T sensors on each finger. External contact forces (whiteboard wiping) and internal grasp forces (zucchini skewering, bulb insertion) are both recorded and regulated. Source: Choi et al. (2026), Fig. 1.

Figure 3.6: UMI-FT controller structure — a 1 Hz visuomotor policy fuses RGB+Depth and left/right CoinFT signals, then issues reference targets to a 500 Hz arm compliance controller and a 30 Hz grasp-force controller. Source: Choi et al. (2026), Fig. 3.

3.4.5 Synthetic Data

NVIDIA's Isaac Sim pipeline generates 780,000 trajectories (equivalent to 6,500 hours) in just 11 hours, improving real-world performance by 40%. Tacto ^[10] is an open-source simulator for vision-based tactile sensors enabling sim-to-real training. DiffTactile^[11] supports gradient-based optimization through differentiable simulation (→ Chapter 10).

Figure 3.7: DiffTactile — grasping a deformable object in the real world (left) reproduced in a physics-based differentiable simulator (right). Mesh-FEM contact modeling computes precise force distributions. Source: Si et al. (2024), Fig. 1.

Method	Throughput	Data Quality	Force Info	Cost	Representative
Teleoperation	Low (~10/hr)	High	Limited	High	DexCap, DexUMI
Kinesthetic teaching	Medium	High	Natural	Medium	DexForce
Handheld device	High	High	6-axis F/T	Low	UMI-FT
Autonomous exploration	High	Medium	Available	Low	PP-Tac
Synthetic data	Very high	Medium (gap)	Available	Low	Isaac Sim, Tacto

3.5 Public Datasets

The scale of tactile datasets remains orders of magnitude smaller than vision datasets. To put this in perspective: LLaMA 3 was trained on 34,000 human-years of text data (at ~40 words/min); the largest robot manipulation dataset (Generalist AI) contains 57 human-years of demonstration data (vision + position only); multimodal tactile data in academia amounts to less than 1 human-year — nearly non-existent by comparison ^[25]. This massive deficit motivates every data collection and pretraining approach discussed in this chapter.

Figure 3.8: Data scale comparison — LLM (34,000 human-years) vs. robot manipulation (57 human-years) vs. multimodal tactile (~0.x human-years). Source: Choi (2026), SNU Data Science Seminar.

Major public datasets include:

Dataset	Scale	Sensor	Modalities	Tasks	Year
Touch-and-Go	3M+ contacts	GelSight	Vision + tactile	Texture, objects	2023
Touch100k	100K+ images	Various	Tactile	Texture classification	2024
ObjectFolder	1K+ objects	Simulation	Vision + tactile + audio	Multimodal	2022
VTDexManip	10 tasks, 182 objects	Vision + tactile	Human demos	Dexterous manipulation	2025
Open X-Embodiment	1M+ trajectories	22 embodiments	Vision + action	General manipulation	2024
EgoDex	829hr, 90M frames	Apple Vision Pro	RGB + hand pose	194 tasks	2025

Key Paper: VTDexManip (ICLR 2025). The first large-scale visual-tactile dataset from human demonstrations, spanning 10 tasks and 182 objects with an RL benchmark.

EgoDex ^[22] is a large-scale egocentric hand manipulation dataset collected using Apple Vision Pro + ARKit. With 829 hours, 90M frames, and 194 tasks recorded at 30Hz per-finger tracking, it surpasses existing tactile/manipulation datasets by orders of magnitude. Together with the scaling laws discovered by EgoScale (Chapter 11.6), this demonstrates that large-scale egocentric human data collection is emerging as a key direction for robot learning.

VTDexManip is a significant milestone — the first large-scale collection of visual-tactile data from actual human demonstrations of dexterous manipulation.

Open X-Embodiment^[16] is the largest robot manipulation dataset (1M+ trajectories from 22 embodiments), but tactile data is included in only a small subset. The absence of large-scale tactile-specific datasets is a key limitation discussed in Chapter 14 (→ Chapter 14.1).

Figure 3.9: Touch100k dataset construction pipeline — three stages (collection, multi-granularity description generation with GPT-4V, consistency verification with Gemini) produce 100K+ aligned vision-touch-language samples. Source: Yang et al. (2024), Fig. 1.

3.6 Self-Supervised Pretraining: Sparsh and UniTouch

The key approach toward the "ImageNet moment" for touch is building general-purpose tactile representations through self-supervised learning.

Sparsh (2024)

A collaboration between Meta FAIR, CMU, and UC Berkeley, Sparsh ^[4] is a tactile foundation model pretrained via self-supervised learning on 460,000+ tactile images (CoRL 2024).

Key Paper: Higuera et al. 2024. "Sparsh: Self-Supervised Touch Representations for Vision-Based Tactile Sensing." CoRL 2024. Self-supervised tactile foundation model pretrained on 460K+ images from multiple vision-based tactile sensors. A milestone for general-purpose tactile representations.

Sparsh's significance lies in being the first large-scale attempt to replicate for touch what ImageNet pretraining accomplished for vision. However, 460K images remain orders of magnitude smaller than ImageNet's 14M, and a 10x+ data expansion is needed (→ Chapter 14.2).

UniTouch (2024)

Yang et al.^[2] use contrastive learning to align touch with vision, language, and audio (CVPR 2024). Analogous to how CLIP aligned vision-language, UniTouch builds a unified embedding space across touch-vision-language-audio, enabling zero-shot tactile classification.

Key Paper: Yang et al. 2024. "Binding Touch to Everything: Learning Unified Multimodal Tactile Representations." CVPR 2024. Contrastive learning aligning touch with vision, language, and audio for cross-modal retrieval and zero-shot classification.

Tactile Sensing for Dexterous Manipulation Survey (2024)

This survey [2024] comprehensively covers tactile sensing, datasets, and sim-to-real transfer, providing the full context for the data representation, collection, and pretraining topics in this chapter.

Figure 3.10: UniTacHand — a human glove and a DexHand are aligned on a MANO UV map to learn a unified latent space. An illustrative example of foundation-model-class representations for human-to-robot cross-embodiment tactile transfer. Source: Zhang et al. (2025), Fig. 1.

Summary and Outlook

Tactile data representations range from vectors to point clouds, with the taxonomy of Albini et al.^[1] providing selection criteria. Sensor-agnostic representations (AnyTouch, Sensor-Invariant, Canonical 3D) are essential for reproducibility and scalability, while Sparsh and UniTouch mark the first milestones toward tactile foundation models.

Yet the scale of tactile data remains orders of magnitude below that of visual data. Expanding from Touch-and-Go's 3M contacts to 100M+, applying NVIDIA's synthetic data pipeline to the tactile domain, and enabling cross-embodiment data reuse (Open X-Embodiment for hands) are key future directions.

The next chapter examines the design principles of robot hands that house these sensors and data (→ Chapter 4: Robot Hand Design).

3.8 Data A/Data B: Representation Includes the Collection Mode

S2's Part I/II expands tactile representation into a data-strategy question. It distinguishes robot teleoperation data (Data A) from human hand data (Data B). Data A is directly executable by the robot, but suffers from throughput, cost, and contact-rich teleoperation limits. Data B can be collected at natural human speed with gloves, exoskeletons, smart glasses, or handheld interfaces, but must be retargeted to the robot embodiment.

This distinction directly affects representation. Data A already aligns robot state, action, and contact force under one clock and one frame tree. Data B requires post-hoc alignment of human hand pose, egocentric video, tactile glove signals, object state, worker id, and environment id. A Data B record therefore needs richer metadata than a robot trajectory record.

Data source	Advantage	Extra representation requirement
Data A: robot teleop	executability, robot-frame alignment	force/action/state synchronization, failure logs
Data B: human hand data	scale, realism, worker diversity	hand model, retarget map, human/robot frame bridge
Data A+B co-training	scale plus executability	shared embedding, modality dropout, alignment labels
tactile Data B	contact-rich information	taxel pose, glove calibration, slip/contact event labels

The lesson is that tactile representation is not just sensor-to-vector conversion. For learning, the record must also preserve which hand, embodiment, contact mode, and collection protocol produced the signal. S2's key claim is that Data B can build a useful policy prior, but contact-rich manufacturing tasks need tactile-rich Data B plus a smaller amount of executable Data A.

Operational Reading Note

The practical value of this chapter is not only the concept of tactile record schema; it is the set of engineering decisions that the concept changes. A deployable robot-hand project should start by asking what state becomes observable after this chapter is applied. The answer should be concrete: contact existence, contact patch, normal force, shear direction, slip margin, object pose, task phase, operator override, or product-damage risk. If a variable cannot be logged or consumed by a controller, it remains an explanatory idea rather than a system capability.

The second decision is the unit of evidence. Research demos often report one success metric, but tactile manipulation improves through failure records. A useful attempt record contains the object or SKU, the selected grasp candidate, the robot hand and sensor configuration, calibration version, task phase, tactile summary, policy action, safety intervention, and final outcome. This record is what connects the sensor chapters to the data chapter, the control chapters to the learning chapters, and the manufacturing chapters to QA.

The third decision is where the chapter sits in the control stack. Some ideas belong in fast reflex loops, some in contact MPC, some in policy inputs, and some only in offline diagnosis. Mixing these time scales creates brittle systems: a VLA cannot react to millisecond slip, and a low-level force controller cannot infer the next process step. The right architecture separates fast contact stabilization, mid-level grasp or rearrangement control, and slow task reasoning.

Finally, the chapter should be evaluated by the failure modes it removes. A method that improves benchmark success but leaves the team unable to distinguish perception failure, contact-acquisition failure, force-closure failure, execution-time slip, or maintenance drift is not yet production-ready. A method with slightly lower headline performance but better logs, safer force limits, and clearer recovery hooks may be the stronger basis for manufacturing Physical AI.

References

Albini, A., Kaboli, M., Cannata, G., & Maiolino, P. (2025). Representing data in robotic tactile perception — A review. arXiv preprint (submitted to IEEE T-RO). arXiv:2510.10804. scholar
Yuan, Y., Che, H., Qin, Y., Huang, B., Yin, Z.-H., Lee, K.-W., Wu, Y., Lim, S.-C., & Wang, X. (2024). Robot Synesthesia: In-hand manipulation with visuotactile sensing. ICRA 2024. arXiv:2312.01853. scholar
Yang, F., Feng, C., Chen, Z., Park, H., Wang, D., Dou, Y., ... & Wong, A. (2024). Binding touch to everything: Learning unified multimodal tactile representations. CVPR 2024. scholar
Higuera, C., Sharma, A., Bodduluri, C. K., Fan, T., Lancaster, P., Malik, J., Pathak, D., Lambeta, M., & Calandra, R. (2024). Sparsh: Self-supervised touch representations for vision-based tactile sensing. CoRL 2024. scholar
Liu, Q., Cui, Y., Sun, Z., Li, G., Chen, J., & Ye, Q. (2025). VTDexManip: A dataset and benchmark for visual-tactile pretraining and dexterous manipulation with reinforcement learning. ICLR 2025. scholar
Feng, R., Hu, J., Xia, W., Gao, T., Shen, A., Sun, Y., Fang, B., & Hu, D. (2025). AnyTouch: Learning unified static-dynamic representation across multiple visuo-tactile sensors. ICLR 2025. arXiv:2502.12191. scholar
Various. (2025). AnyTouch 2: General optical tactile representation learning for dynamic tactile perception. OpenReview. scholar
Various. (2025). Sensor-invariant tactile representation. OpenReview. scholar
Wu, C., et al. (2025). Canonical 3D tactile representation for visuo-tactile imitation learning. ICRA 2025. #14 scholar
Wang, S., Lambeta, M., et al. (2022). Tacto: A fast, flexible, and open-source simulator for vision-based tactile sensors. IEEE Robotics and Automation Letters. scholar
Si, Z., Zhang, G., Ben, Q., Romero, B., Xian, Z., Liu, C., & Gan, C. (2024). DiffTactile: A physics-based differentiable tactile simulator for contact-rich robotic manipulation. ICLR 2024. scholar
Sundaram, S., Kellnhofer, P., Li, Y., Zhu, J.-Y., Torralba, A., & Matusik, W. (2019). Learning the signatures of the human grasp using a scalable tactile glove. Nature, 569, 698-702. scholar
Zhang, C., Xue, Z., Yin, S., Zhao, B., et al. (2025). UniTacHand: Unified spatio-tactile representation for human to robotic hand skill transfer. arXiv preprint. arXiv:2512.21233. #16 scholar
Chen, C., Yu, Z., Choi, H., Cutkosky, M., & Bohg, J. (2025). DexForce: Extracting force-informed actions from kinesthetic demonstrations for dexterous manipulation. IEEE Robotics and Automation Letters. arXiv:2501.10356. #3 scholar
Lin, P., Huang, Y., Li, W., Ma, J., Xiao, C., & Jiao, Z. (2025). PP-Tac: Paper picking using omnidirectional tactile feedback in dexterous robotic hands. RSS 2025. #12 scholar
Open X-Embodiment Collaboration. (2024). Open X-Embodiment: Robotic learning datasets and RT-X models. ICRA 2024. arXiv:2310.08864. scholar
Various. (2024). Tactile sensing for dexterous manipulation: Taxonomies, datasets, and sim-to-real transfer. Journal of Multidisciplinary Engineering Science. scholar
Yang, F., et al. (2023). Touch and Go: Learning from human-collected vision and touch. ICCV 2023. scholar
Yang, F., et al. (2024). Touch100k: A large-scale touch-language-vision dataset. arXiv preprint. arXiv:2406.03813. scholar
Gao, R., et al. (2022). ObjectFolder 2.0: A multisensory object dataset for sim2real transfer. ICML 2022. scholar
Yin, J., Qi, H., Wi, Y., Kundu, S., Lambeta, M., Yang, W., Wang, C., Wu, T., Malik, J., & Hellebrekers, T. (2025). OSMO: Open-source tactile glove for human-to-robot skill transfer. arXiv preprint. arXiv:2512.08920. #18 scholar
Various. (2025). EgoDex: Large-scale egocentric hand manipulation dataset via Apple Vision Pro. arXiv preprint. scholar [Apple, 2025]
Choi, H., Low, J. E., Huh, T. M., Hong, S., Uribe, G. A., Hoffmann, K. A. W., Di, J., Chen, T. G., Stanley, A. A., & Cutkosky, M. R. (2025). CoinFT: A coin-sized, capacitive 6-axis force/torque sensor for robotic applications. arXiv preprint. arXiv:2503.19225. https://coin-ft.github.io/ scholar
Choi, H., Hou, Y., Pan, C., Hong, S., Patel, A., Xu, X., Cutkosky, M. R., & Song, S. (2026). In-the-Wild Compliant Manipulation with UMI-FT. arXiv preprint. arXiv:2601.09988. scholar
Choi, H. (2026). Multimodal Data for Robot Manipulation. SNU Data Science Seminar. scholar
Chi, C., et al. (2024). Universal Manipulation Interface: In-the-wild robot teaching without in-the-wild robots. RSS 2024. source [Chi et al., 2024]
Um, T. (2026). S2 From Human Hands to Robot Hands: large-scale tactile hand data survey. source [Um, 2026]