Heterogeneous Tactile Transformer

Learning Shared Tactile Representations Across Heterogeneous Sensors

Jianxin Bi1†, Qiang Wang1, Jayaram Reddy1, Kelvin Lin1, Soibkhon Khajikhanov1, Ruihan Gao2, and Harold Soh1,3†
1National University of Singapore   2Carnegie Mellon University   3Smart Systems Institute, NUS
Corresponding authors

Code, dataset, and checkpoints to be released in Aug 2026.

Under Review

HTT learns shared tactile representations across heterogeneous sensors:

1) Pretraining: per-modality masked reconstruction + cross-modal alignment over 1.6M paired frames from four sensors.
2) Transfer: shared representations adapt to new perception tasks and previously unseen sensors for contact-rich manipulation.
HTT Teaser
Heterogeneous Tactile Transformer (HTT). HTT is pretrained on the Heterogeneous Paired Tactile (HPT) dataset — 1.6M synchronized paired frames across four distinct sensors collected with a UMI device. HTT adopts sensor-specific encoders and a shared transformer trunk. Data from each sensor is patchified, encoded, and forwarded to the shared trunk. During pretraining, decoders reconstruct each sensor's input while cross-sensor predictors align the shared latent space across heterogeneous sensors. The pretrained model applies to distinct perception tasks and boosts manipulation policy learning on unseen sensors.

Abstract

Tactile sensors are inherently heterogeneous: a model trained on one sensor cannot be directly used on another, which limits learning contact-rich manipulation policies from diverse tactile data at scale. To bridge this gap, we propose the Heterogeneous Tactile Transformer (HTT), a framework that learns shared tactile representations across heterogeneous sensors. HTT consists of sensor-specific encoders and a shared transformer trunk, and is pretrained with per-modality masked reconstruction together with cross-modal alignment between paired sensors. Pretraining uses our novel Heterogeneous Paired Tactile (HPT) dataset, containing 1.6M synchronized paired frames across four vision- and array-based tactile sensors. Across distinct tactile perception and real-world manipulation tasks, HTT is shown to learn transferable representations that adapt to new tasks and previously unseen sensors. Dataset, code, and model checkpoints will be released upon publication.


Heterogeneous Tactile Sensors

Tactile Sensors
The four heterogeneous tactile sensors used in HPT pretraining, each mounted in the UMI gripper shell. Left to right: GelSight Mini and 9DTact (optical-based), TAC-02 and Xela uSkin (array-based).

HPT Dataset

HPT Dataset
Force/slip data collection and dataset statistics. A1. A tactile sensor and a 6-D F/T sensor are mounted on a robot arm; a probe rig contacts the tactile sensor while synchronized tactile frames and ground-truth force are recorded. A2. Example tactile frames from the four probe geometries. B1. The force range spans up to 40 N normal and 14 N shear. B2. Slip labels are heavily imbalanced (13.6% static, 1.2% incipient, 85.2% slide), making the rare classes challenging to detect.

Real-World Manipulation

Real Robot Experiments
Real-world Toy Screw and Grasp Tofu experiments. Left: task setup and tactile image from the Sharpa fingertip. Middle: final rotation of the screw — more than 600 degrees (close to tight) is considered a success. Right: Grasp Tofu completion status (20 rollouts); slip is the most common failure mode. On both tasks, HTT representations achieve the best results.

Video Gallery

Tactile Perception

Raw tactile streams across the four heterogeneous sensors used for force estimation and slip detection.

9DTact
GelSight Mini
TAC-02
Xela uSkin
9DTact
GelSight Mini
TAC-02
Xela uSkin

Simulation (ManiFeel Benchmark)

HTT embeddings boost policy learning on contact-rich simulated tasks, with two tactile modalities: RGB tactile images and force-field (FF).

HTT (RGB) — Success
HTT (RGB) — Failure
HTT (FF) — Success
HTT (FF) — Failure
HTT (RGB) — Success
HTT (RGB) — Failure
HTT (FF) — Success
HTT (FF) — Failure

Real-World Experiments

Sharpa hand on a Franka arm, camera-free. qpos: joint positions only; wrench: qpos + 6-D fingertip force; HTT: qpos + HTT tactile embeddings (zero-shot on unseen sensors).

Qpos
Wrench
HTT (Ours)
Qpos
Wrench
HTT (Ours)

BibTeX

@article{bi2026htt,
      title={Heterogeneous Tactile Transformer},
      author={Bi, Jianxin and Wang, Qiang and Reddy, Jayaram and Lin, Kelvin and Khajikhanov, Soibkhon and Gao, Ruihan and Soh, Harold},
      year={2026},
      eprint={2606.29948},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.29948},
}