Heterogeneous Tactile Transformer

Learning Shared Tactile Representations Across Heterogeneous Sensors

Jianxin Bi^1†, Qiang Wang¹, Jayaram Reddy¹, Kelvin Lin¹, Soibkhon Khajikhanov¹, Ruihan Gao², and Harold Soh^1,3†

¹National University of Singapore ²Carnegie Mellon University ³Smart Systems Institute, NUS

^†Corresponding authors

Code, dataset, and checkpoints to be released in Aug 2026.

Under Review

HTT learns shared tactile representations across heterogeneous sensors:

1) Pretraining: per-modality masked reconstruction + cross-modal alignment over 1.6M paired frames from four sensors.

2) Transfer: shared representations adapt to new perception tasks and previously unseen sensors for contact-rich manipulation.

HTT Teaser — **Heterogeneous Tactile Transformer (HTT).** HTT is pretrained on the *Heterogeneous Paired Tactile (HPT) dataset* — 1.6M synchronized paired frames across four distinct sensors collected with a UMI device. HTT adopts sensor-specific encoders and a shared transformer trunk. Data from each sensor is patchified, encoded, and forwarded to the shared trunk. During pretraining, decoders reconstruct each sensor's input while cross-sensor predictors align the shared latent space across heterogeneous sensors. The pretrained model applies to distinct perception tasks and boosts manipulation policy learning on unseen sensors.

Abstract

Tactile sensors are inherently heterogeneous: a model trained on one sensor cannot be directly used on another, which limits learning contact-rich manipulation policies from diverse tactile data at scale. To bridge this gap, we propose the Heterogeneous Tactile Transformer (HTT), a framework that learns shared tactile representations across heterogeneous sensors. HTT consists of sensor-specific encoders and a shared transformer trunk, and is pretrained with per-modality masked reconstruction together with cross-modal alignment between paired sensors. Pretraining uses our novel Heterogeneous Paired Tactile (HPT) dataset, containing 1.6M synchronized paired frames across four vision- and array-based tactile sensors. Across distinct tactile perception and real-world manipulation tasks, HTT is shown to learn transferable representations that adapt to new tasks and previously unseen sensors. Dataset, code, and model checkpoints will be released upon publication.

Heterogeneous Tactile Sensors

HPT Dataset

Real-World Manipulation

Real Robot Experiments — Real-world Toy Screw and Grasp Tofu experiments. Left: task setup and tactile image from the Sharpa fingertip. Middle: final rotation of the screw — more than 600 degrees (close to tight) is considered a success. Right: Grasp Tofu completion status (20 rollouts); slip is the most common failure mode. On both tasks, HTT representations achieve the best results.

Video Gallery

Tactile Perception

Raw tactile streams across the four heterogeneous sensors used for force estimation and slip detection.

Force Estimation

9DTact

GelSight Mini

TAC-02

Xela uSkin

Slip Detection

9DTact

GelSight Mini

TAC-02

Xela uSkin

Simulation (ManiFeel Benchmark)

HTT embeddings boost policy learning on contact-rich simulated tasks, with two tactile modalities: RGB tactile images and force-field (FF).

Bulb Installation

HTT (RGB) — Success

HTT (RGB) — Failure

HTT (FF) — Success

HTT (FF) — Failure

Peg Insertion

HTT (RGB) — Success

HTT (RGB) — Failure

HTT (FF) — Success

HTT (FF) — Failure

Real-World Experiments

Sharpa hand on a Franka arm, camera-free. qpos: joint positions only; wrench: qpos + 6-D fingertip force; HTT: qpos + HTT tactile embeddings (zero-shot on unseen sensors).

Toy Screw (success rate — qpos 5%, wrench 50%, HTT 95%)

Qpos

Wrench

HTT (Ours)

Grasp Tofu (success rate — qpos 5%, wrench 35%, HTT 55%)

Qpos

Wrench

HTT (Ours)

BibTeX

@article{bi2026htt,
      title={Heterogeneous Tactile Transformer},
      author={Bi, Jianxin and Wang, Qiang and Reddy, Jayaram and Lin, Kelvin and Khajikhanov, Soibkhon and Gao, Ruihan and Soh, Harold},
      year={2026},
      eprint={2606.29948},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.29948},
}