VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback

1 Dept. of Computer Science, National University of Singapore
2 Smart Systems Institute, NUS
3 Show Lab, NUS

VLA-Touch improves VLA models with dual-level tactile feedback:
1) Planning:Convert raw tactile images to linguistic tactile descriptions for effective task planning with VLMs;
2) Manipulation:Refine VLA-generated actions with tactile signals for contact-rich manipulation.



Abstract

Tactile feedback is generally recognized to be crucial for effective interaction with the physical world. However, state-of-the-art Vision-Language-Action (VLA) models lack the ability to interpret and use tactile signals, limiting their effectiveness in contact-rich tasks. Incorporating tactile feedback into these systems is challenging due to the absence of large multi-modal datasets.

We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing without fine-tuning the base VLA. Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation. Through real-world experiments, we demonstrate that our dual-level integration of tactile feedback improves task planning efficiency while enhancing execution precision.


Teaser

VLA-Touch Teaser
VLA-Touch enhances VLA models for contact-rich manipulation through dual-level tactile feedback. Left: Tactile-Assisted Task Planning - VLM task planner actively acquires tactile feedback for planning. Octopi interprets contacted objects and generates linguistic tactile descriptions to inform subsequent plans. Right: Tactile-Enhanced Manipulation - Interpolant Model refines VLA-generated actions using tactile signal, resulting in improved contact-rich interactions, such as more consistent contact with mango surface when peeling.

Framework

VLA-Touch Framework
Pipeline of Dual-level Tactile Feedback for Planning and Manipulation. Planning: Given a scene image st and task goal g, the VLM Task Planner generates manipulation instruction Ik for policy execution. A tactile-language model (Octopi) converts a sequence tactile input omt-n:t to language description Lmt, which informs VLM for updated instruction. Manipulation: The base VLA π(at|st,Ik) generates action chunk at based on visual observation st and instruction Ik. The action chunk is then refined by an interpolant policy πIt|st,at,mt) that takes as input both visual embeddings from a pretrained DinoV2 model and low-dimensional tactile signals mt processed by a marker tracking algorithm from raw tactile input omt.

Acknowledgements

VLA-Touch is developed based on many open-sourced works, including BRIDGeR, Octopi and RDT-1B. We thank all these authors for their nicely open sourced code and their great contributions to the community.

BibTex

@misc{bi2025vlatouchenhancingvisionlanguageactionmodels,
      title={VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback}, 
      author={Jianxin Bi and Kevin Yuchen Ma and Ce Hao and Mike Zheng Shou and Harold Soh},
      year={2025},
      eprint={2507.17294},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2507.17294}, 
    }