VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback

¹ Dept. of Computer Science, National University of Singapore
² Smart Systems Institute, NUS
³ Show Lab, NUS

Abstract

Tactile feedback is generally recognized to be crucial for effective interaction with the physical world. However, state-of-the-art Vision-Language-Action (VLA) models lack the ability to interpret and use tactile signals, limiting their effectiveness in contact-rich tasks. Incorporating tactile feedback into these systems is challenging due to the absence of large multi-modal datasets.

We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing without fine-tuning the base VLA. Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation. Through real-world experiments, we demonstrate that our dual-level integration of tactile feedback improves task planning efficiency while enhancing execution precision.

Teaser

VLA-Touch enhances VLA models for contact-rich manipulation through dual-level tactile feedback. Left: Tactile-Assisted Task Planning - VLM task planner actively acquires tactile feedback for planning. Octopi interprets contacted objects and generates linguistic tactile descriptions to inform subsequent plans. Right: Tactile-Enhanced Manipulation - Interpolant Model refines VLA-generated actions using tactile signal, resulting in improved contact-rich interactions, such as more consistent contact with mango surface when peeling.

Framework

Pipeline of Dual-level Tactile Feedback for Planning and Manipulation. Planning: Given a scene image s_t and task goal g, the VLM Task Planner generates manipulation instruction I_k for policy execution. A tactile-language model (Octopi) converts a sequence tactile input o^m_t-n:t to language description L^m_t, which informs VLM for updated instruction. Manipulation: The base VLA π(a_t|s_t,I_k) generates action chunk a_t based on visual observation s_t and instruction I_k. The action chunk is then refined by an interpolant policy π_I(â_t|s_t,a_t,m_t) that takes as input both visual embeddings from a pretrained DinoV2 model and low-dimensional tactile signals m_t processed by a marker tracking algorithm from raw tactile input o^m_t.

BibTex

@misc{bi2025vlatouchenhancingvisionlanguageactionmodels, title={VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback}, author={Jianxin Bi and Kevin Yuchen Ma and Ce Hao and Mike Zheng Shou and Harold Soh}, year={2025}, eprint={2507.17294}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2507.17294}, }

VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback

VLA-Touch improves VLA models with dual-level tactile feedback：
1） Planning：Convert raw tactile images to linguistic tactile descriptions for effective task planning with VLMs;
2） Manipulation：Refine VLA-generated actions with tactile signals for contact-rich manipulation.

Abstract

Teaser

Framework

Acknowledgements

BibTex

VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback

VLA-Touch improves VLA models with dual-level tactile feedback： 1） Planning：Convert raw tactile images to linguistic tactile descriptions for effective task planning with VLMs; 2） Manipulation：Refine VLA-generated actions with tactile signals for contact-rich manipulation.

Abstract

Teaser

Framework

Acknowledgements

BibTex

VLA-Touch improves VLA models with dual-level tactile feedback：
1） Planning：Convert raw tactile images to linguistic tactile descriptions for effective task planning with VLMs;
2） Manipulation：Refine VLA-generated actions with tactile signals for contact-rich manipulation.