AVA-VLA: Improving Vision-Language-Action models with Active Visual . . . Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep This approach implicitly models the task as a Markov Decision Process (MDP) However, this history-agnostic design is suboptimal for effective
Vision-language-action model - Wikipedia In robot learning, a vision-language-action model (VLA) is a class of multimodal foundation models that integrates vision, language and actions Given an input image (or video) of the robot's surroundings and a text instruction, a VLA directly outputs low-level robot actions that can be executed to accomplish the requested task