|
- LLaVA: Large Language and Vision Assistant - GitHub
[1 30] 🔥 LLaVA-NeXT (LLaVA-1 6) is out! With additional scaling to LLaVA-1 5, LLaVA-NeXT-34B outperforms Gemini Pro on some benchmarks It can now process 4x more pixels and perform more tasks applications than before Check out the blog post, and explore the demo! Models are available in Model Zoo Training eval data and scripts coming soon
- LLaVA: Large Language and Vision Assistant - Microsoft Research
LLaVA represents a cost-efficient approach to building general-purpose multimodal assistant It is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA
- LLaVA Architecture: From Frozen ViT to Fine-Tuned LLM
A complete technical breakdown of the LLaVA-1 5 multimodal visual assistant Explore its architecture, open-source training data, and how to use the model
- LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound
In addition, we perform data redundancy adaptation through cross-modal contrast learning and knowledge distillation LLaVA-Ultra shows strong capability and robustness to medical scenarios On three Med-VQA datasets, LLaVA-Ultra surpasses previous state-of-the-art models on various metrics
- LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound
LLaVA-Ultra Contributed by the robust architecture as well as fine-grained professional data, LLaVA-Ultra shows the best prac-tice in the Chinese medical domain Trained in only 60 hours with 4 48GB A40s, it provides detailed answers relevant to visual con-tent in medical conversations
- GitHub - HumanMLLM LLaVA-Scissor
Overview LLaVA-Scissor provides a solution for compressing video tokens based on identifying semantic connected components Previous methods mostly attempt to compress tokens based on attention scores, but fail to achieve complete semantic coverage and tend to repeatedly select similar semantic regions
- LLaVA - Cerebras AI
Model Description LLaVA (Large Language and Vision Assistant) is a multimodal model that integrates a vision encoder with a language model via a lightweight projector module, enabling end-to-end visual and language understanding It accepts both image and text inputs and generates text-based outputs, making it suitable for instruction-following, question answering, and general-purpose visual
- Mingze Xu - Indiana University
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Mingze Xu *, Mingfei Gao*, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan
|
|
|