Liquid AI Just Dropped the Fastest, Best Open-Source Foundation Model

TLDR;

Liquid AI's LFM2VL represents a significant advancement in vision language AI, enabling it to run efficiently on devices like phones and laptops. Key points include:

LFM2VL models are designed for low latency and device awareness, offering up to twice the inference speed compared to other models.
The architecture includes a language model backbone, a vision encoder, and a multimodal projector, optimised for detail and efficiency.
The models are easy to use, integrating with Hugging Face Transformers and supporting customisation via the Leap platform.
Licensing is open for smaller companies, with commercial licenses required for larger enterprises.

Introduction to Liquid AI and LFM2VL [0:02]

Liquid AI, originating from MIT's CSIL, is focused on redesigning AI architecture for efficiency. Their liquid foundation models (LFMs) are built using mathematical and signal processing principles, resulting in lighter, faster, and more flexible models. LFM2VL is a set of vision language models designed for low latency and device awareness, meaning they are fast and can run on everyday devices.

LFM2VL Versions and Speed [1:16]

There are two versions of LFM2VL: a smaller 450 million parameter model for resource-limited devices and a larger 1.6 billion parameter model for single GPUs or high-end mobile devices. These models offer up to two times faster inference speed on GPUs compared to other vision language models, significantly reducing processing time.

Model Architecture [2:13]

LFM2VL consists of a language model backbone, a vision encoder, and a multimodal projector. The language model uses LFM2 1.2B for the larger model and LFM2 350M for the smaller one. The vision encoder uses SIGLIP 2 NLEX encoders, processing images at their native resolution up to 512x512 pixels. The multimodal projector combines text and vision using pixel unshuffle, reducing the number of image tokens for efficiency.

Flexibility and Training [4:24]

Users can adjust settings to prioritise speed or accuracy based on the device. The training process involved pre-training the backbone model, combining vision and language gradually, and fine-tuning for image understanding, using 100 billion multimodal tokens from open-source and synthetic data sets.

Performance Benchmarks [5:23]

The 1.6 billion model scored 65.23 on real-world QA, 58.68 on info VQA, and 742 on OCR bench. The smaller model also performed well, with 52.29 on real-world QA and 655 on OCR bench. The models are up to two times faster than comparable systems in inference speed, which is crucial for real-world applications like smart cameras and robots.

Ease of Use and Offline Capabilities [6:30]

LFM2VL integrates with Hugging Face Transformers and supports quantization for reduced memory usage. The Leap platform allows for customisation and easy deployment on mobile devices, with the Apollo app enabling offline testing. Liquid AI aims to reduce cloud dependency, enabling devices to run AI tasks locally for better privacy and speed.

Licensing and Use Cases [7:30]

The models are released under the LFM1.0 license, based on Apache 2.0, with conditions. Smaller companies (under $10 million revenue) can use the models for research and commercial projects, while larger companies need a commercial license. Use cases include real-time image captioning, multimodal chatbots, visual search, robotics, IoT systems, and smart cameras.

Watch the Video

Date: 8/24/2025 Source: www.youtube.com