stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research
Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
reproducible-data-curation-in-the-multimodal-lakehouse
Prashanth Rao
newsletter-may-2026
ChanChan Mao
newsletter-april-2026
ChanChan Mao
how-lancedb-accelerates-vector-search-at-10-billion-scale
Yang Cen
opensearch-vs-lancedb-for-vector-search-query-cost-and-infrastructure
Justin Miller
volcano-engine-autonomous-driving-data-lake-solution
Kejian Ju
unifying-the-av-ml-stack-lancedb
Ayush Chaurasia
lance-json-support-why-you-might-not-really-need-variant
Jack Ye
building-a-storage-format-for-the-next-era-of-biology
Pavan Ramkumar
newsletter-march-2026
ChanChan Mao
smart-parsing-meets-sharp-retrieval-combining-liteparse-and-lancedb
Clelia Astra Bertelli
Prashanth Rao
lance-format-v2-2-benchmarks-half-the-storage-none-of-the-slowdown
Xuanwo
make-your-sql-workflows-multimodal-with-lancedb-x-duckdb
Prashanth Rao
agentic-coding-as-community-stewardship
Xuanwo
what-we-mean-by-multimodal
Prashanth Rao
ai-native-development-local-continue-lancedb
Ty Dunn
lance-file-format-2-2-taming-complex-data
Xuanwo
lance-blob-v2
Xuanwo
Jack Ye
openclaw-lancedb-memory-layer
Xuanwo
Prashanth Rao
openclaw-lancedb-seed2
LanceDB
openclaw-memory-from-zero-to-lancedb-pro
Prashanth Rao
upload-lance-datasets-to-hf-hub
Prashanth Rao
zero-shot-image-classification-with-vector-search
Vipul Maheshwari
werides-data-platform-transformation-how-lancedb-fuels-model-development-velocity
Qian Zhu
Fei Chen
training-a-variational-autoencoder-from-scratch-with-the-lance-file-format
LanceDB
track-ai-trends-crewai-agents-rag
LanceDB
tokens-per-second-is-not-all-you-need
Mingran Wang
Tan Li
the-future-of-open-source-table-formats-iceberg-and-lance
Jack Ye
the-case-for-random-access-i-o
LanceDB
series-a-funding
Chang She
semanticdotart
Ayush Chaurasia
second-dinners-secret-weapon-lancedb-powered-rag-for-faster-smarter-game-development
Qian Zhu
search-within-an-image-331b54e4285e
Kaushal Choudhary
scalable-computer-vision-with-lancedb-voxel51-d8b65066d5f6
LanceDB
rethinking-table-file-paths-lance-multi-base-layout
Jack Ye
rag-isnt-one-size-fits-all
Leonard Marcq
python-package-to-convert-image-datasets-to-lance-type
Vipul Maheshwari
one-million-iops
Weston Pace
november-feature-roundup
Will Jones
newsletter-september-2025
Jasmine Wang
newsletter-october-2025
Jasmine Wang
newsletter-november-2025
ChanChan Mao
newsletter-june-2025
David Myriel
newsletter-july-2025
Jasmine Wang
newsletter-january-2026
ChanChan Mao
newsletter-february-2026
ChanChan Mao
newsletter-december-2025
ChanChan Mao
newsletter-august-2025
Jasmine Wang
my-summer-internship-experience-at-lancedb-2
Raunak Sinha
my-simd-is-faster-than-yours-fb2989bf25e7
LanceDB
multimodal-myntra-fashion-search-engine-using-lancedb
LanceDB
multimodal-lakehouse
David Myriel
multi-document-agentic-rag-a-walkthrough
Vipul Maheshwari
modified-rag-parent-document-bigger-chunk-retriever-62b3d1e79bc6
Mahesh Deshwal
memgpt-os-inspired-llms-that-manage-their-own-memory-793d6eed417e
Ayush Chaurasia
late-interaction-efficient-multi-modal-retrievers-need-more-than-just-a-vector-index
Ayush Chaurasia
lancedb-x-continue
LanceDB
lance-x-huggingface-a-new-era-of-sharing-multimodal-data
Prashanth Rao
Quentin Lhoest
Xuanwo
Ayush Chaurasia
lance-x-duckdb-sql-retrieval-on-the-multimodal-lakehouse-format
Xuanwo
lance-windows-windows-lance
Chang She
lance-v2
Weston Pace
lance-namespace-lancedb-and-ray
Jack Ye
lance-file-2-1-stable
Weston Pace
lance-file-2-1-smaller-and-simpler
Weston Pace
lance-data-viewer
Gordon Murray
lance-community-governance
Jack Ye
introducing-lance-namespace-spark-integration
Jack Ye
implementing-corrective-rag-in-the-easiest-way-2
LanceDB
hybrid-search-rag-for-real-life-production-grade-applications-e1e727b3965a
Mahesh Deshwal
hybrid-search-combining-bm25-and-semantic-search-for-better-results-with-lan-1358038fe7e6
LanceDB
hybrid-search-and-custom-reranking-with-lancedb-4c10a6a3447e
LanceDB
how-to-reduce-hallucinations-from-llm-powered-agents-using-long-term-memory-72f262c3cc1f
Tevin Wang
guide-to-use-contextual-retrieval-and-prompt-caching-with-lancedb
LanceDB
grpo-understanding-and-fine-tuning-the-next-gen-reasoning-model-2
Mahesh Deshwal
graphrag-hierarchical-approach-to-retrieval-augmented-generation
Akash Desai
gpu-accelerated-indexing-in-lancedb-27558fa7eee5
LanceDB
geo-support
Jack Ye
geneva-twelvelabs
David Myriel
geneva-feature-engineering
Jonathan Hsieh
from-bi-to-ai-lance-and-iceberg
Jack Ye
Prashanth Rao
fluss-integration
Wayne Wang
file-readers-in-depth-parallelism-without-row-groups
Weston Pace
feature-rabitq-quantization
David Myriel
Yang Cen
feature-full-text-search
David Myriel
enhance-rag-integrate-contextual-compression-and-filtering-for-precision-a29d4a810301
Kaushal Choudhary
effortlessly-loading-and-processing-images-with-lance-a-code-walkthrough
LanceDB
designing-a-table-format-for-ml-workloads
Weston Pace
custom-dataset-for-llm-training-using-lance
LanceDB
creating-a-fintech-agent
Vipul Maheshwari
convert-any-image-dataset-to-lance
LanceDB
columnar-file-readers-in-depth-structural-encoding
Weston Pace
columnar-file-readers-in-depth-repetition-definition-levels
Weston Pace
columnar-file-readers-in-depth-compression-transparency
Weston Pace
columnar-file-readers-in-depth-column-shredding
Weston Pace
columnar-file-readers-in-depth-backpressure
Weston Pace
columnar-file-readers-in-depth-apis-and-fusion
Weston Pace
chunking-techniques-with-langchain-and-llamaindex
Prashant Kumar
chunking-analysis-which-is-the-right-chunking-approach-for-your-language
Shresth Shukla
chat-with-csv-excel-using-lancedb
LanceDB
case-study-netflix
David Myriel
case-study-dosu
Qian Zhu
Michael Ludden
case-study-cognee
David Myriel
Vasilije Markovic
case-study-coderabbit
Qian Zhu
building-rag-on-codebases-part-2
Sankalp Shubham
building-rag-on-codebases-part-1
Sankalp Shubham
branching-and-shallow-clone
Jack Ye
better-rag-with-active-retrieval-augmented-generation-flare-3b66646e2a9f
LanceDB
benchmarking-random-access-in-lance
Chang She
benchmarking-lancedb-92b01032874a-2
LanceDB
benchmarking-cohere-reranker-with-lancedb
LanceDB
anythingllms-competitive-edge-lancedb-for-seamless-rag-and-agent-workflows
Ayush Chaurasia
announcing-lance-sdk
Weston Pace
agentic-rag-using-langgraph-building-a-simple-customer-support-autonomous-agent
LanceDB
advanced-rag-precise-zero-shot-dense-retrieval-with-hyde-0946c54dfdcb
LanceDB
accelerate-vector-search-applications-using-openvino-lancedb
LanceDB
a-primer-on-text-chunking-and-its-types-a420efc96a13
Prashant Kumar
a-practical-guide-to-training-custom-rerankers
Ayush Chaurasia
a-practical-guide-to-fine-tuning-embedding-models
Ayush Chaurasia
keep-your-data-fresh-with-cocoindex-and-lancedb
Prashanth Rao
Linghua Jin

One System, Many Workloads: Rethinking What "Multimodal" Means for AI

January 5, 2026
Engineering

Modern AI-native data infrastructure is no longer about just storing text, metadata, embeddings and pointers to JPEGs – it’s about storage engines that natively support the diverse structures, access patterns, and use cases of complex data seen in practice. A robust data retrieval layer involves far more than just indexing text and running filtered SQL queries. Whether it’s an e-commerce site filtering data by its image catalog, or a financial organization deriving insights from PDFs, or a media firm managing its data for search and content generation (images, video and audio), multimodality is a pragmatic reality of modern AI.

The multimodal reality check

Before diving into the nuances of the term “multimodal”, we must address this question we often hear in the wild: “Why bother with multimodal? Can’t I just generate embeddings and be done with it for my AI application?

Embeddings are useful tools for similarity search, but they’re lossy representations of data. Real-world systems still need to retrieve, govern, and version the source-of-truth bytes: the actual image, video segment, audio clip, or text spans that a human (or an agent) uses downstream.

The conventional pattern for handling multimodal data is “store scalar fields and pointers (URLs) in one table, and store the blobs separately in the object store”. Over time, this architecture fragments your mental model: structured fields live in one system, vectors in another, and the bytes somewhere else – all held together by a catalog of pointer URLs and brittle ETL jobs.

As the volume and variety of data, modalities and derived artifacts grow with time (images, PDFs, audio/video, embeddings, captions, labels), traditional architectures often lead to governance drift: deleted files, invalid pointers, stale embeddings, and unclear lineage of what changed when. The usual response is more ETL and more glue code to keep everything coherent, plus more IOPS at runtime (multiple network hops just to serve one “result” back to an application or agent).

A more useful definition of the term “multimodal”

Most people encounter the term “multimodal” as a content-based definition, i.e., what format the data is in – text, images, audio, video, PDFs, sensor logs, and so on. That’s a useful starting point, but it’s not the whole story. In real systems, multimodality also shows up in how that data is accessed (scans vs. random access, online vs. batch) and in how it’s used downstream (analytics, search, feature engineering, training).

In this section, we break down these dimensions of multimodal complexity.

How the data looks

Multimodal data in the wild is often complex, heterogeneous, and nested. Even “simple” image/video datasets tend to mix standard scalar fields alongside high-dimensional vectors and large media blobs. In OpenVid-1M, for example, a single record might include a video_path and caption (strings), a handful of quality and motion scores (floats), an embedding vector (a fixed-size list), and a nested video_blob struct that points at the raw video bytes. Any given row is already a bundle of different shapes and sizes – some scalar, some nested, others large blobs.

This scenario shows up in how the open source video dataset, OpenVid-1M, is distributed: tabular metadata lives in standard files, while the actual videos are shipped as separate archives/blobs and joined back via paths or offsets.

├── nkp37/OpenVid-1M
├── data
│   └── train/
│       ├── OpenVid-1M.csv
│       └── OpenVidHD.csv
├── OpenVid_part0.zip
├── OpenVid_part1.zip
├── ...  # Zip files
├── OpenVidHD/
│   ├──  OpenVidHD.json
│   ├──  OpenVidHD_part_1.zip
│   ├──  OpenVidHD_part_2.zip
│   └── ... # Zip files

In traditional table formats, mixed-width columns (scalar data + large blobs) force a compromise – for reasons of partitioning and performance, you split the “fast” scalar data into a table that’s queryable in SQL, and banish the embeddings and blobs (large binaries) to a separate table or system. Large binary blobs (large video, audio or image bytes), tend to be managed separately in object stores, with pointer URLs in the scalar data table. “Offline” embeddings that are generated via a batch job are stored in feature stores or object stores, and “online” embeddings are maintained in a dedicated vector database, in-memory index, or key-value store.

This is the first dimension of multimodal complexity: it’s not just that you have multiple formats and data structures to store on disk, but that it takes a lot of time and energy to manage the data across different storage layers and multiple systems, while keeping everything in sync with the source data.

How the data is accessed

Even when you keep everything in sync, multimodal workloads introduce a second (and often underappreciated) source of complexity: the way you need to access the data changes dramatically across the AI lifecycle. This is less about where the bytes live and more about the competing I/O patterns your system has to serve. The same dataset gets touched by two very different access patterns:

  1. Sequential scans (throughput-oriented): When you’re doing analytics or exploratory work, you want bandwidth. You might scan millions of rows to compute simple statistics (“average brightness across 1M frames”, “distribution of clip lengths”, “how often does this label appear?”), and you want the storage layer to read data efficiently and apply filters quickly. This is the world of high-throughput reads and large, contiguous scans.
  2. Random access (latency- and IOPS-oriented): In real-time retrieval and agentic workflows, you want the opposite: small, targeted reads with low latency. You’re rarely reading “a lot of data at once” – you’re trying to find the most relevant items fast, and then you often need to fetch the original bytes (the image, the audio segment, the PDF page, the video frames) to ground a response or power a downstream tool.

What makes these multimodal access patterns especially tricky is that real workflows often combine both personalities of data in a single request. For example, you might want to isolate frames of “outdoor scenes in broad daylight with city landscapes”. A common pipeline is: find candidate frames via embeddings and semantic search (random access), apply additional filters (sometimes scan-like, sometimes point lookups), and then fetch the raw frames for downstream use.

Training pipelines also bring a mix of scan and random access pressure. Each epoch needs fresh, random batches of examples, requiring efficient shuffles (random access). If your storage engine is only good at scans, the data loader ends up doing lots of small, scattered reads – and if those reads also require multiple network hops (first to fetch IDs or embeddings, then to fetch the raw bytes from an object store), IOPS saturate quickly. This dichotomy leads to ML teams maintaining lots of custom infrastructure and caching layers to balance IOPS such that their GPUs are sufficiently fed, adding significant costs over time.

This is the second dimension of multimodal complexity: you don’t just need a place to store the data – you need a retrieval layer that can serve both high-throughput scans and low-latency random access, from the same source of truth, without forcing you to build and maintain multiple “performance copies” of the same dataset across different systems.

How the data is used

If the previous dimension of multimodal complexity was how you access the data, this one is about why you’re accessing it in the first place. Enterprise data isn’t a static artifact you “load once and query forever”. It’s a living, breathing asset that gets reused, reinterpreted, and enriched over time – often by people with very different goals and mental models.

The same underlying video collection (or robotics dataset, or PDF archive) might need to support all of the following (sometimes in the same week!):

  • The data scientist (EDA and dataset sanity checks): They’re trying to understand what’s in the data before they trust it. They ask questions like “Do we have class imbalance?”, “Which scenes are underrepresented?”, or “Show me frames that look like a pedestrian in the middle of the road.” These are ad-hoc, iterative queries that mix filters, sampling, and pulling back raw examples to inspect.
  • The analyst (metrics and reporting): They care about longitudinal trends and aggregates. “How many red-light incidents happened last month?”, “What’s the distribution of near-miss events by region?”, “How did the model’s false positive rate change after the last release?” These are scan-heavy workloads that need consistent semantics and reproducible results.
  • The application developer (search and product experiences): They’re building interactive features: “search by image”, “find similar clips”, “retrieve the most relevant PDF pages”, “show related products”. Latency matters, and the application needs both the retrieved context (metadata) and the retrieved content (pixels, audio, page crops).
  • The ML engineer (feature engineering and labeling loops): They want to run pipelines that generate new/derived columns (e.g., captions, tags, OCR output, embeddings, quality scores) and write them back, often repeatedly, as models improve. The dataset evolves, but it still has to remain queryable and consistent for everyone else.
  • The training/infrastructure team (large-scale training and evaluation): They’re pulling massive subsets of the data for training foundation models, doing repeated evaluations, and slicing performance by metadata. They need high throughput (scan performance) to gulp large batches of data for their GPU clusters, but they also need fast random access, to isolate the most relevant subsets of the data and to ensure randomized samples via shuffling.
  • The agent (tool-using retrieval and action): Increasingly, the “user” is an agent that chains retrieval with downstream tools and workflows. It doesn’t just use “nearest neighbors” – it requires the source-of-truth bytes (the image, the relevant video segment, the specific PDF page) to stay grounded while it reasons and acts.

This is the third dimension of multimodal data in the enterprise: the same data must serve a wide range of downstream consumption scenarios, across multiple personas, without forcing you to copy, reshape, and migrate your dataset into different systems every time the question changes.

Why we built LanceDB

LanceDB’s multimodal lakehouse is built on top of the open source Lance format, designed to address the challenges of managing multimodal complexity we just walked through. In LanceDB, the same table can function as a lakehouse table when you’re scanning and transforming data, and like a low-latency retrieval system when you’re serving search and agentic applications.

1. Single source of truth, for all your data

The first premise is straightforward: stop treating your raw bytes as second-class citizens. LanceDB uses Lance tables under the hood: a Lance table can natively store not only standard tabular columns, deeply nested structures, and embeddings, but also large binary blobs (images, video and/or audio) alongside traditional tabular data. That means the content can stay physically co-located with the context, instead of being scattered across a relational database, a vector store, and an object store full of opaque files.

Here’s how the structure for several terabytes of video data from the OpenVid-1M would look with Lance: it’s just a collection of Lance files that include the video embeddings and binaries in one place.

├── lance-format/openvid-lance
├── data
│   └── train/
│       ├── 0001.lance
│       ├── 0002.lance
│       └── ...  # more Lance files

Lance, an open lakehouse format, makes this kind of storage practical with its file format and table layout. Rows are grouped into fragments, where each fragment can have multiple data files that hold one or more columns. This two-dimensional layout matters in multimodal settings because it lets the dataset evolve without constantly rewriting the heavy parts. If you add a new embedding column, run OCR on PDFs, generate captions, or attach new labels, Lance can write those new columns as additional data files per fragment – without rewriting the existing image/video bytes you already stored.

Just as importantly, Lance tables are versioned. Each commit produces a new table version (tracked via manifests), which gives you a stable snapshot to reproduce training runs, debug “what changed?”, and avoid the silent drift that happens when pointers, blobs, and embeddings evolve out of sync.

The result of this is simplified data governance with fewer unnecessary IOPs, with performance and scalability guarantees baked in.

2. Supports online and batch workloads

The second premise directly addresses the “split personality” problem of access patterns. A Lance table is specified by a columnar file format, so just like other columnar formats, it can deliver high-throughput scans (with column pruning) for analytics and filtered queries. But unlike traditional columnar formats that essentially optimize for “scan-is-all-you-need” use cases, Lance is designed to be the fastest format for random access, without compromising scan performance. In the Lance format (and LanceDB above it), indexes aren’t an afterthought: they’re stored alongside the data (using pointers in the Lance manifest), and the system can use them to avoid read amplification when you only need a small slice of a massive dataset.

Practically, these design decisions are what make LanceDB a “one dataset, many workloads” solution for multimodal data in the enterprise. You can use the same table in a batch pipeline (scan, filter, compute new columns, write them back) and then immediately serve low-latency queries against the enriched data – without exporting to a separate online store. And because Lance “ speaks Arrow ”, the same dataset can be consumed by tools that understand Arrow datasets (e.g., DuckDB, Polars, DataFusion), making it highly versatile for a variety of stacks.

Below is an example of two different access patterns in an autonomous vehicle dataset. The first one shows an online scenario you may want to search, explore and isolate (in real time) a small subset of scenes that align with a query in vector space. The second one shows how that same data, at a different point in time, may be used to run a batch SQL query (which LanceDB supports via the Arrow FlightSQL protocol , or via DuckDB using the lance-duckdb extension) for a month-end report.

3. Far more than just search

The third design premise is all about scope: LanceDB is not just “a vector database plus a storage format”. It’s a complete solution built for the reality that search is only one part of the lifecycle, and that the boundary between “retrieval” and “data processing” gets blurry in multimodal systems.

Search is only a part of the variety of workloads seen in the enterprise. It’s common to see teams building and iterating on datasets, by first looking up small, meaningful slices to understand the data, and then enriching the data via feature engineering, labeling or captioning, and transforming existing or derived columns (embeddings, captions, OCR outputs, quality scores) for downstream applications. This new data can also be used as training data for foundation or fine-tuned models that can further empower agents. All these aspects govern the mechanics of how messy multimodal bytes are turned into a dataset your organization can actually trust and reuse.

This is also where the Lance format’s role matters. Lance is a file format and a table format with a built-in notion of versioned datasets (manifests, fragments, and snapshots), acting like a lightweight catalog for how your data evolves over time. Instead of relying on a separate catalog plus a maze of pipelines to infer “what version of the data is this?”, or “where is the latest source data for these embeddings located?”, the table itself carries that history forward. This means you can reproduce a training run, roll back a bad data enrichment job, and confidently iterate on your data model as your source data and use cases evolve.

In practice, this is what makes up LanceDB: an OSS and enterprise product that helps teams run search, analytics, feature engineering, and training, all on the same underlying data, no matter its size or shape. Fast data evolution (adding and rewriting columns without rewriting the whole dataset) and first-class indexes (vector, full-text search, and scalar) keep the dataset usable as it changes, without duplicating data across systems or rebuilding brittle pipelines every time you switch workloads.

Rethinking what “multimodal” means for AI

The main takeaway from this post is that the term “multimodal” is not just a content label (text vs. images vs. audio) – it’s a property of your system: how you store and version raw bytes, how you serve competing access patterns, and how you keep a dataset usable as different teams (and different applications) reuse it over time.

This shift in perspective matters because as AI agents become more and more widespread, the ecosystem’s needs are rapidly moving from “generate an answer” to “act in an environment”. An agent fluent in natural language can still be ungrounded if it can’t reliably retrieve and interpret the underlying pixels, pages, clips, and sensor traces that make up reality.

“LLMs transformed how we access and work with abstract knowledge. Yet they remain wordsmiths in the dark – eloquent but inexperienced, knowledgeable but ungrounded. Of all the aspects of intelligence, visual intelligence is a cornerstone of intelligence for animals and humans. The pixel world is so rich and mathematically infinite.”

– Fei-Fei Li, CEO of World Labs and Professor of Computer Science at Stanford

So as multimodal AI continues to proliferate across industries, the right question to ask yourself isn’t “Do I have multimodal data?”, but rather: “How should I store, query, and persist my data effectively to handle the full spectrum of use cases and workloads in my organization?”

We believe that true multimodal infrastructure for AI shouldn’t require you to be a distributed systems expert or have an army of data engineers just to get your data to the teams that need it. We built LanceDB, the Multimodal Lakehouse, a unified system where images, videos, embeddings, metadata (and more) live together – queryable by SQL, searchable via embeddings, and available for training and inference at petabyte scale.

The future of AI infrastructure lies in architectures that embrace these aspects of multimodal complexity – without forcing you to fragment your data, duplicate your pipelines, or compromise on performance.

Stable-Worldmodel: A High Performance Platform for Reproducible World Model Research

Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
June 2, 2026
stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research

🌍 Lance-Backed World Model Platform, 🦆 Multimodal SQL with Lance DuckDB Extension, 💰 LanceDB vs OpenSearch Cost Breakdown

ChanChan Mao
May 28, 2026
newsletter-may-2026

Reproducible Data Curation In The Multimodal Lakehouse

Prashanth Rao
May 29, 2026
reproducible-data-curation-in-the-multimodal-lakehouse