stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research
Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
reproducible-data-curation-in-the-multimodal-lakehouse
Prashanth Rao
newsletter-may-2026
ChanChan Mao
newsletter-april-2026
ChanChan Mao
how-lancedb-accelerates-vector-search-at-10-billion-scale
Yang Cen
opensearch-vs-lancedb-for-vector-search-query-cost-and-infrastructure
Justin Miller
volcano-engine-autonomous-driving-data-lake-solution
Kejian Ju
unifying-the-av-ml-stack-lancedb
Ayush Chaurasia
lance-json-support-why-you-might-not-really-need-variant
Jack Ye
building-a-storage-format-for-the-next-era-of-biology
Pavan Ramkumar
newsletter-march-2026
ChanChan Mao
smart-parsing-meets-sharp-retrieval-combining-liteparse-and-lancedb
Clelia Astra Bertelli
Prashanth Rao
lance-format-v2-2-benchmarks-half-the-storage-none-of-the-slowdown
Xuanwo
make-your-sql-workflows-multimodal-with-lancedb-x-duckdb
Prashanth Rao
agentic-coding-as-community-stewardship
Xuanwo
what-we-mean-by-multimodal
Prashanth Rao
ai-native-development-local-continue-lancedb
Ty Dunn
lance-file-format-2-2-taming-complex-data
Xuanwo
lance-blob-v2
Xuanwo
Jack Ye
openclaw-lancedb-memory-layer
Xuanwo
Prashanth Rao
openclaw-lancedb-seed2
LanceDB
openclaw-memory-from-zero-to-lancedb-pro
Prashanth Rao
upload-lance-datasets-to-hf-hub
Prashanth Rao
zero-shot-image-classification-with-vector-search
Vipul Maheshwari
werides-data-platform-transformation-how-lancedb-fuels-model-development-velocity
Qian Zhu
Fei Chen
training-a-variational-autoencoder-from-scratch-with-the-lance-file-format
LanceDB
track-ai-trends-crewai-agents-rag
LanceDB
tokens-per-second-is-not-all-you-need
Mingran Wang
Tan Li
the-future-of-open-source-table-formats-iceberg-and-lance
Jack Ye
the-case-for-random-access-i-o
LanceDB
series-a-funding
Chang She
semanticdotart
Ayush Chaurasia
second-dinners-secret-weapon-lancedb-powered-rag-for-faster-smarter-game-development
Qian Zhu
search-within-an-image-331b54e4285e
Kaushal Choudhary
scalable-computer-vision-with-lancedb-voxel51-d8b65066d5f6
LanceDB
rethinking-table-file-paths-lance-multi-base-layout
Jack Ye
rag-isnt-one-size-fits-all
Leonard Marcq
python-package-to-convert-image-datasets-to-lance-type
Vipul Maheshwari
one-million-iops
Weston Pace
november-feature-roundup
Will Jones
newsletter-september-2025
Jasmine Wang
newsletter-october-2025
Jasmine Wang
newsletter-november-2025
ChanChan Mao
newsletter-june-2025
David Myriel
newsletter-july-2025
Jasmine Wang
newsletter-january-2026
ChanChan Mao
newsletter-february-2026
ChanChan Mao
newsletter-december-2025
ChanChan Mao
newsletter-august-2025
Jasmine Wang
my-summer-internship-experience-at-lancedb-2
Raunak Sinha
my-simd-is-faster-than-yours-fb2989bf25e7
LanceDB
multimodal-myntra-fashion-search-engine-using-lancedb
LanceDB
multimodal-lakehouse
David Myriel
multi-document-agentic-rag-a-walkthrough
Vipul Maheshwari
modified-rag-parent-document-bigger-chunk-retriever-62b3d1e79bc6
Mahesh Deshwal
memgpt-os-inspired-llms-that-manage-their-own-memory-793d6eed417e
Ayush Chaurasia
late-interaction-efficient-multi-modal-retrievers-need-more-than-just-a-vector-index
Ayush Chaurasia
lancedb-x-continue
LanceDB
lance-x-huggingface-a-new-era-of-sharing-multimodal-data
Prashanth Rao
Quentin Lhoest
Xuanwo
Ayush Chaurasia
lance-x-duckdb-sql-retrieval-on-the-multimodal-lakehouse-format
Xuanwo
lance-windows-windows-lance
Chang She
lance-v2
Weston Pace
lance-namespace-lancedb-and-ray
Jack Ye
lance-file-2-1-stable
Weston Pace
lance-file-2-1-smaller-and-simpler
Weston Pace
lance-data-viewer
Gordon Murray
lance-community-governance
Jack Ye
introducing-lance-namespace-spark-integration
Jack Ye
implementing-corrective-rag-in-the-easiest-way-2
LanceDB
hybrid-search-rag-for-real-life-production-grade-applications-e1e727b3965a
Mahesh Deshwal
hybrid-search-combining-bm25-and-semantic-search-for-better-results-with-lan-1358038fe7e6
LanceDB
hybrid-search-and-custom-reranking-with-lancedb-4c10a6a3447e
LanceDB
how-to-reduce-hallucinations-from-llm-powered-agents-using-long-term-memory-72f262c3cc1f
Tevin Wang
guide-to-use-contextual-retrieval-and-prompt-caching-with-lancedb
LanceDB
grpo-understanding-and-fine-tuning-the-next-gen-reasoning-model-2
Mahesh Deshwal
graphrag-hierarchical-approach-to-retrieval-augmented-generation
Akash Desai
gpu-accelerated-indexing-in-lancedb-27558fa7eee5
LanceDB
geo-support
Jack Ye
geneva-twelvelabs
David Myriel
geneva-feature-engineering
Jonathan Hsieh
from-bi-to-ai-lance-and-iceberg
Jack Ye
Prashanth Rao
fluss-integration
Wayne Wang
file-readers-in-depth-parallelism-without-row-groups
Weston Pace
feature-rabitq-quantization
David Myriel
Yang Cen
feature-full-text-search
David Myriel
enhance-rag-integrate-contextual-compression-and-filtering-for-precision-a29d4a810301
Kaushal Choudhary
effortlessly-loading-and-processing-images-with-lance-a-code-walkthrough
LanceDB
designing-a-table-format-for-ml-workloads
Weston Pace
custom-dataset-for-llm-training-using-lance
LanceDB
creating-a-fintech-agent
Vipul Maheshwari
convert-any-image-dataset-to-lance
LanceDB
columnar-file-readers-in-depth-structural-encoding
Weston Pace
columnar-file-readers-in-depth-repetition-definition-levels
Weston Pace
columnar-file-readers-in-depth-compression-transparency
Weston Pace
columnar-file-readers-in-depth-column-shredding
Weston Pace
columnar-file-readers-in-depth-backpressure
Weston Pace
columnar-file-readers-in-depth-apis-and-fusion
Weston Pace
chunking-techniques-with-langchain-and-llamaindex
Prashant Kumar
chunking-analysis-which-is-the-right-chunking-approach-for-your-language
Shresth Shukla
chat-with-csv-excel-using-lancedb
LanceDB
case-study-netflix
David Myriel
case-study-dosu
Qian Zhu
Michael Ludden
case-study-cognee
David Myriel
Vasilije Markovic
case-study-coderabbit
Qian Zhu
building-rag-on-codebases-part-2
Sankalp Shubham
building-rag-on-codebases-part-1
Sankalp Shubham
branching-and-shallow-clone
Jack Ye
better-rag-with-active-retrieval-augmented-generation-flare-3b66646e2a9f
LanceDB
benchmarking-random-access-in-lance
Chang She
benchmarking-lancedb-92b01032874a-2
LanceDB
benchmarking-cohere-reranker-with-lancedb
LanceDB
anythingllms-competitive-edge-lancedb-for-seamless-rag-and-agent-workflows
Ayush Chaurasia
announcing-lance-sdk
Weston Pace
agentic-rag-using-langgraph-building-a-simple-customer-support-autonomous-agent
LanceDB
advanced-rag-precise-zero-shot-dense-retrieval-with-hyde-0946c54dfdcb
LanceDB
accelerate-vector-search-applications-using-openvino-lancedb
LanceDB
a-primer-on-text-chunking-and-its-types-a420efc96a13
Prashant Kumar
a-practical-guide-to-training-custom-rerankers
Ayush Chaurasia
a-practical-guide-to-fine-tuning-embedding-models
Ayush Chaurasia
keep-your-data-fresh-with-cocoindex-and-lancedb
Prashanth Rao
Linghua Jin

Lance File 2.1: Smaller and Simpler

March 27, 2025
Engineering

Almost a year ago I announced we were going to be embarking on a journey to build a new 2.0 version of our file format. Several months later, we released a beta, and last fall it became our default file format. Overall, I’ve been super pleased with how well it worked. As we’ve been working with the community and stressing the format in production, we’ve identified a number of areas for improvement. We’ve been working on a 2.1 format for a few months to address these insights.

As our 2.1 format enters beta I wanted to take some time and look back on the 2.0 format and talk about what worked, what didn’t work, and what we’re going to be doing differently in 2.1.

2.0: We Got Rid of Row Groups

The most direct success of the 2.0 format, by far, was getting rid of row groups. We have not missed them…ever…at all. We have not lost any opportunity for parallelism or performance and we have removed the single biggest foot gun from Parquet. I described my reasoning for this already in an earlier post so I don’t need to go into details here.

2.0: We Filled the I/O Queue

The most important performance concept in modern I/O, whether it is NVMe or cloud storage, is queue depth. If you do not have enough concurrent I/O requests in flight, you will leave bandwidth on the floor. In fact, it’s even better to have too many I/O requests even though you face overhead and fairness issues. In 2.0 we built a scheduling algorithm to hit whatever queue depth we wanted, in priority order, without going over our parallelism budget. So we don’t even have to pay those overhead and fairness costs.

Disk Utilization Comparison
If your disk/NIC isn't screaming then we aren

I discussed most of this in depth in an earlier post. However, I was not prepared for just how effective this was. I had hoped, originally, to match Parquet’s scan speed while introducing random access. I figured this meant we would be slower than Parquet until we figured out compression. However, I was surprised to find that Lance often beats Parquet in full scans. This turned out to be true even when the file was 2-3 times bigger!

2.0: Our Flexible Container Format um…Flexed?

The overall structure of the file format has worked great. Our encoding scheme for 2.0 evolved significantly as it was developed. In 2.1 we pretty much reworked everything related to encodings. The protobufs have been many, and they have been varied, and they have enabled extremely rapid prototyping. Through all of this, I have not once encountered any reason to change the overall file structure. Pages, column descriptors, and a file footer are a simple, consistent, and extremely flexible structure.

Lance File Format Structure
Lance file format in four words

💡 Victory Tour Complete

Ok, the victory tour is over. I’m sure there’s other cool 2.0 things I could talk about but this is a retrospective and I suspect everyone is really just here to read the juicy takes about what went wrong.

2.1: We’re Squeezing out Compression via Structural Encoding

The 2.0 format has very little compression, primarily limited to dictionary compression and a few other tricks. This often surprises people coming from Parquet, but the truth is, compression isn’t as critical for us. Our primary goal is working well with multimodal data such as embeddings, tensors, audio, video images, etc. These types are typically compressed already (often with lossy compression) and they’re so large that any compression of metadata isn’t as significant as you might expect. As a result, compression was a lower priority and, as we started using 2.0 internally, and customers started adopting it in production, we realized we needed to tie things off and ship something stable.

That being said, we’ve found some areas where compression is significant. Compression metadata doesn’t have much impact on overall storage costs, but it can significantly speed up things like prefiltering. Also, semantic text is a very common part of our user’s datasets, and some of these text-heavy datasets (e.g. Common Crawl, Fineweb, Github Code) have massive text columns which need solid compression.

Integrating compression into Lance is trickier than it might seem. Compression is implicitly related to I/O scheduling. For example, if you delta encode your values, or you strip out NULLs during compression, then it can become difficult to extract a single value from the compressed output. Fortunately, while we were working on 2.0, we had two brilliant interns run various experiments into compression, and this really helped us nail down these integration points and refine our encoding traits. What we discovered is that we were mashing together two different concepts. In 2.1 we have split the generic step of “encoding” into structural encoding and compressive encoding.

I’m going to be writing an entire blog post on this topic but structural encoding tells us how an “array” is split into a series of “buffers”. Compressive encoding then focuses on how those buffers can be compressed. For example, in Arrow, we have a well defined standard for splitting an array into buffers. But…there are too many buffers! (this turns out to be bad for random access). In Parquet, we have a completely different standard for splitting an array into buffers that uses far fewer buffers, but can introduce read amplification. In Lance 2.1 we now have two different strategies that we switch between depending on the size of the value.

Structural Encoding Comparison
Dividing structure into buffers can be important for random access

In fact, there was one more lesson for us. Once we figured out structural encoding we suddenly discovered a way to create a taxonomy for compression algorithms. For example, transparent compression algorithms (like bitpacking) support random access while opaque compression algorithms (like delta encoding) do not. You may think we obviously just use transparent compression everywhere but—and here is where the magic comes in—the structural encoding that you choose controls which category of compressive encodings you are allowed. There are times in Lance 2.1 where opaque compression is perfectly valid! I promise I will write a lot more about this later.

2.1: We’re Knocking Out the 1-2 IOP Challenge

As I was working on 2.0 encodings I would often refer to something I called the “1-2 IOP challenge” (it’s a punchy name 😉). The dream was to come up with an encoding so that I could access any element in an array in a single IOP (for fixed width data types) or 2 IOPS (for variable width data types). This sounds simple at first. We all know how to grab a string/binary value. First we grab the offsets and then we grab the value.

Unfortunately, it seems that someone has told customers about fancy data types. Also, customers love NULL values. As an example, let’s look at what happens if you have a list of strings (e.g. tags, facets, etc.). This one array has many pieces of information.

What does this mean? Well, if you’ve been keeping up on your Arrow column format homework you know that we now have a struct validity buffer, a list validity buffer, a list offsets buffer, a string offsets buffer, a string validity buffer, and a string values buffer. In Lance 2.0 this, unfortunately, meant that we needed to perform 6 IOPS to fetch a single value. We have failed the 1-2 IOP challenge.

Bad Tag Access Pattern
In 2.0 we spent too many IOPS to read a single value

In Lance 2.1 it turns out the answer is once again related to the idea of structural encodings. Recall that the structural encoding tells us how an array is split into buffers (that’s important because it controls our IOPS). What’s more, the structural encodings all have a concept known as a repetition index. The union of these two concepts turns out to be exactly what we need for the 1-2 IOP challenge and I am proud to say that we have solved it. No matter what data type you have, no matter how complex it is, no matter how many layers of validity you’ve stashed away, we can access any value in 1 or 2 IOPS. I’ll definitely be writing more on this concept soon.

Good Tag Access Pattern
In Lance 2.1 we only need 2 IOPS (and sometimes just 1) for the same data!

2.1: We’re Pushing Statistics so Far Down they Fall Out

During the 2.0 development process we had a working prototype of pushdown filtering. I didn’t love it but it worked. It then broke. We fixed it. Then it broke again. As I write this that defunct prototype is still lurking in the code, but I’ve come to figure out why it just didn’t stick.

There’s a lot of reasons for this but I’ll boil them down to “introducing all this compute complexity in the file format, when you have a fully functional and complex query engine just sitting there, is just a lot of work for little gain”. What’s more, keeping statistics decoupled from encoding can be helpful for performance too.

It turns out the answer, for us, is that statistics based pushdown is “just another index” (technically a primary index) and it is much easier for us to store this information outside the file (just like we do with all our other indices). This concept may sound strange but it has existed for a very long time, it just had a different name, the zone map (just not the one you use for planting your garden).

External Indices Flow
Once you can read arbitrary ranges from files then many things become possible

By storing these indices outside the file we can pick which columns we want to index after we’ve already written the file. We can also retrain the index with a different block size without rewriting the file. What’s more, we just so happen to have infrastructure for managing external indices lying around.

I can already hear you asking, “what if I don’t have a query engine lying around?”. Well, in that case we will give you one. We take some kind of compute engine like Datafusion (or possibly just arrow-rs) and shove it into a plugin. This plugin calculates the index during write and stores it in the file as a free global buffer. Then, during read, the plugin will apply the filter.

2.1 We’re Experiencing Deja Vu with I/O Scheduling

One of the things we got right in 2.0 was I/O scheduling. It turns out this is one of the things we also got wrong. We could saturate the bandwidth on full scans but random access on NVMe is tricky. A good NVMe can perform close to a million reads a second (the $200 disk I have at home for playing games daily development can achieve ~800K reads/s). This means an average latency of almost 1 microsecond.

Synchronous I/O Issues
These gaps can be larger than the time it takes for the I/O to run!

At the risk of stating the obvious, this is not a lot of time. In 2.0 our poor little read request requires two thread transfers. First we put it in a queue to the I/O thread. Then we put the response back into a queue for the decode thread. As a result we were spending 2-3 microseconds of overhead for a request that only takes a single microsecond. Adding a fully synchronous blocking API brought our file reader from 40K values/second to 400K values/second in some low-level benchmarking! As part of the stabilization for 2.1 I’m hoping to look into this further, and make up new terms like “semi-synchronous I/O”.

2.1 Timeline

The 2.1 format is now officially in beta. There are a few known limitations but it should work for most types. Follow our work and track its completion here. If you want to run experiments you can enable the 2.1 format in a few ways:

Dataset Level

import lance
import pyarrow as pa

data = pa.table({"x": range(1000)})
ds = lance.write_dataset(data, "/tmp/test_ds", data_storage_version="2.1")
print(ds.data_storage_version)  # Should be 2.1

The data_storage_version is set when a new dataset is created and will control all data written by that dataset.

File Level

import lance
import pyarrow as pa

data = pa.table({"x": range(1000)})
ds = lance.write_dataset(data, "/tmp/test_ds", data_storage_version="2.1")
print(ds.data_storage_version)  # Should be 2.1

If you’d prefer to work directly with Lance files for experimentation or lower-level access we have python bindings for a file reader/writer.

⚠️ Beta Warning

By “beta” we mean “files you write with the beta may not be readable in the future”. Please do NOT use unstable / beta versions with production data.

Over the next few months we will be adding new tests, fixing some remaining todos, dogfooding, and tuning the performance. We encourage you all to try it out and do your own experiments. Feel free to start filing bugs about 2.1 and we will take a look. I’ll also be writing additional blog posts going over the new features at a much lower level.

Join the Conversation! 👩‍💻

We’re always happy to chat more on our Discord and on Github. Feel free to ask for more details or help us find better ways to do things! If something doesn’t work or could be faster then let us know too.

Weston Pace
Data engineer from the open source space, working on LanceDB, Arrow, Substrait.

Stable-Worldmodel: A High Performance Platform for Reproducible World Model Research

Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
June 2, 2026
stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research

🌍 Lance-Backed World Model Platform, 🦆 Multimodal SQL with Lance DuckDB Extension, 💰 LanceDB vs OpenSearch Cost Breakdown

ChanChan Mao
May 28, 2026
newsletter-may-2026

Reproducible Data Curation In The Multimodal Lakehouse

Prashanth Rao
May 29, 2026
reproducible-data-curation-in-the-multimodal-lakehouse