stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research
Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
reproducible-data-curation-in-the-multimodal-lakehouse
Prashanth Rao
newsletter-may-2026
ChanChan Mao
newsletter-april-2026
ChanChan Mao
how-lancedb-accelerates-vector-search-at-10-billion-scale
Yang Cen
opensearch-vs-lancedb-for-vector-search-query-cost-and-infrastructure
Justin Miller
volcano-engine-autonomous-driving-data-lake-solution
Kejian Ju
unifying-the-av-ml-stack-lancedb
Ayush Chaurasia
lance-json-support-why-you-might-not-really-need-variant
Jack Ye
building-a-storage-format-for-the-next-era-of-biology
Pavan Ramkumar
newsletter-march-2026
ChanChan Mao
smart-parsing-meets-sharp-retrieval-combining-liteparse-and-lancedb
Clelia Astra Bertelli
Prashanth Rao
lance-format-v2-2-benchmarks-half-the-storage-none-of-the-slowdown
Xuanwo
make-your-sql-workflows-multimodal-with-lancedb-x-duckdb
Prashanth Rao
agentic-coding-as-community-stewardship
Xuanwo
what-we-mean-by-multimodal
Prashanth Rao
ai-native-development-local-continue-lancedb
Ty Dunn
lance-file-format-2-2-taming-complex-data
Xuanwo
lance-blob-v2
Xuanwo
Jack Ye
openclaw-lancedb-memory-layer
Xuanwo
Prashanth Rao
openclaw-lancedb-seed2
LanceDB
openclaw-memory-from-zero-to-lancedb-pro
Prashanth Rao
upload-lance-datasets-to-hf-hub
Prashanth Rao
zero-shot-image-classification-with-vector-search
Vipul Maheshwari
werides-data-platform-transformation-how-lancedb-fuels-model-development-velocity
Qian Zhu
Fei Chen
training-a-variational-autoencoder-from-scratch-with-the-lance-file-format
LanceDB
track-ai-trends-crewai-agents-rag
LanceDB
tokens-per-second-is-not-all-you-need
Mingran Wang
Tan Li
the-future-of-open-source-table-formats-iceberg-and-lance
Jack Ye
the-case-for-random-access-i-o
LanceDB
series-a-funding
Chang She
semanticdotart
Ayush Chaurasia
second-dinners-secret-weapon-lancedb-powered-rag-for-faster-smarter-game-development
Qian Zhu
search-within-an-image-331b54e4285e
Kaushal Choudhary
scalable-computer-vision-with-lancedb-voxel51-d8b65066d5f6
LanceDB
rethinking-table-file-paths-lance-multi-base-layout
Jack Ye
rag-isnt-one-size-fits-all
Leonard Marcq
python-package-to-convert-image-datasets-to-lance-type
Vipul Maheshwari
one-million-iops
Weston Pace
november-feature-roundup
Will Jones
newsletter-september-2025
Jasmine Wang
newsletter-october-2025
Jasmine Wang
newsletter-november-2025
ChanChan Mao
newsletter-june-2025
David Myriel
newsletter-july-2025
Jasmine Wang
newsletter-january-2026
ChanChan Mao
newsletter-february-2026
ChanChan Mao
newsletter-december-2025
ChanChan Mao
newsletter-august-2025
Jasmine Wang
my-summer-internship-experience-at-lancedb-2
Raunak Sinha
my-simd-is-faster-than-yours-fb2989bf25e7
LanceDB
multimodal-myntra-fashion-search-engine-using-lancedb
LanceDB
multimodal-lakehouse
David Myriel
multi-document-agentic-rag-a-walkthrough
Vipul Maheshwari
modified-rag-parent-document-bigger-chunk-retriever-62b3d1e79bc6
Mahesh Deshwal
memgpt-os-inspired-llms-that-manage-their-own-memory-793d6eed417e
Ayush Chaurasia
late-interaction-efficient-multi-modal-retrievers-need-more-than-just-a-vector-index
Ayush Chaurasia
lancedb-x-continue
LanceDB
lance-x-huggingface-a-new-era-of-sharing-multimodal-data
Prashanth Rao
Quentin Lhoest
Xuanwo
Ayush Chaurasia
lance-x-duckdb-sql-retrieval-on-the-multimodal-lakehouse-format
Xuanwo
lance-windows-windows-lance
Chang She
lance-v2
Weston Pace
lance-namespace-lancedb-and-ray
Jack Ye
lance-file-2-1-stable
Weston Pace
lance-file-2-1-smaller-and-simpler
Weston Pace
lance-data-viewer
Gordon Murray
lance-community-governance
Jack Ye
introducing-lance-namespace-spark-integration
Jack Ye
implementing-corrective-rag-in-the-easiest-way-2
LanceDB
hybrid-search-rag-for-real-life-production-grade-applications-e1e727b3965a
Mahesh Deshwal
hybrid-search-combining-bm25-and-semantic-search-for-better-results-with-lan-1358038fe7e6
LanceDB
hybrid-search-and-custom-reranking-with-lancedb-4c10a6a3447e
LanceDB
how-to-reduce-hallucinations-from-llm-powered-agents-using-long-term-memory-72f262c3cc1f
Tevin Wang
guide-to-use-contextual-retrieval-and-prompt-caching-with-lancedb
LanceDB
grpo-understanding-and-fine-tuning-the-next-gen-reasoning-model-2
Mahesh Deshwal
graphrag-hierarchical-approach-to-retrieval-augmented-generation
Akash Desai
gpu-accelerated-indexing-in-lancedb-27558fa7eee5
LanceDB
geo-support
Jack Ye
geneva-twelvelabs
David Myriel
geneva-feature-engineering
Jonathan Hsieh
from-bi-to-ai-lance-and-iceberg
Jack Ye
Prashanth Rao
fluss-integration
Wayne Wang
file-readers-in-depth-parallelism-without-row-groups
Weston Pace
feature-rabitq-quantization
David Myriel
Yang Cen
feature-full-text-search
David Myriel
enhance-rag-integrate-contextual-compression-and-filtering-for-precision-a29d4a810301
Kaushal Choudhary
effortlessly-loading-and-processing-images-with-lance-a-code-walkthrough
LanceDB
designing-a-table-format-for-ml-workloads
Weston Pace
custom-dataset-for-llm-training-using-lance
LanceDB
creating-a-fintech-agent
Vipul Maheshwari
convert-any-image-dataset-to-lance
LanceDB
columnar-file-readers-in-depth-structural-encoding
Weston Pace
columnar-file-readers-in-depth-repetition-definition-levels
Weston Pace
columnar-file-readers-in-depth-compression-transparency
Weston Pace
columnar-file-readers-in-depth-column-shredding
Weston Pace
columnar-file-readers-in-depth-backpressure
Weston Pace
columnar-file-readers-in-depth-apis-and-fusion
Weston Pace
chunking-techniques-with-langchain-and-llamaindex
Prashant Kumar
chunking-analysis-which-is-the-right-chunking-approach-for-your-language
Shresth Shukla
chat-with-csv-excel-using-lancedb
LanceDB
case-study-netflix
David Myriel
case-study-dosu
Qian Zhu
Michael Ludden
case-study-cognee
David Myriel
Vasilije Markovic
case-study-coderabbit
Qian Zhu
building-rag-on-codebases-part-2
Sankalp Shubham
building-rag-on-codebases-part-1
Sankalp Shubham
branching-and-shallow-clone
Jack Ye
better-rag-with-active-retrieval-augmented-generation-flare-3b66646e2a9f
LanceDB
benchmarking-random-access-in-lance
Chang She
benchmarking-lancedb-92b01032874a-2
LanceDB
benchmarking-cohere-reranker-with-lancedb
LanceDB
anythingllms-competitive-edge-lancedb-for-seamless-rag-and-agent-workflows
Ayush Chaurasia
announcing-lance-sdk
Weston Pace
agentic-rag-using-langgraph-building-a-simple-customer-support-autonomous-agent
LanceDB
advanced-rag-precise-zero-shot-dense-retrieval-with-hyde-0946c54dfdcb
LanceDB
accelerate-vector-search-applications-using-openvino-lancedb
LanceDB
a-primer-on-text-chunking-and-its-types-a420efc96a13
Prashant Kumar
a-practical-guide-to-training-custom-rerankers
Ayush Chaurasia
a-practical-guide-to-fine-tuning-embedding-models
Ayush Chaurasia
keep-your-data-fresh-with-cocoindex-and-lancedb
Prashanth Rao
Linghua Jin

November Feature Roundup

December 3, 2024
Engineering

Lance

On November 13th, we released Lance 0.19.2. This release includes several new features and improvements, including:

  • Flexible handling of data for inserts
  • An experimental storage type for video columns and other large blobs
  • Full-text search includes not-yet-indexed data

A full list of changes can be found in the release notes.

Inserting data, without the footguns

Up until now, Lance has been pretty strict about schemas. When you appended to a table:

  • You had to provide all the fields in the schema.
  • You had to provide them in the same order as the dataset schema.
  • You had to match the exact nullability for each field in the schema. If a field was non-nullable in the dataset, providing data with a nullable schema would fail, even if the actual values were never null.

This all changes starting in 0.19.2. Appending data is much more flexible. Now when you append new data:

  • You can omit fields from the schema, even nested fields.
  • You can provide fields in any order.
  • You can provide different but compatible nullability. For example, if a field is non-nullable in the dataset, you can insert data with a nullable field as long as the actual values aren’t null.

Users could already create tables with missing fields in some rows, as most tools know how to insert nulls:

    import lance
    import pyarrow as pa
    
    data = [
        {"vec": [1.0, 2.0, 3.0], "metadata": {"x": 1, "y": 2}},
        {"metadata": {"x": 3}},
        {"vec": [2.0, 3.0, 5.0], "metadata": {"y": 4}},
    ]
    table = pa.Table.from_pylist(data)
    ds = lance.write_dataset(table, "./demo")
    ds.to_table().to_pandas()

                   vec               metadata
    0  [1.0, 2.0, 3.0]   {'x': 1.0, 'y': 2.0}
    1             None  {'x': 3.0, 'y': None}
    2  [2.0, 3.0, 5.0]  {'x': None, 'y': 4.0}

But what's new is you can create a table that is completely missing some fields and then insert data into the table. Notice how we can omit not only the `vec` field, but also the nested `metadata.x` field:

    new_data = [
        {"metadata": {"y": 6}},
    ]
    new_table = pa.Table.from_pylist(new_data)
    ds = lance.write_dataset(new_table, "./demo", mode="append")
    ds.to_table().to_pandas()

                   vec               metadata
    0  [1.0, 2.0, 3.0]   {'x': 1.0, 'y': 2.0}
    1             None  {'x': 3.0, 'y': None}
    2  [2.0, 3.0, 5.0]  {'x': None, 'y': 4.0}
    3             None    {'x': None, 'y': 6}

User feedback has taught us that this development is important for schema evolution.

In many cases, an existing ETL (Extract, Transform, Load) process is designed to work with a specific schema. If you add a new field to the schema, you must carefully update the ETL process to handle the new field. If not, the ETL process might fail because it doesn’t recognize the new field.

With the new ability to insert subschemas, you can now add new fields to the schema without breaking the ETL process. This means you can evolve your schema more easily and with less risk of errors.

This feature was developed by LanceDB engineers Weston Pace (@westonpace) and Will Jones (@wjones127). (https://github.com/lance-format/lance/pull/3041, https://github.com/lance-format/lance/pull/2467)

On top of this feature, we are planning on supporting inserting subschemas using merge insert (This is tracked in #2904). Earlier this year, we already added support for updating subschemas using merge insert.

Fast analytical queries for video datasets

0.19.2 introduced an experimental feature to solve a tricky problem: balancing file sizes when column values are very different in size. This is common with unstructured data like videos.

For example:

  • An integer column might have values that are only 4 bytes each. With compression, they can be even smaller.
  • A video column can have values that are megabytes each and not compressible.

This difference makes it hard to decide how many rows to put in each data file.

For the best performance, we want millions of rows per file for integer columns. A million rows would be just 4 MB without compression. But for video columns, a million rows would be 10 PB per file, which is too large for most storage systems. This means that previously, you had to choose a suboptimal file size for analytical columns if you had any very large columns.

| Column type   | Size per value | Rows for 10GB | Size of 1 million rows |
| ------------- | -------------- | ------------- | ---------------------- |
| int32         | 4B             | 2.5 billion   | 4 MB                   |
| Image (100KB) | 100 KB         | 100,000       | 10 GB                  |
| Video (10MB)  | 10 MB          | 1,000         | 10 PB                  | 

The new balanced storage feature aims to solve this problem. Large columns can be assigned to a new “storage class” called “blob”. The blob storage class is kept in separate files from main storage. The main storage will be limited by the default maximum of 1 million rows per file, while the blob storage will be limited by a maximum file size of 90 GB. This allows excellent scan performance for small columns, even if your dataset contains large unstructured data.

To opt into this feature for a particular column, you must add a metadata field to the schema. For example, here’s how to construct the PyArrow schema:

import pyarrow as pa
    
schema = pa.schema([
	pa.field("int_col", pa.int32()),
    pa.field("video_col", pa.large_binary(), metadata={"lance-schema:storage-class": "blob"}),
])

This feature is experimental and made available for prototyping. Additional work and testing will be done in the next few months before we will recommend it for production use.

This feature was developed by LanceDB engineer Weston Pace (@westonpace).

Eliminating the lag to see new data in full text search

Up until now, searching with full text search only included data that was in the table when the index was created or updated. If you added new data to the table, those new rows would be missing from the search results. You could update the index and the results would appear, but this meant a delay between inserting data and being able to search it. It’s also different than all other indices in Lance: vector search and scalar indices could both automatically search new data. With 0.19.2, full text search now includes not-yet-indexed data.

data = pa.table({"animal": ["wild horse", "domestic rabbit"]})
ds = lance.write_dataset(data, "./demo_fts", mode="overwrite")
ds.create_scalar_index("animal", index_type="INVERTED")
ds.to_table(full_text_query="domestic rabbit and horse").to_pandas()

            animal    _score
0  domestic rabbit  1.386294
1       wild horse  0.693147

new_data = pa.table({"animal": ["wild rabbit"]})
ds = lance.write_dataset(new_data, "./demo_fts", mode="append")
ds.to_table(full_text_query="domestic rabbit and horse").to_pandas()

            animal    _score
0  domestic rabbit  1.386294
1      wild rabbit  0.871385
2       wild horse  0.693147

Eventually, you will still want to update the index to include the new data. This is partially for performance reasons. But importantly it also will improve search results. The search on new data only works for terms that have already been indexed. So if you have a new term that wasn’t in any documents previously, searching that term will not return any results until you update the index.

This feature was developed by LanceDB engineer Yang Cen (@BubbleCal). (#3036).

LanceDB

On November 15th, we released LanceDB 0.13.0 (Rust/Node) and LanceDB Python 0.16.0. In addition to the features from Lance 0.19.2, these releases include several improvements to feature parity between the Python, Node, and Rust SDKs.

A full list of changes can be found in the release notes:

Features from Lance

LanceDB 0.13.0 includes all the features from Lance 0.19.2. This includes then ability to insert subschemas and the full-text search improvements.

Improvements in feature parity

We’ve made several more improvements to feature parity between the Python, Node, and Rust SDKs this release, which include:

New Rust-based remote client

v0.19.2 also included a major milestone in feature parity: the LanceDB remote client (used to connect to LanceDB Cloud and Enterprise) has been consolidated into a single Rust-based client. Previously Python, Node, and Rust had independent implementations, supporting different features. Now they all use the same underlying implementation, ensuring feature parity now and in the future.

Stable-Worldmodel: A High Performance Platform for Reproducible World Model Research

Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
June 2, 2026
stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research

🌍 Lance-Backed World Model Platform, 🦆 Multimodal SQL with Lance DuckDB Extension, 💰 LanceDB vs OpenSearch Cost Breakdown

ChanChan Mao
May 28, 2026
newsletter-may-2026

Reproducible Data Curation In The Multimodal Lakehouse

Prashanth Rao
May 29, 2026
reproducible-data-curation-in-the-multimodal-lakehouse