lance-blob-v2-late-materialization-for-large-binary-data-in-spark
Drew Gallardo
semantic-memory-for-hermes-agent-with-lancedb
Prashanth Rao
a-metadata-benchmark-of-lance-delta-lake-and-iceberg-on-s3
Jack Ye
scalable-feature-engineering-on-multimodal-datasets
Prashanth Rao
stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research
Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
reproducible-data-curation-in-the-multimodal-lakehouse
Prashanth Rao
newsletter-may-2026
ChanChan Mao
newsletter-april-2026
ChanChan Mao
how-lancedb-accelerates-vector-search-at-10-billion-scale
Yang Cen
opensearch-vs-lancedb-for-vector-search-query-cost-and-infrastructure
Justin Miller
volcano-engine-autonomous-driving-data-lake-solution
Kejian Ju
unifying-the-av-ml-stack-lancedb
Ayush Chaurasia
lance-json-support-why-you-might-not-really-need-variant
Jack Ye
building-a-storage-format-for-the-next-era-of-biology
Pavan Ramkumar
newsletter-march-2026
ChanChan Mao
smart-parsing-meets-sharp-retrieval-combining-liteparse-and-lancedb
Clelia Astra Bertelli
Prashanth Rao
lance-format-v2-2-benchmarks-half-the-storage-none-of-the-slowdown
Xuanwo
make-your-sql-workflows-multimodal-with-lancedb-x-duckdb
Prashanth Rao
agentic-coding-as-community-stewardship
Xuanwo
what-we-mean-by-multimodal
Prashanth Rao
ai-native-development-local-continue-lancedb
Ty Dunn
lance-file-format-2-2-taming-complex-data
Xuanwo
lance-blob-v2
Xuanwo
Jack Ye
openclaw-lancedb-memory-layer
Xuanwo
Prashanth Rao
openclaw-lancedb-seed2
LanceDB
openclaw-memory-from-zero-to-lancedb-pro
Prashanth Rao
upload-lance-datasets-to-hf-hub
Prashanth Rao
zero-shot-image-classification-with-vector-search
Vipul Maheshwari
werides-data-platform-transformation-how-lancedb-fuels-model-development-velocity
Qian Zhu
Fei Chen
training-a-variational-autoencoder-from-scratch-with-the-lance-file-format
LanceDB
track-ai-trends-crewai-agents-rag
LanceDB
tokens-per-second-is-not-all-you-need
Mingran Wang
Tan Li
the-future-of-open-source-table-formats-iceberg-and-lance
Jack Ye
the-case-for-random-access-i-o
LanceDB
series-a-funding
Chang She
semanticdotart
Ayush Chaurasia
second-dinners-secret-weapon-lancedb-powered-rag-for-faster-smarter-game-development
Qian Zhu
search-within-an-image-331b54e4285e
Kaushal Choudhary
scalable-computer-vision-with-lancedb-voxel51-d8b65066d5f6
LanceDB
rethinking-table-file-paths-lance-multi-base-layout
Jack Ye
rag-isnt-one-size-fits-all
Leonard Marcq
python-package-to-convert-image-datasets-to-lance-type
Vipul Maheshwari
one-million-iops
Weston Pace
november-feature-roundup
Will Jones
newsletter-september-2025
Jasmine Wang
newsletter-october-2025
Jasmine Wang
newsletter-november-2025
ChanChan Mao
newsletter-june-2025
David Myriel
newsletter-july-2025
Jasmine Wang
newsletter-january-2026
ChanChan Mao
newsletter-february-2026
ChanChan Mao
newsletter-december-2025
ChanChan Mao
newsletter-august-2025
Jasmine Wang
my-summer-internship-experience-at-lancedb-2
Raunak Sinha
my-simd-is-faster-than-yours-fb2989bf25e7
LanceDB
multimodal-myntra-fashion-search-engine-using-lancedb
LanceDB
multimodal-lakehouse
David Myriel
multi-document-agentic-rag-a-walkthrough
Vipul Maheshwari
modified-rag-parent-document-bigger-chunk-retriever-62b3d1e79bc6
Mahesh Deshwal
memgpt-os-inspired-llms-that-manage-their-own-memory-793d6eed417e
Ayush Chaurasia
late-interaction-efficient-multi-modal-retrievers-need-more-than-just-a-vector-index
Ayush Chaurasia
lancedb-x-continue
LanceDB
lance-x-huggingface-a-new-era-of-sharing-multimodal-data
Prashanth Rao
Quentin Lhoest
Xuanwo
Ayush Chaurasia
lance-x-duckdb-sql-retrieval-on-the-multimodal-lakehouse-format
Xuanwo
lance-windows-windows-lance
Chang She
lance-v2
Weston Pace
lance-namespace-lancedb-and-ray
Jack Ye
lance-file-2-1-stable
Weston Pace
lance-file-2-1-smaller-and-simpler
Weston Pace
lance-data-viewer
Gordon Murray
lance-community-governance
Jack Ye
introducing-lance-namespace-spark-integration
Jack Ye
implementing-corrective-rag-in-the-easiest-way-2
LanceDB
hybrid-search-rag-for-real-life-production-grade-applications-e1e727b3965a
Mahesh Deshwal
hybrid-search-combining-bm25-and-semantic-search-for-better-results-with-lan-1358038fe7e6
LanceDB
hybrid-search-and-custom-reranking-with-lancedb-4c10a6a3447e
LanceDB
how-to-reduce-hallucinations-from-llm-powered-agents-using-long-term-memory-72f262c3cc1f
Tevin Wang
guide-to-use-contextual-retrieval-and-prompt-caching-with-lancedb
LanceDB
grpo-understanding-and-fine-tuning-the-next-gen-reasoning-model-2
Mahesh Deshwal
graphrag-hierarchical-approach-to-retrieval-augmented-generation
Akash Desai
gpu-accelerated-indexing-in-lancedb-27558fa7eee5
LanceDB
geo-support
Jack Ye
geneva-twelvelabs
David Myriel
geneva-feature-engineering
Jonathan Hsieh
from-bi-to-ai-lance-and-iceberg
Jack Ye
Prashanth Rao
fluss-integration
Wayne Wang
file-readers-in-depth-parallelism-without-row-groups
Weston Pace
feature-rabitq-quantization
David Myriel
Yang Cen
feature-full-text-search
David Myriel
enhance-rag-integrate-contextual-compression-and-filtering-for-precision-a29d4a810301
Kaushal Choudhary
effortlessly-loading-and-processing-images-with-lance-a-code-walkthrough
LanceDB
designing-a-table-format-for-ml-workloads
Weston Pace
custom-dataset-for-llm-training-using-lance
LanceDB
creating-a-fintech-agent
Vipul Maheshwari
convert-any-image-dataset-to-lance
LanceDB
columnar-file-readers-in-depth-structural-encoding
Weston Pace
columnar-file-readers-in-depth-repetition-definition-levels
Weston Pace
columnar-file-readers-in-depth-compression-transparency
Weston Pace
columnar-file-readers-in-depth-column-shredding
Weston Pace
columnar-file-readers-in-depth-backpressure
Weston Pace
columnar-file-readers-in-depth-apis-and-fusion
Weston Pace
chunking-techniques-with-langchain-and-llamaindex
Prashant Kumar
chunking-analysis-which-is-the-right-chunking-approach-for-your-language
Shresth Shukla
chat-with-csv-excel-using-lancedb
LanceDB
case-study-netflix
David Myriel
case-study-dosu
Qian Zhu
Michael Ludden
case-study-cognee
David Myriel
Vasilije Markovic
case-study-coderabbit
Qian Zhu
building-rag-on-codebases-part-2
Sankalp Shubham
building-rag-on-codebases-part-1
Sankalp Shubham
branching-and-shallow-clone
Jack Ye
better-rag-with-active-retrieval-augmented-generation-flare-3b66646e2a9f
LanceDB
benchmarking-random-access-in-lance
Chang She
benchmarking-lancedb-92b01032874a-2
LanceDB
benchmarking-cohere-reranker-with-lancedb
LanceDB
anythingllms-competitive-edge-lancedb-for-seamless-rag-and-agent-workflows
Ayush Chaurasia
announcing-lance-sdk
Weston Pace
agentic-rag-using-langgraph-building-a-simple-customer-support-autonomous-agent
LanceDB
advanced-rag-precise-zero-shot-dense-retrieval-with-hyde-0946c54dfdcb
LanceDB
accelerate-vector-search-applications-using-openvino-lancedb
LanceDB
a-primer-on-text-chunking-and-its-types-a420efc96a13
Prashant Kumar
a-practical-guide-to-training-custom-rerankers
Ayush Chaurasia
a-practical-guide-to-fine-tuning-embedding-models
Ayush Chaurasia
keep-your-data-fresh-with-cocoindex-and-lancedb
Prashanth Rao
Linghua Jin

🌍 Lance-Backed World Model Platform, 🦆 Multimodal SQL with Lance DuckDB Extension, 💰 LanceDB vs OpenSearch Cost Breakdown

June 2, 2026
Newsletter

🌍 stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation

World model training requires small-batch random access across large sequence buffers—a pattern that performs poorly on HDF5 or video formats, especially over the network. stable-worldmodel is an open-source platform with a Lance-based data layer that streams directly from S3 several times faster than alternatives, making ephemeral GPU training practical without local sync.

The platform includes reference implementations of DINO-WM, LeWorldModel, PLDM, and TD-MPC2, plus ~150 environments with controllable visual and physical factors for standardized zero-shot generalization benchmarks. Details and benchmark numbers are in the paper.

Paper → · Code →

🦆 Make Your SQL Workflows Multimodal with LanceDB x DuckDB

database for embeddings, and a SQL warehouse for metadata. The Lance core extension for DuckDB collapses all three—image bytes live as a BLOB column alongside embeddings and structured fields in a single table.

lance_vector_search() is a SQL table function that returns ranked results with raw JPEG bytes inline. Standard SQL handles the rest: joins, filters, aggregations. The blog post walks through setup and query examples.

Read more →

💰 OpenSearch vs LanceDB for Vector Search: Query Cost and Infrastructure

Vector search costs scale with RAM when the index lives in memory — bigger dataset means bigger instance. LanceDB stores the index in S3 and memory-maps it, so RAM scales with QPS rather than dataset size. At 100M docs (1152-dim, SQ8): ~$779/month total. At 10M: ~$148. At 1M: ~$65.

Benchmarked on 287K COCO images with SigLIP 2 embeddings using IVF_HNSW_SQ — above 0.95 recall@10, sub-50ms p95 on a single node. Full cost breakdown and OpenSearch comparison in the post.

Read more →

📚 Also Published

🎤 Talks & Recordings

Managing Data at Exabyte Scale for AI Model Training | Chang She (LanceDB) | OpenXdata 2026

Chang She (LanceDB) discusses how current data infrastructure forces ML teams to copy data across disconnected tools throughout the training workflow, and presents LanceDB’s approach of managing multimodal training data in a unified table to streamline the path from exploration to GPU loading.

Watch the recording →

Q&A with Chang She, CEO, LanceDB: Why the 20-year-old data stack is breaking under AI workloads

Chang She (CEO, LanceDB) discusses why traditional database architectures that separate data storage from file storage via pointers are failing under AI workloads, particularly the throughput challenges posed by agentic data access at scale.

Watch the recording →

📅 Upcoming Events

Microsoft Build — June 2-3, 2026 · San Francisco, CA

Compares multi-system AI data architectures on Azure with unified approaches that use Blob Storage as the foundation for raw data, embeddings, and features. Covers the practical tradeoffs when curation, feature engineering, training, search, and analytics start spreading data across systems.

Register →

Snowflake Summit — June 1-4, 2026 · San Francisco, CA

Apache Polaris is expanding beyond Iceberg to support Delta and Lance formats as an open catalog layer, with new capabilities including S3-compatible on-prem storage, generic tables, a built-in policy store, and catalog federation. Engineers from Snowflake and partners will demo real-world usage and walk through the evolving Polaris REST catalog spec.

Register →

Data+AI Summit — June 15-18, 2026 · San Francisco, CA

Managing Data at Exabyte Scale for AI Model Training

Enterprise AI model training succeeds or fails based on iteration speed across the full data flywheel — from exploration and curation to feature engineering and GPU loading. LanceDB manages all multimodal training data in a unified table, eliminating the tool-hopping that delays or kills training projects. 

From Streaming to Search: How Exa Uses Lance and Apache Spark for high-throughput AI Workloads

Exa uses Lance and Spark Structured Streaming to power search and AI workloads across large volumes of crawled web data. The pipeline performs local and global deduplication, generates embeddings, and sustains ~10k rows per second into Lance tables that power downstream vector search databases. 

Register →

🏗️ LanceDB Enterprise Updates

Performance

  • Page-Boundary Read Coalescing — The storage caching layer now widens reads to page boundaries instead of issuing one request per page, significantly reducing read fan-out and improving throughput for cached access patterns.
  • KNN Pushdown for Filtered Queries — K-nearest-neighbor queries over remotely filtered result sets now push execution closer to the data, reducing data movement and improving latency for hybrid filtered vector searches.
  • IVF_RQ Index in Distributed Indexer — The distributed indexer now supports IVF with residual quantization (IVF_RQ), offering a better compression/speed tradeoff for large-scale vector indexes built across multiple nodes.
  • Vector Index Row Count Persistence — Row counts for vector indexes are now persisted at build time, eliminating expensive full-table scans that were previously required to determine index coverage.
  • Distributed Indexer Training Optimization — Single-segment index training in the distributed indexer was optimized to reduce unnecessary work, improving build throughput for tables that fit within one segment.

Features

Feature Description
Distributed Full-Text Search Full-text and BTree index queries can now execute across distributed query nodes using a two-phase RPC coordinator, bringing distributed search on par with distributed vector search.
Segmented Distributed Indexing The distributed indexer now supports fragment-scoped and segmented builds for vector, scalar, FTS, and BTree indexes, enabling finer-grained parallelism and faster incremental index updates at scale.
Two-Tier Index Cache & Prewarm A two-tier index cache (in-memory + NVMe disk) is now available, with new API and CLI commands to prewarm the cache ahead of traffic, significantly reducing cold-start query latency.
Write-Ahead Log A new write-ahead log (WAL) writer built on the Lance OSS memory WAL is now integrated, alongside a unified developer CLI with fuzzy testing support, laying the foundation for stronger durability guarantees.
Feature Engineering Views & Schema Evolution Materialized views and user-defined table-valued function (UDTF) views are now supported in the query engine, along with add_columns and alter_columns schema evolution APIs.

🌟 Open Source Releases

Project Description
Lance v6.0.0 – v7.0.0
Release notes
• New MemWAL (memory write-ahead log) system with ShardWriter, memtable-based HNSW index, manual compaction APIs, and Java bindings—enables low-latency ingestion pipelines with durable replay (#6669, #6675, #6795, #6833)
• Segmented and distributed index builds: segmented inverted index (#6305), segmented btree (#6605), zonemap index segments (#6593), distributed bitmap index (#6598), and FTS segment merging (#6790)
• SIMD distance kernels for scalar-quantized vectors: u8 dot/L2/cosine (#6506, #6517), bf16 (#6510), f64 (#6540), and 16× faster RaBitQ 4-bit LUT on ARM (#6537)
LanceDB v0.32.0 – v0.33.0
Release notes
• New IVF_HNSW_FLAT vector index type available in Python (#3366); native FTS now supports model-backed tokenizers (#3289)
• Nested namespace operations for organizing tables hierarchically, with manifest-enabled directory mode (#3265, #3332); Node.js SDK gains connectNamespace and namespace management methods (#3371, #3383)
• PyTorch DataLoader compatibility: Permutation is now fork-safe and picklable for multiprocessing workers (#3339, #3335)
• Vector indexes on nested fields now work correctly—paths are discovered automatically and canonicalized for remote tables (#3408, #3423, #3430)
lance-ray v0.3.0 – v0.4.2
Release notes
• New dataset maintenance operations: optimize_indices (#2152) and compact_database (#2250) for managing Lance datasets at scale via Ray
• Index creation now supports namespaces via create_index() (#2594) and configurable vector index sample rates (#4631)
• Blob v2 support with legacy compatibility and multi-base config for read/write operations (#1614, #4516)
lance-trino v0.3.0
Release notes
• Added support for Struct type mapping to Trino RowType, including Substrait schema support for nested field filtering (#96, #125)
• Upgraded to lance-core 6.0.0 and lance-namespace 0.7.6 (#124)
lance-namespace v0.7.3 – v0.7.7
Release notes
• Added backfill columns and refresh materialized view operations (#335)
• Extended virtual column entries with UDF metadata support (#337)
• API changes for materialized views and UDTFs (#344)

🫶 Community Contributions

Announcing 5 New Lance Maintainers

Thank you to contributors from Bytedance, Adobe, Baidu, Tencent, Hugging Face, Google, Red Hat, Roblox, Uber for improvements across storage, indexing, query execution, distributed processing, and ecosystem integrations in LanceDB, Lance, and the broader ecosystem.

Notable contributions this month:

  • @zhangyue19921010 — Exposed LSM API to Python and Java with distributed bitmap index build support, enabling scalable incremental indexing workflows
  • @beinan — Added segmented btree indices and unenforced clustering key support to the format spec, improving query optimization for large-scale datasets
  • @touch-of-grey — Built write-ahead log appender/tailer primitives and MemWAL HNSW integration, enabling real-time vector indexing for streaming workloads
  • @yanghua — Added Rust support for update by _row_id, enabling efficient row-level mutations in analytical pipelines
  • @alex766 — Added Dataset.takeRows() Java binding for physical row ID access, unlocking low-level row retrieval for JVM applications
  • @Its-Tanay — Added Scannable primitive for Node.js streaming ingestion, enabling efficient large-dataset loading in JavaScript workflows
  • @shenganzhang — Added IVF_HNSW_FLAT vector index support to Python SDK, expanding hybrid search options for production deployments
  • @atakanyenel — Added Struct type mapping to Trino RowType, enabling nested data querying through the Trino connector
  • @devteamaegis — Fixed inverted scores and missing-FTS penalty in LinearCombinationReranker, correcting hybrid search ranking accuracy
  • @yuvalif — Added DataFusion Expr support for table row deletions in Rust SDK, enabling programmatic delete operations

A heartfelt thank you to our community contributors of Lance and LanceDB this past month:

@aayushbaluni@adaworldapi@alex766@ali2arslan@ar-maan05@atakanyenel@beinan@chenghao-guo@ddupg@dentiny@devteamaegis@dhruvgarg111@fangbo@farmerchillax@geserdugarov@gezi-lzq@ghx5t-sol@guillesd@guinik@haochengliu@hfutatzhanghb@hushengquan@its-tanay@ivscheianu@jay-ju@jerryjch@jiaoew1991@jiengup@jja725@johnchak@kaan-simbe@keepromise@kushudai@lennylxx@leoreeyang@lhoestq@luciferyang@majin1102@martji@mesut-doner@mikewhb@myandpr@n1teshy@omkar-334@pengw0048@plotor@pragnyanramtha@qingfeng-occ@ragnorc@sezruby@shenganzhang@shiwk@siddiqueahmad@singhvishalkr@sinianluoye@snigenigmatic@summaryzb@touch-of-grey@vip892766gma@wangxiaobao1222@wending-y@wojiaodoubao@wombatu-kun@xiaguanglei@xodn348@xuqianjin-stars@yanghua@yuqi1129@yuvalif@zelys-dfkh@zhangyue19921010@zouhuajian

🤝 Lance Community Sync Recap

Community syncs this month covered the upcoming Lance 2.1.0 release, which will make the Lance 2.1 file format the default, along with indexing improvements and DataFusion enhancements. Early GPU acceleration work through a CuVS integration was also discussed. On the SDK side, Lance 6.0.1 shipped with bug fixes, and Lance 7.0.0 moved toward a release candidate with a community vote planned. The Lance DuckDB extension gained visibility with an official blog post published on the DuckDB blog.

The next Lance Community Sync will take place on Thursday, June 4 @ 9am PT.

📬 Subscribe to the Lance mailing list to receive the meeting invite →

📄 Add discussion topics to the meeting notes →

📺 Watch previous recordings →

ChanChan Mao
Developer Relations @ LanceDB

Lance Blob V2: Late Materialization for Large Binary Data in Spark

Drew Gallardo
June 17, 2026
lance-blob-v2-late-materialization-for-large-binary-data-in-spark

Semantic Memory for Hermes Agent with LanceDB

Prashanth Rao
June 15, 2026
semantic-memory-for-hermes-agent-with-lancedb

A Metadata Benchmark of Lance, Delta Lake, and Iceberg on S3

Jack Ye
June 9, 2026
a-metadata-benchmark-of-lance-delta-lake-and-iceberg-on-s3