🌍 Lance-Backed World Model Platform, 🦆 Multimodal SQL with Lance DuckDB Extension, 💰 LanceDB vs OpenSearch Cost Breakdown

ChanChan Mao

•

June 2, 2026

•

Newsletter

Table of Contents

This is a title

This is a subtitle

🌍 stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation

World model training requires small-batch random access across large sequence buffers—a pattern that performs poorly on HDF5 or video formats, especially over the network. stable-worldmodel is an open-source platform with a Lance-based data layer that streams directly from S3 several times faster than alternatives, making ephemeral GPU training practical without local sync.

The platform includes reference implementations of DINO-WM, LeWorldModel, PLDM, and TD-MPC2, plus ~150 environments with controllable visual and physical factors for standardized zero-shot generalization benchmarks. Details and benchmark numbers are in the paper.

Paper → · Code →

🦆 Make Your SQL Workflows Multimodal with LanceDB x DuckDB

database for embeddings, and a SQL warehouse for metadata. The Lance core extension for DuckDB collapses all three—image bytes live as a BLOB column alongside embeddings and structured fields in a single table.

lance_vector_search() is a SQL table function that returns ranked results with raw JPEG bytes inline. Standard SQL handles the rest: joins, filters, aggregations. The blog post walks through setup and query examples.

Vector search costs scale with RAM when the index lives in memory — bigger dataset means bigger instance. LanceDB stores the index in S3 and memory-maps it, so RAM scales with QPS rather than dataset size. At 100M docs (1152-dim, SQ8): ~$779/month total. At 10M: ~$148. At 1M: ~$65.

Benchmarked on 287K COCO images with SigLIP 2 embeddings using IVF_HNSW_SQ — above 0.95 recall@10, sub-50ms p95 on a single node. Full cost breakdown and OpenSearch comparison in the post.

Test-Driving the Lance Lakehouse Format in DuckDB

🎤 Talks & Recordings

Managing Data at Exabyte Scale for AI Model Training | Chang She (LanceDB) | OpenXdata 2026

Chang She (LanceDB) discusses how current data infrastructure forces ML teams to copy data across disconnected tools throughout the training workflow, and presents LanceDB’s approach of managing multimodal training data in a unified table to streamline the path from exploration to GPU loading.

Watch the recording →

Q&A with Chang She, CEO, LanceDB: Why the 20-year-old data stack is breaking under AI workloads

Chang She (CEO, LanceDB) discusses why traditional database architectures that separate data storage from file storage via pointers are failing under AI workloads, particularly the throughput challenges posed by agentic data access at scale.

Watch the recording →

📅 Upcoming Events

Microsoft Build — June 2-3, 2026 · San Francisco, CA

Compares multi-system AI data architectures on Azure with unified approaches that use Blob Storage as the foundation for raw data, embeddings, and features. Covers the practical tradeoffs when curation, feature engineering, training, search, and analytics start spreading data across systems.

Snowflake Summit — June 1-4, 2026 · San Francisco, CA

Apache Polaris is expanding beyond Iceberg to support Delta and Lance formats as an open catalog layer, with new capabilities including S3-compatible on-prem storage, generic tables, a built-in policy store, and catalog federation. Engineers from Snowflake and partners will demo real-world usage and walk through the evolving Polaris REST catalog spec.

Data+AI Summit — June 15-18, 2026 · San Francisco, CA

Managing Data at Exabyte Scale for AI Model Training

Enterprise AI model training succeeds or fails based on iteration speed across the full data flywheel — from exploration and curation to feature engineering and GPU loading. LanceDB manages all multimodal training data in a unified table, eliminating the tool-hopping that delays or kills training projects.

From Streaming to Search: How Exa Uses Lance and Apache Spark for high-throughput AI Workloads

Exa uses Lance and Spark Structured Streaming to power search and AI workloads across large volumes of crawled web data. The pipeline performs local and global deduplication, generates embeddings, and sustains ~10k rows per second into Lance tables that power downstream vector search databases. ‍

‍

🏗️ LanceDB Enterprise Updates

Performance

Page-Boundary Read Coalescing — The storage caching layer now widens reads to page boundaries instead of issuing one request per page, significantly reducing read fan-out and improving throughput for cached access patterns.
KNN Pushdown for Filtered Queries — K-nearest-neighbor queries over remotely filtered result sets now push execution closer to the data, reducing data movement and improving latency for hybrid filtered vector searches.
IVF_RQ Index in Distributed Indexer — The distributed indexer now supports IVF with residual quantization (IVF_RQ), offering a better compression/speed tradeoff for large-scale vector indexes built across multiple nodes.
Vector Index Row Count Persistence — Row counts for vector indexes are now persisted at build time, eliminating expensive full-table scans that were previously required to determine index coverage.‍
Distributed Indexer Training Optimization — Single-segment index training in the distributed indexer was optimized to reduce unnecessary work, improving build throughput for tables that fit within one segment.

Features

Feature	Description
Distributed Full-Text Search	Full-text and BTree index queries can now execute across distributed query nodes using a two-phase RPC coordinator, bringing distributed search on par with distributed vector search.
Segmented Distributed Indexing	The distributed indexer now supports fragment-scoped and segmented builds for vector, scalar, FTS, and BTree indexes, enabling finer-grained parallelism and faster incremental index updates at scale.
Two-Tier Index Cache & Prewarm	A two-tier index cache (in-memory + NVMe disk) is now available, with new API and CLI commands to prewarm the cache ahead of traffic, significantly reducing cold-start query latency.
Write-Ahead Log	A new write-ahead log (WAL) writer built on the Lance OSS memory WAL is now integrated, alongside a unified developer CLI with fuzzy testing support, laying the foundation for stronger durability guarantees.
Feature Engineering Views & Schema Evolution	Materialized views and user-defined table-valued function (UDTF) views are now supported in the query engine, along with `add_columns` and `alter_columns` schema evolution APIs.

🌟 Open Source Releases

Project	Description
Lance v6.0.0 – v7.0.0 Release notes	• New MemWAL (memory write-ahead log) system with `ShardWriter`, memtable-based HNSW index, manual compaction APIs, and Java bindings—enables low-latency ingestion pipelines with durable replay (#6669, #6675, #6795, #6833) • Segmented and distributed index builds: segmented inverted index (#6305), segmented btree (#6605), zonemap index segments (#6593), distributed bitmap index (#6598), and FTS segment merging (#6790) • SIMD distance kernels for scalar-quantized vectors: u8 dot/L2/cosine (#6506, #6517), bf16 (#6510), f64 (#6540), and 16× faster RaBitQ 4-bit LUT on ARM (#6537)
LanceDB v0.32.0 – v0.33.0 Release notes	• New `IVF_HNSW_FLAT` vector index type available in Python (#3366); native FTS now supports model-backed tokenizers (#3289) • Nested namespace operations for organizing tables hierarchically, with manifest-enabled directory mode (#3265, #3332); Node.js SDK gains `connectNamespace` and namespace management methods (#3371, #3383) • PyTorch DataLoader compatibility: `Permutation` is now fork-safe and picklable for multiprocessing workers (#3339, #3335) • Vector indexes on nested fields now work correctly—paths are discovered automatically and canonicalized for remote tables (#3408, #3423, #3430)
lance-ray v0.3.0 – v0.4.2 Release notes	• New dataset maintenance operations: `optimize_indices` (#2152) and `compact_database` (#2250) for managing Lance datasets at scale via Ray • Index creation now supports namespaces via `create_index()` (#2594) and configurable vector index sample rates (#4631) • Blob v2 support with legacy compatibility and multi-base config for read/write operations (#1614, #4516)
lance-trino v0.3.0 Release notes	• Added support for Struct type mapping to Trino `RowType`, including Substrait schema support for nested field filtering (#96, #125) • Upgraded to lance-core 6.0.0 and lance-namespace 0.7.6 (#124)
lance-namespace v0.7.3 – v0.7.7 Release notes	• Added backfill columns and refresh materialized view operations (#335) • Extended virtual column entries with UDF metadata support (#337) • API changes for materialized views and UDTFs (#344)

🫶 Community Contributions

Announcing 5 New Lance Maintainers

Jianjian Xie @jja725 (Uber)
Chunxu Tang @ChunxuTang
Yang Jie @LuciferYang (Baidu AI Cloud)
Zhang Yue @zhangyue19921010 (ByteDance Volcano Engine)
Dan Rammer @hamersaw (LanceDB)

Thank you to contributors from Bytedance, Adobe, Baidu, Tencent, Hugging Face, Google, Red Hat, Roblox, Uber for improvements across storage, indexing, query execution, distributed processing, and ecosystem integrations in LanceDB, Lance, and the broader ecosystem.

Notable contributions this month:

@zhangyue19921010 — Exposed LSM API to Python and Java with distributed bitmap index build support, enabling scalable incremental indexing workflows
@beinan — Added segmented btree indices and unenforced clustering key support to the format spec, improving query optimization for large-scale datasets
@touch-of-grey — Built write-ahead log appender/tailer primitives and MemWAL HNSW integration, enabling real-time vector indexing for streaming workloads
@yanghua — Added Rust support for update by _row_id, enabling efficient row-level mutations in analytical pipelines
@alex766 — Added Dataset.takeRows() Java binding for physical row ID access, unlocking low-level row retrieval for JVM applications
@Its-Tanay — Added Scannable primitive for Node.js streaming ingestion, enabling efficient large-dataset loading in JavaScript workflows
@shenganzhang — Added IVF_HNSW_FLAT vector index support to Python SDK, expanding hybrid search options for production deployments
@atakanyenel — Added Struct type mapping to Trino RowType, enabling nested data querying through the Trino connector
@devteamaegis — Fixed inverted scores and missing-FTS penalty in LinearCombinationReranker, correcting hybrid search ranking accuracy
@yuvalif — Added DataFusion Expr support for table row deletions in Rust SDK, enabling programmatic delete operations

A heartfelt thank you to our community contributors of Lance and LanceDB this past month:

@aayushbaluni • @adaworldapi • @alex766 • @ali2arslan • @ar-maan05 • @atakanyenel • @beinan • @chenghao-guo • @ddupg • @dentiny • @devteamaegis • @dhruvgarg111 • @fangbo • @farmerchillax • @geserdugarov • @gezi-lzq • @ghx5t-sol • @guillesd • @guinik • @haochengliu • @hfutatzhanghb • @hushengquan • @its-tanay • @ivscheianu • @jay-ju • @jerryjch • @jiaoew1991 • @jiengup • @jja725 • @johnchak • @kaan-simbe • @keepromise • @kushudai • @lennylxx • @leoreeyang • @lhoestq • @luciferyang • @majin1102 • @martji • @mesut-doner • @mikewhb • @myandpr • @n1teshy • @omkar-334 • @pengw0048 • @plotor • @pragnyanramtha • @qingfeng-occ • @ragnorc • @sezruby • @shenganzhang • @shiwk • @siddiqueahmad • @singhvishalkr • @sinianluoye • @snigenigmatic • @summaryzb • @touch-of-grey • @vip892766gma • @wangxiaobao1222 • @wending-y • @wojiaodoubao • @wombatu-kun • @xiaguanglei • @xodn348 • @xuqianjin-stars • @yanghua • @yuqi1129 • @yuvalif • @zelys-dfkh • @zhangyue19921010 • @zouhuajian

🤝 Lance Community Sync Recap

Community syncs this month covered the upcoming Lance 2.1.0 release, which will make the Lance 2.1 file format the default, along with indexing improvements and DataFusion enhancements. Early GPU acceleration work through a CuVS integration was also discussed. On the SDK side, Lance 6.0.1 shipped with bug fixes, and Lance 7.0.0 moved toward a release candidate with a community vote planned. The Lance DuckDB extension gained visibility with an official blog post published on the DuckDB blog.

The next Lance Community Sync will take place on Thursday, June 4 @ 9am PT.

• 📬 Subscribe to the Lance mailing list to receive the meeting invite →

• 📄 Add discussion topics to the meeting notes →

• 📺 Watch previous recordings →