🌍 stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation

World model training requires small-batch random access across large sequence buffers—a pattern that performs poorly on HDF5 or video formats, especially over the network. stable-worldmodel is an open-source platform with a Lance-based data layer that streams directly from S3 several times faster than alternatives, making ephemeral GPU training practical without local sync.
The platform includes reference implementations of DINO-WM, LeWorldModel, PLDM, and TD-MPC2, plus ~150 environments with controllable visual and physical factors for standardized zero-shot generalization benchmarks. Details and benchmark numbers are in the paper.
🦆 Make Your SQL Workflows Multimodal with LanceDB x DuckDB

database for embeddings, and a SQL warehouse for metadata. The Lance core extension for DuckDB collapses all three—image bytes live as a BLOB column alongside embeddings and structured fields in a single table.
lance_vector_search() is a SQL table function that returns ranked results with raw JPEG bytes inline. Standard SQL handles the rest: joins, filters, aggregations. The blog post walks through setup and query examples.
💰 OpenSearch vs LanceDB for Vector Search: Query Cost and Infrastructure

Vector search costs scale with RAM when the index lives in memory — bigger dataset means bigger instance. LanceDB stores the index in S3 and memory-maps it, so RAM scales with QPS rather than dataset size. At 100M docs (1152-dim, SQ8): ~$779/month total. At 10M: ~$148. At 1M: ~$65.
Benchmarked on 287K COCO images with SigLIP 2 embeddings using IVF_HNSW_SQ — above 0.95 recall@10, sub-50ms p95 on a single node. Full cost breakdown and OpenSearch comparison in the post.
📚 Also Published
🎤 Talks & Recordings
Managing Data at Exabyte Scale for AI Model Training | Chang She (LanceDB) | OpenXdata 2026
Chang She (LanceDB) discusses how current data infrastructure forces ML teams to copy data across disconnected tools throughout the training workflow, and presents LanceDB’s approach of managing multimodal training data in a unified table to streamline the path from exploration to GPU loading.
Q&A with Chang She, CEO, LanceDB: Why the 20-year-old data stack is breaking under AI workloads
Chang She (CEO, LanceDB) discusses why traditional database architectures that separate data storage from file storage via pointers are failing under AI workloads, particularly the throughput challenges posed by agentic data access at scale.
📅 Upcoming Events

Microsoft Build — June 2-3, 2026 · San Francisco, CA
Compares multi-system AI data architectures on Azure with unified approaches that use Blob Storage as the foundation for raw data, embeddings, and features. Covers the practical tradeoffs when curation, feature engineering, training, search, and analytics start spreading data across systems.

Snowflake Summit — June 1-4, 2026 · San Francisco, CA
Apache Polaris is expanding beyond Iceberg to support Delta and Lance formats as an open catalog layer, with new capabilities including S3-compatible on-prem storage, generic tables, a built-in policy store, and catalog federation. Engineers from Snowflake and partners will demo real-world usage and walk through the evolving Polaris REST catalog spec.
🏗️ LanceDB Enterprise Updates
Performance
- Page-Boundary Read Coalescing — The storage caching layer now widens reads to page boundaries instead of issuing one request per page, significantly reducing read fan-out and improving throughput for cached access patterns.
- KNN Pushdown for Filtered Queries — K-nearest-neighbor queries over remotely filtered result sets now push execution closer to the data, reducing data movement and improving latency for hybrid filtered vector searches.
- IVF_RQ Index in Distributed Indexer — The distributed indexer now supports IVF with residual quantization (IVF_RQ), offering a better compression/speed tradeoff for large-scale vector indexes built across multiple nodes.
- Vector Index Row Count Persistence — Row counts for vector indexes are now persisted at build time, eliminating expensive full-table scans that were previously required to determine index coverage.
- Distributed Indexer Training Optimization — Single-segment index training in the distributed indexer was optimized to reduce unnecessary work, improving build throughput for tables that fit within one segment.
Features
🌟 Open Source Releases
🫶 Community Contributions
Announcing 5 New Lance Maintainers
- Jianjian Xie @jja725 (Uber)
- Chunxu Tang @ChunxuTang
- Yang Jie @LuciferYang (Baidu AI Cloud)
- Zhang Yue @zhangyue19921010 (ByteDance Volcano Engine)
- Dan Rammer @hamersaw (LanceDB)

Thank you to contributors from Bytedance, Adobe, Baidu, Tencent, Hugging Face, Google, Red Hat, Roblox, Uber for improvements across storage, indexing, query execution, distributed processing, and ecosystem integrations in LanceDB, Lance, and the broader ecosystem.
Notable contributions this month:
- @zhangyue19921010 — Exposed LSM API to Python and Java with distributed bitmap index build support, enabling scalable incremental indexing workflows
- @beinan — Added segmented btree indices and unenforced clustering key support to the format spec, improving query optimization for large-scale datasets
- @touch-of-grey — Built write-ahead log appender/tailer primitives and MemWAL HNSW integration, enabling real-time vector indexing for streaming workloads
- @yanghua — Added Rust support for update by _row_id, enabling efficient row-level mutations in analytical pipelines
- @alex766 — Added Dataset.takeRows() Java binding for physical row ID access, unlocking low-level row retrieval for JVM applications
- @Its-Tanay — Added Scannable primitive for Node.js streaming ingestion, enabling efficient large-dataset loading in JavaScript workflows
- @shenganzhang — Added IVF_HNSW_FLAT vector index support to Python SDK, expanding hybrid search options for production deployments
- @atakanyenel — Added Struct type mapping to Trino RowType, enabling nested data querying through the Trino connector
- @devteamaegis — Fixed inverted scores and missing-FTS penalty in LinearCombinationReranker, correcting hybrid search ranking accuracy
- @yuvalif — Added DataFusion Expr support for table row deletions in Rust SDK, enabling programmatic delete operations
A heartfelt thank you to our community contributors of Lance and LanceDB this past month:
@aayushbaluni • @adaworldapi • @alex766 • @ali2arslan • @ar-maan05 • @atakanyenel • @beinan • @chenghao-guo • @ddupg • @dentiny • @devteamaegis • @dhruvgarg111 • @fangbo • @farmerchillax • @geserdugarov • @gezi-lzq • @ghx5t-sol • @guillesd • @guinik • @haochengliu • @hfutatzhanghb • @hushengquan • @its-tanay • @ivscheianu • @jay-ju • @jerryjch • @jiaoew1991 • @jiengup • @jja725 • @johnchak • @kaan-simbe • @keepromise • @kushudai • @lennylxx • @leoreeyang • @lhoestq • @luciferyang • @majin1102 • @martji • @mesut-doner • @mikewhb • @myandpr • @n1teshy • @omkar-334 • @pengw0048 • @plotor • @pragnyanramtha • @qingfeng-occ • @ragnorc • @sezruby • @shenganzhang • @shiwk • @siddiqueahmad • @singhvishalkr • @sinianluoye • @snigenigmatic • @summaryzb • @touch-of-grey • @vip892766gma • @wangxiaobao1222 • @wending-y • @wojiaodoubao • @wombatu-kun • @xiaguanglei • @xodn348 • @xuqianjin-stars • @yanghua • @yuqi1129 • @yuvalif • @zelys-dfkh • @zhangyue19921010 • @zouhuajian
🤝 Lance Community Sync Recap
Community syncs this month covered the upcoming Lance 2.1.0 release, which will make the Lance 2.1 file format the default, along with indexing improvements and DataFusion enhancements. Early GPU acceleration work through a CuVS integration was also discussed. On the SDK side, Lance 6.0.1 shipped with bug fixes, and Lance 7.0.0 moved toward a release candidate with a community vote planned. The Lance DuckDB extension gained visibility with an official blog post published on the DuckDB blog.
The next Lance Community Sync will take place on Thursday, June 4 @ 9am PT.
• 📬 Subscribe to the Lance mailing list to receive the meeting invite →





