🦆 Lance x DuckDB SQL Retrieval, 🚗 Uber-Scale Storage, ⚡ 1.5M IOPS

ChanChan Mao

•

February 9, 2026

•

Newsletter

Table of Contents

This is a title

This is a subtitle

🦆 Lance x DuckDB: SQL for Retrieval on the Multimodal Lakehouse Format

The Lance extension for DuckDB turns DuckDB into a SQL compute engine over Lance datasets, exposing vector, full-text, and hybrid retrieval as SQL table functions. This enables fully composable retrieval workflows — joins with eval data, reproducible top-k slicing, SQL-based debugging, and materialization back into Lance.

This extension bridges traditional SQL analytics with multimodal retrieval on a single open dataset format.

Rethinking Table File Paths with Uber: Lance's Multi-Base Layout

Working with Uber's AI Infrastructure team, Lance introduced a multi-base layout to support product systems that need a single dataset to span multiple S3 buckets for parallel reads and writes.

By separating storage bases from file references, Lance enables multi-bucket and multi-region layouts with compact, relocatable metadata — allowing Uber to scale training and retrieval workloads without fragmenting datasets or rewriting metadata.

The Quest for One Million IOPS: Benchmarking Storage at Lance

Recent storage benchmarks in Lance reached up to 1.5 million IOPS by combining a scheduler rework with io_uring, showing that high random-access throughput depends more on reducing CPU overhead and context switching than on single-read latency.

This blog explains how this design better drives modern NVMe hardware for vector, text, and key-based lookups, and contrasts embedded and disaggregated architectures to show how LanceDB scales from single-process deployments to large, distributed systems.

📅 Upcoming Events

February Open Data + AI Meetup - Peninsula, Bay Area Edition — Thursday, February 12

Hear from speakers from LanceDB, Fivetran, Dremio, and typedef about what they're building and how they're defining the future of open data and AI.

NYC Lakehouse Meetup — Tuesday, February 17

We're bringing together Apache Iceberg, Lance, and Apache DataFusion communities in NYC to chat about all things open lakehouse and data infrastructure at Cloudflare's NYC office!

🏗️ LanceDB Enterprise Updates

Feature	Description
Add page cache prewarm API	Users can prewarm LanceDB tables using a LanceDB administrative API. (It is also possible to prewarm some columns, but not others.) This is useful for cases where we want to ensure that data is in the page cache prior to running a specific workload. It is also useful for benchmarking.
Admission Control for Feature Engineering Jobs	Avoid deadlocks by rejecting jobs if the cluster does not have enough resources to execute the job.
Adaptive Batch sizing for Feature Engineering Job checkpoints	Backfill jobs now change checkpoint size depending on udf execution time. Internal benchmarks show up to 2x performance improvements.

🌟 Open Source Releases

Project	Description
Lance v1.0.1 - v1.0.4 Release notes	Multi-base storage layouts enabling a single dataset to span multiple buckets or regions for parallel reads and writes (#5790, #5801) Faster query execution via tighter WAND block score bounds and reduced per-query overhead (#5668, #5696)
LanceDB v0.26 - v0.28 Release notes	DuckDB-powered SQL retrieval with vector, FTS, and hybrid search exposed as composable table functions (#2946, #2957) Expanded embedding support (VoyageAI v4, multimodal) and improved ingestion robustness via parallel embedding computation and better remote query cancellation (#2959, #2887, #2896, #2913)
lance-graph v0.4.0 - v0.5.0 Release notes	Significantly expanded Cypher expressiveness with `WITH` clause chaining, `COLLECT`, and `COUNT(DISTINCT …)` support (#86, #85, #116) Integrated vector search and similarity UDFs into graph queries, with improved execution efficiency on object stores (#80, #81, #83, #89, #96)
lance-context v0.2.0 - v0.2.1 Release notes	Core context store APIs for append, search, and versioned checkout across Python and Rust (#6, #11, #12, #24) Improved runtime behavior with multimodal context support, background compaction, and reduced Python-side blocking during remote I/O (#9, #28, #29)
lance-duckdb v0.4.1 - v0.5.0 Release notes	Improved DuckDB integration with global aggregate pushdown and expanded vector search ergonomics, including ARRAY-based query vectors and tuning controls (#124, #119, #120)
lance-namespace v0.4.4 - v0.4.5 Release notes	New Lance partitioning specification for defining and operating on partitioned datasets (#279, #297)
lance-ray v0.1.0 - v0.2.0 Release notes	Distributed Ray-based IVF_SQ / PQ / FLAT index builder for scalable, parallel index creation (#67)
lance-spark v0.2.0 Release notes	Spark `MERGE INTO` support for upserts and deletes, plus vector search and distributed index creation for large-scale Spark pipelines (#172, #189, #171)

🫶 Community Contributions

Thank you to contributors from Uber, Netflix, Hugging Face, Bytedance, Huawei, Tencent, and Alibaba for improvements across embeddings, query robustness, storage compatibility, distributed indexing, Spark integration, and core format reliability in LanceDB, Lance, lance-spark, and lance-ray.

Notable contributions this month:

- [@fzowl](https://github.com/fzowl) — Added support for VoyageAI v4 and multimodal models, expanding first-class embedding options in LanceDB.
- [@dcfocus](https://github.com/dcfocus) — Delivered major Cypher features in lance-graph, including `COLLECT` aggregation, `WITH` clause query chaining, and foundational context APIs.
- [@ChunxuTang](https://github.com/ChunxuTang) — Expanded Cypher query capabilities with `COUNT(DISTINCT …)`, case-insensitive matching, and vector search operators.
- [@beinan](https://github.com/beinan) — Improved execution efficiency and deployability across lance-graph and lance-context, enabling more scalable production deployments.
- [@jja725](https://github.com/jja725) — Implemented background compaction for Lance fragments, improving long-running system performance.
- [@ex172000](https://github.com/ex172000) — Improved performance and correctness through executor fixes and parallelized embedding computation.
- [@fatelei](https://github.com/fatelei) — Prevented Python-side blocking by releasing the GIL during remote storage operations.
- [@wojiaodoubao](https://github.com/wojiaodoubao) — Introduced the Lance partitioning specification, enabling native support for partitioned datasets.
- [@chenghao-guo](https://github.com/chenghao-guo) — Implemented a Ray-based distributed IVF index builder, enabling scalable index construction.
- [@nyl3532016](https://github.com/nyl3532016) — Added vector search support to `lance-spark`, enabling similarity search in Spark pipelines.
- [@jiaoew1991](https://github.com/jiaoew1991) — Built a fragment-aware join optimizer to improve Spark query performance on Lance datasets.
- [@jtuglu1](https://github.com/jtuglu1) — Implemented distributed full-text search index creation in `lance-spark`.
- [@bryanck](https://github.com/bryanck) — Improved stability of `lance-spark` by fixing Kryo serialization and classloader issues.
- [@zhangyue19921010](https://github.com/zhangyue19921010) — Implemented Spark `MERGE INTO` support for upsert and delete operations on Lance tables.

We want to especially highlight the initial release of lance-context contributed by Uber.

A heartfelt thank you to our community contributors of Lance and LanceDB this past month:

@fzowl • @dcfocus • @ChunxuTang • @beinan • @jja725 • @ex172000 • @hushengquan • @fatelei • @ddupg • @Mesut-Doner • @amanharshx • @Angryrou • @youssef-tharwat • @leiyuou • @prrao87 • @fenfeng9 • @chyyran • @camilesing • @zhangyue19921010 • @touch-of-grey • @fredlarochelle • @LuciferYang • @lhoestq • @majin1102 • @yanghua • @wojiaodoubao • @lichuang • @Ke-Wang • @niebayes • @HaochengLIU • @markmcd • @chenghao-guo • @nyl3532016 • @jiaoew1991 • @jtuglu1 • @bryanck • @fangbo • @majian1998 • @hamersaw‍

🤝 Lance Community Sync Recap

In January, we held two Lance Community Syncs focused on the upcoming Lance 2.0.0 release (now at RC4 and approaching final community vote), growing ecosystem integrations with DuckDB, Polaris, and Hugging Face, and the formalization of lance-context and lance-graph as official sub-projects.

We also discussed recent performance work across Spark, vector indexing, and WAL/mem-table updates, alongside forward-looking proposals covering schema semantics, metadata visibility, clustering strategies, and a new Incubator governance stage for emerging projects.

The next Lance Community Sync will take place on Thursday, February 12, 2026.