⚡Vector Search at 10B Scale, 📊 Lance Format Benchmarks, 🚗 AV Pipelines at Scale

ChanChan Mao

•

May 4, 2026

•

Newsletter

Table of Contents

This is a title

This is a subtitle

⚡How LanceDB Accelerates Vector Search at 10 Billion Scale

LanceDB Enterprise introduces a distributed architecture for vector search at 10B scale that avoids common bottlenecks in indexing and query execution. Indexes are split into independently built segments, enabling near-linear scaling in build time, while query execution is parallelized across workers without increasing per-query latency.

Key optimizations—like HNSW over centroids to remove linear scans and faster preprocessing for RaBitQ—reduce query overhead even at high dimensions. The result is scalable throughput, predictable latency, and a stable API, even as datasets grow to tens of billions of vectors.

Lance format v2.2 reduces storage by ~50% vs. Parquet on text-heavy data while enabling up to 75× faster random reads for blobs like images and video. Filtering and sampling performance stays stable as data scales, without requiring application changes.

These gains come from compressing dictionary values without impacting access paths—reducing I/O during training and improving GPU utilization.

LanceDB consolidates AV ML pipelines into a single system: raw data, annotations, and embeddings in one table, with SQL-based curation and materialized views for training splits. New data and signals can be added incrementally without rebuilding pipelines.

Training jobs can resume from checkpoints, and retraining workflows stay stable as data evolves—reducing iteration time on large multimodal datasets.

Bytedance Volcano Engine LAS's Lance-Based PB-Scale Autonomous Driving Data Lake Solution

Bytedance’s Volcano Engine LAS (Lake for AI Service) team rebuilt their autonomous driving data lake on Lance to address scaling bottlenecks in annotation and training. Instead of rewriting datasets on every schema change, they added new labels incrementally, reduced storage by up to 70% with built-in compression, and enabled training jobs to read only required columns.

At production scale, this translated into higher GPU utilization (60% → 96%) and faster iteration: 10PB label processing dropped from 4 days to 1, with overall model iteration improving by ~40%.

📺 Talks & Recordings

10 Things I Hate About Feature Engineering for AI

In a world where software engineers have stopped writing most of their code manually, data teams are still debugging distributed pipeline failures at 2 AM and watching their OpenAI bills spike over the weekend. Chang breaks down the ten biggest pain points of feature engineering for AI — and makes the case that our data infrastructure was built for the last generation.

Watch the recording →

Powering Netflix's Multimodal Feature Engineering at Scale

Jack Ye (Software Engineer @ LanceDB) and Pablo Delgado (Machine Learning Engineer @ Netflix) share how Netflix builds and curates multimodal features across large video and image corpora, with LanceDB serving as the core storage and query layer for multimodal data.

Watch the recording →

Exa-Scale Search with Lance & Ray

Dive into the infrastructure behind Exa's AI search engine, covering how Lance and Ray support distributed embedding pipelines and semantic retrieval at web scale across billions of documents.

Data & Engineering at Exa – Hubert Yuan, Software Engineer @ Exa
Lance: Exa-Scale | Multimodal Lakehouse – Lei Xu, CTO @ LanceDB‍
Ray Data: Scalable AI Computing & Distributed Systems – Goutam Venkatramanan, Software Engineer @ Anyscale

📅 Upcoming Events

AI Agent Conference – May 4-5 in NYC

Join Chang's session to learn why the biggest bottleneck for production AI agents isn't model intelligence, but data infrastructure – and how context engineering and purpose-built systems will drive the next leap in agent quality.

KGC (Knowledge Graph Conference) – May 4-8 in NYC

Prashanth Rao (AI Engineer @ LanceDB) and David Hughes (AI & Graph Solution Architect) will present a Lance-native multimodal RAG architecture that unifies embeddings, graph traversal, and media access in a single system—enabling zero-copy retrieval, lower latency, and simpler pipelines.

AI Council – May 12-14 in SF

Explore the infrastructure challenges of managing trillion-scale multimodal datasets, and how Lance format and LanceDB are built to help you scale faster and cut costs.

LanceDB is a sponsor of AI Council this year! Come find us at our booth to talk training data infrastructure, feature engineering at PB scale, and a chance to win some cool swag!

🎙️Meetups

Agent Builders Night – May 5, NYC

LanceDB, Braintrust, Modal, and Augment Code are bringing together AI leaders and builders for an evening of relaxed conversations and cocktails. No pitches or panels - just good food, drinks, and great vibes!

San Francisco DataFusion Meetup – May 11, SF

Prashanth Rao will be diving into the internals of distribution query execution built with Apache DataFusion and Lance, multimodal lakehouse format.

Ship It: Dinner & Drinks on a Boat – May 13, SF

Join us for a private gathering of AI and data peeps for an evening on the water during AI Council week in San Francisco. We’re bringing together a small group of builders, operators, and technical leaders shaping the future of AI systems and data infrastructure.

The missing data layer for ML – May 13, Menlo Park

Join LanceDB, dltHub, and DataHub for a night of technical talks and demos as we demystify the missing data layer for ML. Hear from the engineers building the ingestion, retrieval, and metadata layers of the open source AI stack.

🏗️ LanceDB Enterprise Updates

Performance

IVF Centroid Routing — Accelerated centroid routing for the IVF index reduces vector search latency at query time.
Page Map Caching — Replaced the page map cache with a high-concurrency implementation, improving read throughput under contention.
‍Index Metadata Lookups — Reduced overhead on index metadata lookups during table maintenance, cutting operation cost at scale.

Features

Feature	Description
Distributed Vector Search	ANN execution now distributes across workers with segment-level routing, distributed query plan execution with built-in metrics, and segment-based index builds — the architecture behind the 10B-scale results.
Table Maintenance Automation	Intelligent job planning with automated and remote backfill support, plus configurable warm-up readiness gating to block traffic until the query engine is ready.
Telemetry Privacy Controls	Table names, column names, and user identifiers can be obfuscated in telemetry and indexer workloads — required for deployments operating under strict data governance or regulatory constraints.
Secure Namespace Credential Vending	Manifest-based credential vending for cross-namespace access to storage and services without exposing credentials at the application layer.
Catalog Explorer & SQL Console	Catalog explorer frontend and integrated SQL console in the Feature Engineering sidebar for interactive data exploration without leaving the platform.
Cache Observability	Page read and disk read counters added to query engine cache metrics, giving operators direct visibility into cache behavior for performance tuning.

🌟 Open Source Releases

Project	Description
LanceDB v0.30.2 Release notes	• Parallel inserts for remote tables via multipart write improves throughput for large uploads (#3071) • New type-safe expression builder API in Python for constructing queries programmatically (#3150) • Node.js SDK now supports Float16, Float64, and Uint8 vector queries (#3193) • Progress bar added to `add()` for visibility during bulk inserts (#3067)
lance-duckdb v0.5.4 Release notes	• Added dataset and session reuse across read, write, and search paths—including per-connection dataset caching and shared per-database sessions—reducing overhead for repeated operations (#182, #183, #185, #188, #189) • Vector index controls now available in hybrid search (#190)
lance-namespace v0.7.0 – v0.7.2 Release notes	• `CreateTable` REST API now supports storage options and properties; `CreateEmptyTable` removed (#330) • `ListTables` now includes declared tables by default via `include_declared=true` (#332) • `DescribeTable` request adds `check_declared` parameter for explicit declared table lookups (#334)

🫶 Community Contributions

Thank you to contributors from Google, Bytedance, Tencent, Baidu, Adobe, Uber, Pinterest, Microsoft, Luma AI, and Rerun.io for improvements across storage, indexing, query execution, distributed processing, and ecosystem integrations in LanceDB, Lance, and the broader ecosystem.

Notable contributions this month:

@beinan — Added Dataset.sample() API to Java SDK and zonemap index segments support, enabling efficient sampling and index-based filtering in JVM environments
@hushengquan — Improved I/O throughput by submitting requests eagerly in FullZipScheduler, reducing latency for batch reads
@zhangyue19921010 — Extended dictionary-namespace with table operations and exposed base-scoped store bindings to Python, enabling richer catalog integration
@dentiny — Fixed float64-stored number detection in JSON type extraction and corrected logical operator bug in conflict resolver
@pratik0316 — Added type-safe expression builder API to Python SDK, enabling compile-time query validation
@VedantMadane — Extended Node.js SDK to support Float16, Float64, and Uint8 vector queries, broadening precision options for embeddings
@myandpr — Fixed max_batch_length handling for Rust vector and hybrid queries, ensuring consistent batch size limits
@ivscheianu — Prevented arithmetic overflow in U64Segment encoding for sparse/extreme row ID ranges, improving stability for large datasets
@LuciferYang — Improved robustness by warning and clamping LANCE_INITIAL_UPLOAD_SIZE instead of panicking on invalid values

A heartfelt thank you to our community contributors of Lance and LanceDB this past month:

@adaworldapi • @atakanyenel • @beinan • @bryanck • @butnaruandrei • @chenghao-guo • @danielmao1 • @dardourimohamed • @dcfocus • @dentiny • @dhruvgarg111 • @emilk • @fangbo • @fightboxing • @frankliee • @gezi-lzq • @huahuay • @hushengquan • @its-tanay • @ivscheianu • @jaystarshot • @jerryjch • @jiaoew1991 • @jja725 • @justsml • @kaan-simbe • @lakshjain7 • @lennylxx • @lilei1128 • @luciferyang • @majiayu000 • @majin1102 • @myandpr • @pengw0048 • @ppei-wang • @pratik0316 • @puchengy • @qingfeng-occ • @shiwk • @shmilygkd • @sinianluoye • @snigenigmatic • @summaryzb • @timsaucer • @tobocop2 • @vedantmadane • @wojiaodoubao • @wombatu-kun • @xiaguanglei • @xodn348 • @xuqianjin-stars • @xuzha • @xuzifu666 • @yangmeilly • @yangshangqing95 • @ysbf • @yxd-ym • @zehiko • @zelys-dfkh • @zhangyue19921010 • @ztorchan

🤝 Lance Community Sync Recap

This month's community sync highlighted Lance 2.2 file format benchmarks that showed major performance gains, alongside new Blob V2 and variant benchmarks, while the Lance 4.0.0 SDK is out and work has started on the 6.0.0 release candidate.

The next Lance Community Sync will take place on Thursday May 7.

ChanChan Mao

Developer Relations @ LanceDB

GitHub

OpenSearch vs LanceDB for Vector Search: Query Cost and Infrastructure

Justin Miller

•

April 15, 2026

opensearch-vs-lancedb-for-vector-search-query-cost-and-infrastructure