⚡How LanceDB Accelerates Vector Search at 10 Billion Scale

LanceDB Enterprise introduces a distributed architecture for vector search at 10B scale that avoids common bottlenecks in indexing and query execution. Indexes are split into independently built segments, enabling near-linear scaling in build time, while query execution is parallelized across workers without increasing per-query latency.
Key optimizations—like HNSW over centroids to remove linear scans and faster preprocessing for RaBitQ—reduce query overhead even at high dimensions. The result is scalable throughput, predictable latency, and a stable API, even as datasets grow to tens of billions of vectors.
📊 Lance Format v2.2 Benchmarks: Half the Storage, None of the Slowdown

Lance format v2.2 reduces storage by ~50% vs. Parquet on text-heavy data while enabling up to 75× faster random reads for blobs like images and video. Filtering and sampling performance stays stable as data scales, without requiring application changes.
These gains come from compressing dictionary values without impacting access paths—reducing I/O during training and improving GPU utilization.
🚗 Unifying the AV ML Stack: From Raw Data to Trained Model with LanceDB

LanceDB consolidates AV ML pipelines into a single system: raw data, annotations, and embeddings in one table, with SQL-based curation and materialized views for training splits. New data and signals can be added incrementally without rebuilding pipelines.
Training jobs can resume from checkpoints, and retraining workflows stay stable as data evolves—reducing iteration time on large multimodal datasets.
🗞️ Case Study
Bytedance Volcano Engine LAS's Lance-Based PB-Scale Autonomous Driving Data Lake Solution

Bytedance’s Volcano Engine LAS (Lake for AI Service) team rebuilt their autonomous driving data lake on Lance to address scaling bottlenecks in annotation and training. Instead of rewriting datasets on every schema change, they added new labels incrementally, reduced storage by up to 70% with built-in compression, and enabled training jobs to read only required columns.
At production scale, this translated into higher GPU utilization (60% → 96%) and faster iteration: 10PB label processing dropped from 4 days to 1, with overall model iteration improving by ~40%.
📖 Also Published This Month
- Agentic Coding as Community Stewardship
- Smart Parsing Meets Sharp Retrieval: Combining LiteParse and LanceDB
- Lance JSON Support: Why You Might Not Really Need Variant
- Building A Storage Format For The Next Era of Biology
📺 Talks & Recordings
10 Things I Hate About Feature Engineering for AI
In a world where software engineers have stopped writing most of their code manually, data teams are still debugging distributed pipeline failures at 2 AM and watching their OpenAI bills spike over the weekend. Chang breaks down the ten biggest pain points of feature engineering for AI — and makes the case that our data infrastructure was built for the last generation.
Powering Netflix's Multimodal Feature Engineering at Scale
Jack Ye (Software Engineer @ LanceDB) and Pablo Delgado (Machine Learning Engineer @ Netflix) share how Netflix builds and curates multimodal features across large video and image corpora, with LanceDB serving as the core storage and query layer for multimodal data.
Exa-Scale Search with Lance & Ray
Dive into the infrastructure behind Exa's AI search engine, covering how Lance and Ray support distributed embedding pipelines and semantic retrieval at web scale across billions of documents.
- Data & Engineering at Exa – Hubert Yuan, Software Engineer @ Exa
- Lance: Exa-Scale | Multimodal Lakehouse – Lei Xu, CTO @ LanceDB
- Ray Data: Scalable AI Computing & Distributed Systems – Goutam Venkatramanan, Software Engineer @ Anyscale
📅 Upcoming Events

AI Agent Conference – May 4-5 in NYC
Join Chang's session to learn why the biggest bottleneck for production AI agents isn't model intelligence, but data infrastructure – and how context engineering and purpose-built systems will drive the next leap in agent quality.

KGC (Knowledge Graph Conference) – May 4-8 in NYC
Prashanth Rao (AI Engineer @ LanceDB) and David Hughes (AI & Graph Solution Architect) will present a Lance-native multimodal RAG architecture that unifies embeddings, graph traversal, and media access in a single system—enabling zero-copy retrieval, lower latency, and simpler pipelines.

AI Council – May 12-14 in SF
Explore the infrastructure challenges of managing trillion-scale multimodal datasets, and how Lance format and LanceDB are built to help you scale faster and cut costs.
LanceDB is a sponsor of AI Council this year! Come find us at our booth to talk training data infrastructure, feature engineering at PB scale, and a chance to win some cool swag!
🎙️Meetups
Agent Builders Night – May 5, NYC
LanceDB, Braintrust, Modal, and Augment Code are bringing together AI leaders and builders for an evening of relaxed conversations and cocktails. No pitches or panels - just good food, drinks, and great vibes!
San Francisco DataFusion Meetup – May 11, SF
Prashanth Rao will be diving into the internals of distribution query execution built with Apache DataFusion and Lance, multimodal lakehouse format.
Ship It: Dinner & Drinks on a Boat – May 13, SF
Join us for a private gathering of AI and data peeps for an evening on the water during AI Council week in San Francisco. We’re bringing together a small group of builders, operators, and technical leaders shaping the future of AI systems and data infrastructure.
The missing data layer for ML – May 13, Menlo Park
Join LanceDB, dltHub, and DataHub for a night of technical talks and demos as we demystify the missing data layer for ML. Hear from the engineers building the ingestion, retrieval, and metadata layers of the open source AI stack.
🏗️ LanceDB Enterprise Updates
Performance
- IVF Centroid Routing — Accelerated centroid routing for the IVF index reduces vector search latency at query time.
- Page Map Caching — Replaced the page map cache with a high-concurrency implementation, improving read throughput under contention.
- Index Metadata Lookups — Reduced overhead on index metadata lookups during table maintenance, cutting operation cost at scale.
Features
🌟 Open Source Releases
🫶 Community Contributions
Thank you to contributors from Google, Bytedance, Tencent, Baidu, Adobe, Uber, Pinterest, Microsoft, Luma AI, and Rerun.io for improvements across storage, indexing, query execution, distributed processing, and ecosystem integrations in LanceDB, Lance, and the broader ecosystem.
Notable contributions this month:
- @beinan — Added
Dataset.sample()API to Java SDK and zonemap index segments support, enabling efficient sampling and index-based filtering in JVM environments - @hushengquan — Improved I/O throughput by submitting requests eagerly in
FullZipScheduler, reducing latency for batch reads - @zhangyue19921010 — Extended dictionary-namespace with table operations and exposed base-scoped store bindings to Python, enabling richer catalog integration
- @dentiny — Fixed float64-stored number detection in JSON type extraction and corrected logical operator bug in conflict resolver
- @pratik0316 — Added type-safe expression builder API to Python SDK, enabling compile-time query validation
- @VedantMadane — Extended Node.js SDK to support Float16, Float64, and Uint8 vector queries, broadening precision options for embeddings
- @myandpr — Fixed
max_batch_lengthhandling for Rust vector and hybrid queries, ensuring consistent batch size limits - @ivscheianu — Prevented arithmetic overflow in U64Segment encoding for sparse/extreme row ID ranges, improving stability for large datasets
- @LuciferYang — Improved robustness by warning and clamping
LANCE_INITIAL_UPLOAD_SIZEinstead of panicking on invalid values
A heartfelt thank you to our community contributors of Lance and LanceDB this past month:
@adaworldapi • @atakanyenel • @beinan • @bryanck • @butnaruandrei • @chenghao-guo • @danielmao1 • @dardourimohamed • @dcfocus • @dentiny • @dhruvgarg111 • @emilk • @fangbo • @fightboxing • @frankliee • @gezi-lzq • @huahuay • @hushengquan • @its-tanay • @ivscheianu • @jaystarshot • @jerryjch • @jiaoew1991 • @jja725 • @justsml • @kaan-simbe • @lakshjain7 • @lennylxx • @lilei1128 • @luciferyang • @majiayu000 • @majin1102 • @myandpr • @pengw0048 • @ppei-wang • @pratik0316 • @puchengy • @qingfeng-occ • @shiwk • @shmilygkd • @sinianluoye • @snigenigmatic • @summaryzb • @timsaucer • @tobocop2 • @vedantmadane • @wojiaodoubao • @wombatu-kun • @xiaguanglei • @xodn348 • @xuqianjin-stars • @xuzha • @xuzifu666 • @yangmeilly • @yangshangqing95 • @ysbf • @yxd-ym • @zehiko • @zelys-dfkh • @zhangyue19921010 • @ztorchan
🤝 Lance Community Sync Recap
This month's community sync highlighted Lance 2.2 file format benchmarks that showed major performance gains, alongside new Blob V2 and variant benchmarks, while the Lance 4.0.0 SDK is out and work has started on the 6.0.0 release candidate.
The next Lance Community Sync will take place on Thursday May 7.



