faster-vlm-fine-tuning-with-materialized-model-features-in-lancedb
Prashanth Rao
Ayush Chaurasia
lance-blob-v2-late-materialization-for-large-binary-data-in-spark
Drew Gallardo
semantic-memory-for-hermes-agent-with-lancedb
Prashanth Rao
a-metadata-benchmark-of-lance-delta-lake-and-iceberg-on-s3
Jack Ye
scalable-feature-engineering-on-multimodal-datasets
Prashanth Rao
stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research
Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
reproducible-data-curation-in-the-multimodal-lakehouse
Prashanth Rao
newsletter-may-2026
ChanChan Mao
newsletter-april-2026
ChanChan Mao
how-lancedb-accelerates-vector-search-at-10-billion-scale
Yang Cen
opensearch-vs-lancedb-for-vector-search-query-cost-and-infrastructure
Justin Miller
volcano-engine-autonomous-driving-data-lake-solution
Kejian Ju
unifying-the-av-ml-stack-lancedb
Ayush Chaurasia
lance-json-support-why-you-might-not-really-need-variant
Jack Ye
building-a-storage-format-for-the-next-era-of-biology
Pavan Ramkumar
newsletter-march-2026
ChanChan Mao
smart-parsing-meets-sharp-retrieval-combining-liteparse-and-lancedb
Clelia Astra Bertelli
Prashanth Rao
lance-format-v2-2-benchmarks-half-the-storage-none-of-the-slowdown
Xuanwo
make-your-sql-workflows-multimodal-with-lancedb-x-duckdb
Prashanth Rao
agentic-coding-as-community-stewardship
Xuanwo
what-we-mean-by-multimodal
Prashanth Rao
ai-native-development-local-continue-lancedb
Ty Dunn
lance-file-format-2-2-taming-complex-data
Xuanwo
lance-blob-v2
Xuanwo
Jack Ye
openclaw-lancedb-memory-layer
Xuanwo
Prashanth Rao
openclaw-lancedb-seed2
LanceDB
openclaw-memory-from-zero-to-lancedb-pro
Prashanth Rao
upload-lance-datasets-to-hf-hub
Prashanth Rao
zero-shot-image-classification-with-vector-search
Vipul Maheshwari
werides-data-platform-transformation-how-lancedb-fuels-model-development-velocity
Qian Zhu
Fei Chen
training-a-variational-autoencoder-from-scratch-with-the-lance-file-format
LanceDB
track-ai-trends-crewai-agents-rag
LanceDB
tokens-per-second-is-not-all-you-need
Mingran Wang
Tan Li
the-future-of-open-source-table-formats-iceberg-and-lance
Jack Ye
the-case-for-random-access-i-o
LanceDB
series-a-funding
Chang She
semanticdotart
Ayush Chaurasia
second-dinners-secret-weapon-lancedb-powered-rag-for-faster-smarter-game-development
Qian Zhu
search-within-an-image-331b54e4285e
Kaushal Choudhary
scalable-computer-vision-with-lancedb-voxel51-d8b65066d5f6
LanceDB
rethinking-table-file-paths-lance-multi-base-layout
Jack Ye
rag-isnt-one-size-fits-all
Leonard Marcq
python-package-to-convert-image-datasets-to-lance-type
Vipul Maheshwari
one-million-iops
Weston Pace
november-feature-roundup
Will Jones
newsletter-september-2025
Jasmine Wang
newsletter-october-2025
Jasmine Wang
newsletter-november-2025
ChanChan Mao
newsletter-june-2025
David Myriel
newsletter-july-2025
Jasmine Wang
newsletter-january-2026
ChanChan Mao
newsletter-february-2026
ChanChan Mao
newsletter-december-2025
ChanChan Mao
newsletter-august-2025
Jasmine Wang
my-summer-internship-experience-at-lancedb-2
Raunak Sinha
my-simd-is-faster-than-yours-fb2989bf25e7
LanceDB
multimodal-myntra-fashion-search-engine-using-lancedb
LanceDB
multimodal-lakehouse
David Myriel
multi-document-agentic-rag-a-walkthrough
Vipul Maheshwari
modified-rag-parent-document-bigger-chunk-retriever-62b3d1e79bc6
Mahesh Deshwal
memgpt-os-inspired-llms-that-manage-their-own-memory-793d6eed417e
Ayush Chaurasia
late-interaction-efficient-multi-modal-retrievers-need-more-than-just-a-vector-index
Ayush Chaurasia
lancedb-x-continue
LanceDB
lance-x-huggingface-a-new-era-of-sharing-multimodal-data
Prashanth Rao
Quentin Lhoest
Xuanwo
Ayush Chaurasia
lance-x-duckdb-sql-retrieval-on-the-multimodal-lakehouse-format
Xuanwo
lance-windows-windows-lance
Chang She
lance-v2
Weston Pace
lance-namespace-lancedb-and-ray
Jack Ye
lance-file-2-1-stable
Weston Pace
lance-file-2-1-smaller-and-simpler
Weston Pace
lance-data-viewer
Gordon Murray
lance-community-governance
Jack Ye
introducing-lance-namespace-spark-integration
Jack Ye
implementing-corrective-rag-in-the-easiest-way-2
LanceDB
hybrid-search-rag-for-real-life-production-grade-applications-e1e727b3965a
Mahesh Deshwal
hybrid-search-combining-bm25-and-semantic-search-for-better-results-with-lan-1358038fe7e6
LanceDB
hybrid-search-and-custom-reranking-with-lancedb-4c10a6a3447e
LanceDB
how-to-reduce-hallucinations-from-llm-powered-agents-using-long-term-memory-72f262c3cc1f
Tevin Wang
guide-to-use-contextual-retrieval-and-prompt-caching-with-lancedb
LanceDB
grpo-understanding-and-fine-tuning-the-next-gen-reasoning-model-2
Mahesh Deshwal
graphrag-hierarchical-approach-to-retrieval-augmented-generation
Akash Desai
gpu-accelerated-indexing-in-lancedb-27558fa7eee5
LanceDB
geo-support
Jack Ye
geneva-twelvelabs
David Myriel
geneva-feature-engineering
Jonathan Hsieh
from-bi-to-ai-lance-and-iceberg
Jack Ye
Prashanth Rao
fluss-integration
Wayne Wang
file-readers-in-depth-parallelism-without-row-groups
Weston Pace
feature-rabitq-quantization
David Myriel
Yang Cen
feature-full-text-search
David Myriel
enhance-rag-integrate-contextual-compression-and-filtering-for-precision-a29d4a810301
Kaushal Choudhary
effortlessly-loading-and-processing-images-with-lance-a-code-walkthrough
LanceDB
designing-a-table-format-for-ml-workloads
Weston Pace
custom-dataset-for-llm-training-using-lance
LanceDB
creating-a-fintech-agent
Vipul Maheshwari
convert-any-image-dataset-to-lance
LanceDB
columnar-file-readers-in-depth-structural-encoding
Weston Pace
columnar-file-readers-in-depth-repetition-definition-levels
Weston Pace
columnar-file-readers-in-depth-compression-transparency
Weston Pace
columnar-file-readers-in-depth-column-shredding
Weston Pace
columnar-file-readers-in-depth-backpressure
Weston Pace
columnar-file-readers-in-depth-apis-and-fusion
Weston Pace
chunking-techniques-with-langchain-and-llamaindex
Prashant Kumar
chunking-analysis-which-is-the-right-chunking-approach-for-your-language
Shresth Shukla
chat-with-csv-excel-using-lancedb
LanceDB
case-study-netflix
David Myriel
case-study-dosu
Qian Zhu
Michael Ludden
case-study-cognee
David Myriel
Vasilije Markovic
case-study-coderabbit
Qian Zhu
building-rag-on-codebases-part-2
Sankalp Shubham
building-rag-on-codebases-part-1
Sankalp Shubham
branching-and-shallow-clone
Jack Ye
better-rag-with-active-retrieval-augmented-generation-flare-3b66646e2a9f
LanceDB
benchmarking-random-access-in-lance
Chang She
benchmarking-lancedb-92b01032874a-2
LanceDB
benchmarking-cohere-reranker-with-lancedb
LanceDB
anythingllms-competitive-edge-lancedb-for-seamless-rag-and-agent-workflows
Ayush Chaurasia
announcing-lance-sdk
Weston Pace
agentic-rag-using-langgraph-building-a-simple-customer-support-autonomous-agent
LanceDB
advanced-rag-precise-zero-shot-dense-retrieval-with-hyde-0946c54dfdcb
LanceDB
accelerate-vector-search-applications-using-openvino-lancedb
LanceDB
a-primer-on-text-chunking-and-its-types-a420efc96a13
Prashant Kumar
a-practical-guide-to-training-custom-rerankers
Ayush Chaurasia
a-practical-guide-to-fine-tuning-embedding-models
Ayush Chaurasia
keep-your-data-fresh-with-cocoindex-and-lancedb
Prashanth Rao
Linghua Jin

⚡Vector Search at 10B Scale, 📊 Lance Format Benchmarks, 🚗 AV Pipelines at Scale

May 4, 2026
Newsletter

⚡How LanceDB Accelerates Vector Search at 10 Billion Scale

LanceDB Enterprise introduces a distributed architecture for vector search at 10B scale that avoids common bottlenecks in indexing and query execution. Indexes are split into independently built segments, enabling near-linear scaling in build time, while query execution is parallelized across workers without increasing per-query latency.

Key optimizations—like HNSW over centroids to remove linear scans and faster preprocessing for RaBitQ—reduce query overhead even at high dimensions. The result is scalable throughput, predictable latency, and a stable API, even as datasets grow to tens of billions of vectors.

Read more →

📊 Lance Format v2.2 Benchmarks: Half the Storage, None of the Slowdown

Lance format v2.2 reduces storage by ~50% vs. Parquet on text-heavy data while enabling up to 75× faster random reads for blobs like images and video. Filtering and sampling performance stays stable as data scales, without requiring application changes.

These gains come from compressing dictionary values without impacting access paths—reducing I/O during training and improving GPU utilization.

Read more →

🚗 Unifying the AV ML Stack: From Raw Data to Trained Model with LanceDB

LanceDB consolidates AV ML pipelines into a single system: raw data, annotations, and embeddings in one table, with SQL-based curation and materialized views for training splits. New data and signals can be added incrementally without rebuilding pipelines.

Training jobs can resume from checkpoints, and retraining workflows stay stable as data evolves—reducing iteration time on large multimodal datasets.

Read more →

🗞️ Case Study

Bytedance Volcano Engine LAS's Lance-Based PB-Scale Autonomous Driving Data Lake Solution

Bytedance’s Volcano Engine LAS (Lake for AI Service) team rebuilt their autonomous driving data lake on Lance to address scaling bottlenecks in annotation and training. Instead of rewriting datasets on every schema change, they added new labels incrementally, reduced storage by up to 70% with built-in compression, and enabled training jobs to read only required columns.

At production scale, this translated into higher GPU utilization (60% → 96%) and faster iteration: 10PB label processing dropped from 4 days to 1, with overall model iteration improving by ~40%.

Read more →

📖 Also Published This Month

📺 Talks & Recordings

10 Things I Hate About Feature Engineering for AI

In a world where software engineers have stopped writing most of their code manually, data teams are still debugging distributed pipeline failures at 2 AM and watching their OpenAI bills spike over the weekend. Chang breaks down the ten biggest pain points of feature engineering for AI — and makes the case that our data infrastructure was built for the last generation.

Watch the recording →

Powering Netflix's Multimodal Feature Engineering at Scale

Jack Ye (Software Engineer @ LanceDB) and Pablo Delgado (Machine Learning Engineer @ Netflix) share how Netflix builds and curates multimodal features across large video and image corpora, with LanceDB serving as the core storage and query layer for multimodal data. 

Watch the recording →

Exa-Scale Search with Lance & Ray

Dive into the infrastructure behind Exa's AI search engine, covering how Lance and Ray support distributed embedding pipelines and semantic retrieval at web scale across billions of documents.

📅 Upcoming Events

AI Agent Conference – May 4-5 in NYC

Join Chang's session to learn why the biggest bottleneck for production AI agents isn't model intelligence, but data infrastructure – and how context engineering and purpose-built systems will drive the next leap in agent quality.

Register →

KGC (Knowledge Graph Conference) – May 4-8 in NYC

Prashanth Rao (AI Engineer @ LanceDB) and David Hughes (AI & Graph Solution Architect) will present a Lance-native multimodal RAG architecture that unifies embeddings, graph traversal, and media access in a single system—enabling zero-copy retrieval, lower latency, and simpler pipelines.

Register →

AI Council – May 12-14 in SF

Explore the infrastructure challenges of managing trillion-scale multimodal datasets, and how Lance format and LanceDB are built to help you scale faster and cut costs.

LanceDB is a sponsor of AI Council this year! Come find us at our booth to talk training data infrastructure, feature engineering at PB scale, and a chance to win some cool swag!

Register →

🎙️Meetups

Agent Builders Night – May 5, NYC

LanceDB, Braintrust, Modal, and Augment Code are bringing together AI leaders and builders for an evening of relaxed conversations and cocktails. No pitches or panels - just good food, drinks, and great vibes!

San Francisco DataFusion Meetup – May 11, SF

Prashanth Rao will be diving into the internals of distribution query execution built with Apache DataFusion and Lance, multimodal lakehouse format.

Ship It: Dinner & Drinks on a Boat – May 13, SF

Join us for a private gathering of AI and data peeps for an evening on the water during AI Council week in San Francisco. ​We’re bringing together a small group of builders, operators, and technical leaders shaping the future of AI systems and data infrastructure.

The missing data layer for ML – May 13, Menlo Park

Join LanceDB, dltHub, and DataHub for a night of technical talks and demos as we demystify the missing data layer for ML. Hear from the engineers building the ingestion, retrieval, and metadata layers of the open source AI stack.

🏗️ LanceDB Enterprise Updates

Performance

  • IVF Centroid Routing — Accelerated centroid routing for the IVF index reduces vector search latency at query time.
  • Page Map Caching — Replaced the page map cache with a high-concurrency implementation, improving read throughput under contention.
  • Index Metadata Lookups — Reduced overhead on index metadata lookups during table maintenance, cutting operation cost at scale.

Features

Feature Description
Distributed Vector Search ANN execution now distributes across workers with segment-level routing, distributed query plan execution with built-in metrics, and segment-based index builds — the architecture behind the 10B-scale results.
Table Maintenance Automation Intelligent job planning with automated and remote backfill support, plus configurable warm-up readiness gating to block traffic until the query engine is ready.
Telemetry Privacy Controls Table names, column names, and user identifiers can be obfuscated in telemetry and indexer workloads — required for deployments operating under strict data governance or regulatory constraints.
Secure Namespace Credential Vending Manifest-based credential vending for cross-namespace access to storage and services without exposing credentials at the application layer.
Catalog Explorer & SQL Console Catalog explorer frontend and integrated SQL console in the Feature Engineering sidebar for interactive data exploration without leaving the platform.
Cache Observability Page read and disk read counters added to query engine cache metrics, giving operators direct visibility into cache behavior for performance tuning.

🌟 Open Source Releases

Project Description
LanceDB v0.30.2
Release notes
• Parallel inserts for remote tables via multipart write improves throughput for large uploads (#3071)
• New type-safe expression builder API in Python for constructing queries programmatically (#3150)
• Node.js SDK now supports Float16, Float64, and Uint8 vector queries (#3193)
• Progress bar added to add() for visibility during bulk inserts (#3067)
lance-duckdb v0.5.4
Release notes
• Added dataset and session reuse across read, write, and search paths—including per-connection dataset caching and shared per-database sessions—reducing overhead for repeated operations (#182, #183, #185, #188, #189)
• Vector index controls now available in hybrid search (#190)
lance-namespace v0.7.0 – v0.7.2
Release notes
CreateTable REST API now supports storage options and properties; CreateEmptyTable removed (#330)
ListTables now includes declared tables by default via include_declared=true (#332)
DescribeTable request adds check_declared parameter for explicit declared table lookups (#334)

🫶 Community Contributions

Thank you to contributors from Google, Bytedance, Tencent, Baidu, Adobe, Uber, Pinterest, Microsoft, Luma AI, and Rerun.io for improvements across storage, indexing, query execution, distributed processing, and ecosystem integrations in LanceDB, Lance, and the broader ecosystem.

Notable contributions this month:

  • @beinan — Added Dataset.sample() API to Java SDK and zonemap index segments support, enabling efficient sampling and index-based filtering in JVM environments
  • @hushengquan — Improved I/O throughput by submitting requests eagerly in FullZipScheduler, reducing latency for batch reads
  • @zhangyue19921010 — Extended dictionary-namespace with table operations and exposed base-scoped store bindings to Python, enabling richer catalog integration
  • @dentiny — Fixed float64-stored number detection in JSON type extraction and corrected logical operator bug in conflict resolver
  • @pratik0316 — Added type-safe expression builder API to Python SDK, enabling compile-time query validation
  • @VedantMadane — Extended Node.js SDK to support Float16, Float64, and Uint8 vector queries, broadening precision options for embeddings
  • @myandpr — Fixed max_batch_length handling for Rust vector and hybrid queries, ensuring consistent batch size limits
  • @ivscheianu — Prevented arithmetic overflow in U64Segment encoding for sparse/extreme row ID ranges, improving stability for large datasets
  • @LuciferYang — Improved robustness by warning and clamping LANCE_INITIAL_UPLOAD_SIZE instead of panicking on invalid values

A heartfelt thank you to our community contributors of Lance and LanceDB this past month:

@adaworldapi @atakanyenel @beinan @bryanck @butnaruandrei @chenghao-guo @danielmao1 @dardourimohamed @dcfocus @dentiny @dhruvgarg111 @emilk @fangbo @fightboxing @frankliee @gezi-lzq @huahuay @hushengquan @its-tanay @ivscheianu @jaystarshot @jerryjch @jiaoew1991 @jja725 @justsml @kaan-simbe @lakshjain7 @lennylxx @lilei1128 @luciferyang @majiayu000 @majin1102 @myandpr @pengw0048 @ppei-wang @pratik0316 @puchengy @qingfeng-occ @shiwk @shmilygkd @sinianluoye @snigenigmatic @summaryzb @timsaucer @tobocop2 @vedantmadane @wojiaodoubao @wombatu-kun @xiaguanglei @xodn348 @xuqianjin-stars @xuzha @xuzifu666 @yangmeilly @yangshangqing95 @ysbf @yxd-ym @zehiko @zelys-dfkh @zhangyue19921010 @ztorchan

🤝 Lance Community Sync Recap

This month's community sync highlighted Lance 2.2 file format benchmarks that showed major performance gains, alongside new Blob V2 and variant benchmarks, while the Lance 4.0.0 SDK is out and work has started on the 6.0.0 release candidate.

The next Lance Community Sync will take place on Thursday May 7.

ChanChan Mao
Developer Relations @ LanceDB

Faster VLM Fine-Tuning With Materialized Model Features in LanceDB

Prashanth Rao
Ayush Chaurasia
June 24, 2026
faster-vlm-fine-tuning-with-materialized-model-features-in-lancedb

Lance Blob V2: Late Materialization for Large Binary Data in Spark

Drew Gallardo
June 17, 2026
lance-blob-v2-late-materialization-for-large-binary-data-in-spark

Semantic Memory for Hermes Agent with LanceDB

Prashanth Rao
June 15, 2026
semantic-memory-for-hermes-agent-with-lancedb