stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research
Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
reproducible-data-curation-in-the-multimodal-lakehouse
Prashanth Rao
newsletter-may-2026
ChanChan Mao
newsletter-april-2026
ChanChan Mao
how-lancedb-accelerates-vector-search-at-10-billion-scale
Yang Cen
opensearch-vs-lancedb-for-vector-search-query-cost-and-infrastructure
Justin Miller
volcano-engine-autonomous-driving-data-lake-solution
Kejian Ju
unifying-the-av-ml-stack-lancedb
Ayush Chaurasia
lance-json-support-why-you-might-not-really-need-variant
Jack Ye
building-a-storage-format-for-the-next-era-of-biology
Pavan Ramkumar
newsletter-march-2026
ChanChan Mao
smart-parsing-meets-sharp-retrieval-combining-liteparse-and-lancedb
Clelia Astra Bertelli
Prashanth Rao
lance-format-v2-2-benchmarks-half-the-storage-none-of-the-slowdown
Xuanwo
make-your-sql-workflows-multimodal-with-lancedb-x-duckdb
Prashanth Rao
agentic-coding-as-community-stewardship
Xuanwo
what-we-mean-by-multimodal
Prashanth Rao
ai-native-development-local-continue-lancedb
Ty Dunn
lance-file-format-2-2-taming-complex-data
Xuanwo
lance-blob-v2
Xuanwo
Jack Ye
openclaw-lancedb-memory-layer
Xuanwo
Prashanth Rao
openclaw-lancedb-seed2
LanceDB
openclaw-memory-from-zero-to-lancedb-pro
Prashanth Rao
upload-lance-datasets-to-hf-hub
Prashanth Rao
zero-shot-image-classification-with-vector-search
Vipul Maheshwari
werides-data-platform-transformation-how-lancedb-fuels-model-development-velocity
Qian Zhu
Fei Chen
training-a-variational-autoencoder-from-scratch-with-the-lance-file-format
LanceDB
track-ai-trends-crewai-agents-rag
LanceDB
tokens-per-second-is-not-all-you-need
Mingran Wang
Tan Li
the-future-of-open-source-table-formats-iceberg-and-lance
Jack Ye
the-case-for-random-access-i-o
LanceDB
series-a-funding
Chang She
semanticdotart
Ayush Chaurasia
second-dinners-secret-weapon-lancedb-powered-rag-for-faster-smarter-game-development
Qian Zhu
search-within-an-image-331b54e4285e
Kaushal Choudhary
scalable-computer-vision-with-lancedb-voxel51-d8b65066d5f6
LanceDB
rethinking-table-file-paths-lance-multi-base-layout
Jack Ye
rag-isnt-one-size-fits-all
Leonard Marcq
python-package-to-convert-image-datasets-to-lance-type
Vipul Maheshwari
one-million-iops
Weston Pace
november-feature-roundup
Will Jones
newsletter-september-2025
Jasmine Wang
newsletter-october-2025
Jasmine Wang
newsletter-november-2025
ChanChan Mao
newsletter-june-2025
David Myriel
newsletter-july-2025
Jasmine Wang
newsletter-january-2026
ChanChan Mao
newsletter-february-2026
ChanChan Mao
newsletter-december-2025
ChanChan Mao
newsletter-august-2025
Jasmine Wang
my-summer-internship-experience-at-lancedb-2
Raunak Sinha
my-simd-is-faster-than-yours-fb2989bf25e7
LanceDB
multimodal-myntra-fashion-search-engine-using-lancedb
LanceDB
multimodal-lakehouse
David Myriel
multi-document-agentic-rag-a-walkthrough
Vipul Maheshwari
modified-rag-parent-document-bigger-chunk-retriever-62b3d1e79bc6
Mahesh Deshwal
memgpt-os-inspired-llms-that-manage-their-own-memory-793d6eed417e
Ayush Chaurasia
late-interaction-efficient-multi-modal-retrievers-need-more-than-just-a-vector-index
Ayush Chaurasia
lancedb-x-continue
LanceDB
lance-x-huggingface-a-new-era-of-sharing-multimodal-data
Prashanth Rao
Quentin Lhoest
Xuanwo
Ayush Chaurasia
lance-x-duckdb-sql-retrieval-on-the-multimodal-lakehouse-format
Xuanwo
lance-windows-windows-lance
Chang She
lance-v2
Weston Pace
lance-namespace-lancedb-and-ray
Jack Ye
lance-file-2-1-stable
Weston Pace
lance-file-2-1-smaller-and-simpler
Weston Pace
lance-data-viewer
Gordon Murray
lance-community-governance
Jack Ye
introducing-lance-namespace-spark-integration
Jack Ye
implementing-corrective-rag-in-the-easiest-way-2
LanceDB
hybrid-search-rag-for-real-life-production-grade-applications-e1e727b3965a
Mahesh Deshwal
hybrid-search-combining-bm25-and-semantic-search-for-better-results-with-lan-1358038fe7e6
LanceDB
hybrid-search-and-custom-reranking-with-lancedb-4c10a6a3447e
LanceDB
how-to-reduce-hallucinations-from-llm-powered-agents-using-long-term-memory-72f262c3cc1f
Tevin Wang
guide-to-use-contextual-retrieval-and-prompt-caching-with-lancedb
LanceDB
grpo-understanding-and-fine-tuning-the-next-gen-reasoning-model-2
Mahesh Deshwal
graphrag-hierarchical-approach-to-retrieval-augmented-generation
Akash Desai
gpu-accelerated-indexing-in-lancedb-27558fa7eee5
LanceDB
geo-support
Jack Ye
geneva-twelvelabs
David Myriel
geneva-feature-engineering
Jonathan Hsieh
from-bi-to-ai-lance-and-iceberg
Jack Ye
Prashanth Rao
fluss-integration
Wayne Wang
file-readers-in-depth-parallelism-without-row-groups
Weston Pace
feature-rabitq-quantization
David Myriel
Yang Cen
feature-full-text-search
David Myriel
enhance-rag-integrate-contextual-compression-and-filtering-for-precision-a29d4a810301
Kaushal Choudhary
effortlessly-loading-and-processing-images-with-lance-a-code-walkthrough
LanceDB
designing-a-table-format-for-ml-workloads
Weston Pace
custom-dataset-for-llm-training-using-lance
LanceDB
creating-a-fintech-agent
Vipul Maheshwari
convert-any-image-dataset-to-lance
LanceDB
columnar-file-readers-in-depth-structural-encoding
Weston Pace
columnar-file-readers-in-depth-repetition-definition-levels
Weston Pace
columnar-file-readers-in-depth-compression-transparency
Weston Pace
columnar-file-readers-in-depth-column-shredding
Weston Pace
columnar-file-readers-in-depth-backpressure
Weston Pace
columnar-file-readers-in-depth-apis-and-fusion
Weston Pace
chunking-techniques-with-langchain-and-llamaindex
Prashant Kumar
chunking-analysis-which-is-the-right-chunking-approach-for-your-language
Shresth Shukla
chat-with-csv-excel-using-lancedb
LanceDB
case-study-netflix
David Myriel
case-study-dosu
Qian Zhu
Michael Ludden
case-study-cognee
David Myriel
Vasilije Markovic
case-study-coderabbit
Qian Zhu
building-rag-on-codebases-part-2
Sankalp Shubham
building-rag-on-codebases-part-1
Sankalp Shubham
branching-and-shallow-clone
Jack Ye
better-rag-with-active-retrieval-augmented-generation-flare-3b66646e2a9f
LanceDB
benchmarking-random-access-in-lance
Chang She
benchmarking-lancedb-92b01032874a-2
LanceDB
benchmarking-cohere-reranker-with-lancedb
LanceDB
anythingllms-competitive-edge-lancedb-for-seamless-rag-and-agent-workflows
Ayush Chaurasia
announcing-lance-sdk
Weston Pace
agentic-rag-using-langgraph-building-a-simple-customer-support-autonomous-agent
LanceDB
advanced-rag-precise-zero-shot-dense-retrieval-with-hyde-0946c54dfdcb
LanceDB
accelerate-vector-search-applications-using-openvino-lancedb
LanceDB
a-primer-on-text-chunking-and-its-types-a420efc96a13
Prashant Kumar
a-practical-guide-to-training-custom-rerankers
Ayush Chaurasia
a-practical-guide-to-fine-tuning-embedding-models
Ayush Chaurasia
keep-your-data-fresh-with-cocoindex-and-lancedb
Prashanth Rao
Linghua Jin

LanceDB's Geneva: Scalable Feature Engineering

August 21, 2025
Engineering

When you start a machine learning project, your first challenge usually isn’t training a model. It’s getting the data into a usable shape.

Raw data is messy: images are just pixels, audio is just waveforms, and text needs structure. Before you can do anything interesting, you need features.

LanceDB’s Geneva is designed to take that pain away. Instead of writing ad hoc scripts that don’t scale, you define feature transformations once as Python functions, run them locally or on a distributed cluster, and Geneva materializes the results as typed, queryable columns.

In this walkthrough, you’ll load a dataset of cats and dogs, define four feature extractors, and run them at scale: file size, dimensions, captions with BLIP, and embeddings with OpenCLIP. Along the way, you’ll see how Geneva keeps the process consistent whether you’re running on your laptop or across a Ray cluster.

Figure 1: The Feature Engineering Workflow

This diagram shows how Geneva transforms raw data into enriched features through UDFs, creating new columns for file size, dimensions, captions, and embeddings.

flowchart LR A[Raw dataset] --> B[Load into Geneva] B --> C[Define UDFs] C --> D[File size] C --> E[Dimensions] C --> F[Captions] C --> G[Embeddings] D --> H[New columns] E --> H F --> H G --> H H --> I[Enriched table]

💡 Geneva’s basic promise is deceptively simple. Write Python like you normally would. Keep your functions pure. Geneva will serialize the code, ship the exact same environment to worker nodes, execute at scale, and persist results as new columns in LanceDB.

Watch the demo on YouTube

The demo is quite complex, so we recommend reading the article first. The steps outlined in this blog will help guide you through the tutorial.

Step 1: Install and check your environment

Start by installing the required packages. Geneva integrates with PyTorch, Hugging Face Transformers, and OpenCLIP, so you can use state-of-the-art models right out of the box.

!pip install --upgrade datasets pillow
!pip install transformers==4.51 torch accelerate     # BLIP captioning
!pip install open-clip-torch scikit-learn matplotlib # CLIP embeddings

If you have access to a GPU, Geneva will take advantage of it automatically. It’s worth checking:

import torch
print("CUDA available:", torch.cuda.is_available())

This matters because GPU acceleration can dramatically reduce the time it takes to generate captions and embeddings. If CUDA isn’t available, you can still run everything on CPU, which is fine for testing smaller datasets. The important thing is that the exact same code will work in either environment.

Step 2: Load a dataset into Geneva

For this demo, we’ll use the Oxford-IIIT Pets dataset — a collection of cats and dogs with labels.

cats-and-dogs

You can swap this for any dataset you like. You can ingest Arrow data into LanceDB tables, which makes it efficient to store, process, and query.

from datasets import load_dataset
import pyarrow as pa
import io, shutil
from geneva.tqdm import tqdm

shutil.rmtree(GENEVA_DB_PATH, ignore_errors=True)

def load_images(frag_size: int = 25):
    dataset = load_dataset("timm/oxford-iiit-pet", split=f"train[:{NUM_IMAGES}]")
    batch = []
    for row in tqdm(dataset):
        buf = io.BytesIO()
        row["image"].save(buf, format="png")
        batch.append({"image": buf.getvalue(),
                      "label": row["label"],
                      "image_id": row["image_id"],
                      "label_cat_dog": row["label_cat_dog"]})
        if len(batch) >= frag_size:
            yield pa.RecordBatch.from_pylist(batch)
            batch = []
    if batch:
        yield pa.RecordBatch.from_pylist(batch)

db = geneva.connect(GENEVA_DB_PATH)

This function streams the dataset into batches instead of loading everything into memory at once.

By breaking things into smaller batches, Geneva can process them in parallel more easily. If it was just one large partition, only one worker would take the work leaving the rest idle. By splitting into smaller batches, the batches can be farmed out to multiple workers for parallel processing.

To make the table, just loop through the batches and add them:

first = True
for batch in load_images():
    if first:
        tbl = db.create_table("images", batch, mode="overwrite")
        first = False
    else:
        tbl.add(batch)

At this point you’ve got a table of raw images with labels. It doesn’t do much on its own, but now you’re ready to enrich it with features.

Step 3: Add simple features (file size, dimensions)

Start small. Geneva lets you define UDFs as simple Python functions, and the results become new columns in your table. The following examples show how to return both scalars and structured data:

from geneva import udf
import pyarrow as pa
from PIL import Image
import io

@udf
def file_size(image: bytes) -> int:
    return len(image)

@udf(data_type=pa.struct([
    pa.field("width", pa.int32()),
    pa.field("height", pa.int32())
]))
def dimensions(image: bytes):
    img = Image.open(io.BytesIO(image))
    return {"width": img.size[0], "height": img.size[1]}

The first UDF calculates the size of each image file in bytes. It’s a trivial example, but it demonstrates how easy it is to add scalar values. The second UDF extracts the width and height of each image and returns them as a structured record. With these two functions, you now have queryable columns that let you filter images by resolution or spot outliers based on file size. Even though these examples are simple, they highlight Geneva’s flexibility in handling different data types.

Step 4: Generate captions with BLIP

Now let’s create something more useful. Geneva makes it easy to run expensive models at scale by letting you write stateful UDFs. That means the model is loaded once and reused across rows, instead of being reloaded every time. Here’s how you can generate captions using BLIP:

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch, io

@udf(cuda=True)
class BlipCaptioner:
    def __init__(self): self.is_loaded = False
    def setup(self):
        self.processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
        self.model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device); self.is_loaded = True
    def __call__(self, image: bytes) -> str:
        if not self.is_loaded: self.setup()
        raw = Image.open(io.BytesIO(image)).convert("RGB")
        inputs = self.processor([raw], return_tensors="pt")
        inputs = {k: v.to(self.device) for k,v in inputs.items()}
        out = self.model.generate(**inputs, max_length=50)
        return self.processor.decode(out[0], skip_special_tokens=True)

With this UDF, each image in your dataset now gets a natural language description. Instead of just having raw pixels and labels, you can run queries like “show me all the rows where the caption mentions a dog.” This transforms your dataset into something you can search and analyze in ways that weren’t possible before. And because the model is cached, it runs efficiently even across large batches.

Step 5: Create embeddings with OpenCLIP

Captions give you text, but embeddings give you the ability to search semantically. With embeddings, you can ask questions like “find images most similar to this one” or “cluster my dataset into related groups.” Geneva makes it simple to generate these embeddings and store them as vector columns.

import open_clip, numpy as np

@udf(data_type=pa.list_(pa.float32(), 512))
class GenEmbeddings:
    def __init__(self): self.is_loaded = False
    def setup(self):
        self.model, _, self.preprocess = open_clip.create_model_and_transforms(
            "ViT-B-32", pretrained="laion2b_s34b_b79k")
        self.is_loaded = True
    def __call__(self, image: bytes):
        if not self.is_loaded: self.setup()
        # preprocess → forward pass → normalize → return 512-d vector

Once you run this, every image will have a 512-dimensional vector representation. That vector becomes the foundation for building similarity search, recommendations, or clustering workflows. Instead of just relying on labels or captions, you now have a mathematical representation of content that Geneva can index and query.

Figure 2: Backfill Execution and Querying

This diagram shows Geneva’s execution workflow: UDFs are processed in batches with partial commits, then results become available for searching and filtering.

flowchart LR A[Define UDFs] --> B[Run Backfill] B --> C[Partial Commits] C --> D[Async Execution] D --> E[Table Updated] E --> F[Search/Filter] F --> G[Results]

Step 6: Run backfills

Defining UDFs is only half the story — you need to run them against your dataset. Geneva’s backfill API applies your UDFs across the table and writes the results into new columns. For lightweight tasks like file size, you can run them synchronously:

tbl.backfill("file_size", batch_size=10)
tbl.backfill("dimensions", batch_size=10, commit_granularity=5)

For heavier tasks like embeddings, you’ll want to run them asynchronously. This way, Geneva commits partial results as they’re processed, and you can monitor progress in real time:

tbl.add_columns({"embedding": GenEmbeddings()})
fut = tbl.backfill_async("embedding", batch_size=10, commit_granularity=2)

while not fut.done(timeout=5):
    tbl.checkout_latest()
    done = tbl.search().where("embedding is not null").to_arrow()
    print(f"committed {len(done)} rows, version {tbl.version}")

This workflow makes a big difference when you scale. You don’t have to wait hours for the entire dataset to finish before you can see results. Geneva will stream partial results as they come in, and because every version of the table is saved, you can safely retry or roll back if something fails.

A quick “before and after” view

This is just a simple example of the evolution in data you should see:

| image_id | label | image (bytes) |
|----------|-------|---------------|
| pet_001  | cat   | ...           |
| pet_002  | dog   | ...           |
| pet_003  | cat   | ...           |

UDFs take effect, your table grows with feature columns:

| image_id | label | file_size | dimensions        | caption                           | embedding        |
|----------|-------|-----------|-------------------|-----------------------------------|------------------|
| pet_001  | cat   | 15423     | {"w":128,"h":128} | "a small brown cat sitting down"  | [0.12, 0.33,...] |
| pet_002  | dog   | 28764     | {"w":256,"h":256} | "a black dog looking at camera"   | [0.42, 0.77,...] |
| pet_003  | cat   | 19287     | {"w":128,"h":128} | "a fluffy white kitten indoors"   | [0.09, 0.55,...] |

(These tables are illustrative to show the shape of the data; in the notebook you’ll see Arrow/Lance tables and DataFrame previews.)

Step 7: Query your results

Once your features are materialized, you can treat them just like any other database column. For example, you can run a SQL query to filter by captions:

SELECT image_id, caption
FROM images
WHERE caption LIKE '%dog%'

Or you can query/filter for them in Python:

rows = tbl.search().where("caption is not null").to_arrow()

At this point, your dataset has evolved from a collection of raw images into a rich table with metadata, captions, and embeddings. You can search it, analyze it, or feed it into downstream pipelines without writing custom glue code.

Why this workflow matters

What you’ve built here is more than just a demo. By combining UDFs with Geneva’s execution engine, you’ve taken raw data and turned it into something structured, searchable, and scalable.

  • Scalars like file size show you that you can store simple numbers.
  • Structs like dimensions let you work with richer metadata.
  • Captions give you natural language descriptions that make the dataset human-readable, while embeddings let you run semantic search and clustering.

Because Geneva backfills work asynchronously and are versioned, you don’t have to worry about reruns or failures — you always know where you left off.

And perhaps most importantly, you don’t have to change your code when you scale: the same functions you tested on a handful of images locally will run across millions of rows in a Ray cluster. That’s the difference between a quick prototype and a production-ready workflow.

Your next step

  1. Pick a dataset that matters to you. Maybe it’s product images from your company’s website, or a collection of documents you want to index.
  2. Start with a small batch locally to make sure your UDFs behave as expected.
  3. Then, when you’re ready, scale it out with a cluster. Geneva gives you a smooth path from experimentation to production, without the usual friction of rewriting pipelines for distributed systems.

Once you’ve experienced how quickly you can go from raw pixels to searchable features, you’ll see why Geneva changes the way you approach feature engineering.

Over the coming months, we’ll be publishing a lot more regarding our roadmap for feature engineering in LanceDB, so stay tuned! If you’re interested in using Geneva in production, please contact us for more information and example notebooks.

Jonathan Hsieh
Software Engineer & Geneva Project Lead.

Stable-Worldmodel: A High Performance Platform for Reproducible World Model Research

Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
June 2, 2026
stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research

🌍 Lance-Backed World Model Platform, 🦆 Multimodal SQL with Lance DuckDB Extension, 💰 LanceDB vs OpenSearch Cost Breakdown

ChanChan Mao
May 28, 2026
newsletter-may-2026

Reproducible Data Curation In The Multimodal Lakehouse

Prashanth Rao
May 29, 2026
reproducible-data-curation-in-the-multimodal-lakehouse