Working With Tables in LanceDB

In LanceDB, tables store records with a defined schema that specifies column names and types. You can create LanceDB tables from these data formats:

  • Pandas DataFrames
  • Polars DataFrames
  • Apache Arrow Tables

The Python SDK additionally supports:

  • PyArrow schemas for explicit schema control
  • LanceModel for Pydantic-based validation

Create a LanceDB Table

Initialize a LanceDB connection and create a table

python
import lancedb

uri = "data/sample-lancedb"
db = lancedb.connect(uri)
typescript
import * as lancedb from "@lancedb/lancedb";
import * as arrow from "apache-arrow";

const uri = "data/sample-lancedb";
const db = await lancedb.connect(uri);

LanceDB allows ingesting data from various sources - dict, list[dict], pd.DataFrame, pa.Table or a Iterator[pa.RecordBatch]. Let’s take a look at some of the these.

From list of tuples or dictionaries

python
data = [
    {"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
    {"vector": [0.2, 1.8], "lat": 40.1, "long": -74.1},
]
db.create_table("test_table", data)
db["test_table"].head()
typescript
const _tbl = await db.createTable(
  "myTable",
  [
    { vector: [3.1, 4.1], item: "foo", price: 10.0 },
    { vector: [5.9, 26.5], item: "bar", price: 20.0 },
  ],
  { mode: "overwrite" },
);

From a Pandas DataFrame

python
import pandas as pd

data = pd.DataFrame(
    {
        "vector": [[1.1, 1.2, 1.3, 1.4], [0.2, 1.8, 0.4, 3.6]],
        "lat": [45.5, 40.1],
        "long": [-122.7, -74.1],
    }
)
db.create_table("my_table_pandas", data)
db["my_table_pandas"].head()
💡 Note
Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly.
💡 Vector Column Type
The vector column needs to be a Vector (defined as pyarrow.FixedSizeList ) type.

From a custom schema

python
import pyarrow as pa

custom_schema = pa.schema(
    [
        pa.field("vector", pa.list_(pa.float32(), 4)),
        pa.field("lat", pa.float32()),
        pa.field("long", pa.float32()),
    ]
)

tbl = db.create_table("my_table_custom_schema", data, schema=custom_schema)

From a Polars DataFrame

LanceDB supports Polars , a modern, fast DataFrame library written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow under the hood. A deeper integration between LanceDB Tables and Polars DataFrames is on the way.

python
import polars as pl

data = pl.DataFrame(
    {
        "vector": [[3.1, 4.1], [5.9, 26.5]],
        "item": ["foo", "bar"],
        "price": [10.0, 20.0],
    }
)
tbl = db.create_table("my_table_pl", data)

From an Arrow Table

You can also create LanceDB tables directly from Arrow tables. LanceDB supports float16 data type!

python
import pyarrow as pa

import numpy as np

dim = 16
total = 2
schema = pa.schema(
    [pa.field("vector", pa.list_(pa.float16(), dim)), pa.field("text", pa.string())]
)
data = pa.Table.from_arrays(
    [
        pa.array(
            [np.random.randn(dim).astype(np.float16) for _ in range(total)],
            pa.list_(pa.float16(), dim),
        ),
        pa.array(["foo", "bar"]),
    ],
    ["vector", "text"],
)
tbl = db.create_table("f16_tbl", data, schema=schema)
typescript
const db = await lancedb.connect(databaseDir);
const dim = 16;
const total = 10;
const f16Schema = new Schema([
  new Field("id", new Int32()),
  new Field(
    "vector",
    new FixedSizeList(dim, new Field("item", new Float16(), true)),
    false,
  ),
]);
const data = lancedb.makeArrowTable(
  Array.from(Array(total), (_, i) => ({
    id: i,
    vector: Array.from(Array(dim), Math.random),
  })),
  { schema: f16Schema },
);
const _table = await db.createTable("f16_tbl", data);

From Pydantic Models

When you create an empty table without data, you must specify the table schema. LanceDB supports creating tables by specifying a PyArrow schema or a specialized Pydantic model called LanceModel.

For example, the following Content model specifies a table with 5 columns: movie_id, vector, genres, title, and imdb_id. When you create a table, you can pass the class as the value of the schema parameter to create_table. The vector column is a Vector type, which is a specialized Pydantic type that can be configured with the vector dimensions. It is also important to note that LanceDB only understands subclasses of lancedb.pydantic.LanceModel (which itself derives from pydantic.BaseModel).

python
from lancedb.pydantic import Vector, LanceModel

import pyarrow as pa

class Content(LanceModel):
    movie_id: int
    vector: Vector(128)
    genres: str
    title: str
    imdb_id: int

    @property
    def imdb_url(self) -> str:
        return f"https://www.imdb.com/title/tt{self.imdb_id}"


tbl = db.create_table("movielens_small", schema=Content)

Nested schemas

Sometimes your data model may contain nested objects. For example, you may want to store the document string and the document source name as a nested Document object:

python
from pydantic import BaseModel

class Document(BaseModel):
    content: str
    source: str

This can be used as the type of a LanceDB table column:

python
class NestedSchema(LanceModel):
    id: str
    vector: Vector(1536)
    document: Document


tbl = db.create_table("nested_table", schema=NestedSchema)

This creates a struct column called “document” that has two subfields called “content” and “source”:

code
In [28]: tbl.schema
Out[28]:
id: string not null
vector: fixed_size_list<item: float>[1536] not null
    child 0, item: float
document: struct<content: string not null, source: string not null> not null
    child 0, content: string not null
    child 1, source: string not null

Validators

Note that neither Pydantic nor PyArrow automatically validates that input data is of the correct timezone, but this is easy to add as a custom field validator:

python
from datetime import datetime
from zoneinfo import ZoneInfo

from lancedb.pydantic import LanceModel
from pydantic import Field, field_validator, ValidationError, ValidationInfo

tzname = "America/New_York"
tz = ZoneInfo(tzname)

class TestModel(LanceModel):
    dt_with_tz: datetime = Field(json_schema_extra={"tz": tzname})

    @field_validator('dt_with_tz')
    @classmethod
    def tz_must_match(cls, dt: datetime) -> datetime:
        assert dt.tzinfo == tz
        return dt

ok = TestModel(dt_with_tz=datetime.now(tz))

try:
    TestModel(dt_with_tz=datetime.now(ZoneInfo("Asia/Shanghai")))
    assert 0 == 1, "this should raise ValidationError"
except ValidationError:
    print("A ValidationError was raised.")
    pass

When you run this code it should print “A ValidationError was raised.”

Pydantic custom types

LanceDB does NOT yet support converting pydantic custom types. If this is something you need, please file a feature request on the LanceDB Github repo .

Using Iterators / Writing Large Datasets

It is recommended to use iterators to add large datasets in batches when creating your table in one go. This does not create multiple versions of your dataset unlike manually adding batches using table.add()

LanceDB additionally supports PyArrow’s RecordBatch Iterators or other generators producing supported data types.

Here’s an example using using RecordBatch iterator for creating tables.

python
import pyarrow as pa

def make_batches():
    for i in range(5):
        yield pa.RecordBatch.from_arrays(
            [
                pa.array(
                    [[3.1, 4.1, 5.1, 6.1], [5.9, 26.5, 4.7, 32.8]],
                    pa.list_(pa.float32(), 4),
                ),
                pa.array(["foo", "bar"]),
                pa.array([10.0, 20.0]),
            ],
            ["vector", "item", "price"],
        )


schema = pa.schema(
    [
        pa.field("vector", pa.list_(pa.float32(), 4)),
        pa.field("item", pa.utf8()),
        pa.field("price", pa.float32()),
    ]
)
db.create_table("batched_tale", make_batches(), schema=schema)

You can also use iterators of other types like Pandas DataFrame or Pylists directly in the above example.

Open existing tables

If you forget the name of your table, you can always get a listing of all table names.

python
print(db.table_names())
typescript
console.log(await db.tableNames());

Then, you can open any existing tables.

python
tbl = db.open_table("test_table")
typescript
const tbl = await db.openTable("my_table");

Creating empty table

You can create an empty table for scenarios where you want to add data to the table later. An example would be when you want to collect data from a stream/external file and then add it to a table in batches.

An empty table can be initialized via a PyArrow schema.

python
import lancedb

import pyarrow as pa

schema = pa.schema(
    [
        pa.field("vector", pa.list_(pa.float32(), 2)),
        pa.field("item", pa.string()),
        pa.field("price", pa.float32()),
    ]
)
tbl = db.create_table("test_empty_table", schema=schema)
typescript
const schema = new arrow.Schema([
  new arrow.Field("id", new arrow.Int32()),
  new arrow.Field("name", new arrow.Utf8()),
]);

const emptyTbl = await db.createEmptyTable("empty_table", schema);

Alternatively, you can also use Pydantic to specify the schema for the empty table. Note that we do not directly import pydantic but instead use lancedb.pydantic which is a subclass of pydantic.BaseModel that has been extended to support LanceDB specific types like Vector.

python
import lancedb

from lancedb.pydantic import Vector, LanceModel

class Item(LanceModel):
    vector: Vector(2)
    item: str
    price: float


tbl = db.create_table("test_empty_table_new", schema=Item.to_arrow_schema())

Once the empty table has been created, you can add data to it via the various methods listed in the Adding to a table section.

Drop a table

Use the drop_table() method on the database to remove a table.

python
db.drop_table("my_table")
typescript
await db.dropTable("myTable");

This permanently removes the table and is not recoverable, unlike deleting rows. By default, if the table does not exist an exception is raised. To suppress this, you can pass in ignore_missing=True.