stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research
Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
reproducible-data-curation-in-the-multimodal-lakehouse
Prashanth Rao
newsletter-may-2026
ChanChan Mao
newsletter-april-2026
ChanChan Mao
how-lancedb-accelerates-vector-search-at-10-billion-scale
Yang Cen
opensearch-vs-lancedb-for-vector-search-query-cost-and-infrastructure
Justin Miller
volcano-engine-autonomous-driving-data-lake-solution
Kejian Ju
unifying-the-av-ml-stack-lancedb
Ayush Chaurasia
lance-json-support-why-you-might-not-really-need-variant
Jack Ye
building-a-storage-format-for-the-next-era-of-biology
Pavan Ramkumar
newsletter-march-2026
ChanChan Mao
smart-parsing-meets-sharp-retrieval-combining-liteparse-and-lancedb
Clelia Astra Bertelli
Prashanth Rao
lance-format-v2-2-benchmarks-half-the-storage-none-of-the-slowdown
Xuanwo
make-your-sql-workflows-multimodal-with-lancedb-x-duckdb
Prashanth Rao
agentic-coding-as-community-stewardship
Xuanwo
what-we-mean-by-multimodal
Prashanth Rao
ai-native-development-local-continue-lancedb
Ty Dunn
lance-file-format-2-2-taming-complex-data
Xuanwo
lance-blob-v2
Xuanwo
Jack Ye
openclaw-lancedb-memory-layer
Xuanwo
Prashanth Rao
openclaw-lancedb-seed2
LanceDB
openclaw-memory-from-zero-to-lancedb-pro
Prashanth Rao
upload-lance-datasets-to-hf-hub
Prashanth Rao
zero-shot-image-classification-with-vector-search
Vipul Maheshwari
werides-data-platform-transformation-how-lancedb-fuels-model-development-velocity
Qian Zhu
Fei Chen
training-a-variational-autoencoder-from-scratch-with-the-lance-file-format
LanceDB
track-ai-trends-crewai-agents-rag
LanceDB
tokens-per-second-is-not-all-you-need
Mingran Wang
Tan Li
the-future-of-open-source-table-formats-iceberg-and-lance
Jack Ye
the-case-for-random-access-i-o
LanceDB
series-a-funding
Chang She
semanticdotart
Ayush Chaurasia
second-dinners-secret-weapon-lancedb-powered-rag-for-faster-smarter-game-development
Qian Zhu
search-within-an-image-331b54e4285e
Kaushal Choudhary
scalable-computer-vision-with-lancedb-voxel51-d8b65066d5f6
LanceDB
rethinking-table-file-paths-lance-multi-base-layout
Jack Ye
rag-isnt-one-size-fits-all
Leonard Marcq
python-package-to-convert-image-datasets-to-lance-type
Vipul Maheshwari
one-million-iops
Weston Pace
november-feature-roundup
Will Jones
newsletter-september-2025
Jasmine Wang
newsletter-october-2025
Jasmine Wang
newsletter-november-2025
ChanChan Mao
newsletter-june-2025
David Myriel
newsletter-july-2025
Jasmine Wang
newsletter-january-2026
ChanChan Mao
newsletter-february-2026
ChanChan Mao
newsletter-december-2025
ChanChan Mao
newsletter-august-2025
Jasmine Wang
my-summer-internship-experience-at-lancedb-2
Raunak Sinha
my-simd-is-faster-than-yours-fb2989bf25e7
LanceDB
multimodal-myntra-fashion-search-engine-using-lancedb
LanceDB
multimodal-lakehouse
David Myriel
multi-document-agentic-rag-a-walkthrough
Vipul Maheshwari
modified-rag-parent-document-bigger-chunk-retriever-62b3d1e79bc6
Mahesh Deshwal
memgpt-os-inspired-llms-that-manage-their-own-memory-793d6eed417e
Ayush Chaurasia
late-interaction-efficient-multi-modal-retrievers-need-more-than-just-a-vector-index
Ayush Chaurasia
lancedb-x-continue
LanceDB
lance-x-huggingface-a-new-era-of-sharing-multimodal-data
Prashanth Rao
Quentin Lhoest
Xuanwo
Ayush Chaurasia
lance-x-duckdb-sql-retrieval-on-the-multimodal-lakehouse-format
Xuanwo
lance-windows-windows-lance
Chang She
lance-v2
Weston Pace
lance-namespace-lancedb-and-ray
Jack Ye
lance-file-2-1-stable
Weston Pace
lance-file-2-1-smaller-and-simpler
Weston Pace
lance-data-viewer
Gordon Murray
lance-community-governance
Jack Ye
introducing-lance-namespace-spark-integration
Jack Ye
implementing-corrective-rag-in-the-easiest-way-2
LanceDB
hybrid-search-rag-for-real-life-production-grade-applications-e1e727b3965a
Mahesh Deshwal
hybrid-search-combining-bm25-and-semantic-search-for-better-results-with-lan-1358038fe7e6
LanceDB
hybrid-search-and-custom-reranking-with-lancedb-4c10a6a3447e
LanceDB
how-to-reduce-hallucinations-from-llm-powered-agents-using-long-term-memory-72f262c3cc1f
Tevin Wang
guide-to-use-contextual-retrieval-and-prompt-caching-with-lancedb
LanceDB
grpo-understanding-and-fine-tuning-the-next-gen-reasoning-model-2
Mahesh Deshwal
graphrag-hierarchical-approach-to-retrieval-augmented-generation
Akash Desai
gpu-accelerated-indexing-in-lancedb-27558fa7eee5
LanceDB
geo-support
Jack Ye
geneva-twelvelabs
David Myriel
geneva-feature-engineering
Jonathan Hsieh
from-bi-to-ai-lance-and-iceberg
Jack Ye
Prashanth Rao
fluss-integration
Wayne Wang
file-readers-in-depth-parallelism-without-row-groups
Weston Pace
feature-rabitq-quantization
David Myriel
Yang Cen
feature-full-text-search
David Myriel
enhance-rag-integrate-contextual-compression-and-filtering-for-precision-a29d4a810301
Kaushal Choudhary
effortlessly-loading-and-processing-images-with-lance-a-code-walkthrough
LanceDB
designing-a-table-format-for-ml-workloads
Weston Pace
custom-dataset-for-llm-training-using-lance
LanceDB
creating-a-fintech-agent
Vipul Maheshwari
convert-any-image-dataset-to-lance
LanceDB
columnar-file-readers-in-depth-structural-encoding
Weston Pace
columnar-file-readers-in-depth-repetition-definition-levels
Weston Pace
columnar-file-readers-in-depth-compression-transparency
Weston Pace
columnar-file-readers-in-depth-column-shredding
Weston Pace
columnar-file-readers-in-depth-backpressure
Weston Pace
columnar-file-readers-in-depth-apis-and-fusion
Weston Pace
chunking-techniques-with-langchain-and-llamaindex
Prashant Kumar
chunking-analysis-which-is-the-right-chunking-approach-for-your-language
Shresth Shukla
chat-with-csv-excel-using-lancedb
LanceDB
case-study-netflix
David Myriel
case-study-dosu
Qian Zhu
Michael Ludden
case-study-cognee
David Myriel
Vasilije Markovic
case-study-coderabbit
Qian Zhu
building-rag-on-codebases-part-2
Sankalp Shubham
building-rag-on-codebases-part-1
Sankalp Shubham
branching-and-shallow-clone
Jack Ye
better-rag-with-active-retrieval-augmented-generation-flare-3b66646e2a9f
LanceDB
benchmarking-random-access-in-lance
Chang She
benchmarking-lancedb-92b01032874a-2
LanceDB
benchmarking-cohere-reranker-with-lancedb
LanceDB
anythingllms-competitive-edge-lancedb-for-seamless-rag-and-agent-workflows
Ayush Chaurasia
announcing-lance-sdk
Weston Pace
agentic-rag-using-langgraph-building-a-simple-customer-support-autonomous-agent
LanceDB
advanced-rag-precise-zero-shot-dense-retrieval-with-hyde-0946c54dfdcb
LanceDB
accelerate-vector-search-applications-using-openvino-lancedb
LanceDB
a-primer-on-text-chunking-and-its-types-a420efc96a13
Prashant Kumar
a-practical-guide-to-training-custom-rerankers
Ayush Chaurasia
a-practical-guide-to-fine-tuning-embedding-models
Ayush Chaurasia
keep-your-data-fresh-with-cocoindex-and-lancedb
Prashanth Rao
Linghua Jin

Modified RAG: Parent Document & Bigger Chunk Retriever

December 15, 2023
Engineering

In case you’re interested in modifying and improving retrieval accuracy of RAG pipelines, you should check Re-ranking post.

What’s it about?

There are some cases when your users want to have a task done by providing just a couple of lines input or even worse, couple of words. In this example, let’s say I have a “Sequel” song generation task given a line or two as input. Now if it’s a Part-2 of something, the tone, writing style, story etc are supposed to be related to the previous song so given the line “I am whatever I am”, my LLM should generate something related to the previous song not a mixture of 10 different songs and artists. If you use a vanilla RAG here, you’d be getting multiple results which might not be from same song, artist or even genre. If you use only the first match, you lose a lot of context as a smaller chunk won’t give the full context of the song.

Solution?

There are 2 approaches to tackle that. Let’s go one by one from theory to code starting from Parent Document Retriever.

Parent Document retriever diagram

Given a text, find the most related chunk first (you can fetch N and then use additional logic based on Count etc too). Then instead of passing that chunk, get the Parent Document itself whose part was this chunk and pass THAT to the LLM as context. Let’s jump to the code quickly. Install and get all the required imports like LanceDB, LangChain etc. For the Embedding function, I’m using BAAI encoder but you can use any one.

pip install -U "langchain==0.0.344" openai tiktoken lark datasets sentence_transformers FlagEmbedding lancedb -qq
from langchain.vectorstores import LanceDB
from langchain.retrievers import ParentDocumentRetriever

# Text Splitting
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.docstore.document import Document

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

import os
from datasets import load_dataset

from langchain.embeddings import HuggingFaceBgeEmbeddings
import lancedb

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY_HERE"  # Needed if you run LLM experiment below

# Embedding Functions
model_name = "BAAI/bge-small-en-v1.5"  # Open Source and effective embedding
encode_kwargs = {"normalize_embeddings": True}  # set True to compute cosine similarity
bge_embeddings = HuggingFaceBgeEmbeddings(model_name=model_name, model_kwargs={"device": "cuda"}, encode_kwargs=encode_kwargs)

# Data Chunking Functions
small_chunk_splitter = RecursiveCharacterTextSplitter(chunk_size=512)  # Split documents into small chunks
big_chunk_splitter = RecursiveCharacterTextSplitter(chunk_size=2048)   # Another level of bigger chunks

# LanceDB Connection. Load if exists else create
my_db = lancedb.connect("./my_db")

For the demo data, I’m using Eminem Song lyrics dataset present here. Extract that data, split it into chunks and make embedding of those. Save all of those into a LanceDB table.

# Load a sample data here
long_texts = load_dataset("huggingartists/eminem")["train"].to_pandas().sample(100)["text"]  # Data of huge context length. Use 100 random examples for demo

# Convert to LangChain Document object
docs = [Document(page_content=content, doc_id=_id, metadata={"doc_id": _id}) for (_id, content) in enumerate(long_texts)]

if "small_chunk_table" in my_db.table_names():
    small_chunk_table = my_db.open_table("small_chunk_table")
else:  # NOTE: 384 is the size of BAAI Embedding and -999 because it's a dummy data so invalid Embedding
    small_chunk_table = my_db.create_table("small_chunk_table", data=[{"vector": [-999] * 384, "text": "", "doc_id": "-1"}], mode="overwrite")

small_chunk_table.delete('doc_id = "-1"')

vectorstore = LanceDB(small_chunk_table, bge_embeddings)  # Index child chunks
store = InMemoryStore()  # Storage layer for the parent documents

full_doc_retriever = ParentDocumentRetriever(vectorstore=vectorstore, docstore=store, child_splitter=small_chunk_splitter)

full_doc_retriever.add_documents(docs, ids=None)  # Add all the documents

Now that everything’s been done, retrieve some text. We’ll first retrieve a smaller chunk and then we’ll jump on to get the parent document itself.

# Fetch 3 most similar Smaller Documents
sub_docs = vectorstore.similarity_search("I am whatever you say I am and if I wasn't why would you say I am", k=3)

print(sub_docs[0].page_content)  # This is a smaller chunk

full_docs = full_doc_retriever.get_relevant_documents("I am whatever you say I am and if I wasn't why would you say I am", k=3)
print(full_docs[0].page_content)  # Parent document returned after matching the smaller chunks internally

Girls Lyrics
Ayo, dawg, I got some *** on my ****** chest
That I need to get off cause if I dont
Ima ****** explode or somethin
Now, look, this is the story about......

If you look at both outputs, you’ll see that the first documents are the child and parent ones. You can use the Parent Document for the task described. BUT there might be a PROBLEM with that. If you have your parent documents, very big and short context length of LLM, then? You can’t summarise lyrics like a good old solution. Can you? So? There’s a middle ground too.

Bigger Chunk Retrieval

To get around the problem of larger size of Parent document, what you can do right now is to make bigger chunks along with smaller ones. For example, if your smaller chunks are of 512 tokens and your Parent Documents are of 2048 tokens on average, you can make chunks of size 1024. Now during retrieval, it’ll match as the previous one above BUT this time, instead of parent document, it’ll fetch the Bigger chunk and pass it to LLM. This way you’ll lose some text for sure but not completely. You could use 2 verses instead of the original 4 to help the model understand writing style and context while staying within limits. Good thing, you just have to change 1 line from the previous one.

if "big_chunk_table" in my_db.table_names():
    big_chunk_table = my_db.open_table("big_chunk_table")
else:
    big_chunk_table = my_db.create_table(
        "big_chunk_table",
        data=[{"vector": [-999] * 384, "text": "", "doc_id": "-1"}],
        mode="overwrite",
    )

big_chunk_table.delete('doc_id = "-1"')

vectorstore = LanceDB(big_chunk_table, bge_embeddings)
store = InMemoryStore()

big_chunk_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=small_chunk_splitter,
    parent_splitter=big_chunk_splitter,  # retrieves the larger chunk instead of Parent Document
)

big_chunk_retriever.add_documents(docs, ids=None)  # Add all the documents

big_chunks_docs = big_chunk_retriever.get_relevant_documents(
    "I am whatever you say I am and if I wasn't why would you say I am", k=3
)
print(big_chunks_docs[0].page_content)  # BIG chunks (in place of Parent Document)

.........But as soon as someone calls you out
You put your tail between your legs and bow down
Now, I dont ask nobody to share my beliefs
To be involved in my beefs
Im a man, I can stand on my feet
So if you dont wanna be in em, all I ask
Is that you dont open your mouth with an opinion
And I wont put you in em
Cause I dont ask nobody to share my beliefs...........

Using it with OpenAI

qa = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=big_chunk_retriever,
)

query = "I am whatever you say I am and if I wasn't why would you say I am? So who is Em?"
qa.run(query)

You can find the whole code (and lot more examples like this) HERE.

Until next time, happy parenting :)

Bigger chunk retriever diagram
Mahesh Deshwal
ML Engineer and researcher specializing in reinforcement learning, fine-tuning techniques, and practical AI applications.

Stable-Worldmodel: A High Performance Platform for Reproducible World Model Research

Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
June 2, 2026
stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research

🌍 Lance-Backed World Model Platform, 🦆 Multimodal SQL with Lance DuckDB Extension, 💰 LanceDB vs OpenSearch Cost Breakdown

ChanChan Mao
May 28, 2026
newsletter-may-2026

Reproducible Data Curation In The Multimodal Lakehouse

Prashanth Rao
May 29, 2026
reproducible-data-curation-in-the-multimodal-lakehouse