stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research
Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
reproducible-data-curation-in-the-multimodal-lakehouse
Prashanth Rao
newsletter-may-2026
ChanChan Mao
newsletter-april-2026
ChanChan Mao
how-lancedb-accelerates-vector-search-at-10-billion-scale
Yang Cen
opensearch-vs-lancedb-for-vector-search-query-cost-and-infrastructure
Justin Miller
volcano-engine-autonomous-driving-data-lake-solution
Kejian Ju
unifying-the-av-ml-stack-lancedb
Ayush Chaurasia
lance-json-support-why-you-might-not-really-need-variant
Jack Ye
building-a-storage-format-for-the-next-era-of-biology
Pavan Ramkumar
newsletter-march-2026
ChanChan Mao
smart-parsing-meets-sharp-retrieval-combining-liteparse-and-lancedb
Clelia Astra Bertelli
Prashanth Rao
lance-format-v2-2-benchmarks-half-the-storage-none-of-the-slowdown
Xuanwo
make-your-sql-workflows-multimodal-with-lancedb-x-duckdb
Prashanth Rao
agentic-coding-as-community-stewardship
Xuanwo
what-we-mean-by-multimodal
Prashanth Rao
ai-native-development-local-continue-lancedb
Ty Dunn
lance-file-format-2-2-taming-complex-data
Xuanwo
lance-blob-v2
Xuanwo
Jack Ye
openclaw-lancedb-memory-layer
Xuanwo
Prashanth Rao
openclaw-lancedb-seed2
LanceDB
openclaw-memory-from-zero-to-lancedb-pro
Prashanth Rao
upload-lance-datasets-to-hf-hub
Prashanth Rao
zero-shot-image-classification-with-vector-search
Vipul Maheshwari
werides-data-platform-transformation-how-lancedb-fuels-model-development-velocity
Qian Zhu
Fei Chen
training-a-variational-autoencoder-from-scratch-with-the-lance-file-format
LanceDB
track-ai-trends-crewai-agents-rag
LanceDB
tokens-per-second-is-not-all-you-need
Mingran Wang
Tan Li
the-future-of-open-source-table-formats-iceberg-and-lance
Jack Ye
the-case-for-random-access-i-o
LanceDB
series-a-funding
Chang She
semanticdotart
Ayush Chaurasia
second-dinners-secret-weapon-lancedb-powered-rag-for-faster-smarter-game-development
Qian Zhu
search-within-an-image-331b54e4285e
Kaushal Choudhary
scalable-computer-vision-with-lancedb-voxel51-d8b65066d5f6
LanceDB
rethinking-table-file-paths-lance-multi-base-layout
Jack Ye
rag-isnt-one-size-fits-all
Leonard Marcq
python-package-to-convert-image-datasets-to-lance-type
Vipul Maheshwari
one-million-iops
Weston Pace
november-feature-roundup
Will Jones
newsletter-september-2025
Jasmine Wang
newsletter-october-2025
Jasmine Wang
newsletter-november-2025
ChanChan Mao
newsletter-june-2025
David Myriel
newsletter-july-2025
Jasmine Wang
newsletter-january-2026
ChanChan Mao
newsletter-february-2026
ChanChan Mao
newsletter-december-2025
ChanChan Mao
newsletter-august-2025
Jasmine Wang
my-summer-internship-experience-at-lancedb-2
Raunak Sinha
my-simd-is-faster-than-yours-fb2989bf25e7
LanceDB
multimodal-myntra-fashion-search-engine-using-lancedb
LanceDB
multimodal-lakehouse
David Myriel
multi-document-agentic-rag-a-walkthrough
Vipul Maheshwari
modified-rag-parent-document-bigger-chunk-retriever-62b3d1e79bc6
Mahesh Deshwal
memgpt-os-inspired-llms-that-manage-their-own-memory-793d6eed417e
Ayush Chaurasia
late-interaction-efficient-multi-modal-retrievers-need-more-than-just-a-vector-index
Ayush Chaurasia
lancedb-x-continue
LanceDB
lance-x-huggingface-a-new-era-of-sharing-multimodal-data
Prashanth Rao
Quentin Lhoest
Xuanwo
Ayush Chaurasia
lance-x-duckdb-sql-retrieval-on-the-multimodal-lakehouse-format
Xuanwo
lance-windows-windows-lance
Chang She
lance-v2
Weston Pace
lance-namespace-lancedb-and-ray
Jack Ye
lance-file-2-1-stable
Weston Pace
lance-file-2-1-smaller-and-simpler
Weston Pace
lance-data-viewer
Gordon Murray
lance-community-governance
Jack Ye
introducing-lance-namespace-spark-integration
Jack Ye
implementing-corrective-rag-in-the-easiest-way-2
LanceDB
hybrid-search-rag-for-real-life-production-grade-applications-e1e727b3965a
Mahesh Deshwal
hybrid-search-combining-bm25-and-semantic-search-for-better-results-with-lan-1358038fe7e6
LanceDB
hybrid-search-and-custom-reranking-with-lancedb-4c10a6a3447e
LanceDB
how-to-reduce-hallucinations-from-llm-powered-agents-using-long-term-memory-72f262c3cc1f
Tevin Wang
guide-to-use-contextual-retrieval-and-prompt-caching-with-lancedb
LanceDB
grpo-understanding-and-fine-tuning-the-next-gen-reasoning-model-2
Mahesh Deshwal
graphrag-hierarchical-approach-to-retrieval-augmented-generation
Akash Desai
gpu-accelerated-indexing-in-lancedb-27558fa7eee5
LanceDB
geo-support
Jack Ye
geneva-twelvelabs
David Myriel
geneva-feature-engineering
Jonathan Hsieh
from-bi-to-ai-lance-and-iceberg
Jack Ye
Prashanth Rao
fluss-integration
Wayne Wang
file-readers-in-depth-parallelism-without-row-groups
Weston Pace
feature-rabitq-quantization
David Myriel
Yang Cen
feature-full-text-search
David Myriel
enhance-rag-integrate-contextual-compression-and-filtering-for-precision-a29d4a810301
Kaushal Choudhary
effortlessly-loading-and-processing-images-with-lance-a-code-walkthrough
LanceDB
designing-a-table-format-for-ml-workloads
Weston Pace
custom-dataset-for-llm-training-using-lance
LanceDB
creating-a-fintech-agent
Vipul Maheshwari
convert-any-image-dataset-to-lance
LanceDB
columnar-file-readers-in-depth-structural-encoding
Weston Pace
columnar-file-readers-in-depth-repetition-definition-levels
Weston Pace
columnar-file-readers-in-depth-compression-transparency
Weston Pace
columnar-file-readers-in-depth-column-shredding
Weston Pace
columnar-file-readers-in-depth-backpressure
Weston Pace
columnar-file-readers-in-depth-apis-and-fusion
Weston Pace
chunking-techniques-with-langchain-and-llamaindex
Prashant Kumar
chunking-analysis-which-is-the-right-chunking-approach-for-your-language
Shresth Shukla
chat-with-csv-excel-using-lancedb
LanceDB
case-study-netflix
David Myriel
case-study-dosu
Qian Zhu
Michael Ludden
case-study-cognee
David Myriel
Vasilije Markovic
case-study-coderabbit
Qian Zhu
building-rag-on-codebases-part-2
Sankalp Shubham
building-rag-on-codebases-part-1
Sankalp Shubham
branching-and-shallow-clone
Jack Ye
better-rag-with-active-retrieval-augmented-generation-flare-3b66646e2a9f
LanceDB
benchmarking-random-access-in-lance
Chang She
benchmarking-lancedb-92b01032874a-2
LanceDB
benchmarking-cohere-reranker-with-lancedb
LanceDB
anythingllms-competitive-edge-lancedb-for-seamless-rag-and-agent-workflows
Ayush Chaurasia
announcing-lance-sdk
Weston Pace
agentic-rag-using-langgraph-building-a-simple-customer-support-autonomous-agent
LanceDB
advanced-rag-precise-zero-shot-dense-retrieval-with-hyde-0946c54dfdcb
LanceDB
accelerate-vector-search-applications-using-openvino-lancedb
LanceDB
a-primer-on-text-chunking-and-its-types-a420efc96a13
Prashant Kumar
a-practical-guide-to-training-custom-rerankers
Ayush Chaurasia
a-practical-guide-to-fine-tuning-embedding-models
Ayush Chaurasia
keep-your-data-fresh-with-cocoindex-and-lancedb
Prashanth Rao
Linghua Jin

A Primer on Text Chunking and Its Types

October 24, 2023
Community

Text chunking is a technique in natural language processing that divides text into smaller segments, usually based on the parts of speech and grammatical meanings of the words. Text chunking can help extract important information from a text, such as noun phrases, verb phrases, or other semantic units.

In this blog, We’ll see some Text Chunking strategies, why they are important for building LLM-based systems like RAG, and How to use them.

Why is Text Chunking Important?

There are various reasons why text chunking becomes important when working with LLMs, and in this post, we’ll share a few examples that have a significant impact on the results.

Say you have a document of 15 pages full of text and you want to perform summarization and question-answering on the document. The first and foremost step is to extract embeddings of the full document. It’s from here that all the problems related to text processing begin.

  1. When you extract embeddings for an entire document at once, you may capture the overall context, but you risk losing valuable information about specific topics. This can lead to imprecise results and missing details when working with large language models (LLMs).
  2. Regardless of the embedding model provider, you need to be aware of their context window limit, to pick the right chunk size. Although OpenAI GPT-4 has a 32K token window, which sounds reasonably large, it’s still a good practice to think about the right chunk size from the outset.

Not using text chunking correctly when it’s needed can cause problems that impact retrieval quality, which further affects the downstream task. If the chunk size is too large, the relevant piece of information in the retrieved chunk may be buried under a mountain of other irrelevant text. If the chunk size is too small, it can scatter the relevant information across several pieces of retrieved information, in a way that order isn’t preserved (which affects semantics).

Text Chunking Strategies

There are different Text Chunking Strategies, Here we’ll discuss them with their strengths and weaknesses and the right scenarios where they can be applied.

Sentence Splitting Using Classical Methods

1. Naive method

The most naive method of sentence splitting is using a split function that splits text on a given character.

text = "content.of.document." #input text
chunks = text.split(".")
print(chunks)

This gives:

['content', 'of', 'document', '']

2. NLTK text splitter

NLTK is a library used for working with Language data. It provides a sentence tokenizer that can split the text into sentences, helping to create more meaningful chunks.

import nltk

input_text ="Much of refactoring is devoted to correctly composing methods. In most cases, excessively long methods are the root of all evil. The vagaries of code inside these methods conceal the execution logic and make the method extremely hard to understand"

sentences = nltk.sent_tokenize(input_text)
print(sentences)

Here, we have not used any character to split sentences. Further, many other chunking techniques can be used, like tokenizing, POS tagging, etc.

Output:

['Much of refactoring is devoted to correctly composing methods.',
    'In most cases, excessively long methods are the root of all evil.',
    'The vagaries of code inside these methods conceal the execution logic and make the method extremely hard to understand']

3. SpaCy text splitter

spaCy is a popular NLP library is used for performing various tasks of NLP. The text splitter in spaCy creates text splits, preserving contexts of resultant splits.

import spacy
input_text ="Much of refactoring is devoted to correctly composing methods. In most cases, excessively long methods are the root of all evil. The vagaries of code inside these methods conceal the execution logic and make the method extremely hard to understand"

nlp = spacy.load("en_core_web_sm")
doc = nlp(input_text)
for s in doc.sents:
    print(s)

Output:

Much of refactoring is devoted to correctly composing methods.
In most cases, excessively long methods are the root of all evil.
The vagaries of code inside these methods conceal the execution logic and make the method extremely hard to understand

spaCy uses tokenization and a dependency parser under the hood and offers more sophisticated techniques for detecting sentence boundaries for splitting.

Recursive Splitting

Recursive Splitting splits the input text into small chunks in iterative matter using a set of separators. If, in the starting steps, chunks are not created of the desired size, it will recursively try different separators or criteria to achieve the desired size of the chunk.

Here is an example of using Recursive Splitting using LangChain.

# input text
input_text ="Much of refactoring is devoted to correctly composing methods. In most cases, excessively long methods are the root of all evil. The vagaries of code inside these methods conceal the execution logic and make the method extremely hard to understand"

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100, #set desired text size
    chunk_overlap  = 20 )

chunks = text_splitter.create_documents([input_text])
print(chunks)

Output:

[Document(page_content='Much of refactoring is devoted to correctly composing methods. In most cases, excessively long'),
Document(page_content='excessively long methods are the root of all evil. The vagaries of code inside these methods'),
Document(page_content='these methods conceal the execution logic and make the method extremely hard to understand')]

Specialized Structured Splitting

1. HTML Text Splitter

HTML Splitter is a structure-aware chunker that splits text at the HTML element level and adds metadata for each header relevant to any given chunk.

Given this input HTML string:

html_string ="""
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

We can process it as follows:

from langchain.text_splitter import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text(html_string)
print(html_header_split)

The HTML header will extract only the headers mentioned in header_to_split_on.

[Document(page_content='Foo'),
Document(page_content='Some intro text about Foo.  \nBar main section Bar subsection 1 Bar subsection 2', metadata={'Header 1': 'Foo'}),
Document(page_content='Some intro text about Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}),
Document(page_content='Some text about the first subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}),
Document(page_content='Some text about the second subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}),
Document(page_content='Baz', metadata={'Header 1': 'Foo'}),
Document(page_content='Some text about Baz', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}),
Document(page_content='Some concluding text about Foo', metadata={'Header 1': 'Foo'})]

2. Markdown text splitting

Markdown Splitting is used to chunk based on Markdown syntax like heading, bash code, images, and lists. It can also structure-aware chunker.

#input markdown string
markdown_text = '# Foo\n\n ## Bar\n\nHi this is Jim  \nHi this is Joe\n\n ## Baz\n\n Hi this is Molly'

from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_text)
print(md_header_splits)

MarkdownHeaderTextSplitter splits markdown text based on headers_to_split_on.

Output:

[Document(page_content='Hi this is Jim\nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]

3. LaTex text splitting

LaTex text splitting is another code-splitting chunker that parses LaTex commands to create chunks that are the logical organization, like sections and subsections, leading to more accurate and contextually relevant results.

#input latex string
latex_text = """
\documentclass{article}

\begin{document}

\maketitle

\section{Introduction}
Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.

\subsection{History of LLMs}
The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.

\subsection{Applications of LLMs}
LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.

\end{document}
"""

from langchain.text_splitter import LatexTextSplitter
latex_splitter = LatexTextSplitter(chunk_size=100, chunk_overlap=0)

latex_splits = latex_splitter.create_documents([latex_text])
print(latex_splits)

In the above example you can see that *overlap is 0, *it is because when we are working with code splits at that time overlapping codes totally change the meaning of it. So overlapping should be 0.

[Document(page_content='\\documentclass{article}\n\n\x08egin{document}\n\n\\maketitle\n\n\\section{Introduction}\nLarge language models'),
    Document(page_content='(LLMs) are a type of machine learning model that can be trained on vast amounts of text data to'),
    Document(page_content='generate human-like language. In recent years, LLMs have made significant advances in a variety of'),
    Document(page_content='natural language processing tasks, including language translation, text generation, and sentiment'),
    Document(page_content='analysis.\n\n\\subsection{History of LLMs}\nThe earliest LLMs were developed in the 1980s and 1990s,'),
    Document(page_content='but they were limited by the amount of data that could be processed and the computational power'),
    Document(page_content='available at the time. In the past decade, however, advances in hardware and software have made it'),
    Document(page_content='possible to train LLMs on massive datasets, leading to significant improvements in'),
    Document(page_content='performance.\n\n\\subsection{Applications of LLMs}\nLLMs have many applications in industry, including'),
    Document(page_content='chatbots, content creation, and virtual assistants. They can also be used in academia for research'),
    Document(page_content='in linguistics, psychology, and computational linguistics.\n\n\\end{document}')]

Now when you choose which chunker to use for your data, extract embeddings for them and store them in Vector DB.

In this post, we’ve seen numerous examples of using chunking with LanceDB to store text chunks and their respective embeddings. LanceDB is an easy-to-setup, open source vector database that persists your data and vector index on disk, allowing you to scale without breaking the bank. It is also well-integrated with the Python data ecosystem so you can use it with your existing data tools like pandas, pyarrow, and more.

See this notebook for the code that performs these tasks.

Conclusion

Text chunking is a relatively straightforward task, but it’s also important to study how it works, because it presents certain challenges and tradeoffs. There’s no single strategy that works universally, nor a chunk size that suits every scenario. What works for one type of data or solution might not be suitable for others.

Hopefully, this post reinforced the importance of text chunking and showed you some new approaches!

Learn more about LanceDB or applied GenAI applications from vectordb-recipes. Don’t forget to drop us a 🌟!

Stable-Worldmodel: A High Performance Platform for Reproducible World Model Research

Ayush Chaurasia
Quentin Lhoest
Lucas Maes
Quentin Le Lidec
June 2, 2026
stable-worldmodel-a-high-performance-platform-for-reproducible-world-model-research

🌍 Lance-Backed World Model Platform, 🦆 Multimodal SQL with Lance DuckDB Extension, 💰 LanceDB vs OpenSearch Cost Breakdown

ChanChan Mao
May 28, 2026
newsletter-may-2026

Reproducible Data Curation In The Multimodal Lakehouse

Prashanth Rao
May 29, 2026
reproducible-data-curation-in-the-multimodal-lakehouse