Semantic Memory for Hermes Agent with LanceDB

Prashanth Rao

•

June 16, 2026

•

Engineering

Table of Contents

This is a title

This is a subtitle

Anyone who's worked with a personal agent for more than a few sessions tends to hit the same wall: it forgets. You explain your preferences, you state your desired style, you spell out the caveats behind a particular task, and then a few sessions later you find yourself explaining all of it over again. This happens because the agent's session context only holds so much before older details get compressed away to make room. Persisting the facts to flat files can only go so far.

An agent that can hold on to durable facts across sessions rapidly becomes one that you bank on for daily, repeatable tasks. The interesting question is where that memory should live and how it should work, and that's what we'll build toward in this post. We'll introduce the LanceDB memory plugin for Hermes Agent, an open-source AI agent framework that's rapidly gaining in popularity.

What Is Hermes Agent?

Hermes Agent is a self-hosted, open-source personal agent from Nous Research. Unlike many other well-known agent harnesses, Hermes isn't tied to a single interface: you can talk to it from a full terminal UI, or run a gateway process and reach the same agent from Telegram, Discord, Slack, and other messaging platforms. Every one of those surfaces funnels into one AIAgent loop that builds the prompt, calls your model of choice, dispatches tools, and persists each conversation to a local SQLite database. The agent stays platform-agnostic, and everything else plugs into it.

Hermes is already built around an agent that learns: it curates its own notes across sessions, writes and refines skills from experience, and can search back through past conversations. It also exposes a dedicated slot for external memory providers like LanceDB, which is the work we discuss in the rest of this post. Before we get there, it's worth seeing how Hermes already handles memory out of the box.

How Typical Hermes Users Deal With Memory Today

Out of the box, Hermes already remembers across sessions in two ways, and a third slot exists for going further:

Layer	What it is	Good for	Limitation
Curated memory	`MEMORY.md` (facts, conventions) and `USER.md` (your preferences), agent-written, frozen into the system prompt at session start	A handful of facts that should always be in context	Tiny budgets (~800 and ~500 tokens); not a growing history
Session search	Every conversation stored in local SQLite with FTS5; queried on demand	Fast, free "did we talk about X last week?" lookups	Lexical matching: misses paraphrases of what you originally typed
External memory providers	A pluggable slot that runs alongside the built-in memory	Deeper cross-session recall, semantic search, fact extraction, provenance, agent-callable memory tools	One provider active at a time

The first two are great at what they do, but because they rely on lexical search, neither provides good results for paraphrased queries during memory recall. That's the gap the third layer fills. By providing a hook to an external memory provider, Hermes Agent users can benefit from semantic search to retrieve relevant facts from long-term memory. The LanceDB memory plugin we introduce below is one such option, and it's built for local, embedded, semantic recall.

How Hermes Relates To Other Personal Agent Runtimes

If you've used OpenClaw or a similar personal-agent runtime, Hermes Agent will feel familiar. Like OpenClaw, Hermes agents are built to do real work rather than just chat or answer questions. You can easily wire any LLM up to the provided tools, messaging surfaces, and local state, and lean on skills as reusable procedural knowledge.

The key distinction between Hermes and OpenClaw is how Hermes draws its plugin extension boundaries. Hermes exposes clean, first-class slots and an abstract interface for external memory providers to hook into. Long-term memory in particular is an important add-on for many users that runs alongside the built-in session-based MEMORY.md/USER.md layer, and this is precisely the slot we're filling with LanceDB.

Hermes Agent Memory Providers

Concretely, a memory provider is a Python class implementing the MemoryProvider interface. Hermes initializes it once at session start, handing over the context it needs to scope storage: the active hermes_home, the platform, the session ID, and the workspace and user identity. From there, the provider hooks into the agent loop at a few well-defined points:

Hook	When it fires	What the provider can do
`system_prompt_block()`	Prompt assembly	Inject a static block telling the model its memory tools exist
`get_tool_schemas()` / `handle_tool_call()`	Init, then on use	Expose memory tools and respond when the agent calls them
`prefetch(query)`	Before each API call	Return recalled context for the upcoming turn
`sync_turn(user, assistant)`	After each turn	Persist the conversation (non-blocking)
`on_pre_compress(messages)`	Before context compression	Extract durable facts before they're summarized away
`on_session_end(messages)`	Conversation ends	Final extraction and flush

This contract allows exactly one external memory provider active at a given time (alongside the built-in session-based memory that's persisted to local files).

Why LanceDB Fits Agent Memory

LanceDB is an open-source, embedded retrieval library, and it's well-suited to serve as a memory store for personal agents like Hermes: there's no separate server to stand up, and it reads and writes a table on local disk. A single table holds your data, its metadata, and its embeddings together, and you can query that table by vector similarity, full-text (BM25) search, a hybrid of the two, or plain metadata filters. The same table abstraction scales up to larger LanceDB deployments later, so nothing about the local setup is a dead end.

Those capabilities line up almost one-to-one with what agent memory needs. A memory becomes a structured row, not just a text blob, so we can attach a category, tags, timestamps, and provenance alongside the content. Metadata filters let recall stay scoped to the right workspace and user instead of bleeding across contexts. Vector search is what survives paraphrasing, the exact failure mode that sank FTS-only session search earlier, while BM25 is still there for the cases where an exact name, ID, or piece of jargon matters more than meaning. And because it's embedded, the whole thing installs as a dependency rather than a service to operate.

The mental model is clean: Hermes owns the agent loop; LanceDB manages durable long-term memories and offers semantic recall.

Introducing The LanceDB Memory Plugin

The plugin slots seamlessly into an existing Hermes installation, and runs inside the Python process, with no remote memory server to operate. It does so by storing a single workspace-scoped table at ~/.hermes/lancedb/memories.lance. By default, it calls OpenAI's embeddings endpoint, but because it uses an OpenAI-compatible client, you can point it at any other compatible provider.

Here's how the pieces fit together:

An agent interacts with LanceDB memory through four tools that ship with this plugin:

lancedb_remember: stores a durable fact when the agent is explicitly asked to.
lancedb_recall: runs the actual search, returning the most relevant facts (vector by default, hybrid if you opt in).
lancedb_read: fetches a single memory by ID, optionally with the original conversation messages it was drawn from, so the agent can check where a fact came from.
lancedb_forget: deletes, but only after previewing candidates and confirming an exact ID, so nothing disappears by accident.

Beyond these tools, the plugin also captures durable facts from your conversations automatically, which we'll walk through in the next section.

Everything is stored in one memories.lance table. A kind column separates the extracted facts (fact rows) from the original conversation messages they were drawn from (turn rows), which are kept so the agent can trace any fact back to where it came from. Each row carries the content plus metadata: a category, tags, the originating session, the workspace and user it belongs to, a timestamp, and a content hash. That hash is what dedupes facts, so remembering the same thing twice doesn't pile up duplicate rows.

_💡_{What does "Workspace-scoped" mean?}_{Scoping of agent data happens at two levels. Each Hermes profile gets its own}_{hermes_home}_{, so a separate profile means a physically separate table on disk. Within one profile, memories from different agent workspaces share the same table but stay logically isolated: every row is tagged with its}_{agent_workspace}_{, and recall prefilters on it (the content hash folds it in too, so the same fact in two workspaces is two distinct rows). In practice, this means recall in one workspace never surfaces memories from another, and you get that separation without having to spin up a separate database for each project.}

The Memory Lifecycle: From Conversation To Durable Fact

So how does a passing comment in a chat become a fact the agent can recall weeks later? Here's the path it travels over a single session:

Session starts. Hermes activates the plugin and hands it the session details. The plugin opens (or creates) its LanceDB table and gets its search index ready.
The agent learns its tools. The plugin adds a short note to the system prompt telling the agent that memory exists and when to use each tool.
Before each reply, the agent checks memory. Hermes asks the plugin for anything relevant to the current message, so past facts can inform the response.
After each reply, the conversation is saved. The plugin stores the new messages in the background, so it never slows down the reply you're waiting on.
Facts are distilled. When the conversation is about to be compressed, and again when the session ends, the plugin sends the recent messages to a cheaper helper model that picks out the durable facts worth keeping.
Facts are stored. Each extracted fact is turned into an embedding, checked against existing rows so duplicates don't pile up, and saved as a fact row linked back to the messages it came from.

The result is that the store fills with two kinds of rows: compact, distilled facts, and the raw messages behind them. When the agent recalls something, it searches the facts first. That's deliberate: a handful of clean, durable facts is far more useful to the model than a pile of raw conversation snippets, which tend to bury the answer in noise. The original messages are still there as backing evidence and as a fallback when no fact matches, but the facts are what lead the process.

Retrieval Modes: Vector First, Hybrid When You Need It

What does recalling a memory actually need? The user rarely repeats the exact words from when a fact was first mentioned, so matching on keywords alone misses too much. What you'd want is semantic search: compare meaning rather than spelling, so "what's my preferred commit style?" can pull back a fact you phrased months ago as "I like conventional commits, scoped, lowercase."

So recall ships in vector mode by default: pure semantic search. If you also need exact name or jargon matching, you can switch to hybrid (vector + BM25) and choose how the two are fused: RRF, a vector-biased linear blend, or a cross-encoder reranker. The retrieval mode is set per call, fusion is a config setting, and the benchmark below puts all four head to head.

In the config, switching to hybrid with the default RRF fusion looks like this. If you'd trade a little latency for better ranking quality, the cross-encoder option (commented out) is also available.

# ~/.hermes/config.yaml
plugins:
  lancedb:
    retrieval:
      mode: hybrid          # vector (default) | hybrid
      reranker:
        type: rrf           # how the vector + BM25 legs are fused
        # Swap RRF for a reranking pass (pulls in sentence-transformers + torch):
        # type: cross-encoder
        # model: cross-encoder/ettin-reranker-17m-v1
        # rerank_top_n: 50

For now, the cross-encoder runs locally through sentence-transformers, defaulting to the compact 17M-parameter ettin reranker (you can point model at any Sentence Transformer cross-encoder). Support for other reranker backends, such as hosted rerank APIs, is left out in this initial release.

Benchmark

Setup: What We Measured

We ran LongMemEval-S, a long-conversation QA benchmark. Each case is a long multi-session chat history plus a question whose answer is buried somewhere in it. For every case the harness builds an isolated store, loads in all the conversation messages, retrieves the top 5 for the question, hands only those snippets to an answer model, and a judge model grades the answer against the reference. So this measures the retrieval substrate itself, how well each method finds the right messages, rather than the full extraction lifecycle.

This lets us put Hermes' built-in search side by side with the plugin's modes:

Variant	What it is
`hermes-session-search`	Baseline. Hermes' built-in SQLite FTS5 (BM25) search over raw messages
`lancedb-vector`	Pure semantic search (the shipped default)
`lancedb-hybrid-rrf`	Hybrid, fused with equal-weight Reciprocal Rank Fusion
`lancedb-hybrid-linear`	Hybrid, fused with a vector-biased linear blend
`lancedb-hybrid-cross-encoder`	Hybrid, then re-ranked by a cross-encoder model

The full LongMemEval-S set has 500 cases. Rather than run all of them, we sampled a stratified subset: 10 cases from each of the six question types, for 60 cases total (top-k 5), answered by gpt-5.4 and judged by gpt-5.4-mini, on the plugin's shipped defaults (text-embedding-3-small, a 17M-parameter cross-encoder). It's recommended to treat the results as illustrative. You can always reproduce it yourself on the full set with models of your choice using the benchmark harness in this repo.

Results: Semantic Recall Shows Gains

Variant	Accuracy	Recall@5	MRR@5	Query latency p50	Query latency p95
hermes-session-search	0.53	0.66	0.64	<0.01s	<0.01s
lancedb-vector	0.66	0.80	0.68	0.21s	1.05s
lancedb-hybrid-rrf	0.61	0.65	0.64	0.24s	0.93s
lancedb-hybrid-linear	0.61	0.72	0.68	0.25s	1.46s
lancedb-hybrid-cross-encoder	0.68	0.75	0.69	0.70s	3.06s

_{💡 Note on the latency numbers shown above}_{The latency includes}_{query embedding + search}_{. The actual vector search over this local LanceDB table is sub-millisecond; nearly all of the ~0.2-1s is the network round-trip to the embedding API, so it varies with your embedding provider and a faster endpoint shrinks it. The Hermes baseline is near-instant only because it makes no embedding call and uses pure FTS.}

A few things stand out:

Vector search clearly beats the lexical baseline. It surfaces the right messages even when the question is worded differently from the original conversation, exactly where keyword search fails. Accuracy and recall both step up.
Equal-weight hybrid (RRF) goes backwards. Treating keywords and meaning as equally important lets noisy lexical matches crowd out stronger semantic hits, dropping it below pure vector. Tilting the blend toward vectors (linear) wins most of that back.
The cross-encoder takes the quality crown. A reranking pass over the candidates lands the best accuracy and ranking, especially on temporal-reasoning questions, but it's the slowest and pulls in a heavy dependency, so it stays opt-in.

As expected, using a cross-encoder to rerank potential candidates does yield the best results, but at a more significant cost. Traditional vector search does better than lexical search on average.

Example Answer

To make things concrete, let's look at a single case from the benchmark that shows why semantic search helps with memory recall. In this case, the user asks:

What play did I attend at the local community theater?

Keyword search latches onto the message that literally contains "play" and "local community theater":

"I recently went to a play at the local community theater and was impressed..."

But that message never names the play, so the answer comes back "I don't know." Semantic search instead reads the intent and surfaces a different message, the one where the play is actually named:

"The play I attended was actually a production of The Glass Menagerie."

The agent answers correctly: The Glass Menagerie. The phrase that mattered never appears in the question, and no keyword links it to "play" or "theater." This shows the for semantic recall, in one example.

LongMemEval-S is a challenging benchmark with six question types. The per-type breakdown explains why the average accuracy, recall and MRR are not that high across them all:

Question type	session-search	vector	hybrid-rrf	hybrid-linear	cross-encoder
knowledge-update	0.70	0.78	0.56	0.67	1.00
multi-session	0.10	0.50	0.30	0.40	0.40
single-session-assistant	0.60	0.90	0.90	0.90	0.80
single-session-preference	0.20	0.30	0.40	0.10	0.10
single-session-user	0.80	0.90	0.90	1.00	0.90
temporal-reasoning	0.80	0.60	0.60	0.60	0.90

Single-session questions are easy for everyone (our play example is one of them); multi-session and preference questions are hard across the board − there could be additional room for Hermes' extraction lifecycle to help, which this benchmark harness doesn't exercise. Temporal reasoning is where the cross-encoder's reranking really shows its worth.

Usage: A Brief Intro

Getting a LanceDB memory store running in Hermes Agent takes only a few minutes. We'll do it inside an isolated profile called demo (the -p demo flag), so you can try everything here without touching your main Hermes setup. The same steps work in your default profile, just drop the -p demo.

First, install Hermes itself (skip this if you already have it installed):

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

Create the demo profile, then set it up with a model and API key and add the plugin to it. The profile has to exist first; -p demo won't conjure it on the fly:

# Create a new profile
hermes profile create demo
# Pick a model + API key
hermes -p demo setup
# Install the LanceDB memory plugin
hermes -p demo plugins install lancedb/hermes-agent-memory

The plugin needs a few Python packages, and they go into Hermes' own Python environment, because Hermes loads plugins inside its own interpreter, not a separate virtualenv you might have active. (This interpreter is shared across profiles, so you only install once.)

uv pip install --python ~/.hermes/hermes-agent/venv/bin/python lancedb openai pyyaml

The plugin turns your conversations into embeddings, so it needs an embedding API key. By default that's OpenAI, so set OPENAI_API_KEY in your environment or in the profile's ~/.hermes/profiles/demo/.env. (If you'd rather use a local or non-OpenAI endpoint, you can point the embedding client at any OpenAI-compatible URL in your config.)

Finally, switch memory on and pick this plugin:

hermes -p demo memory setup     # choose "lancedb"

Before you start chatting, verify the provider is actually active. This is an important step and shouldn't be skipped: if the plugin isn't switched on, Hermes quietly falls back to its built-in notes, making you wonder if memory "isn't working", even when it simply wasn't wired in. You can check the status of the memory plugins with:

hermes -p demo memory status

Memory status
────────────────────────────────────────
  Built-in:  always active
  Provider:  lancedb

  Plugin:    installed ✓
  Status:    available ✓

You want to see Provider: lancedb with both installed ✓ and available ✓. If Provider is blank or shows something else, re-run hermes -p demo memory setup and pick lancedb.

Once everything checks out, start a chat with hermes -p demo and the agent has its four memory tools to begin retrieving!

Demo: Teaching Hermes Your Project Conventions

Let's make this concrete with the pain point we opened with: re-explaining your setup every session. We'll save a convention once, then prove a brand-new session can recall it correctly.

Remember: First, ask Hermes to commit a convention to long-term memory. Saying "remember in long-term memory" makes sure it lands in the LanceDB store, which shows up as the ⚡ lancedb_r (lancedb_remember) line below. Hermes' built-in notes grab a copy too (the 🧠 memory line), since the two layers run side by side, but it's the LanceDB copy that powers durable semantic recall later:

● Remember in long-term memory: for this project I only use uv, never pip, and I always add type hints to Python functions.
Initializing agent...

────────────────────────────────────────
  ┊ 🧠 memory    +memory: "For this project, the user only uses uv for Python package management, never pip, and always adds type hints to Python functions."  0.0s
  ┊ ⚡ lancedb_r   0.0s
 ─  ⚕ Hermes  ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

     Remembered. I’ve stored that project convention: use uv only, never pip, and always add type hints to Python functions.

Isolate the plugin: Notice both Hermes' inbuilt store and the LanceDB memory grabbed the fact. The built-in note gets reloaded into the prompt at the start of every session, so a later recall could be answered from either store and we couldn't tell which did the work. To prove LanceDB can do it on its own, we can take the built-in layer out of the picture:

hermes -p demo config set memory.memory_enabled false

Now the LanceDB plugin is the only memory store in play.

Recall: Start a fresh session by typing /new and ask Hermes to consult its long-term memory. With built-in memory disabled, the memory can only come from LanceDB, and we see that it comes back clean: no repo files read, just the durable fact you saved:

● /new
● Look through your long-term memory for my Python project preferences.
────────────────────────────────────────

 ─  ⚕ Hermes  ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

     I found one durable memory about Python project preferences:

     - Project Python tooling preference is uv; pip is not used; type hints are required.

This was the whole promise of the LanceDB memory plugin in one exchange: a fresh session, with none of the earlier conversation in context, recalling a fact that now lives only in the LanceDB store.

Read: You can also ask where a fact came from. Here Hermes attributes the answer to its stored memory and quotes the saved convention back, rather than guessing from a file in the repo (under the hood, lancedb_read can also return the exact source messages a fact was distilled from):

● Where did that uv preference come from?
────────────────────────────────────────

 ─  ⚕ Hermes  ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

     It came from the project’s persistent memory, not from a codebase file I inspected.

     Specifically, the recalled memory says:
     - “For this project, the user only uses uv for Python package management, never pip, and always adds type hints to Python functions.”

     So I treated it as an established project convention. If that’s no longer true, I can update or remove it.

Forget: Say you change your project preferences at a later time. You can ask Hermes to drop the old fact, and the tool calls tell the whole story: the two ⚡ lancedb_f (lancedb_forget) lines are it previewing matches and then deleting, and the trailing ⚡ lancedb_r is it saving the replacement preference in the same breath:

● I've switched from uv to pixi. Forget the uv preference.
────────────────────────────────────────

  ┊ ⚡ lancedb_f uv preference python package management never pip type hints  0.3s
  ┊ ⚡ lancedb_f   0.0s
  ┊ ⚡ lancedb_r   0.2s
 ─  ⚕ Hermes  ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

     Done — I removed the uv preference from memory and replaced it with:

     - This project uses pixi for Python package management
     - Type hints on Python functions are still preferred

Remember, recall, read, forget: four small operations, and between them they cover the entire lifecycle of a durable memory. Note that we turned built-in memory off only to isolate the plugin for this walkthrough. You can flip it back on with hermes -p demo config set memory.memory_enabled true for everyday use, where the two layers happily coexist.

What To Watch During Real Use

Below, we list a few practical things to keep in mind while adding the LanceDB memory plugin to your daily workflow:

Extraction quality: Memory is only as good as what gets extracted. The auxiliary model that distills facts matters; point it at a capable (or cheaper) model to taste, and the agent is tuned to skip trivial, throwaway details.
Storage upkeep: Each save is a tiny write, which would fragment the table over time. The plugin runs background compaction automatically; leave maintenance enabled and you never think about it. Still, it's worth monitoring the overall size of the memories.lance table over time.‍
Privacy: The LanceDB table is local, but embeddings are sent to whatever serving endpoint you configure. If privacy and local-first really matter, point it at a local model (Ollama, vLLM) and nothing needs to leave your machine.‍
Deletion safety: forget always previews before it deletes and removes by exact ID, so the agent can't wipe the wrong memory on a vague request.

Memory As A Small, Durable Interface

Even though model providers may be racing toward ever-bigger context windows, the key fact still remains: raw context isn't really what makes an agent smarter and remember relevant facts. A small, deliberate set of retrieval operations helps narrow down on the facts, especially ones that were from a distant point in the past. By pointing to a long-term memory store, you can persist a durable fact, recall it when it's relevant, read back where it came from, and forget it when it's wrong or no longer relevant, over many sessions and longer time periods.

In an earlier post, we explained why LanceDB is a natural memory layer for OpenClaw. This post makes a similar case, but for Hermes Agent. Because LanceDB is open-source and embedded, it's lightweight and simple to get started with. Under the hood, it does semantic search to recall memories, with optional hybrid and reranking, and it keeps each memory as a structured row with metadata and a link back to its source. Hermes still owns the agent loop, while LanceDB manages the durable, ranked recall, and the plugin is the small piece of glue between them.

To try it, install the plugin, enable it with hermes memory setup, and run the kind of workflow we walked through. The repo that contains the plugin and the benchmark harness, is there if you'd like to dig in or reproduce the numbers yourself. If you'd like to see more features in the plugin, feel free to send over pull requests on the repo!

‍