I built a self-hosted memory system for my AI tools

ai elixir self-hosted open-source

Open Brain — self-hosted semantic memory system

Every AI assistant has the same flaw: it forgets everything the moment the session ends. You spend time giving it context — your setup, your preferences, what you decided last week — and the next day, gone. Blank slate.

Context windows keep getting bigger, which helps. But they don’t solve the problem. They just delay it. Stuffing the same background into every prompt gets old fast, and at some point the context is so full of repeated setup that you’re wasting tokens on things that shouldn’t need restating.

I got tired of workarounds. So I built something.

What Open Brain is

Open Brain is a self-hosted semantic memory service. It’s an Elixir/Phoenix API backed by Postgres with pgvector. I run it on a home server behind Tailscale. When I want an AI tool to remember something — a project decision, a config detail, a preference — I store it. When I need it back, I search for it semantically, and it surfaces the right memory even if my search terms don’t match the original wording exactly.

The important detail: all embeddings are computed locally using Ollama with nomic-embed-text. 768-dimensional vectors. Nothing gets sent to an external API. When you’re storing notes about your own infrastructure and personal decisions, that matters.

Why this stack

I’ve been writing Elixir for a few years and it was the obvious choice for a small, long-running API service. Reliable, low overhead, easy to reason about. Elixir apps tend to just stay up without a lot of babysitting. That’s what I wanted here.

Postgres with pgvector is the less obvious choice that I’d make again without hesitation. You’re already running Postgres for something. pgvector gives you cosine similarity search inside that same database, with ivfflat indexing that holds up well for personal-scale memory stores. No separate vector database, no additional service to maintain. One less thing.

Ollama running locally handles the embedding generation. nomic-embed-text is good at this — it’s a purpose-built embedding model, not a general-purpose LLM, so it’s fast and cheap to run. On an M3 or a capable mini PC, the latency is low enough that you don’t notice it.

The combination means the system is entirely self-contained. Embeddings happen on device. Retrieval happens on device. The only network involved is my private Tailscale network.

How memories are organized

This part matters more than it might seem.

Every memory has a namespace: personal, health, family, infra, ideas, projects/open-brain, and so on. When I’m searching for infrastructure facts, I’m not swimming through health notes. The scope makes retrieval faster and cleaner.

Tags go on top of that. I can store something in the projects/mission-control namespace, tag it decision and auth, and later retrieve exactly that slice if I need it. Each memory also records where it came from, which embedding model generated the vector, and timestamps for both creation and last update.

It’s structured data. That discipline is boring but it pays off when you’re actually using the thing every day rather than just admiring the architecture.

Hybrid search

Pure semantic search has an edge case problem. It’s great for conceptual retrieval — “what did I decide about authentication?” surfaces the right memory even if I phrased it differently when I stored it. But for specific proper nouns, exact strings, or narrow technical details, vector similarity can miss.

Open Brain runs semantic search first. If that comes up short, it falls back to keyword search against the stored content. Most of the time the semantic search wins. Occasionally the keyword fallback catches something the vector search would have ranked too low. The combination means fewer frustrated “I know I stored this somewhere” moments.

The MCP server

This is the part that made everything click.

Open Brain has a Model Context Protocol server at /mcp. It exposes four tools: add_memory, search_memories, list_memories, delete_memory. Any MCP-capable client — Claude Code, local Ollama agents, OpenAI-compatible tools — can connect and use the same memory store.

The client config is straightforward:

{    "open-brain": {
    "type": "http",
    "url": "http://10.0.0.1:8765/mcp",
    "headers": {
      "Authorization": "Bearer your-api-key"
    }
  }
}

That’s it. Drop that in your MCP client configuration and now your AI tool can search and write memories mid-task without you doing anything manually.

The reason this matters: before the MCP server, memories were something I managed explicitly with shell scripts. I’d run brain-add "fact" or brain-search "query" by hand. Useful, but still manual. With MCP, the AI tools themselves can retrieve relevant context when they need it and store new facts as they learn them. The memory layer becomes active rather than passive.

One shared brain. Every AI tool on my network reads from and writes to the same store. That’s the unlock.

Tailscale as the perimeter

Here’s the honest constraint: Open Brain doesn’t have authentication implemented yet.

That means it’s internal-only, full stop. I don’t expose it to the public internet. Tailscale handles the perimeter — it gives me private networking across all my devices, WireGuard encryption, and zero firewall configuration. If you’re on my Tailscale network, you can reach the service. If you’re not, you can’t.

For personal use, this is fine. It’s a reasonable tradeoff: ship something useful now, add auth when the use case demands it. The use case that demands it is team access or exposing the service more broadly. I’m not there yet.

When I do add auth, the Tailscale perimeter doesn’t go away — it becomes one layer of a layered approach rather than the only layer. That’s the plan.

How I actually use it

My AI assistant — Xander, running via OpenClaw — uses Open Brain as its long-term memory. Every session, before answering anything about past decisions or project state, it searches Open Brain. The things that would otherwise vanish when a session ends now persist across model restarts, configuration changes, everything.

The practical difference is that Xander actually knows my situation. It knows which server runs which services, what I decided about the auth approach on a project last month, what supplements I take and why. Not because I told it in this session, but because it’s stored. It can pick up where we left off.

That’s what I wanted. An assistant that treats accumulated context as an asset rather than something that evaporates.

What I’d do differently

Start with auth. It’s one of those things that’s genuinely easier to build in from the beginning than to add later. I knew this and shipped without it anyway because I wanted to get to the useful part. No regrets, but if I were starting fresh, session-based API key auth would be in the first version.

I’d also think harder about memory lifecycle from day one. Right now memories accumulate. I run a weekly cleanup pass that uses an LLM to prune duplicates and stale entries. It works, but it’s reactive. Something like automatic expiration on certain memory types, or confidence decay over time, would be better designed in than patched in later.

What’s next

Authentication is first. Once that’s in place, I can share access with collaborators and think about multi-user memory spaces.

After that, memory consolidation — periodically synthesizing related memories into higher-level summaries rather than just accumulating discrete facts. The system works well now, but it’ll work better with some intelligence about what to keep, what to merge, and what to let go.

Current version is v0.2.7. It does what I built it to do. The next versions will make it ready for more than just me.


Stack: Elixir 1.17, Phoenix, Postgres 16, pgvector 0.8, Ollama, nomic-embed-text. MCP server follows the 2025-03-26 spec.

Thoughts from the Yukon

© 2026 Andrew Kalek