Cut AI Costs 60% and Build Better Memory—Free, in One Afternoon

Problem: Burning $20/day on AI API costs. Reading 5,000-line memory files every session. Slow, expensive, doesn’t scale.

Solution: Built a local memory system and intelligent routing layer. Open-source tools. Zero ongoing costs.

Result: 60% cost reduction ($4,200/year saved), <50ms memory retrieval, and it gets smarter over time.

Here’s what we built in one afternoon.

The Two Problems

1. Memory Was Slow and Dumb

Reading MEMORY.md (5,000+ lines) every session just to remember yesterday. Hundreds of thousands of tokens wasted on file reads. No indexing. No entity tracking. Grep and hope.

2. Every Task Cost the Same

Sending “format this JSON” to Claude Sonnet ($0.098/request) because it’s good at everything. No routing logic. No cheaper alternatives. Just overpay or build custom infrastructure.

Most people accept this. We didn’t.

What We Built

Local Memory System (Inspired by Hindsight, Mem0, GitHub Copilot)

SQLite database with multi-strategy retrieval. No API calls. Completely private.

What it does:

Extracts facts automatically from conversations
Resolves entities (“Jed” = “Jed Wilson” = “my client”)
Indexes keywords for fast search
Multi-strategy retrieval (keyword + entity + temporal)
Synthesis across memories (not just retrieval)

What we imported:

228 existing memories from markdown files
1,089 entities tracked
2,664 keywords indexed

Performance:

Search: <50ms (vs minutes reading files)
Accuracy: ~70-80% (vs Hindsight’s 91.4%)
Cost: $0 forever

The code:

# Core operations
engine.retain("Jed works at night")  # Stores with auto-extraction
engine.recall("When does Jed work?")  # Multi-strategy search
engine.reflect("Tell me about Jed")   # Synthesis across memories

No external dependencies. Just SQLite and Python.

Ollama (Local LLM for Simple Tasks)

Downloaded Qwen2.5:14b (9GB). Runs locally. Handles 60-70% of tasks with zero API costs.

Installation:

brew install ollama
ollama pull qwen2.5:14b

Good for:

Email drafts
Basic research
Data formatting
Simple Q&A
File operations

Not good for:

Deep analysis
Strategic decisions
Complex synthesis

That’s fine. Claude handles the hard stuff. Ollama handles everything else. For free.

NadirClaw (Smart Router)

Open-source LLM router that analyzes request complexity in ~10ms and routes automatically.

Installation:

pip install nadirclaw
nadirclaw setup  # Interactive wizard
nadirclaw serve  # Starts on localhost:8857

Routing logic:

Simple tasks → Claude Haiku ($0.0004)
Complex tasks → Claude Sonnet ($0.098)
Free tier → Ollama qwen2.5:14b ($0)

No manual decisions. It just works.

The Numbers

Before

Memory: Reading 5,000-line files every session (slow, expensive)
Cost: All tasks → Claude Sonnet
Daily spend: $20/day ($600/month, $7,200/year)

After

Memory: <50ms indexed retrieval (fast, free)
Cost: 60% cheap/free, 40% premium
Daily spend: $8/day ($240/month, $2,880/year)

Savings: $4,320/year

Time to build: 4 hours

Ongoing cost: $0

The Architecture

User request
    ↓
Local Memory System (instant recall, $0)
    ↓
NadirClaw Router (analyze complexity, ~10ms)
    ↓
    ├─→ 60% Simple → Haiku ($0.0004) or Ollama ($0)
    └─→ 40% Complex → Sonnet ($0.098)

Every layer optimizes something:

Memory: eliminates redundant file reads
Router: right model for the job
Local LLM: free inference when possible

Why This Works

It’s Actually Free

Not “free tier with limits.” Free as in:

SQLite: open-source, embedded
Ollama: open-source, local inference
NadirClaw: open-source, self-hosted

No subscriptions. No usage caps. No vendor that can raise prices.

It’s Better, Not Just Cheaper

Memory improvements:

50ms retrieval vs minutes of file reading
Entity resolution: “Jed” links to all mentions
Multi-strategy search catches what single-strategy misses
Synthesis: reasons across memories, not just retrieves

Cost improvements:

60% reduction immediately
Gets better as routing improves
Predictable spending (no surprise bills)

It Compounds

Memory gets smarter with use. More memories → better retrieval → better synthesis.

Routing gets smarter with data. More requests → better classification → better savings.

Traditional approaches plateau. This improves.

How to Replicate

1. Build Local Memory

# SQLite schema
CREATE TABLE memories (
    id INTEGER PRIMARY KEY,
    content TEXT,
    fact TEXT,  # Extracted core fact
    created_at TIMESTAMP
);

CREATE TABLE entities (
    id INTEGER PRIMARY KEY,
    name TEXT,
    aliases TEXT  # JSON array
);

CREATE TABLE keywords (
    memory_id INTEGER,
    keyword TEXT,
    weight REAL
);

Multi-strategy retrieval:

Keyword search (BM25-like)
Entity-based search
Temporal filtering
Merge and rerank

2. Install Ollama

brew install ollama
ollama pull qwen2.5:14b  # or qwen2.5:7b for smaller
ollama serve  # Runs on localhost:11434

3. Install NadirClaw

pip install nadirclaw
nadirclaw setup
nadirclaw serve --port 8857

4. Point Requests to Router

# Instead of direct API:
curl https://api.anthropic.com/...

# Route through NadirClaw:
curl http://localhost:8857/v1/chat/completions

5. Monitor

nadirclaw report   # Routing decisions
nadirclaw savings  # Cost savings

What We Learned

Memory Systems

The best research (Hindsight, Mem0, GitHub Copilot) all use:

Fact extraction (not raw storage)
Entity resolution (link mentions)
Multi-strategy retrieval (semantic + keyword + graph + temporal)
Citation tracking (know the source)

We implemented the core ideas in ~600 lines of Python. Good enough for personal use. Beats reading files every session.

Cost Optimization

60-70% of LLM requests are simple. They don’t need premium models. But routing logic is hard to build.

Open-source routers (NadirClaw) solve this. Drop-in replacement. Smart classification. Automatic failover.

The infrastructure exists. Just use it.

Local vs Cloud

Local wins when:

Cost matters more than latest models
Privacy matters
You control the hardware

Cloud wins when:

You need cutting-edge reasoning
Scale beyond one machine
Zero maintenance preference

We use both. Local for 60%, cloud for 40%. Best of both worlds.

The ROI

Time invested: 4 hours (research, build, document)
Money invested: $0
Annual savings: $4,320
Ongoing cost: $0

Payback period: Immediate (no cost to recover)
ROI: Infinite

The Trade-Offs

What You Give Up

Managed service convenience
Latest models immediately
Infinite cloud scale

What You Gain

60% cost reduction
Complete control of your stack
Privacy (data never leaves your machine)
No vendor lock-in
Predictable costs

For most personal and small-team use cases, this is the right tradeoff.

The Bottom Line

We went from:

❌ $20/day API costs
❌ Reading 5,000-line files every session
❌ No entity tracking
❌ No learning between sessions

To:

✅ $8/day API costs (60% reduction)
✅ <50ms indexed memory retrieval
✅ Entity resolution and multi-strategy search
✅ Memory that compounds over time
✅ All local, all private, all free

In one afternoon. With open-source tools. Zero ongoing costs.

The only question: how long will you keep overpaying?

Tools used:

Ollama - Local LLM runtime
NadirClaw - LLM router
SQLite - Embedded database
Python - Memory engine

Research sources:

Total cost: Still $0.

The Two Problems#

1. Memory Was Slow and Dumb#

2. Every Task Cost the Same#

What We Built#

Local Memory System (Inspired by Hindsight, Mem0, GitHub Copilot)#

Ollama (Local LLM for Simple Tasks)#

NadirClaw (Smart Router)#

The Numbers#

Before#

After#

The Architecture#

Why This Works#

It’s Actually Free#

It’s Better, Not Just Cheaper#

It Compounds#

How to Replicate#

1. Build Local Memory#

2. Install Ollama#

3. Install NadirClaw#

4. Point Requests to Router#

5. Monitor#

What We Learned#

Memory Systems#

Cost Optimization#

Local vs Cloud#

The ROI#

The Trade-Offs#

What You Give Up#

What You Gain#

The Bottom Line#

Get in Touch

Request Custom Skill

Your Custom Skill Estimate

What's Included:

Request Submitted!

Before You Go...