Open Source · Apache 2.0

StreetAI Memory

A Python memory layer that sits between your app and the LLM API. It stores the conversation, organises it, and at each turn sends only the parts the model needs. Your AI's memory grows forever. Your token bill doesn't.

Plain history Every turn resends the whole conversation prompt grows linearly
StreetAI Memory Each turn sends only what matters prompt stays near flat

In a 16-turn benchmark, input tokens dropped 55 to 80 percent per turn (average 68 percent), with the savings growing as the conversation lengthens.

What is it for?

A chatbot that remembers things across sessions. An agent that learns which facts about a user actually matter. A support bot that recalls a customer's history without you resending the whole ticket log on every turn. Any LLM app where the conversation outlives a single request.

What makes it different from plain chat history

Plain chat history StreetAI Memory
Prompt grows with conversation Yes (linear) No (near flat)
Recent context kept verbatim Yes Yes (recency window)
Activity-aware decay No Yes (per interaction)
Learns from outcomes No Yes (boost / demote)
Self-organizing No Yes (auto-stacks)
Cross-provider Yes Yes

Install

Requires Python 3.10 or newer. Works on Windows, macOS, and Linux.

# Core package
pip install streetai-memory

# With provider adapters (optional)
pip install "streetai-memory[anthropic]"
pip install "streetai-memory[openai]"      # also covers DeepSeek, Together, Groq
pip install "streetai-memory[gemini]"
pip install "streetai-memory[all]"         # all of the above

First use downloads a roughly 90MB embedding model (all-MiniLM-L6-v2, in ONNX format) into a local cache. The cache lives inside the installed package by default. Set FASTEMBED_CACHE_DIR to share it across virtual environments.

Quickstart

The shortest path: a Memory you can write to, query, and serialise without touching an LLM yet.

from streetai import MemoryRegistry

registry = MemoryRegistry("./memory.db")
mem = registry.get("user_123")

mem.add_message("Hi, I'm planning a trip to Japan.", role="user")
mem.add_message("Great. Which cities?", role="assistant")

prompt = mem.build_prompt("What did I say about Japan?")
# prompt.messages   -> list of {role, content} ready for any LLM API
# prompt.retrieved  -> the signals pulled in (pass to post_process later)
# prompt.inspector  -> debug data (stacks activated, retrieval scores)

After your LLM responds, feed the response back to the memory so it can learn:

response_text = your_llm(prompt.messages)
mem.post_process(prompt.retrieved, response_text)
mem.add_message("What did I say about Japan?", role="user")
mem.add_message(response_text, role="assistant")

Drop-in adapters

If you would rather not write that loop by hand, wrap your provider client with with_memory(). The wrapper exposes the same SDK methods you already know. Memory reads happen before each call; writes happen after.

Anthropic

from anthropic import Anthropic
from streetai.adapters.anthropic import with_memory

client = with_memory(Anthropic(), memory_id="user_123")
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are helpful.",
    messages=[{"role": "user", "content": "What did I mention earlier?"}],
)
print(response.content[0].text)

OpenAI

from openai import OpenAI
from streetai.adapters.openai import with_memory

client = with_memory(OpenAI(), memory_id="user_123")
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What did I mention earlier?"}],
)
print(response.choices[0].message.content)

DeepSeek (uses the OpenAI adapter)

DeepSeek is OpenAI API compatible. Use the OpenAI adapter with base_url.

import os
from openai import OpenAI
from streetai.adapters.openai import with_memory

deepseek = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com/v1",
)
client = with_memory(deepseek, memory_id="user_123")
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[{"role": "user", "content": "What did I mention earlier?"}],
)

The same pattern works for Together, Anyscale, Groq, and any other OpenAI-compatible endpoint.

Google Gemini

from google import genai
from streetai.adapters.gemini import with_memory

client = with_memory(genai.Client(api_key="..."), memory_id="user_123")
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="What did I mention earlier?",
)
print(response.text)

Async

Every adapter has an async equivalent. Pass an async client; use await. Memory operations are wrapped in asyncio.to_thread.

# Anthropic
from anthropic import AsyncAnthropic
from streetai.adapters.anthropic import with_memory_async
client = with_memory_async(AsyncAnthropic(), memory_id="user_123")
response = await client.messages.create(model="claude-sonnet-4-6",
    max_tokens=1024, messages=[{"role": "user", "content": "..."}])

# OpenAI (and DeepSeek via base_url)
from openai import AsyncOpenAI
from streetai.adapters.openai import with_memory_async
client = with_memory_async(AsyncOpenAI(), memory_id="user_123")
response = await client.chat.completions.create(model="gpt-4o-mini",
    messages=[{"role": "user", "content": "..."}])

# Gemini (uses the existing client's .aio namespace internally)
from google import genai
from streetai.adapters.gemini import with_memory_async
client = with_memory_async(genai.Client(api_key="..."), memory_id="user_123")
response = await client.models.generate_content(model="gemini-2.0-flash",
    contents="...")
Streaming is not yet supported. Setting stream=True on any adapter raises NotImplementedError. The chunks would bypass the memory write step, so we block it explicitly rather than silently dropping the response.

Memory IDs and persistence

Each memory_id is an isolated, persistent memory. Use one per user or per session. They never leak into each other. The memory is saved to the SQLite file you pass to the registry and reloads automatically next time you start the process. It survives restarts and works across processes that share the file.

registry = MemoryRegistry("./memory.db")
user_a = registry.get("alice")
user_b = registry.get("bob")

user_a.add_message("My favourite city is Kyoto", role="user")
user_b.add_message("My favourite city is Lisbon", role="user")

# Two completely separate memories.
# Querying user_a will never surface user_b's signals, and vice versa.

To wipe one memory entirely, for example on a "clear chat" action or an account deletion:

registry.reset("alice")   # everything stored under that id is gone
What lives where. The model file (about 90MB) is one copy on disk per Python environment. The signals and stacks live in one SQLite file, scoped per memory_id. The model in RAM is shared across every memory in the process, so adding users does not multiply embedding cost.

Editing and deleting messages

Every add_message call returns the signals it created. Each signal carries a parent_turn_id that identifies the whole message. Save that id alongside the message in your own display database, then use it later when a user edits or deletes.

From a direct Memory

created = mem.add_message("I am vegetarian", role="user")
pid = created[0].parent_turn_id

mem.update_message(pid, "I am vegan")     # in-place edit; keeps the same turn
mem.delete_message(pid)                    # remove the message entirely

From an adapter

The wrapper exposes a .memory property, so any Memory method works on the client. Each adapter call stores two signals (user, then assistant); read the parent_turn_id from client.memory.signals after the call.

resp = client.messages.create(model="claude-sonnet-4-6",
    max_tokens=512, messages=[{"role": "user", "content": "I am vegetarian"}])

user_pid = client.memory.signals[-2].parent_turn_id
client.memory.update_message(user_pid, "I am vegan")
client.memory.delete_message(user_pid)
Two honest details to know. An edit replaces the old content but does not carry over the boost/demote history from before; the new version starts at base weight 1.0. And deleting a message does not advance the turn counter, so turn numbers stay monotonic but may have gaps. Both are intentional and tested.

How it works

The full pipeline, in order. Nothing magic happens here. Each piece is small enough to read in source.

1. Chunking

An incoming message is split into sentence-sized pieces (between 30 and 400 characters, or 5 to 60 words). Triple-backtick code blocks stay atomic. Undersized chunks merge into their neighbours. Oversized ones split on comma or semicolon near the midpoint.

2. Embedding

Each chunk is embedded with all-MiniLM-L6-v2 through fastembed, producing a 384-dimensional normalised vector. Chunks from one message are batched into a single model call.

3. Stack assignment

The chunk is compared (cosine similarity) against the centroid of every existing stack. If the best match crosses the stack threshold (default 0.55), the chunk joins that stack. Otherwise it starts a new stack.

4. Tier 1 retrieval

When a query arrives, it is embedded the same way. A FAISS IndexFlatIP over stack centroids returns the top stacks (default 6) by inner product. This is the fast filter, not the final ranking.

5. Tier 2 scoring

Inside each surviving stack, every signal is scored by max(effective_weight, revival_floor) × similarity. Signals that clear the activation threshold (default 0.15) join the thinking space, capped at 25 signals total.

6. Recency window

The last few messages (default the last 3, all their chunks) are reassembled into their original form and sent to the LLM verbatim, alongside the retrieved long-term signals. Recent context is never replaced by retrieval.

7. Prompt assembly

The final prompt is a small list: a retrieved-context block (if any), the recency window, and the new user message. The total size is determined by what is relevant, not by how long the conversation has been.

8. Post-processing

After the LLM replies, each retrieved signal is scored against the response. Strong contributors get boosted (base weight times 1.10, capped at 8.0). Weak ones get demoted (times 0.95). Touched stacks recompute their centroids.

Decay and revival

Two forces shape what gets surfaced at retrieval time. They work together; you tune each separately.

Decay is measured in interactions, not wall-clock time

A signal's age is the number of turns since it was last used. The effective weight is base_weight × exp(-rate × (current_turn - last_used_turn)). With the default 50-turn half-life, a fresh signal halves at turn 50, quarters at turn 100, and falls below the death threshold of 0.05 after roughly 216 turns of no use. An idle gap (no new turns) does not advance decay; a user returning weeks later finds memory where they left it. Every time a signal is pulled into the thinking space, its clock resets to the current turn. Useful signals stay sharp through use; unused ones quietly fade.

Dead signals can come back on a strong match

A faded signal would normally be too weak to surface. But the right cue should still be able to bring an old memory back. StreetAI Memory does this with a weight floor in the retrieval score: score = max(effective_weight, floor) × similarity, where floor = activation_threshold / revival_similarity. With the defaults (0.15 and 0.45), the floor is about 0.33, which means a faded signal needs a similarity of at least 0.45 to clear the gate. Strong matches revive a dead signal; weak ones do not. Set revival_similarity = 0 in your Config to disable revival entirely.

Boost and demote, the slow learning loop

After each LLM response, post_process scores every retrieved signal against the response (cosine similarity). Signals that contributed strongly (default threshold 0.55) have their base weight multiplied by 1.10, capped at 8.0. Signals that did not (default threshold 0.20) are multiplied by 0.95. Anything in between is left untouched. Over many interactions, the signals that consistently help drift toward higher base weight; noise drifts lower and dies on its own.

Configuration

All knobs live on the Config dataclass. Pass one to the registry to override the defaults for every Memory it creates. Most apps never need to change these.

import math
from streetai import MemoryRegistry, Config

cfg = Config(
    recency_turns=5,                # last 5 messages verbatim (default 3)
    decay_rate=math.log(2)/100,     # 100-turn half-life (default 50; decay is per turn)
    stack_threshold=0.65,           # tighter stack assignment (default 0.55)
    activation_threshold=0.1,       # min score for a signal to surface (default 0.15)
    revival_similarity=0.45,        # match needed to revive a faded signal (0 disables)
)

registry = MemoryRegistry("./memory.db", config=cfg)

Full reference

FieldDefaultWhat it does
stack_threshold0.55Cosine similarity required for a new chunk to join an existing stack instead of starting its own.
decay_rateln(2)/50Per-turn decay rate. The default gives a 50-turn half-life. Pass math.log(2)/N for an N-turn half-life.
death_threshold0.05Effective weight below this is considered dead. Dead signals are skipped in retrieval unless revival rescues them.
top_stacks6How many stacks Tier 1 returns. Higher means broader retrieval, slightly more work in Tier 2.
activation_threshold0.15Minimum score = eff × sim for a signal to enter the thinking space.
max_thinking_space25Hard cap on the number of retrieved signals sent to the LLM per turn.
revival_similarity0.45Similarity required for a dead signal to come back. Implemented as a weight floor at activation_threshold / revival_similarity. Set to 0 to disable.
recency_turns3How many of the most recent messages are sent to the LLM verbatim alongside the retrieved signals.
boost_factor1.10Multiplier applied to base weight when a signal contributed strongly to the response.
demote_factor0.95Multiplier applied to base weight when a signal did not contribute.
high_relevance0.55Cosine similarity (signal vs response) above which a signal is boosted.
low_relevance0.20Cosine similarity below which a signal is demoted.
max_base_weight8.0Hard cap on base weight. Prevents runaway reinforcement of a single signal.

Limitations

The current release is alpha (0.2.0). The honest list of what is not in the box yet.

Non-streaming only

stream=True raises NotImplementedError on every adapter. Streaming bypasses the memory write step, so it is blocked rather than silently broken. Support is on the roadmap.

English-tuned defaults

Chunking sizes, the embedding model, and the default thresholds were validated on English conversations. Other languages may need tuning of the Config and possibly a different encoder later.

fastembed is the only encoder

The embedding model is fixed to all-MiniLM-L6-v2 through fastembed. A pluggable encoder interface, so you can supply a faster or hosted embedder, is planned.

Scale not yet validated

Behaviour at hundreds of thousands of signals or after months of continuous use has not been benchmarked. The architecture supports it, but treat large-scale claims as theoretical until tested.

View source on GitHub View on PyPI