Open Source · Apache 2.0
A Python memory layer that sits between your app and the LLM API. It stores the conversation, organises it, and at each turn sends only the parts the model needs. Your AI's memory grows forever. Your token bill doesn't.
In a 16-turn benchmark, input tokens dropped 55 to 80 percent per turn (average 68 percent), with the savings growing as the conversation lengthens.
A chatbot that remembers things across sessions. An agent that learns which facts about a user actually matter. A support bot that recalls a customer's history without you resending the whole ticket log on every turn. Any LLM app where the conversation outlives a single request.
| Plain chat history | StreetAI Memory | |
|---|---|---|
| Prompt grows with conversation | Yes (linear) | No (near flat) |
| Recent context kept verbatim | Yes | Yes (recency window) |
| Activity-aware decay | No | Yes (per interaction) |
| Learns from outcomes | No | Yes (boost / demote) |
| Self-organizing | No | Yes (auto-stacks) |
| Cross-provider | Yes | Yes |
Requires Python 3.10 or newer. Works on Windows, macOS, and Linux.
# Core package
pip install streetai-memory
# With provider adapters (optional)
pip install "streetai-memory[anthropic]"
pip install "streetai-memory[openai]" # also covers DeepSeek, Together, Groq
pip install "streetai-memory[gemini]"
pip install "streetai-memory[all]" # all of the above
First use downloads a roughly 90MB embedding model (all-MiniLM-L6-v2, in ONNX format) into a local cache. The cache lives inside the installed package by default. Set FASTEMBED_CACHE_DIR to share it across virtual environments.
The shortest path: a Memory you can write to, query, and serialise without touching an LLM yet.
from streetai import MemoryRegistry
registry = MemoryRegistry("./memory.db")
mem = registry.get("user_123")
mem.add_message("Hi, I'm planning a trip to Japan.", role="user")
mem.add_message("Great. Which cities?", role="assistant")
prompt = mem.build_prompt("What did I say about Japan?")
# prompt.messages -> list of {role, content} ready for any LLM API
# prompt.retrieved -> the signals pulled in (pass to post_process later)
# prompt.inspector -> debug data (stacks activated, retrieval scores)
After your LLM responds, feed the response back to the memory so it can learn:
response_text = your_llm(prompt.messages)
mem.post_process(prompt.retrieved, response_text)
mem.add_message("What did I say about Japan?", role="user")
mem.add_message(response_text, role="assistant")
If you would rather not write that loop by hand, wrap your provider client with with_memory(). The wrapper exposes the same SDK methods you already know. Memory reads happen before each call; writes happen after.
from anthropic import Anthropic
from streetai.adapters.anthropic import with_memory
client = with_memory(Anthropic(), memory_id="user_123")
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are helpful.",
messages=[{"role": "user", "content": "What did I mention earlier?"}],
)
print(response.content[0].text)
from openai import OpenAI
from streetai.adapters.openai import with_memory
client = with_memory(OpenAI(), memory_id="user_123")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What did I mention earlier?"}],
)
print(response.choices[0].message.content)
DeepSeek is OpenAI API compatible. Use the OpenAI adapter with base_url.
import os
from openai import OpenAI
from streetai.adapters.openai import with_memory
deepseek = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com/v1",
)
client = with_memory(deepseek, memory_id="user_123")
response = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": "What did I mention earlier?"}],
)
The same pattern works for Together, Anyscale, Groq, and any other OpenAI-compatible endpoint.
from google import genai
from streetai.adapters.gemini import with_memory
client = with_memory(genai.Client(api_key="..."), memory_id="user_123")
response = client.models.generate_content(
model="gemini-2.0-flash",
contents="What did I mention earlier?",
)
print(response.text)
Every adapter has an async equivalent. Pass an async client; use await. Memory operations are wrapped in asyncio.to_thread.
# Anthropic
from anthropic import AsyncAnthropic
from streetai.adapters.anthropic import with_memory_async
client = with_memory_async(AsyncAnthropic(), memory_id="user_123")
response = await client.messages.create(model="claude-sonnet-4-6",
max_tokens=1024, messages=[{"role": "user", "content": "..."}])
# OpenAI (and DeepSeek via base_url)
from openai import AsyncOpenAI
from streetai.adapters.openai import with_memory_async
client = with_memory_async(AsyncOpenAI(), memory_id="user_123")
response = await client.chat.completions.create(model="gpt-4o-mini",
messages=[{"role": "user", "content": "..."}])
# Gemini (uses the existing client's .aio namespace internally)
from google import genai
from streetai.adapters.gemini import with_memory_async
client = with_memory_async(genai.Client(api_key="..."), memory_id="user_123")
response = await client.models.generate_content(model="gemini-2.0-flash",
contents="...")
stream=True on any adapter raises NotImplementedError. The chunks would bypass the memory write step, so we block it explicitly rather than silently dropping the response.
Each memory_id is an isolated, persistent memory. Use one per user or per session. They never leak into each other. The memory is saved to the SQLite file you pass to the registry and reloads automatically next time you start the process. It survives restarts and works across processes that share the file.
registry = MemoryRegistry("./memory.db")
user_a = registry.get("alice")
user_b = registry.get("bob")
user_a.add_message("My favourite city is Kyoto", role="user")
user_b.add_message("My favourite city is Lisbon", role="user")
# Two completely separate memories.
# Querying user_a will never surface user_b's signals, and vice versa.
To wipe one memory entirely, for example on a "clear chat" action or an account deletion:
registry.reset("alice") # everything stored under that id is gone
memory_id. The model in RAM is shared across every memory in the process, so adding users does not multiply embedding cost.
Every add_message call returns the signals it created. Each signal carries a parent_turn_id that identifies the whole message. Save that id alongside the message in your own display database, then use it later when a user edits or deletes.
created = mem.add_message("I am vegetarian", role="user")
pid = created[0].parent_turn_id
mem.update_message(pid, "I am vegan") # in-place edit; keeps the same turn
mem.delete_message(pid) # remove the message entirely
The wrapper exposes a .memory property, so any Memory method works on the client. Each adapter call stores two signals (user, then assistant); read the parent_turn_id from client.memory.signals after the call.
resp = client.messages.create(model="claude-sonnet-4-6",
max_tokens=512, messages=[{"role": "user", "content": "I am vegetarian"}])
user_pid = client.memory.signals[-2].parent_turn_id
client.memory.update_message(user_pid, "I am vegan")
client.memory.delete_message(user_pid)
The full pipeline, in order. Nothing magic happens here. Each piece is small enough to read in source.
An incoming message is split into sentence-sized pieces (between 30 and 400 characters, or 5 to 60 words). Triple-backtick code blocks stay atomic. Undersized chunks merge into their neighbours. Oversized ones split on comma or semicolon near the midpoint.
Each chunk is embedded with all-MiniLM-L6-v2 through fastembed, producing a 384-dimensional normalised vector. Chunks from one message are batched into a single model call.
The chunk is compared (cosine similarity) against the centroid of every existing stack. If the best match crosses the stack threshold (default 0.55), the chunk joins that stack. Otherwise it starts a new stack.
When a query arrives, it is embedded the same way. A FAISS IndexFlatIP over stack centroids returns the top stacks (default 6) by inner product. This is the fast filter, not the final ranking.
Inside each surviving stack, every signal is scored by max(effective_weight, revival_floor) × similarity. Signals that clear the activation threshold (default 0.15) join the thinking space, capped at 25 signals total.
The last few messages (default the last 3, all their chunks) are reassembled into their original form and sent to the LLM verbatim, alongside the retrieved long-term signals. Recent context is never replaced by retrieval.
The final prompt is a small list: a retrieved-context block (if any), the recency window, and the new user message. The total size is determined by what is relevant, not by how long the conversation has been.
After the LLM replies, each retrieved signal is scored against the response. Strong contributors get boosted (base weight times 1.10, capped at 8.0). Weak ones get demoted (times 0.95). Touched stacks recompute their centroids.
Two forces shape what gets surfaced at retrieval time. They work together; you tune each separately.
A signal's age is the number of turns since it was last used. The effective weight is
base_weight × exp(-rate × (current_turn - last_used_turn)).
With the default 50-turn half-life, a fresh signal halves at turn 50, quarters at turn 100, and falls below the death threshold of 0.05 after roughly 216 turns of no use. An idle gap (no new turns) does not advance decay; a user returning weeks later finds memory where they left it. Every time a signal is pulled into the thinking space, its clock resets to the current turn. Useful signals stay sharp through use; unused ones quietly fade.
A faded signal would normally be too weak to surface. But the right cue should still be able to bring an old memory back. StreetAI Memory does this with a weight floor in the retrieval score:
score = max(effective_weight, floor) × similarity,
where floor = activation_threshold / revival_similarity. With the defaults (0.15 and 0.45), the floor is about 0.33, which means a faded signal needs a similarity of at least 0.45 to clear the gate. Strong matches revive a dead signal; weak ones do not. Set revival_similarity = 0 in your Config to disable revival entirely.
After each LLM response, post_process scores every retrieved signal against the response (cosine similarity). Signals that contributed strongly (default threshold 0.55) have their base weight multiplied by 1.10, capped at 8.0. Signals that did not (default threshold 0.20) are multiplied by 0.95. Anything in between is left untouched. Over many interactions, the signals that consistently help drift toward higher base weight; noise drifts lower and dies on its own.
All knobs live on the Config dataclass. Pass one to the registry to override the defaults for every Memory it creates. Most apps never need to change these.
import math
from streetai import MemoryRegistry, Config
cfg = Config(
recency_turns=5, # last 5 messages verbatim (default 3)
decay_rate=math.log(2)/100, # 100-turn half-life (default 50; decay is per turn)
stack_threshold=0.65, # tighter stack assignment (default 0.55)
activation_threshold=0.1, # min score for a signal to surface (default 0.15)
revival_similarity=0.45, # match needed to revive a faded signal (0 disables)
)
registry = MemoryRegistry("./memory.db", config=cfg)
| Field | Default | What it does |
|---|---|---|
| stack_threshold | 0.55 | Cosine similarity required for a new chunk to join an existing stack instead of starting its own. |
| decay_rate | ln(2)/50 | Per-turn decay rate. The default gives a 50-turn half-life. Pass math.log(2)/N for an N-turn half-life. |
| death_threshold | 0.05 | Effective weight below this is considered dead. Dead signals are skipped in retrieval unless revival rescues them. |
| top_stacks | 6 | How many stacks Tier 1 returns. Higher means broader retrieval, slightly more work in Tier 2. |
| activation_threshold | 0.15 | Minimum score = eff × sim for a signal to enter the thinking space. |
| max_thinking_space | 25 | Hard cap on the number of retrieved signals sent to the LLM per turn. |
| revival_similarity | 0.45 | Similarity required for a dead signal to come back. Implemented as a weight floor at activation_threshold / revival_similarity. Set to 0 to disable. |
| recency_turns | 3 | How many of the most recent messages are sent to the LLM verbatim alongside the retrieved signals. |
| boost_factor | 1.10 | Multiplier applied to base weight when a signal contributed strongly to the response. |
| demote_factor | 0.95 | Multiplier applied to base weight when a signal did not contribute. |
| high_relevance | 0.55 | Cosine similarity (signal vs response) above which a signal is boosted. |
| low_relevance | 0.20 | Cosine similarity below which a signal is demoted. |
| max_base_weight | 8.0 | Hard cap on base weight. Prevents runaway reinforcement of a single signal. |
The current release is alpha (0.2.0). The honest list of what is not in the box yet.
stream=True raises NotImplementedError on every adapter. Streaming bypasses the memory write step, so it is blocked rather than silently broken. Support is on the roadmap.
Chunking sizes, the embedding model, and the default thresholds were validated on English conversations. Other languages may need tuning of the Config and possibly a different encoder later.
The embedding model is fixed to all-MiniLM-L6-v2 through fastembed. A pluggable encoder interface, so you can supply a faster or hosted embedder, is planned.
Behaviour at hundreds of thousands of signals or after months of continuous use has not been benchmarked. The architecture supports it, but treat large-scale claims as theoretical until tested.