Architecture

How Cortex is built and why it's designed this way.

High-Level Overview

┌──────────────────────────────────────────────────────────────┐
│                         SDKs                                  │
│  TypeScript (@cortex/memory)  │  Python (cortex-memory)      │
│  MCP Server (@cortex/mcp)     │  REST API                    │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                      API Gateway                              │
│  Authentication │ Rate Limiting │ Request Routing            │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                    Processing Layer                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │   Entity    │  │  Temporal   │  │ Commitment  │          │
│  │  Extractor  │  │  Resolver   │  │  Extractor  │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │ Importance  │  │   Profile   │  │ Consolidation│          │
│  │   Scorer    │  │  Generator  │  │   Engine    │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                     Storage Layer                             │
│  ┌─────────────────┐    ┌─────────────────┐                  │
│  │   D1 (SQLite)   │    │    Vectorize    │                  │
│  │  Structured Data │    │   Embeddings    │                  │
│  └─────────────────┘    └─────────────────┘                  │
└──────────────────────────────────────────────────────────────┘

Infrastructure

Cortex runs on Cloudflare Workers - serverless compute at the edge.

Why Cloudflare?

Global edge: Requests handled close to users (~20ms latency)
Serverless: No servers to manage
D1 Database: SQLite at the edge with automatic replication
Vectorize: Native vector search without external services
Queues: Async processing without managing infrastructure

Stack

Component	Technology
Compute	Cloudflare Workers
Database	Cloudflare D1 (SQLite)
Vector DB	Cloudflare Vectorize
Queue	Cloudflare Queues
Embeddings	@cf/baai/bge-base-en-v1.5
LLM (extraction)	Llama 3.1-8B
LLM (critical)	GPT-4o-mini

Data Model

Core Tables

memories
├── id (primary key)
├── user_id (foreign key)
├── container_tag (multi-tenant isolation)
├── content (the memory text)
├── source (email, chat, manual, etc.)
├── memory_type (episodic, semantic)
├── importance_score (0-1)
├── processing_status
├── valid_from / valid_to (temporal)
├── supersedes / superseded_by
└── created_at / updated_at
 
entities
├── id (primary key)
├── user_id
├── name ("Sarah")
├── entity_type (person, place, org, etc.)
├── attributes (JSON: title, email, etc.)
├── importance_score
└── created_at / updated_at
 
entity_relationships
├── source_entity_id
├── target_entity_id
├── relationship_type (works_with, reports_to, etc.)
├── confidence
├── valid_from / valid_to
└── source_memory_ids (provenance)
 
beliefs
├── id
├── user_id
├── content ("User prefers dark mode")
├── category (preference, characteristic, etc.)
├── confidence (0-1)
├── source_memory_ids
└── valid_from / valid_to
 
commitments
├── id
├── user_id
├── title
├── description
├── status (pending, completed, overdue)
├── due_date
├── related_entity_id
└── source_memory_id

Vector Storage

Memories are embedded using BGE embeddings (768 dimensions) and stored in Vectorize for semantic search.

namespace: cortex-memories
id: mem_abc123
vector: [0.123, -0.456, ...]
metadata: { userId, containerTag, source }

Processing Pipeline

When you add a memory, it goes through multiple processing stages:

Memory Created
      │
      ▼
┌─────────────┐
│   Queued    │ ← Immediately returns, processing is async
└─────────────┘
      │
      ▼
┌─────────────┐
│ Extracting  │ ← Extract metadata, dates, entities
└─────────────┘
      │
      ▼
┌─────────────┐
│  Chunking   │ ← Split long content if needed
└─────────────┘
      │
      ▼
┌─────────────┐
│ Embedding   │ ← Generate vector embeddings
└─────────────┘
      │
      ▼
┌─────────────┐
│  Indexing   │ ← Store in Vectorize
└─────────────┘
      │
      ▼
┌─────────────┐
│  Temporal   │ ← Extract event dates, check conflicts
└─────────────┘
      │
      ▼
┌─────────────┐
│  Entities   │ ← Extract and link entities
└─────────────┘
      │
      ▼
┌─────────────┐
│ Importance  │ ← Score importance
└─────────────┘
      │
      ▼
┌─────────────┐
│Commitments  │ ← Extract obligations
└─────────────┘
      │
      ▼
┌─────────────┐
│    Done     │
└─────────────┘

Async Processing

Processing happens asynchronously using Cloudflare Queues. This means:

POST /v3/memories returns immediately
Processing happens in background
Full data available within seconds

For synchronous needs, poll processingStatus or use webhooks.

Search Architecture

Cortex uses hybrid search - combining semantic and keyword search for best results.

Hybrid Search Flow

Query: "project deadlines"
        │
        ▼
┌───────────────────┐
│  Generate Query   │ ← Create embedding for query
│     Embedding     │
└───────────────────┘
        │
        ├──────────────────────┐
        ▼                      ▼
┌───────────────┐    ┌───────────────┐
│ Vector Search │    │ Keyword Search│
│  (Vectorize)  │    │    (D1 FTS)   │
└───────────────┘    └───────────────┘
        │                      │
        └──────────┬───────────┘
                   ▼
         ┌─────────────────┐
         │  Fusion Scoring │ ← Combine and rerank
         └─────────────────┘
                   │
                   ▼
         ┌─────────────────┐
         │ Importance Boost│ ← Boost high-importance
         └─────────────────┘
                   │
                   ▼
              Results

Reranking

Results are reranked using:

Semantic score: Cosine similarity of embeddings
Keyword score: FTS relevance
Importance boost: Higher importance memories ranked up
Recency decay: Recent memories slightly preferred

Multi-Tenancy

Cortex supports multi-tenant isolation through containers.

How It Works

Every query is scoped by:

user_id: From JWT token
container_tag: Optional namespace within user

-- Every query includes these filters
SELECT * FROM memories
WHERE user_id = ? AND container_tag = ?

Use Cases

Per-user isolation: Each user's memories are private
Per-project isolation: Separate memories by project/workspace
Per-environment isolation: Separate test and production

LLM Strategy

Cortex uses a "quality-first" approach to LLM usage:

Fast Model (Llama 3.1-8B)

Used for:

Entity extraction
Basic classification
Sentiment analysis
Importance scoring

Characteristics:

Very fast (~100ms)
Good enough for most extractions
Low cost

Quality Model (GPT-4o-mini)

Used for:

Profile generation
Semantic consolidation
Conflict resolution
Complex reasoning

Characteristics:

Higher quality output
Slower (~500ms)
Used sparingly

Fallback Strategy

If LLM calls fail:

Retry with exponential backoff
Fall back to simpler extraction
Never block the pipeline
Log for manual review

Security

Authentication

JWT tokens for API access
API keys stored hashed (bcrypt)
Keys can be revoked instantly

Data Isolation

Row-level security via user_id
Container isolation via container_tag
No cross-user data access possible

Encryption

TLS 1.3 for all API traffic
Data encrypted at rest in D1
API keys never logged

Scalability

Current Limits

Metric	Limit
Memories per user	1,000,000
Entities per user	50,000
Memory size	32KB
Batch size	100
Requests/second	1,000

Why These Work

D1 handles millions of rows efficiently
Vectorize scales to millions of vectors
Edge compute eliminates bottlenecks
Async processing absorbs spikes

Proactive Intelligence