KOTA Custom Database Technical Architecture¶
System Overview¶
The KOTA Database (KotaDB) is a purpose-built storage engine designed specifically for distributed cognition between human and AI. It combines the best aspects of document stores, graph databases, and vector databases while maintaining compatibility with KOTA's existing file-based architecture.
Design Philosophy¶
- Memory as a Graph, Not a Hierarchy: Documents are nodes in a knowledge graph
- Time as a First-Class Dimension: All data is temporal by default
- Semantic Understanding Built-In: Vector embeddings for every document
- Human-Readable Storage: Markdown files remain the source of truth
- AI-Native Query Language: Designed for LLM interaction patterns
Core Architecture Components¶
1. Storage Layer¶
┌─────────────────────────────────────────────────────────────┐
│ Storage Engine │
├─────────────────┬────────────────┬──────────────────────────┤
│ Page Manager │ Write-Ahead │ Memory-Mapped Files │
│ (4KB pages) │ Log (WAL) │ (hot data cache) │
├─────────────────┴────────────────┴──────────────────────────┤
│ Compression Layer │
│ (ZSTD with domain dictionaries) │
├──────────────────────────────────────────────────────────────┤
│ Filesystem Interface │
│ (Markdown files + Binary indices) │
└──────────────────────────────────────────────────────────────┘
Page Manager¶
- Fixed 4KB pages: Matches OS page size for optimal I/O
- Copy-on-Write: Enables versioning without duplication
- Free space management: Bitmap allocation for efficiency
- Checksums: CRC32C for corruption detection
Write-Ahead Log (WAL)¶
- Append-only design: Sequential writes for performance
- Group commit: Batch multiple transactions
- Checkpoint mechanism: Periodic state snapshots
- Recovery protocol: Fast startup after crashes
Compression Strategy¶
- Domain-specific dictionaries:
- Markdown syntax patterns
- YAML frontmatter structures
- Common tag vocabularies
- Adaptive compression levels:
- Hot data: LZ4 (fast)
- Warm data: ZSTD level 3
- Cold data: ZSTD level 19
- Estimated ratios: 3-5x for typical KOTA content
2. Index Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ Index Manager │
├──────────────┬───────────────┬───────────────┬──────────────┤
│ Primary │ Full-Text │ Graph │ Semantic │
│ (B+ Tree) │ (Trigram) │ (Adjacency) │ (HNSW) │
├──────────────┼───────────────┼───────────────┼──────────────┤
│ Temporal │ Tag │ Metadata │ Spatial │
│ (Time-Series)│ (Bitmap) │ (Hash) │ (R-Tree) │
└──────────────┴───────────────┴───────────────┴──────────────┘
Primary Index (B+ Tree)¶
- Key: File path (for filesystem compatibility)
- Value: Document ID + metadata
- Features: Range queries, ordered traversal
- Performance: O(log n) lookups
Full-Text Index (Trigram)¶
- Trigram extraction: "hello" → ["hel", "ell", "llo"]
- Inverted index: Trigram → Document IDs (RoaringBitmap)
- Fuzzy matching: Levenshtein distance calculation
- Position tracking: For snippet extraction
Graph Index (Adjacency List)¶
- Forward edges: Document → Related documents
- Backward edges: Document ← Referencing documents
- Edge metadata: Relationship type, strength, timestamp
- Traversal optimization: Bloom filters for existence checks
Semantic Index (HNSW)¶
- Hierarchical Navigable Small World: Fast approximate search
- Vector dimensions: 384 (all-MiniLM-L6-v2) or 1536 (OpenAI)
- Distance metrics: Cosine similarity, L2 distance
- Performance: Sub-linear search time
3. Query Engine¶
┌─────────────────────────────────────────────────────────────┐
│ Query Interface │
│ (Natural Language) │
├─────────────────────────────────────────────────────────────┤
│ Query Parser │
│ (KQL - KOTA Query Language) │
├─────────────────────────────────────────────────────────────┤
│ Query Planner │
│ (Cost-based optimization) │
├─────────────────────────────────────────────────────────────┤
│ Query Executor │
│ (Parallel, streaming) │
├─────────────────────────────────────────────────────────────┤
│ Result Processor │
│ (Ranking, aggregation, projection) │
└─────────────────────────────────────────────────────────────┘
KOTA Query Language (KQL)¶
// Natural language queries
"meetings about rust programming last week"
"documents similar to distributed cognition"
"show my productivity patterns"
// Structured queries
{
"type": "semantic",
"query": "consciousness evolution",
"filters": {
"created": { "$gte": "2025-01-01" },
"tags": { "$contains": "philosophy" }
},
"limit": 10
}
// Graph traversal
{
"type": "graph",
"start": "projects/kota-ai/README.md",
"depth": 3,
"direction": "outbound",
"edge_filter": { "type": "implements" }
}
Query Planning¶
- Parse: Convert natural language to AST
- Analyze: Determine required indices
- Optimize: Choose best execution path
- Estimate: Predict cost and result size
Execution Strategy¶
- Index selection: Use most selective index first
- Parallel execution: Split independent subqueries
- Pipeline processing: Stream results as available
- Memory budget: Spill to disk if needed
4. Transaction Management¶
┌─────────────────────────────────────────────────────────────┐
│ Transaction Manager │
├─────────────────┬────────────────┬──────────────────────────┤
│ MVCC │ Lock Manager │ Deadlock Detector │
│ (Multi-Version)│ (Row-level) │ (Wait-for graph) │
└─────────────────┴────────────────┴──────────────────────────┘
MVCC Implementation¶
- Version chains: Each document has version history
- Snapshot isolation: Consistent reads
- Garbage collection: Clean old versions
- Read-write separation: No read locks needed
5. Consciousness Integration¶
┌─────────────────────────────────────────────────────────────┐
│ Consciousness Interface │
├──────────────┬────────────────┬─────────────────────────────┤
│ Session │ Insight │ Memory │
│ Tracking │ Recording │ Compression │
├──────────────┼────────────────┼─────────────────────────────┤
│ Trigger │ Pattern │ Narrative │
│ Monitor │ Detection │ Generation │
└──────────────┴────────────────┴─────────────────────────────┘
Direct Integration Benefits¶
- Real-time context: No file scanning needed
- Pattern detection: Built-in analytics
- Memory optimization: Compression-aware queries
- Trigger efficiency: Index-based monitoring
Data Model¶
Document Structure¶
pub struct Document {
// Identity
id: DocumentId, // 128-bit UUID
path: CompressedPath, // Original file path
// Content
frontmatter: Frontmatter,
content: MarkdownContent,
// Metadata
created: Timestamp,
updated: Timestamp,
accessed: Timestamp,
version: Version,
// Relationships
tags: TagSet,
related: Vec<DocumentId>,
backlinks: Vec<DocumentId>,
// Cognitive metadata
embedding: Option<Vector>,
relevance_score: f32,
access_count: u32,
}
Index Entry Structure¶
pub struct IndexEntry {
doc_id: DocumentId,
score: f32, // Relevance score
positions: Vec<u32>, // Word positions for highlighting
metadata: Metadata, // Quick-access fields
}
Performance Characteristics¶
Time Complexity¶
Operation | Complexity | Typical Time |
---|---|---|
Insert | O(log n) | <1ms |
Update | O(log n) | <1ms |
Delete | O(log n) | <1ms |
Lookup by path | O(log n) | <0.1ms |
Full-text search | O(k) | <10ms |
Graph traversal | O(V + E) | <50ms |
Semantic search | O(log n) | <20ms |
Space Complexity¶
Component | Memory Usage | Disk Usage |
---|---|---|
Document | ~8KB avg | ~3KB compressed |
Indices | ~500B/doc | ~200B/doc |
WAL | 10MB active | Configurable |
Page cache | 100MB default | N/A |
Throughput Targets¶
- Writes: 10,000+ documents/second
- Reads: 100,000+ queries/second
- Mixed: 50% read, 50% write maintaining targets
Security Architecture¶
Encryption¶
- At rest: AES-256-GCM for sensitive documents
- In transit: TLS 1.3 for network operations
- Key management: OS keychain integration
Access Control¶
- Document-level: Read/write permissions
- Field-level: Redaction for sensitive fields
- Audit logging: All operations tracked
Privacy Features¶
- PII detection: Automatic flagging
- Retention policies: Automatic expiry
- Right to forget: Complete removal
Extensibility Points¶
Plugin System¶
pub trait KotaPlugin {
fn on_insert(&mut self, doc: &Document) -> Result<()>;
fn on_query(&mut self, query: &Query) -> Result<()>;
fn on_index(&mut self, index: &Index) -> Result<()>;
}
Custom Index Types¶
- Bloom filter index: For existence checks
- Geospatial index: For location data
- Phonetic index: For name matching
- Custom embeddings: Domain-specific vectors
Query Extensions¶
- Custom functions: User-defined computations
- External data sources: Federation support
- Streaming queries: Real-time updates
Operational Considerations¶
Monitoring¶
- Prometheus metrics: Performance and health
- OpenTelemetry traces: Distributed tracing
- Custom dashboards: Grafana integration
Maintenance¶
- Online defragmentation: No downtime
- Index rebuilding: Background operation
- Backup coordination: Consistent snapshots
Disaster Recovery¶
- Point-in-time recovery: Any timestamp
- Geo-replication: Optional for critical data
- Incremental backups: Efficient storage
Future Optimizations¶
Hardware Acceleration¶
- SIMD instructions: Batch operations
- GPU indexing: Parallel vector search
- Persistent memory: Intel Optane support
Advanced Features¶
- Learned indices: ML-based optimization
- Adaptive compression: Content-aware
- Predictive caching: Access pattern learning
Cognitive Enhancements¶
- Thought chains: Native support
- Memory consolidation: Sleep-like processing
- Attention mechanisms: Priority-based indexing
Conclusion¶
This architecture provides a solid foundation for KOTA's evolution from a tool collection to a genuine cognitive partner. The custom database design specifically addresses the unique requirements of human-AI distributed cognition while maintaining practical considerations like Git compatibility and human readability.
The modular design allows for incremental implementation and testing, reducing risk while enabling rapid innovation in areas like consciousness integration and semantic understanding.