KOTA Custom Database Implementation Plan¶

Executive Summary¶

This document outlines a comprehensive plan for implementing a custom database system specifically designed for KOTA's unique memory architecture needs. The database will replace the current file-scanning approach with a high-performance, memory-efficient system that maintains Git compatibility while enabling advanced cognitive capabilities.

Key Metrics from Analysis¶

Current Scale: 1,002 markdown files, ~4.8MB total
Update Rate: 85% of files modified weekly
Query Performance Target: <100ms for consciousness sessions, <500ms for chat
Memory Budget: <500MB for indices, unlimited for memory-mapped content

Phase 0: Foundation Research & Design (Week 0-1)¶

0.1 Feasibility Prototype¶

Goal: Validate core assumptions with minimal implementation

// Proof of concept in 500 lines
pub struct MiniKotaDB {
    // Memory-mapped file storage
    mmap: memmap2::MmapMut,

    // Simple B-tree index
    index: BTreeMap<PathBuf, DocumentOffset>,

    // Basic query engine
    query: SimpleQueryEngine,
}

Deliverables: - [ ] Benchmark memory-mapped vs file I/O for markdown - [ ] Test ZSTD compression ratios on KOTA content - [ ] Validate B-tree performance for 10k documents - [ ] Prototype fuzzy search with trigrams

0.2 Architecture Documentation¶

Goal: Detailed technical design before implementation

Documents to Create: 1. ARCHITECTURE.md - System design and components 2. DATA_MODEL.md - Storage format and indices 3. QUERY_LANGUAGE.md - KOTA-specific query syntax 4. INTEGRATION_GUIDE.md - How to integrate with existing system

0.3 Development Environment Setup¶

# Create project structure
mkdir -p crates/kota-db/{src,tests,benches,examples}
mkdir -p crates/kota-db/src/{storage,index,query,compression}

# Add dependencies
cat >> Cargo.toml << EOF
[workspace.members]
members = ["crates/kota-db"]

[dependencies.kota-db]
version = "0.1.0"
path = "crates/kota-db"
EOF

Phase 1: Core Storage Engine (Week 2-3)¶

1.1 Page-Based Storage Manager¶

Goal: Efficient disk I/O with fixed-size pages

pub struct StorageEngine {
    // Page size: 4KB (matches OS page size)
    page_size: usize,

    // Page cache with LRU eviction
    page_cache: LruCache<PageId, Page>,

    // Free page management
    free_list: FreePageList,

    // Write-ahead log
    wal: WriteAheadLog,
}

impl StorageEngine {
    pub fn allocate_page(&mut self) -> Result<PageId>;
    pub fn read_page(&mut self, id: PageId) -> Result<&Page>;
    pub fn write_page(&mut self, id: PageId, page: Page) -> Result<()>;
    pub fn sync(&mut self) -> Result<()>;
}

Key Features: - Copy-on-write for versioning - Checksums for corruption detection - Compression at page level - Memory-mapped option for hot data

1.2 Document Storage Format¶

Goal: Optimized format for markdown with frontmatter

#[repr(C)]
pub struct DocumentHeader {
    magic: [u8; 4],              // "KOTA"
    version: u16,                // Format version
    flags: DocumentFlags,        // Compression, encryption, etc.

    // Offsets within document
    frontmatter_offset: u32,
    frontmatter_len: u32,
    content_offset: u32,
    content_len: u32,

    // Metadata
    created: i64,                // Unix timestamp
    updated: i64,
    git_hash: [u8; 20],         // SHA-1

    // Relationships
    related_count: u16,
    tags_count: u16,
}

pub struct CompressedDocument {
    header: DocumentHeader,
    data: Vec<u8>,  // ZSTD compressed with dictionary
}

1.3 Write-Ahead Logging¶

Goal: Durability and crash recovery

pub struct WriteAheadLog {
    log_file: tokio::fs::File,
    sequence: AtomicU64,
    checkpoint_interval: Duration,
}

pub enum WalEntry {
    Begin { tx_id: u64 },
    Insert { tx_id: u64, doc: Document },
    Update { tx_id: u64, id: DocId, changes: Delta },
    Delete { tx_id: u64, id: DocId },
    Commit { tx_id: u64 },
    Checkpoint { snapshot: DatabaseState },
}

Phase 2: Indexing Subsystem (Week 4-5)¶

Goal: Unified interface for different index types

pub trait Index: Send + Sync {
    type Key;
    type Value;

    fn insert(&mut self, key: Self::Key, value: Self::Value) -> Result<()>;
    fn delete(&mut self, key: &Self::Key) -> Result<()>;
    fn search(&self, query: &Query) -> Result<Vec<Self::Value>>;
    fn range(&self, start: &Self::Key, end: &Self::Key) -> Result<Vec<Self::Value>>;
}

pub struct IndexManager {
    // Primary indices
    path_index: BTreeIndex<PathBuf, DocId>,

    // Secondary indices
    tag_index: InvertedIndex<String, DocId>,
    fulltext_index: TrigramIndex,
    temporal_index: TimeSeriesIndex,

    // Graph indices
    relationship_graph: AdjacencyList<DocId>,

    // Semantic indices
    embedding_index: HnswIndex<Vector, DocId>,
}

2.2 Full-Text Search with Fuzzy Matching¶

Goal: Fast, typo-tolerant search

pub struct TrigramIndex {
    // Trigram to document mapping
    trigrams: HashMap<[u8; 3], RoaringBitmap>,

    // Document to position mapping
    positions: HashMap<DocId, Vec<TrigramPosition>>,

    // Fuzzy matcher
    matcher: FuzzyMatcher,
}

impl TrigramIndex {
    pub fn search_fuzzy(&self, query: &str, max_distance: u32) -> Vec<SearchResult> {
        // 1. Extract query trigrams
        // 2. Find candidate documents
        // 3. Calculate edit distance
        // 4. Rank by relevance
    }
}

2.3 Graph Index for Relationships¶

Goal: Efficient traversal of document relationships

pub struct GraphIndex {
    // Forward edges (document -> related)
    forward: HashMap<DocId, Vec<Edge>>,

    // Backward edges (document <- related)
    backward: HashMap<DocId, Vec<Edge>>,

    // Edge metadata
    edge_data: HashMap<EdgeId, EdgeMetadata>,

    // Bloom filter for quick existence checks
    bloom: BloomFilter,
}

pub struct Edge {
    target: DocId,
    weight: f32,    // Relationship strength
    type_: EdgeType, // Related, references, child, etc.
}

2.4 Vector Index for Semantic Search¶

Goal: Find conceptually similar documents

pub struct HnswIndex {
    // Hierarchical Navigable Small World graph
    layers: Vec<Layer>,

    // Vector storage
    vectors: HashMap<DocId, Vector>,

    // Distance function
    distance: DistanceMetric,
}

impl HnswIndex {
    pub fn search_knn(&self, query: &Vector, k: usize) -> Vec<(DocId, f32)> {
        // Approximate nearest neighbor search
    }

    pub fn add_vector(&mut self, id: DocId, vector: Vector) -> Result<()> {
        // Insert with automatic layer assignment
    }
}

Phase 3: Query Engine (Week 6-7)¶

3.1 KOTA Query Language (KQL)¶

Goal: Natural, powerful query syntax

// Example queries:
// "meetings about rust"
// "related_to: 'project-mosaic' AND created: last_week"
// "consciousness sessions WITH insights ABOUT productivity"
// "similar_to: 'distributed cognition' LIMIT 10"

pub enum KotaQuery {
    // Text search
    Text { 
        query: String, 
        fields: Vec<Field>,
        fuzzy: bool 
    },

    // Relationship queries
    Related { 
        start: DocId, 
        depth: u32,
        filter: Option<Filter> 
    },

    // Temporal queries
    Temporal { 
        range: TimeRange,
        aggregation: Option<Aggregation> 
    },

    // Semantic queries
    Semantic { 
        vector: Vector,
        threshold: f32 
    },

    // Compound queries
    And(Box<KotaQuery>, Box<KotaQuery>),
    Or(Box<KotaQuery>, Box<KotaQuery>),
    Not(Box<KotaQuery>),
}

3.2 Query Parser and Planner¶

Goal: Convert text queries to execution plans

pub struct QueryParser {
    lexer: Lexer,
    grammar: Grammar,
}

pub struct QueryPlanner {
    statistics: TableStatistics,
    cost_model: CostModel,
}

pub struct ExecutionPlan {
    steps: Vec<PlanStep>,
    estimated_cost: f64,
    estimated_rows: usize,
}

pub enum PlanStep {
    IndexScan { index: IndexType, range: Range },
    SeqScan { filter: Filter },
    Join { left: Box<PlanStep>, right: Box<PlanStep> },
    Sort { key: SortKey },
    Limit { count: usize },
}

3.3 Query Executor¶

Goal: Efficient execution with streaming results

pub struct QueryExecutor {
    buffer_pool: BufferPool,
    thread_pool: ThreadPool,
}

impl QueryExecutor {
    pub async fn execute(&self, plan: ExecutionPlan) -> Result<QueryStream> {
        // Parallel execution where possible
        // Streaming results for large queries
        // Progress reporting for long operations
    }
}

pub struct QueryStream {
    receiver: mpsc::Receiver<Result<Document>>,
    metadata: QueryMetadata,
}

Phase 4: Advanced Features (Week 8-9)¶

4.1 Memory Compression Integration¶

Goal: Intelligent compression aware of content patterns

pub struct CompressionEngine {
    // Domain-specific dictionaries
    markdown_dict: ZstdDict,
    frontmatter_dict: ZstdDict,

    // Compression levels by age/access
    hot_level: i32,  // Fast compression
    cold_level: i32, // High compression

    // Statistics for adaptive compression
    stats: CompressionStats,
}

impl CompressionEngine {
    pub fn compress_document(&self, doc: &Document) -> CompressedDocument {
        // 1. Separate frontmatter and content
        // 2. Apply appropriate dictionary
        // 3. Choose compression level based on access patterns
    }
}

4.2 Real-Time Synchronization¶

Goal: Keep database in sync with filesystem

pub struct FileSystemSync {
    watcher: notify::RecommendedWatcher,
    db: Arc<KotaDB>,

    // Debouncing for rapid changes
    debouncer: Debouncer,

    // Conflict resolution
    resolver: ConflictResolver,
}

impl FileSystemSync {
    pub async fn start(&mut self) -> Result<()> {
        // Watch for filesystem changes
        // Queue updates with debouncing
        // Apply changes in batches
        // Handle conflicts (DB vs filesystem)
    }
}

4.3 Consciousness Integration¶

Goal: Direct integration with consciousness system

pub struct ConsciousnessInterface {
    db: Arc<KotaDB>,
    session_cache: LruCache<SessionId, SessionState>,
}

impl ConsciousnessInterface {
    pub async fn record_insight(&self, insight: Insight) -> Result<()> {
        // Store with temporal context
        // Update relationship graph
        // Trigger relevant indices
    }

    pub async fn query_context(&self, focus: Focus) -> Result<Context> {
        // Multi-index query
        // Relevance scoring
        // Context assembly
    }
}

4.4 Performance Optimizations¶

Goal: Sub-100ms query latency

pub struct PerformanceOptimizer {
    // Query result caching
    query_cache: Cache<QueryHash, ResultSet>,

    // Prepared statement cache
    prepared_statements: HashMap<String, PreparedQuery>,

    // Statistics for query optimization
    query_stats: QueryStatistics,

    // Adaptive indices
    adaptive_indexer: AdaptiveIndexer,
}

Phase 5: Integration & Testing (Week 10-11)¶

5.1 MCP Server Wrapper¶

Goal: Expose database through MCP protocol

pub struct KotaDBServer {
    db: Arc<KotaDB>,
    tools: Vec<Tool>,
}

impl McpServer for KotaDBServer {
    async fn handle_tool_call(&self, tool: &str, args: Value) -> Result<Value> {
        match tool {
            "query" => self.handle_query(args).await,
            "insert" => self.handle_insert(args).await,
            "update" => self.handle_update(args).await,
            // ... other operations
        }
    }
}

5.2 CLI Integration¶

Goal: Seamless integration with existing kota commands

// New commands
pub enum DatabaseCommand {
    Query { kql: String },
    Index { path: PathBuf },
    Compact,
    Stats,
    Export { format: ExportFormat },
}

// Integration with existing commands
impl KnowledgeOrgCommand {
    pub async fn execute_with_db(&self, db: &KotaDB) -> Result<()> {
        // Use database instead of in-memory indices
    }
}

5.3 Migration Tools¶

Goal: Smooth transition from current system

pub struct Migrator {
    source: FileSystemSource,
    target: KotaDB,
    progress: ProgressBar,
}

impl Migrator {
    pub async fn migrate(&mut self) -> Result<MigrationReport> {
        // 1. Scan all markdown files
        // 2. Parse and validate
        // 3. Insert into database
        // 4. Build indices
        // 5. Verify integrity
    }
}

5.4 Testing Strategy¶

Unit Tests¶

#[cfg(test)]
mod tests {
    #[test]
    fn test_document_serialization() { }

    #[test]
    fn test_index_operations() { }

    #[test]
    fn test_query_parsing() { }
}

Integration Tests¶

#[tokio::test]
async fn test_full_query_pipeline() {
    // 1. Insert test documents
    // 2. Build indices
    // 3. Execute complex queries
    // 4. Verify results
}

Performance Benchmarks¶

#[bench]
fn bench_insert_throughput(b: &mut Bencher) {
    // Measure documents/second
}

#[bench]
fn bench_query_latency(b: &mut Bencher) {
    // Measure p50, p95, p99 latencies
}

Chaos Testing¶

pub struct ChaosTester {
    db: KotaDB,
    chaos_monkey: ChaosMonkey,
}

impl ChaosTester {
    pub async fn test_crash_recovery(&mut self) {
        // 1. Start transaction
        // 2. Random crash
        // 3. Recover from WAL
        // 4. Verify consistency
    }
}

Phase 6: Production Hardening (Week 12-13)¶

6.1 Monitoring and Observability¶

pub struct Metrics {
    // Performance metrics
    query_latency: Histogram,
    index_hit_rate: Gauge,
    compression_ratio: Gauge,

    // Health metrics
    page_cache_hit_rate: Gauge,
    wal_size: Gauge,
    connection_count: Counter,
}

6.2 Backup and Recovery¶

pub struct BackupManager {
    schedule: CronSchedule,
    retention: RetentionPolicy,
    storage: BackupStorage,
}

impl BackupManager {
    pub async fn create_backup(&self) -> Result<BackupId> {
        // 1. Checkpoint WAL
        // 2. Snapshot data files
        // 3. Export metadata
        // 4. Compress and encrypt
    }
}

6.3 Security Hardening¶

pub struct SecurityLayer {
    // Encryption at rest
    encryption: AesGcm,

    // Access control
    permissions: PermissionSystem,

    // Audit logging
    audit_log: AuditLog,
}

Implementation Timeline¶

Week 1: Foundation¶

Set up project structure
Implement basic storage engine
Create simple B-tree index
Write first integration test

Week 2-3: Storage Engine¶

Complete page manager
Implement WAL
Add compression support
Benchmark I/O performance

Week 4-5: Indexing¶

Build inverted index for text
Implement graph index
Add fuzzy search
Create index benchmarks

Week 6-7: Query Engine¶

Design query language
Build parser and planner
Implement executor
Add streaming results

Week 8-9: Advanced Features¶

Integrate compression engine
Add filesystem sync
Build consciousness interface
Optimize performance

Week 10-11: Integration¶

Create MCP server wrapper
Update CLI commands
Build migration tools
Write comprehensive tests

Week 12-13: Production¶

Add monitoring/metrics
Implement backup system
Security hardening
Performance tuning

Success Metrics¶

Performance Targets¶

Insert throughput: >10,000 docs/sec
Query latency p50: <10ms
Query latency p99: <100ms
Memory usage: <500MB for 100k docs
Startup time: <1 second

Functionality Goals¶

Query types: Text, graph, temporal, semantic
Index types: B-tree, inverted, graph, vector
Compression ratio: >3x for typical content
Crash recovery: <10 second RTO
Backup size: <30% of original

Quality Standards¶

Test coverage: >90%
Documentation: 100% public API
Zero clippy warnings
No unsafe code (except FFI)
Fuzz testing: 24 hours no crashes

Risk Mitigation¶

Technical Risks¶

Performance not meeting targets
Mitigation: Profile early, optimize hot paths
Memory usage too high
Mitigation: Implement aggressive paging
Query language too complex
Mitigation: Start simple, iterate with users

Schedule Risks¶

Underestimated complexity
Mitigation: MVP first, features later
Integration challenges
Mitigation: Continuous integration from week 1

Operational Risks¶

Migration failures
Mitigation: Extensive testing, rollback plan
Data corruption
Mitigation: Checksums, backups, WAL

Conclusion¶

This custom database will provide KOTA with: - 10-100x faster queries than current approach - Native markdown support with Git compatibility - Advanced cognitive features through semantic search - Complete control over memory architecture evolution

The 13-week timeline is aggressive but achievable, with clear milestones and risk mitigation strategies. The phased approach allows for early validation and continuous integration with the existing KOTA system.