Skip to content

KOTA Custom Database Implementation Plan

Executive Summary

This document outlines a comprehensive plan for implementing a custom database system specifically designed for KOTA's unique memory architecture needs. The database will replace the current file-scanning approach with a high-performance, memory-efficient system that maintains Git compatibility while enabling advanced cognitive capabilities.

Key Metrics from Analysis

  • Current Scale: 1,002 markdown files, ~4.8MB total
  • Update Rate: 85% of files modified weekly
  • Query Performance Target: <100ms for consciousness sessions, <500ms for chat
  • Memory Budget: <500MB for indices, unlimited for memory-mapped content

Phase 0: Foundation Research & Design (Week 0-1)

0.1 Feasibility Prototype

Goal: Validate core assumptions with minimal implementation

// Proof of concept in 500 lines
pub struct MiniKotaDB {
    // Memory-mapped file storage
    mmap: memmap2::MmapMut,

    // Simple B-tree index
    index: BTreeMap<PathBuf, DocumentOffset>,

    // Basic query engine
    query: SimpleQueryEngine,
}

Deliverables: - [ ] Benchmark memory-mapped vs file I/O for markdown - [ ] Test ZSTD compression ratios on KOTA content - [ ] Validate B-tree performance for 10k documents - [ ] Prototype fuzzy search with trigrams

0.2 Architecture Documentation

Goal: Detailed technical design before implementation

Documents to Create: 1. ARCHITECTURE.md - System design and components 2. DATA_MODEL.md - Storage format and indices 3. QUERY_LANGUAGE.md - KOTA-specific query syntax 4. INTEGRATION_GUIDE.md - How to integrate with existing system

0.3 Development Environment Setup

# Create project structure
mkdir -p crates/kota-db/{src,tests,benches,examples}
mkdir -p crates/kota-db/src/{storage,index,query,compression}

# Add dependencies
cat >> Cargo.toml << EOF
[workspace.members]
members = ["crates/kota-db"]

[dependencies.kota-db]
version = "0.1.0"
path = "crates/kota-db"
EOF

Phase 1: Core Storage Engine (Week 2-3)

1.1 Page-Based Storage Manager

Goal: Efficient disk I/O with fixed-size pages

pub struct StorageEngine {
    // Page size: 4KB (matches OS page size)
    page_size: usize,

    // Page cache with LRU eviction
    page_cache: LruCache<PageId, Page>,

    // Free page management
    free_list: FreePageList,

    // Write-ahead log
    wal: WriteAheadLog,
}

impl StorageEngine {
    pub fn allocate_page(&mut self) -> Result<PageId>;
    pub fn read_page(&mut self, id: PageId) -> Result<&Page>;
    pub fn write_page(&mut self, id: PageId, page: Page) -> Result<()>;
    pub fn sync(&mut self) -> Result<()>;
}

Key Features: - Copy-on-write for versioning - Checksums for corruption detection - Compression at page level - Memory-mapped option for hot data

1.2 Document Storage Format

Goal: Optimized format for markdown with frontmatter

#[repr(C)]
pub struct DocumentHeader {
    magic: [u8; 4],              // "KOTA"
    version: u16,                // Format version
    flags: DocumentFlags,        // Compression, encryption, etc.

    // Offsets within document
    frontmatter_offset: u32,
    frontmatter_len: u32,
    content_offset: u32,
    content_len: u32,

    // Metadata
    created: i64,                // Unix timestamp
    updated: i64,
    git_hash: [u8; 20],         // SHA-1

    // Relationships
    related_count: u16,
    tags_count: u16,
}

pub struct CompressedDocument {
    header: DocumentHeader,
    data: Vec<u8>,  // ZSTD compressed with dictionary
}

1.3 Write-Ahead Logging

Goal: Durability and crash recovery

pub struct WriteAheadLog {
    log_file: tokio::fs::File,
    sequence: AtomicU64,
    checkpoint_interval: Duration,
}

pub enum WalEntry {
    Begin { tx_id: u64 },
    Insert { tx_id: u64, doc: Document },
    Update { tx_id: u64, id: DocId, changes: Delta },
    Delete { tx_id: u64, id: DocId },
    Commit { tx_id: u64 },
    Checkpoint { snapshot: DatabaseState },
}

Phase 2: Indexing Subsystem (Week 4-5)

2.1 Multi-Modal Index Manager

Goal: Unified interface for different index types

pub trait Index: Send + Sync {
    type Key;
    type Value;

    fn insert(&mut self, key: Self::Key, value: Self::Value) -> Result<()>;
    fn delete(&mut self, key: &Self::Key) -> Result<()>;
    fn search(&self, query: &Query) -> Result<Vec<Self::Value>>;
    fn range(&self, start: &Self::Key, end: &Self::Key) -> Result<Vec<Self::Value>>;
}

pub struct IndexManager {
    // Primary indices
    path_index: BTreeIndex<PathBuf, DocId>,

    // Secondary indices
    tag_index: InvertedIndex<String, DocId>,
    fulltext_index: TrigramIndex,
    temporal_index: TimeSeriesIndex,

    // Graph indices
    relationship_graph: AdjacencyList<DocId>,

    // Semantic indices
    embedding_index: HnswIndex<Vector, DocId>,
}

2.2 Full-Text Search with Fuzzy Matching

Goal: Fast, typo-tolerant search

pub struct TrigramIndex {
    // Trigram to document mapping
    trigrams: HashMap<[u8; 3], RoaringBitmap>,

    // Document to position mapping
    positions: HashMap<DocId, Vec<TrigramPosition>>,

    // Fuzzy matcher
    matcher: FuzzyMatcher,
}

impl TrigramIndex {
    pub fn search_fuzzy(&self, query: &str, max_distance: u32) -> Vec<SearchResult> {
        // 1. Extract query trigrams
        // 2. Find candidate documents
        // 3. Calculate edit distance
        // 4. Rank by relevance
    }
}

2.3 Graph Index for Relationships

Goal: Efficient traversal of document relationships

pub struct GraphIndex {
    // Forward edges (document -> related)
    forward: HashMap<DocId, Vec<Edge>>,

    // Backward edges (document <- related)
    backward: HashMap<DocId, Vec<Edge>>,

    // Edge metadata
    edge_data: HashMap<EdgeId, EdgeMetadata>,

    // Bloom filter for quick existence checks
    bloom: BloomFilter,
}

pub struct Edge {
    target: DocId,
    weight: f32,    // Relationship strength
    type_: EdgeType, // Related, references, child, etc.
}

Goal: Find conceptually similar documents

pub struct HnswIndex {
    // Hierarchical Navigable Small World graph
    layers: Vec<Layer>,

    // Vector storage
    vectors: HashMap<DocId, Vector>,

    // Distance function
    distance: DistanceMetric,
}

impl HnswIndex {
    pub fn search_knn(&self, query: &Vector, k: usize) -> Vec<(DocId, f32)> {
        // Approximate nearest neighbor search
    }

    pub fn add_vector(&mut self, id: DocId, vector: Vector) -> Result<()> {
        // Insert with automatic layer assignment
    }
}

Phase 3: Query Engine (Week 6-7)

3.1 KOTA Query Language (KQL)

Goal: Natural, powerful query syntax

// Example queries:
// "meetings about rust"
// "related_to: 'project-mosaic' AND created: last_week"
// "consciousness sessions WITH insights ABOUT productivity"
// "similar_to: 'distributed cognition' LIMIT 10"

pub enum KotaQuery {
    // Text search
    Text { 
        query: String, 
        fields: Vec<Field>,
        fuzzy: bool 
    },

    // Relationship queries
    Related { 
        start: DocId, 
        depth: u32,
        filter: Option<Filter> 
    },

    // Temporal queries
    Temporal { 
        range: TimeRange,
        aggregation: Option<Aggregation> 
    },

    // Semantic queries
    Semantic { 
        vector: Vector,
        threshold: f32 
    },

    // Compound queries
    And(Box<KotaQuery>, Box<KotaQuery>),
    Or(Box<KotaQuery>, Box<KotaQuery>),
    Not(Box<KotaQuery>),
}

3.2 Query Parser and Planner

Goal: Convert text queries to execution plans

pub struct QueryParser {
    lexer: Lexer,
    grammar: Grammar,
}

pub struct QueryPlanner {
    statistics: TableStatistics,
    cost_model: CostModel,
}

pub struct ExecutionPlan {
    steps: Vec<PlanStep>,
    estimated_cost: f64,
    estimated_rows: usize,
}

pub enum PlanStep {
    IndexScan { index: IndexType, range: Range },
    SeqScan { filter: Filter },
    Join { left: Box<PlanStep>, right: Box<PlanStep> },
    Sort { key: SortKey },
    Limit { count: usize },
}

3.3 Query Executor

Goal: Efficient execution with streaming results

pub struct QueryExecutor {
    buffer_pool: BufferPool,
    thread_pool: ThreadPool,
}

impl QueryExecutor {
    pub async fn execute(&self, plan: ExecutionPlan) -> Result<QueryStream> {
        // Parallel execution where possible
        // Streaming results for large queries
        // Progress reporting for long operations
    }
}

pub struct QueryStream {
    receiver: mpsc::Receiver<Result<Document>>,
    metadata: QueryMetadata,
}

Phase 4: Advanced Features (Week 8-9)

4.1 Memory Compression Integration

Goal: Intelligent compression aware of content patterns

pub struct CompressionEngine {
    // Domain-specific dictionaries
    markdown_dict: ZstdDict,
    frontmatter_dict: ZstdDict,

    // Compression levels by age/access
    hot_level: i32,  // Fast compression
    cold_level: i32, // High compression

    // Statistics for adaptive compression
    stats: CompressionStats,
}

impl CompressionEngine {
    pub fn compress_document(&self, doc: &Document) -> CompressedDocument {
        // 1. Separate frontmatter and content
        // 2. Apply appropriate dictionary
        // 3. Choose compression level based on access patterns
    }
}

4.2 Real-Time Synchronization

Goal: Keep database in sync with filesystem

pub struct FileSystemSync {
    watcher: notify::RecommendedWatcher,
    db: Arc<KotaDB>,

    // Debouncing for rapid changes
    debouncer: Debouncer,

    // Conflict resolution
    resolver: ConflictResolver,
}

impl FileSystemSync {
    pub async fn start(&mut self) -> Result<()> {
        // Watch for filesystem changes
        // Queue updates with debouncing
        // Apply changes in batches
        // Handle conflicts (DB vs filesystem)
    }
}

4.3 Consciousness Integration

Goal: Direct integration with consciousness system

pub struct ConsciousnessInterface {
    db: Arc<KotaDB>,
    session_cache: LruCache<SessionId, SessionState>,
}

impl ConsciousnessInterface {
    pub async fn record_insight(&self, insight: Insight) -> Result<()> {
        // Store with temporal context
        // Update relationship graph
        // Trigger relevant indices
    }

    pub async fn query_context(&self, focus: Focus) -> Result<Context> {
        // Multi-index query
        // Relevance scoring
        // Context assembly
    }
}

4.4 Performance Optimizations

Goal: Sub-100ms query latency

pub struct PerformanceOptimizer {
    // Query result caching
    query_cache: Cache<QueryHash, ResultSet>,

    // Prepared statement cache
    prepared_statements: HashMap<String, PreparedQuery>,

    // Statistics for query optimization
    query_stats: QueryStatistics,

    // Adaptive indices
    adaptive_indexer: AdaptiveIndexer,
}

Phase 5: Integration & Testing (Week 10-11)

5.1 MCP Server Wrapper

Goal: Expose database through MCP protocol

pub struct KotaDBServer {
    db: Arc<KotaDB>,
    tools: Vec<Tool>,
}

impl McpServer for KotaDBServer {
    async fn handle_tool_call(&self, tool: &str, args: Value) -> Result<Value> {
        match tool {
            "query" => self.handle_query(args).await,
            "insert" => self.handle_insert(args).await,
            "update" => self.handle_update(args).await,
            // ... other operations
        }
    }
}

5.2 CLI Integration

Goal: Seamless integration with existing kota commands

// New commands
pub enum DatabaseCommand {
    Query { kql: String },
    Index { path: PathBuf },
    Compact,
    Stats,
    Export { format: ExportFormat },
}

// Integration with existing commands
impl KnowledgeOrgCommand {
    pub async fn execute_with_db(&self, db: &KotaDB) -> Result<()> {
        // Use database instead of in-memory indices
    }
}

5.3 Migration Tools

Goal: Smooth transition from current system

pub struct Migrator {
    source: FileSystemSource,
    target: KotaDB,
    progress: ProgressBar,
}

impl Migrator {
    pub async fn migrate(&mut self) -> Result<MigrationReport> {
        // 1. Scan all markdown files
        // 2. Parse and validate
        // 3. Insert into database
        // 4. Build indices
        // 5. Verify integrity
    }
}

5.4 Testing Strategy

Unit Tests

#[cfg(test)]
mod tests {
    #[test]
    fn test_document_serialization() { }

    #[test]
    fn test_index_operations() { }

    #[test]
    fn test_query_parsing() { }
}

Integration Tests

#[tokio::test]
async fn test_full_query_pipeline() {
    // 1. Insert test documents
    // 2. Build indices
    // 3. Execute complex queries
    // 4. Verify results
}

Performance Benchmarks

#[bench]
fn bench_insert_throughput(b: &mut Bencher) {
    // Measure documents/second
}

#[bench]
fn bench_query_latency(b: &mut Bencher) {
    // Measure p50, p95, p99 latencies
}

Chaos Testing

pub struct ChaosTester {
    db: KotaDB,
    chaos_monkey: ChaosMonkey,
}

impl ChaosTester {
    pub async fn test_crash_recovery(&mut self) {
        // 1. Start transaction
        // 2. Random crash
        // 3. Recover from WAL
        // 4. Verify consistency
    }
}

Phase 6: Production Hardening (Week 12-13)

6.1 Monitoring and Observability

pub struct Metrics {
    // Performance metrics
    query_latency: Histogram,
    index_hit_rate: Gauge,
    compression_ratio: Gauge,

    // Health metrics
    page_cache_hit_rate: Gauge,
    wal_size: Gauge,
    connection_count: Counter,
}

6.2 Backup and Recovery

pub struct BackupManager {
    schedule: CronSchedule,
    retention: RetentionPolicy,
    storage: BackupStorage,
}

impl BackupManager {
    pub async fn create_backup(&self) -> Result<BackupId> {
        // 1. Checkpoint WAL
        // 2. Snapshot data files
        // 3. Export metadata
        // 4. Compress and encrypt
    }
}

6.3 Security Hardening

pub struct SecurityLayer {
    // Encryption at rest
    encryption: AesGcm,

    // Access control
    permissions: PermissionSystem,

    // Audit logging
    audit_log: AuditLog,
}

Implementation Timeline

Week 1: Foundation

  • Set up project structure
  • Implement basic storage engine
  • Create simple B-tree index
  • Write first integration test

Week 2-3: Storage Engine

  • Complete page manager
  • Implement WAL
  • Add compression support
  • Benchmark I/O performance

Week 4-5: Indexing

  • Build inverted index for text
  • Implement graph index
  • Add fuzzy search
  • Create index benchmarks

Week 6-7: Query Engine

  • Design query language
  • Build parser and planner
  • Implement executor
  • Add streaming results

Week 8-9: Advanced Features

  • Integrate compression engine
  • Add filesystem sync
  • Build consciousness interface
  • Optimize performance

Week 10-11: Integration

  • Create MCP server wrapper
  • Update CLI commands
  • Build migration tools
  • Write comprehensive tests

Week 12-13: Production

  • Add monitoring/metrics
  • Implement backup system
  • Security hardening
  • Performance tuning

Success Metrics

Performance Targets

  • Insert throughput: >10,000 docs/sec
  • Query latency p50: <10ms
  • Query latency p99: <100ms
  • Memory usage: <500MB for 100k docs
  • Startup time: <1 second

Functionality Goals

  • Query types: Text, graph, temporal, semantic
  • Index types: B-tree, inverted, graph, vector
  • Compression ratio: >3x for typical content
  • Crash recovery: <10 second RTO
  • Backup size: <30% of original

Quality Standards

  • Test coverage: >90%
  • Documentation: 100% public API
  • Zero clippy warnings
  • No unsafe code (except FFI)
  • Fuzz testing: 24 hours no crashes

Risk Mitigation

Technical Risks

  1. Performance not meeting targets
  2. Mitigation: Profile early, optimize hot paths

  3. Memory usage too high

  4. Mitigation: Implement aggressive paging

  5. Query language too complex

  6. Mitigation: Start simple, iterate with users

Schedule Risks

  1. Underestimated complexity
  2. Mitigation: MVP first, features later

  3. Integration challenges

  4. Mitigation: Continuous integration from week 1

Operational Risks

  1. Migration failures
  2. Mitigation: Extensive testing, rollback plan

  3. Data corruption

  4. Mitigation: Checksums, backups, WAL

Conclusion

This custom database will provide KOTA with: - 10-100x faster queries than current approach - Native markdown support with Git compatibility - Advanced cognitive features through semantic search - Complete control over memory architecture evolution

The 13-week timeline is aggressive but achievable, with clear milestones and risk mitigation strategies. The phased approach allows for early validation and continuous integration with the existing KOTA system.