KOTA Custom Database Implementation Plan¶
Executive Summary¶
This document outlines a comprehensive plan for implementing a custom database system specifically designed for KOTA's unique memory architecture needs. The database will replace the current file-scanning approach with a high-performance, memory-efficient system that maintains Git compatibility while enabling advanced cognitive capabilities.
Key Metrics from Analysis¶
- Current Scale: 1,002 markdown files, ~4.8MB total
- Update Rate: 85% of files modified weekly
- Query Performance Target: <100ms for consciousness sessions, <500ms for chat
- Memory Budget: <500MB for indices, unlimited for memory-mapped content
Phase 0: Foundation Research & Design (Week 0-1)¶
0.1 Feasibility Prototype¶
Goal: Validate core assumptions with minimal implementation
// Proof of concept in 500 lines
pub struct MiniKotaDB {
// Memory-mapped file storage
mmap: memmap2::MmapMut,
// Simple B-tree index
index: BTreeMap<PathBuf, DocumentOffset>,
// Basic query engine
query: SimpleQueryEngine,
}
Deliverables: - [ ] Benchmark memory-mapped vs file I/O for markdown - [ ] Test ZSTD compression ratios on KOTA content - [ ] Validate B-tree performance for 10k documents - [ ] Prototype fuzzy search with trigrams
0.2 Architecture Documentation¶
Goal: Detailed technical design before implementation
Documents to Create: 1. ARCHITECTURE.md
- System design and components 2. DATA_MODEL.md
- Storage format and indices 3. QUERY_LANGUAGE.md
- KOTA-specific query syntax 4. INTEGRATION_GUIDE.md
- How to integrate with existing system
0.3 Development Environment Setup¶
# Create project structure
mkdir -p crates/kota-db/{src,tests,benches,examples}
mkdir -p crates/kota-db/src/{storage,index,query,compression}
# Add dependencies
cat >> Cargo.toml << EOF
[workspace.members]
members = ["crates/kota-db"]
[dependencies.kota-db]
version = "0.1.0"
path = "crates/kota-db"
EOF
Phase 1: Core Storage Engine (Week 2-3)¶
1.1 Page-Based Storage Manager¶
Goal: Efficient disk I/O with fixed-size pages
pub struct StorageEngine {
// Page size: 4KB (matches OS page size)
page_size: usize,
// Page cache with LRU eviction
page_cache: LruCache<PageId, Page>,
// Free page management
free_list: FreePageList,
// Write-ahead log
wal: WriteAheadLog,
}
impl StorageEngine {
pub fn allocate_page(&mut self) -> Result<PageId>;
pub fn read_page(&mut self, id: PageId) -> Result<&Page>;
pub fn write_page(&mut self, id: PageId, page: Page) -> Result<()>;
pub fn sync(&mut self) -> Result<()>;
}
Key Features: - Copy-on-write for versioning - Checksums for corruption detection - Compression at page level - Memory-mapped option for hot data
1.2 Document Storage Format¶
Goal: Optimized format for markdown with frontmatter
#[repr(C)]
pub struct DocumentHeader {
magic: [u8; 4], // "KOTA"
version: u16, // Format version
flags: DocumentFlags, // Compression, encryption, etc.
// Offsets within document
frontmatter_offset: u32,
frontmatter_len: u32,
content_offset: u32,
content_len: u32,
// Metadata
created: i64, // Unix timestamp
updated: i64,
git_hash: [u8; 20], // SHA-1
// Relationships
related_count: u16,
tags_count: u16,
}
pub struct CompressedDocument {
header: DocumentHeader,
data: Vec<u8>, // ZSTD compressed with dictionary
}
1.3 Write-Ahead Logging¶
Goal: Durability and crash recovery
pub struct WriteAheadLog {
log_file: tokio::fs::File,
sequence: AtomicU64,
checkpoint_interval: Duration,
}
pub enum WalEntry {
Begin { tx_id: u64 },
Insert { tx_id: u64, doc: Document },
Update { tx_id: u64, id: DocId, changes: Delta },
Delete { tx_id: u64, id: DocId },
Commit { tx_id: u64 },
Checkpoint { snapshot: DatabaseState },
}
Phase 2: Indexing Subsystem (Week 4-5)¶
2.1 Multi-Modal Index Manager¶
Goal: Unified interface for different index types
pub trait Index: Send + Sync {
type Key;
type Value;
fn insert(&mut self, key: Self::Key, value: Self::Value) -> Result<()>;
fn delete(&mut self, key: &Self::Key) -> Result<()>;
fn search(&self, query: &Query) -> Result<Vec<Self::Value>>;
fn range(&self, start: &Self::Key, end: &Self::Key) -> Result<Vec<Self::Value>>;
}
pub struct IndexManager {
// Primary indices
path_index: BTreeIndex<PathBuf, DocId>,
// Secondary indices
tag_index: InvertedIndex<String, DocId>,
fulltext_index: TrigramIndex,
temporal_index: TimeSeriesIndex,
// Graph indices
relationship_graph: AdjacencyList<DocId>,
// Semantic indices
embedding_index: HnswIndex<Vector, DocId>,
}
2.2 Full-Text Search with Fuzzy Matching¶
Goal: Fast, typo-tolerant search
pub struct TrigramIndex {
// Trigram to document mapping
trigrams: HashMap<[u8; 3], RoaringBitmap>,
// Document to position mapping
positions: HashMap<DocId, Vec<TrigramPosition>>,
// Fuzzy matcher
matcher: FuzzyMatcher,
}
impl TrigramIndex {
pub fn search_fuzzy(&self, query: &str, max_distance: u32) -> Vec<SearchResult> {
// 1. Extract query trigrams
// 2. Find candidate documents
// 3. Calculate edit distance
// 4. Rank by relevance
}
}
2.3 Graph Index for Relationships¶
Goal: Efficient traversal of document relationships
pub struct GraphIndex {
// Forward edges (document -> related)
forward: HashMap<DocId, Vec<Edge>>,
// Backward edges (document <- related)
backward: HashMap<DocId, Vec<Edge>>,
// Edge metadata
edge_data: HashMap<EdgeId, EdgeMetadata>,
// Bloom filter for quick existence checks
bloom: BloomFilter,
}
pub struct Edge {
target: DocId,
weight: f32, // Relationship strength
type_: EdgeType, // Related, references, child, etc.
}
2.4 Vector Index for Semantic Search¶
Goal: Find conceptually similar documents
pub struct HnswIndex {
// Hierarchical Navigable Small World graph
layers: Vec<Layer>,
// Vector storage
vectors: HashMap<DocId, Vector>,
// Distance function
distance: DistanceMetric,
}
impl HnswIndex {
pub fn search_knn(&self, query: &Vector, k: usize) -> Vec<(DocId, f32)> {
// Approximate nearest neighbor search
}
pub fn add_vector(&mut self, id: DocId, vector: Vector) -> Result<()> {
// Insert with automatic layer assignment
}
}
Phase 3: Query Engine (Week 6-7)¶
3.1 KOTA Query Language (KQL)¶
Goal: Natural, powerful query syntax
// Example queries:
// "meetings about rust"
// "related_to: 'project-mosaic' AND created: last_week"
// "consciousness sessions WITH insights ABOUT productivity"
// "similar_to: 'distributed cognition' LIMIT 10"
pub enum KotaQuery {
// Text search
Text {
query: String,
fields: Vec<Field>,
fuzzy: bool
},
// Relationship queries
Related {
start: DocId,
depth: u32,
filter: Option<Filter>
},
// Temporal queries
Temporal {
range: TimeRange,
aggregation: Option<Aggregation>
},
// Semantic queries
Semantic {
vector: Vector,
threshold: f32
},
// Compound queries
And(Box<KotaQuery>, Box<KotaQuery>),
Or(Box<KotaQuery>, Box<KotaQuery>),
Not(Box<KotaQuery>),
}
3.2 Query Parser and Planner¶
Goal: Convert text queries to execution plans
pub struct QueryParser {
lexer: Lexer,
grammar: Grammar,
}
pub struct QueryPlanner {
statistics: TableStatistics,
cost_model: CostModel,
}
pub struct ExecutionPlan {
steps: Vec<PlanStep>,
estimated_cost: f64,
estimated_rows: usize,
}
pub enum PlanStep {
IndexScan { index: IndexType, range: Range },
SeqScan { filter: Filter },
Join { left: Box<PlanStep>, right: Box<PlanStep> },
Sort { key: SortKey },
Limit { count: usize },
}
3.3 Query Executor¶
Goal: Efficient execution with streaming results
pub struct QueryExecutor {
buffer_pool: BufferPool,
thread_pool: ThreadPool,
}
impl QueryExecutor {
pub async fn execute(&self, plan: ExecutionPlan) -> Result<QueryStream> {
// Parallel execution where possible
// Streaming results for large queries
// Progress reporting for long operations
}
}
pub struct QueryStream {
receiver: mpsc::Receiver<Result<Document>>,
metadata: QueryMetadata,
}
Phase 4: Advanced Features (Week 8-9)¶
4.1 Memory Compression Integration¶
Goal: Intelligent compression aware of content patterns
pub struct CompressionEngine {
// Domain-specific dictionaries
markdown_dict: ZstdDict,
frontmatter_dict: ZstdDict,
// Compression levels by age/access
hot_level: i32, // Fast compression
cold_level: i32, // High compression
// Statistics for adaptive compression
stats: CompressionStats,
}
impl CompressionEngine {
pub fn compress_document(&self, doc: &Document) -> CompressedDocument {
// 1. Separate frontmatter and content
// 2. Apply appropriate dictionary
// 3. Choose compression level based on access patterns
}
}
4.2 Real-Time Synchronization¶
Goal: Keep database in sync with filesystem
pub struct FileSystemSync {
watcher: notify::RecommendedWatcher,
db: Arc<KotaDB>,
// Debouncing for rapid changes
debouncer: Debouncer,
// Conflict resolution
resolver: ConflictResolver,
}
impl FileSystemSync {
pub async fn start(&mut self) -> Result<()> {
// Watch for filesystem changes
// Queue updates with debouncing
// Apply changes in batches
// Handle conflicts (DB vs filesystem)
}
}
4.3 Consciousness Integration¶
Goal: Direct integration with consciousness system
pub struct ConsciousnessInterface {
db: Arc<KotaDB>,
session_cache: LruCache<SessionId, SessionState>,
}
impl ConsciousnessInterface {
pub async fn record_insight(&self, insight: Insight) -> Result<()> {
// Store with temporal context
// Update relationship graph
// Trigger relevant indices
}
pub async fn query_context(&self, focus: Focus) -> Result<Context> {
// Multi-index query
// Relevance scoring
// Context assembly
}
}
4.4 Performance Optimizations¶
Goal: Sub-100ms query latency
pub struct PerformanceOptimizer {
// Query result caching
query_cache: Cache<QueryHash, ResultSet>,
// Prepared statement cache
prepared_statements: HashMap<String, PreparedQuery>,
// Statistics for query optimization
query_stats: QueryStatistics,
// Adaptive indices
adaptive_indexer: AdaptiveIndexer,
}
Phase 5: Integration & Testing (Week 10-11)¶
5.1 MCP Server Wrapper¶
Goal: Expose database through MCP protocol
pub struct KotaDBServer {
db: Arc<KotaDB>,
tools: Vec<Tool>,
}
impl McpServer for KotaDBServer {
async fn handle_tool_call(&self, tool: &str, args: Value) -> Result<Value> {
match tool {
"query" => self.handle_query(args).await,
"insert" => self.handle_insert(args).await,
"update" => self.handle_update(args).await,
// ... other operations
}
}
}
5.2 CLI Integration¶
Goal: Seamless integration with existing kota commands
// New commands
pub enum DatabaseCommand {
Query { kql: String },
Index { path: PathBuf },
Compact,
Stats,
Export { format: ExportFormat },
}
// Integration with existing commands
impl KnowledgeOrgCommand {
pub async fn execute_with_db(&self, db: &KotaDB) -> Result<()> {
// Use database instead of in-memory indices
}
}
5.3 Migration Tools¶
Goal: Smooth transition from current system
pub struct Migrator {
source: FileSystemSource,
target: KotaDB,
progress: ProgressBar,
}
impl Migrator {
pub async fn migrate(&mut self) -> Result<MigrationReport> {
// 1. Scan all markdown files
// 2. Parse and validate
// 3. Insert into database
// 4. Build indices
// 5. Verify integrity
}
}
5.4 Testing Strategy¶
Unit Tests¶
#[cfg(test)]
mod tests {
#[test]
fn test_document_serialization() { }
#[test]
fn test_index_operations() { }
#[test]
fn test_query_parsing() { }
}
Integration Tests¶
#[tokio::test]
async fn test_full_query_pipeline() {
// 1. Insert test documents
// 2. Build indices
// 3. Execute complex queries
// 4. Verify results
}
Performance Benchmarks¶
#[bench]
fn bench_insert_throughput(b: &mut Bencher) {
// Measure documents/second
}
#[bench]
fn bench_query_latency(b: &mut Bencher) {
// Measure p50, p95, p99 latencies
}
Chaos Testing¶
pub struct ChaosTester {
db: KotaDB,
chaos_monkey: ChaosMonkey,
}
impl ChaosTester {
pub async fn test_crash_recovery(&mut self) {
// 1. Start transaction
// 2. Random crash
// 3. Recover from WAL
// 4. Verify consistency
}
}
Phase 6: Production Hardening (Week 12-13)¶
6.1 Monitoring and Observability¶
pub struct Metrics {
// Performance metrics
query_latency: Histogram,
index_hit_rate: Gauge,
compression_ratio: Gauge,
// Health metrics
page_cache_hit_rate: Gauge,
wal_size: Gauge,
connection_count: Counter,
}
6.2 Backup and Recovery¶
pub struct BackupManager {
schedule: CronSchedule,
retention: RetentionPolicy,
storage: BackupStorage,
}
impl BackupManager {
pub async fn create_backup(&self) -> Result<BackupId> {
// 1. Checkpoint WAL
// 2. Snapshot data files
// 3. Export metadata
// 4. Compress and encrypt
}
}
6.3 Security Hardening¶
pub struct SecurityLayer {
// Encryption at rest
encryption: AesGcm,
// Access control
permissions: PermissionSystem,
// Audit logging
audit_log: AuditLog,
}
Implementation Timeline¶
Week 1: Foundation¶
- Set up project structure
- Implement basic storage engine
- Create simple B-tree index
- Write first integration test
Week 2-3: Storage Engine¶
- Complete page manager
- Implement WAL
- Add compression support
- Benchmark I/O performance
Week 4-5: Indexing¶
- Build inverted index for text
- Implement graph index
- Add fuzzy search
- Create index benchmarks
Week 6-7: Query Engine¶
- Design query language
- Build parser and planner
- Implement executor
- Add streaming results
Week 8-9: Advanced Features¶
- Integrate compression engine
- Add filesystem sync
- Build consciousness interface
- Optimize performance
Week 10-11: Integration¶
- Create MCP server wrapper
- Update CLI commands
- Build migration tools
- Write comprehensive tests
Week 12-13: Production¶
- Add monitoring/metrics
- Implement backup system
- Security hardening
- Performance tuning
Success Metrics¶
Performance Targets¶
- Insert throughput: >10,000 docs/sec
- Query latency p50: <10ms
- Query latency p99: <100ms
- Memory usage: <500MB for 100k docs
- Startup time: <1 second
Functionality Goals¶
- Query types: Text, graph, temporal, semantic
- Index types: B-tree, inverted, graph, vector
- Compression ratio: >3x for typical content
- Crash recovery: <10 second RTO
- Backup size: <30% of original
Quality Standards¶
- Test coverage: >90%
- Documentation: 100% public API
- Zero clippy warnings
- No unsafe code (except FFI)
- Fuzz testing: 24 hours no crashes
Risk Mitigation¶
Technical Risks¶
- Performance not meeting targets
-
Mitigation: Profile early, optimize hot paths
-
Memory usage too high
-
Mitigation: Implement aggressive paging
-
Query language too complex
- Mitigation: Start simple, iterate with users
Schedule Risks¶
- Underestimated complexity
-
Mitigation: MVP first, features later
-
Integration challenges
- Mitigation: Continuous integration from week 1
Operational Risks¶
- Migration failures
-
Mitigation: Extensive testing, rollback plan
-
Data corruption
- Mitigation: Checksums, backups, WAL
Conclusion¶
This custom database will provide KOTA with: - 10-100x faster queries than current approach - Native markdown support with Git compatibility - Advanced cognitive features through semantic search - Complete control over memory architecture evolution
The 13-week timeline is aggressive but achievable, with clear milestones and risk mitigation strategies. The phased approach allows for early validation and continuous integration with the existing KOTA system.