KotaDB MVP Specification¶
Overview¶
This document defines a Minimum Viable Product for KotaDB that can be built in 2-3 weeks and immediately provide value to KOTA. The MVP focuses on solving the most painful current problems while laying a foundation for future expansion.
MVP Goals¶
- Eliminate startup scan time (currently ~30s for 1000 files)
- Enable persistent indices (survive restarts)
- Provide fast full-text search (<10ms for common queries)
- Support basic relationship queries (1-2 levels deep)
- Maintain Git compatibility (keep markdown files as source)
What's In Scope¶
Core Features (Week 1)¶
- Document Storage
- Read markdown files on demand (not stored in DB)
- Store only metadata and indices
-
SHA-256 hashes for change detection
-
Primary Index
- Simple B-tree for path → metadata lookup
- In-memory with periodic persistence
-
~500 bytes per document overhead
-
Full-Text Search
- Basic trigram index
- Case-insensitive matching
-
Simple relevance scoring (TF-IDF)
-
Tag Index
- Inverted index for tags
- Fast intersection queries
- Support for tag hierarchies
Extended Features (Week 2)¶
- Relationship Graph
- Simple adjacency list
- Bidirectional links
-
1-2 level traversal only
-
File Watcher
- Monitor for changes
- Incremental index updates
-
Debouncing for rapid edits
-
Basic Query Interface
- Simple JSON-based queries
- No query language parser
- Direct index access
Integration (Week 3)¶
- CLI Commands
kota db index
- Build indiceskota db search
- Query interface-
kota db stats
- Database statistics -
MCP Server
- Expose search via MCP tools
-
Replace KnowledgeOrgServer indices
-
Migration Tool
- Scan existing files
- Build initial indices
- Verify integrity
What's Out of Scope (Future)¶
- ❌ Complex query language (use JSON for now)
- ❌ Semantic/vector search (requires embeddings)
- ❌ Advanced graph algorithms (keep it simple)
- ❌ Compression (files stay uncompressed)
- ❌ Transactions (single-writer for now)
- ❌ Backup/restore (just rebuild indices)
- ❌ Encryption (rely on OS)
Technical Design¶
Storage Format¶
// Minimal document metadata
pub struct DocumentMeta {
pub id: [u8; 16], // UUID
pub path: String, // Full path
pub hash: [u8; 32], // Content hash
pub size: u64, // File size
pub created: i64, // Unix timestamp
pub updated: i64, // Unix timestamp
pub title: String, // From frontmatter
pub word_count: u32, // For scoring
}
// Simple index entry
pub struct IndexEntry {
pub doc_id: [u8; 16],
pub score: f32, // Relevance score
}
File Layout¶
~/.kota/db/
├── meta.db # Document metadata (MessagePack)
├── indices/
│ ├── paths.idx # Path → ID mapping
│ ├── trigrams.idx # Trigram inverted index
│ ├── tags.idx # Tag inverted index
│ └── links.idx # Relationship graph
└── wal/ # Write-ahead log
└── changes.log # Pending updates
Index Structures¶
Path Index (B-Tree)¶
// Simple B-tree node
pub struct BTreeNode {
pub keys: Vec<String>, // Paths
pub values: Vec<[u8; 16]>, // Document IDs
pub children: Vec<u64>, // Child page offsets
pub is_leaf: bool,
}
Trigram Index¶
// Trigram posting list
pub struct TrigramIndex {
// Trigram → Document IDs
pub postings: HashMap<[u8; 3], Vec<[u8; 16]>>,
// Document → Trigram positions
pub positions: HashMap<[u8; 16], Vec<u32>>,
}
Tag Index¶
// Simple inverted index
pub struct TagIndex {
// Tag → Document IDs
pub postings: HashMap<String, Vec<[u8; 16]>>,
// Document → Tags (for removal)
pub doc_tags: HashMap<[u8; 16], Vec<String>>,
}
Query Format¶
Simple JSON-based queries:
// Text search
{
"type": "text",
"query": "rust programming",
"limit": 10
}
// Tag filter
{
"type": "tags",
"tags": ["meeting", "cogzia"],
"op": "and"
}
// Combined query
{
"type": "and",
"queries": [
{ "type": "text", "query": "consciousness" },
{ "type": "tags", "tags": ["philosophy"] }
]
}
// Relationship query
{
"type": "related",
"start": "/projects/kota-ai/README.md",
"depth": 1
}
Implementation Plan¶
Week 1: Core Storage and Indexing¶
Day 1-2: Storage Layer¶
// Minimal implementation
pub struct Storage {
meta: HashMap<[u8; 16], DocumentMeta>,
path_index: BTreeMap<String, [u8; 16]>,
}
impl Storage {
pub fn insert(&mut self, path: &str, meta: DocumentMeta);
pub fn get(&self, id: &[u8; 16]) -> Option<&DocumentMeta>;
pub fn persist(&self) -> Result<()>;
pub fn load() -> Result<Self>;
}
Day 3-4: Trigram Index¶
pub struct TrigramIndex {
postings: HashMap<[u8; 3], RoaringBitmap>,
}
impl TrigramIndex {
pub fn index_document(&mut self, id: [u8; 16], content: &str);
pub fn search(&self, query: &str) -> Vec<[u8; 16]>;
}
Day 5: Tag Index¶
pub struct TagIndex {
postings: HashMap<String, RoaringBitmap>,
}
impl TagIndex {
pub fn add_tags(&mut self, id: [u8; 16], tags: &[String]);
pub fn search(&self, tags: &[String]) -> Vec<[u8; 16]>;
}
Week 2: Extended Features¶
Day 6-7: Relationship Graph¶
pub struct GraphIndex {
edges: HashMap<[u8; 16], Vec<[u8; 16]>>,
}
impl GraphIndex {
pub fn add_edge(&mut self, from: [u8; 16], to: [u8; 16]);
pub fn get_related(&self, id: [u8; 16], depth: u32) -> Vec<[u8; 16]>;
}
Day 8-9: File Watcher¶
pub struct FileWatcher {
watcher: notify::RecommendedWatcher,
db: Arc<Mutex<Database>>,
}
impl FileWatcher {
pub fn watch(&mut self, path: &Path) -> Result<()>;
pub fn handle_event(&mut self, event: notify::Event);
}
Day 10: Query Engine¶
pub struct QueryEngine {
storage: Arc<Storage>,
indices: Indices,
}
impl QueryEngine {
pub fn execute(&self, query: Query) -> Result<Vec<SearchResult>>;
}
Week 3: Integration¶
Day 11-12: CLI Integration¶
# New commands
kota db index # Build/rebuild indices
kota db search "query" # Search interface
kota db stats # Show statistics
kota db verify # Check integrity
Day 13-14: MCP Server¶
pub struct DatabaseServer {
db: Arc<Database>,
}
impl McpServer for DatabaseServer {
async fn handle_tool_call(&self, tool: &str, args: Value) -> Result<Value> {
match tool {
"search" => self.search(args).await,
"get_related" => self.get_related(args).await,
_ => Err(anyhow!("Unknown tool")),
}
}
}
Day 15: Testing and Polish¶
- Integration tests
- Performance benchmarks
- Documentation
- Bug fixes
Performance Targets¶
Storage¶
- Metadata size: <500 bytes per document
- Index size: <2KB per document total
- Memory usage: <100MB for 10k documents
Operations¶
- Indexing: >1000 documents/second
- Search latency: <10ms for simple queries
- Startup time: <100ms (with indices)
- Update latency: <1ms per document
Benchmarks¶
#[bench]
fn bench_index_document(b: &mut Bencher) {
let mut idx = TrigramIndex::new();
b.iter(|| {
idx.index_document(uuid::Uuid::new_v4().into(), "sample content");
});
}
#[bench]
fn bench_search(b: &mut Bencher) {
let idx = create_test_index();
b.iter(|| {
idx.search("test query");
});
}
Migration Path¶
From Current System¶
- Parallel Operation
- Run alongside existing KnowledgeOrgServer
- Compare results for validation
-
Gradual cutover
-
Data Migration
-
Verification
- Count documents
- Verify relationships
- Test queries
- Check performance
Success Criteria¶
Functional¶
- ✅ Indexes persist between restarts
- ✅ Search returns correct results
- ✅ File changes are detected
- ✅ Relationships are bidirectional
- ✅ No data corruption
Performance¶
- ✅ Startup time <1 second
- ✅ Search latency <10ms
- ✅ Memory usage <100MB
- ✅ CPU usage minimal when idle
Integration¶
- ✅ CLI commands work correctly
- ✅ MCP server responds properly
- ✅ No regression in functionality
- ✅ Easy to set up and use
Risk Mitigation¶
Technical Risks¶
- Corruption: Use checksums, atomic writes
- Performance: Profile early, optimize hotspots
- Compatibility: Keep markdown files unchanged
Schedule Risks¶
- Scope creep: Stick to MVP features
- Integration issues: Test continuously
- Unknown unknowns: Time buffer in week 3
Future Roadmap¶
After MVP success:
Phase 2 (Weeks 4-6)¶
- Query language parser
- Advanced text search (stemming, synonyms)
- Basic vector search
- Compression
Phase 3 (Weeks 7-9)¶
- ACID transactions
- Multi-version concurrency
- Advanced graph algorithms
- Backup/restore
Phase 4 (Weeks 10-12)¶
- Distributed queries
- Real-time subscriptions
- Machine learning integration
- Performance optimization
Conclusion¶
This MVP provides immediate value by solving KOTA's most pressing database needs while laying a foundation for future enhancements. The 3-week timeline is aggressive but achievable by focusing on pragmatic solutions and deferring complexity.
The key is to start simple, validate the approach, and iterate based on real usage. This MVP will prove the custom database concept and provide a platform for the more ambitious features described in the full implementation plan.