know provides several configuration options to optimize indexing for your content. Understanding these settings helps you balance search quality, performance, and storage.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ynbh/know/llms.txt
Use this file to discover all available pages before exploring further.
Chunk Size and Overlap
Documents are split into chunks for indexing. The chunk size and overlap determine how content is divided.Chunk Size
The--chunk-size parameter controls the maximum size of each chunk in tokens (roughly equivalent to words).
- Small (256)
- Medium (512)
- Large (1024)
Use for:
- Short documents (notes, snippets)
- Precise location matching
- Code files with small functions
- FAQ-style content
- ✅ More precise results
- ✅ Better for short queries
- ❌ May split logical units
- ❌ More chunks = more storage
Chunk Overlap
The--overlap parameter controls how many tokens overlap between consecutive chunks.
Why Use Chunk Overlap?
Why Use Chunk Overlap?
Overlap ensures that content at chunk boundaries isn’t lost or split awkwardly:Without overlap (—overlap 0):Searching for “return value processed” might miss this!With overlap (—overlap 50):Now both chunks contain the complete context.Recommended overlap:
- 10-20% of chunk size
- Default 50 tokens works well with 512 chunk size (~10%)
- Increase for narrative content
- Decrease for independent items (logs, code)
How Chunking Works
know uses LlamaIndex’sSentenceSplitter for intelligent chunking:
- Respects sentence boundaries when possible
- Avoids breaking words or sentences mid-way
- Maintains metadata (file path, chunk index)
- Creates overlapping regions for context
src/db.py:79-313
Caching and Performance
File Cache
know maintains a cache in~/.cache/know/ to track indexed files:
The cache is invalidated when:
- File modification time changes
- File size changes
- Chunk size changes
- Chunk overlap changes
src/db.py:142-173
Deduplication
know automatically deduplicates chunks using MD5 hashing:- Duplicate content from being indexed multiple times
- Wasted storage and compute
- Redundant search results
src/db.py:198-255
Batch Processing
Indexing uses batched operations for performance:- API call overhead
- Memory usage
- Indexing time
src/db.py:190-301
Directory Management
know tracks watched directories in~/.know_dirs:
know index, all watched directories are indexed together.
Recursive Scanning
By default, indexing is recursive. You can disable this:Index Storage
know stores indexes in./know_index/:
Storage Requirements
Approximate storage per 1000 documents (512 token chunks):- Dense vectors: ~5-10 MB (depends on embedding model)
- BM25 index: ~2-5 MB (depends on vocabulary size)
- Metadata: ~1 MB
Maintenance Operations
Pruning Orphaned Chunks
Remove chunks from deleted files:src/db.py:528-582
Resetting the Index
Clear everything and start fresh:src/db.py:519-525
Advanced Indexing Options
Dry Run
Preview what would be indexed without making changes:Detailed Logging
Get verbose output during indexing:Force Reindex
Bypass cache and reindex everything:Skip Reports
Generate detailed reports of skipped chunks:- Files skipped (unchanged)
- Chunks skipped (already indexed)
- Chunks skipped (duplicate content)
- Collision details
src/db.py:57-73, src/db.py:146-267
Extension Filtering
Control which file types to index:Default supported extensions in
src/db.py:29-54:Documents: .md, .txt, .pdf, .docx, .pptx, .htmlCode: .py, .js, .ts, .jsx, .tsx, .go, .rs, .java, .c, .cpp, .h, .hpp, .rb, .sh, .lua, .swiftOptimization Guidelines
Small Documents
Long Documents
Code Files
Documentation
Configuration Recommendations
General Purpose (Default)
Technical Documentation
Code Search
Notes and Snippets
Academic Papers
Next Steps
Search Modes
Learn about dense, BM25, and hybrid search
Output Formats
Explore different output formats