Chunking Strategy
Optimize how documents are split into memories for better retrieval and context preservation
Chunking Strategy
Optimize how documents are split into memories for better retrieval and context preservation.
Chunking Methods
Fixed-Size Chunking
Split text into fixed-length chunks.
Best for: Simple documents, quick setup
om.ingest_file(
'document.pdf',
chunk_strategy='fixed',
chunk_size=512,
chunk_overlap=50
)
Pros:
- Fast and simple
- Predictable chunk sizes
- Low memory usage
Cons:
- May split sentences/paragraphs
- Loses semantic boundaries
Semantic Chunking
Split based on semantic coherence.
Best for: Articles, documentation, books
om.ingest_file(
'article.md',
chunk_strategy='semantic',
similarity_threshold=0.75
)
Pros:
- Maintains topic coherence
- Better context preservation
- Improves retrieval quality
Cons:
- Slower processing
- Variable chunk sizes
Sentence-Based Chunking
Split at sentence boundaries.
Best for: Chat logs, Q&A, structured text
om.ingest_file(
'conversation.txt',
chunk_strategy='sentence',
sentences_per_chunk=3
)
Code-Aware Chunking
Split code by functions/classes.
Best for: Source code repositories
om.ingest_file(
'module.py',
chunk_strategy='code',
split_by='function' # or 'class', 'method'
)
Configuration
Chunk Size Guidelines
Content Type | Recommended Size | Strategy |
---|---|---|
Technical docs | 300-500 chars | Semantic |
Books/Articles | 500-800 chars | Semantic |
Code | By function | Code-aware |
Chat/Logs | 100-200 chars | Sentence |
API responses | 200-400 chars | Fixed |
Overlap Strategy
# High overlap - better context but more storage
om.ingest_file(
'document.pdf',
chunk_size=500,
chunk_overlap=100 # 20% overlap
)
# Low overlap - less storage but may miss context
om.ingest_file(
'document.pdf',
chunk_size=500,
chunk_overlap=25 # 5% overlap
)
Advanced Techniques
Hierarchical Chunking
# Create parent-child chunk relationships
om.ingest_file(
'book.pdf',
chunk_strategy='hierarchical',
levels=[
{'size': 2000, 'name': 'chapter'},
{'size': 500, 'name': 'section'},
{'size': 100, 'name': 'paragraph'}
]
)
Metadata-Enhanced Chunking
# Extract and add metadata to chunks
om.ingest_file(
'document.pdf',
extract_metadata=True, # Headers, page numbers, etc.
metadata_strategy='inherit' # Inherit from document
)
Custom Chunking Function
def custom_chunker(text: str) -> list[str]:
"""Custom chunking logic"""
chunks = []
# Your logic here
return chunks
om.ingest_file(
'document.pdf',
chunk_function=custom_chunker
)
Best Practices
- Match chunk size to query length - Similar sizes work better
- Use semantic chunking for quality - Worth the extra processing
- Add overlap for context - 10-20% overlap recommended
- Preserve structure - Keep paragraphs/sections together
- Test and iterate - Evaluate retrieval quality
Performance Impact
Strategy | Speed | Storage | Quality |
---|---|---|---|
Fixed | ⚡⚡⚡ | ✅ | ⭐⭐ |
Sentence | ⚡⚡ | ✅✅ | ⭐⭐⭐ |
Semantic | ⚡ | ✅✅✅ | ⭐⭐⭐⭐ |
Code-aware | ⚡⚡ | ✅✅ | ⭐⭐⭐⭐ |
See Multimodal Ingestion for file ingestion and Custom Providers for custom chunkers.