0
# Postprocessors
1
2
Components for processing and refining retrieved results, including similarity filtering, reranking, metadata replacement, and recency scoring. Postprocessors enhance retrieval quality by applying various refinement strategies to improve relevance and remove irrelevant content.
3
4
## Capabilities
5
6
### Base Postprocessor Interface
7
8
Foundation interface for all postprocessing operations with standardized node processing methods.
9
10
```python { .api }
11
class BaseNodePostprocessor:
12
"""
13
Base interface for node postprocessing operations.
14
15
Parameters:
16
- callback_manager: Optional[CallbackManager], callback management system
17
"""
18
def __init__(self, callback_manager: Optional[CallbackManager] = None): ...
19
20
def postprocess_nodes(
21
self,
22
nodes: List[NodeWithScore],
23
query_bundle: Optional[QueryBundle] = None
24
) -> List[NodeWithScore]:
25
"""
26
Process and refine retrieved nodes.
27
28
Parameters:
29
- nodes: List[NodeWithScore], nodes to postprocess
30
- query_bundle: Optional[QueryBundle], original query for context
31
32
Returns:
33
- List[NodeWithScore], processed and refined nodes
34
"""
35
36
def _postprocess_nodes(
37
self,
38
nodes: List[NodeWithScore],
39
query_bundle: Optional[QueryBundle] = None
40
) -> List[NodeWithScore]:
41
"""Internal postprocessing method to be implemented by subclasses."""
42
```
43
44
### Similarity Filtering
45
46
Filters nodes based on relevance scores and similarity thresholds to remove low-quality results.
47
48
```python { .api }
49
class SimilarityPostprocessor(BaseNodePostprocessor):
50
"""
51
Postprocessor that filters nodes based on similarity score thresholds.
52
53
Parameters:
54
- similarity_cutoff: Optional[float], minimum similarity score to retain nodes
55
"""
56
def __init__(self, similarity_cutoff: Optional[float] = None): ...
57
```
58
59
### Keyword Filtering
60
61
Filters nodes based on keyword inclusion or exclusion criteria for content-based filtering.
62
63
```python { .api }
64
class KeywordNodePostprocessor(BaseNodePostprocessor):
65
"""
66
Postprocessor for keyword-based node filtering.
67
68
Parameters:
69
- required_keywords: Optional[List[str]], keywords that must be present
70
- exclude_keywords: Optional[List[str]], keywords that must not be present
71
- lang: str, language for keyword matching
72
"""
73
def __init__(
74
self,
75
required_keywords: Optional[List[str]] = None,
76
exclude_keywords: Optional[List[str]] = None,
77
lang: str = "en"
78
): ...
79
```
80
81
### Context Enhancement
82
83
Enhances nodes by adding adjacent or related content for better context understanding.
84
85
```python { .api }
86
class PrevNextNodePostprocessor(BaseNodePostprocessor):
87
"""
88
Postprocessor that adds previous and next nodes for enhanced context.
89
90
Parameters:
91
- docstore: BaseDocumentStore, document store for node relationships
92
- num_nodes: int, number of previous/next nodes to include
93
- mode: str, inclusion mode (previous, next, or both)
94
"""
95
def __init__(
96
self,
97
docstore: BaseDocumentStore,
98
num_nodes: int = 1,
99
mode: str = "both"
100
): ...
101
102
class AutoPrevNextNodePostprocessor(BaseNodePostprocessor):
103
"""
104
Automatic previous/next node inclusion with intelligent boundary detection.
105
106
Parameters:
107
- docstore: BaseDocumentStore, document store for node relationships
108
- num_nodes: int, number of nodes to include in each direction
109
"""
110
def __init__(
111
self,
112
docstore: BaseDocumentStore,
113
num_nodes: int = 1
114
): ...
115
```
116
117
### Long Context Optimization
118
119
Reorders and optimizes nodes for long context scenarios to improve model performance.
120
121
```python { .api }
122
class LongContextReorder(BaseNodePostprocessor):
123
"""
124
Reorders nodes to optimize performance in long context scenarios.
125
126
Long context reordering places the most relevant information at the beginning
127
and end of the context window where language models pay more attention.
128
"""
129
def __init__(self): ...
130
```
131
132
### Recency Processing
133
134
Applies recency-based scoring and filtering to prioritize recent or time-relevant content.
135
136
```python { .api }
137
class FixedRecencyPostprocessor(BaseNodePostprocessor):
138
"""
139
Postprocessor that applies fixed recency scoring based on date metadata.
140
141
Parameters:
142
- top_k: int, number of top recent nodes to return
143
- date_key: str, metadata key containing date information
144
- service_context: Optional[ServiceContext], service context for processing
145
"""
146
def __init__(
147
self,
148
top_k: int = 1,
149
date_key: str = "date",
150
service_context: Optional[ServiceContext] = None
151
): ...
152
153
class EmbeddingRecencyPostprocessor(BaseNodePostprocessor):
154
"""
155
Recency postprocessor using embedding-based similarity for temporal relevance.
156
157
Parameters:
158
- embed_model: Optional[BaseEmbedding], embedding model for similarity computation
159
- similarity_cutoff: float, minimum similarity threshold
160
- date_key: str, metadata key containing date information
161
- service_context: Optional[ServiceContext], service context for processing
162
"""
163
def __init__(
164
self,
165
embed_model: Optional[BaseEmbedding] = None,
166
similarity_cutoff: float = 0.7,
167
date_key: str = "date",
168
service_context: Optional[ServiceContext] = None
169
): ...
170
171
class TimeWeightedPostprocessor(BaseNodePostprocessor):
172
"""
173
Time-weighted relevance scoring that balances content relevance with recency.
174
175
Parameters:
176
- time_decay: float, decay factor for time-based scoring
177
- time_access_refresh: bool, whether to refresh access times
178
- top_k: int, number of top nodes to return
179
"""
180
def __init__(
181
self,
182
time_decay: float = 0.99,
183
time_access_refresh: bool = True,
184
top_k: int = 1
185
): ...
186
```
187
188
### Privacy & Security Processing
189
190
Removes or masks personally identifiable information (PII) and sensitive data from retrieved content.
191
192
```python { .api }
193
class PIINodePostprocessor(BaseNodePostprocessor):
194
"""
195
Postprocessor for detecting and removing personally identifiable information.
196
197
Parameters:
198
- pii_node_info_key: str, metadata key for storing PII information
199
- pii_str_tmpl: str, template for PII replacement strings
200
- service_context: Optional[ServiceContext], service context for processing
201
"""
202
def __init__(
203
self,
204
pii_node_info_key: str = "__pii_node_info__",
205
pii_str_tmpl: str = "[PII_REMOVED]",
206
service_context: Optional[ServiceContext] = None
207
): ...
208
209
class NERPIINodePostprocessor(BaseNodePostprocessor):
210
"""
211
Named Entity Recognition-based PII detection and removal postprocessor.
212
213
Parameters:
214
- pii_node_info_key: str, metadata key for PII information
215
- pii_str_tmpl: str, template for PII replacement
216
- ner_model_name: str, name of NER model to use
217
- service_context: Optional[ServiceContext], service context for processing
218
"""
219
def __init__(
220
self,
221
pii_node_info_key: str = "__pii_node_info__",
222
pii_str_tmpl: str = "[PII_REMOVED]",
223
ner_model_name: str = "StanfordAIMI/stanford-deidentifier-base",
224
service_context: Optional[ServiceContext] = None
225
): ...
226
```
227
228
### Reranking Systems
229
230
Advanced reranking using language models and specialized algorithms to improve result ordering.
231
232
```python { .api }
233
class LLMRerank(BaseNodePostprocessor):
234
"""
235
LLM-based reranking postprocessor for improved result ordering.
236
237
Parameters:
238
- choice_batch_size: int, batch size for LLM processing
239
- top_n: int, number of top nodes to return after reranking
240
- service_context: Optional[ServiceContext], service context for LLM operations
241
- choice_select_prompt: Optional[PromptTemplate], prompt for node selection
242
- choice_batch_select_prompt: Optional[PromptTemplate], prompt for batch selection
243
- llm: Optional[LLM], language model for reranking
244
"""
245
def __init__(
246
self,
247
choice_batch_size: int = 10,
248
top_n: int = 10,
249
service_context: Optional[ServiceContext] = None,
250
choice_select_prompt: Optional[PromptTemplate] = None,
251
choice_batch_select_prompt: Optional[PromptTemplate] = None,
252
llm: Optional[LLM] = None
253
): ...
254
255
class StructuredLLMRerank(BaseNodePostprocessor):
256
"""
257
Structured LLM reranking with explicit scoring criteria and rationale.
258
259
Parameters:
260
- llm: Optional[LLM], language model for structured reranking
261
- top_n: int, number of top nodes to return
262
- choice_batch_size: int, batch size for processing
263
"""
264
def __init__(
265
self,
266
llm: Optional[LLM] = None,
267
top_n: int = 10,
268
choice_batch_size: int = 10
269
): ...
270
271
class SentenceTransformerRerank(BaseNodePostprocessor):
272
"""
273
Sentence transformer-based reranking for semantic similarity.
274
275
Parameters:
276
- model: str, sentence transformer model name
277
- top_n: int, number of top nodes to return
278
- device: Optional[str], device for model computation (cpu, cuda)
279
- keep_retrieval_score: bool, whether to preserve original retrieval scores
280
"""
281
def __init__(
282
self,
283
model: str = "cross-encoder/ms-marco-MiniLM-L-2-v2",
284
top_n: int = 10,
285
device: Optional[str] = None,
286
keep_retrieval_score: bool = False
287
): ...
288
```
289
290
### Embedding Optimization
291
292
Optimizes embedding-based operations and enhances semantic understanding of retrieved content.
293
294
```python { .api }
295
class SentenceEmbeddingOptimizer(BaseNodePostprocessor):
296
"""
297
Optimizer for sentence embeddings to improve semantic retrieval quality.
298
299
Parameters:
300
- embed_model: Optional[BaseEmbedding], embedding model for optimization
301
- percentile_cutoff: Optional[float], percentile cutoff for optimization
302
- threshold_cutoff: Optional[float], absolute threshold for optimization
303
- mode: str, optimization mode (percentile, threshold, or auto)
304
"""
305
def __init__(
306
self,
307
embed_model: Optional[BaseEmbedding] = None,
308
percentile_cutoff: Optional[float] = None,
309
threshold_cutoff: Optional[float] = None,
310
mode: str = "percentile"
311
): ...
312
```
313
314
### Metadata Processing
315
316
Processes and transforms node metadata to enhance content understanding and presentation.
317
318
```python { .api }
319
class MetadataReplacementPostProcessor(BaseNodePostprocessor):
320
"""
321
Postprocessor for replacing and transforming node metadata.
322
323
Parameters:
324
- target_metadata_key: str, metadata key to replace or transform
325
- new_metadata_key: str, new key name for transformed metadata
326
- replacement_function: Optional[Callable], function for metadata transformation
327
"""
328
def __init__(
329
self,
330
target_metadata_key: str,
331
new_metadata_key: str = "new_metadata",
332
replacement_function: Optional[Callable] = None
333
): ...
334
```
335
336
### Document Relevance Processing
337
338
Advanced relevance scoring and document-level processing for improved result quality.
339
340
```python { .api }
341
class DocumentWithRelevance:
342
"""
343
Document wrapper with relevance scoring for postprocessing operations.
344
345
Parameters:
346
- document: Document, the original document
347
- relevance_score: float, computed relevance score
348
- metadata: Optional[dict], additional relevance metadata
349
"""
350
def __init__(
351
self,
352
document: Document,
353
relevance_score: float,
354
metadata: Optional[dict] = None
355
): ...
356
357
@property
358
def text(self) -> str:
359
"""Get document text content."""
360
361
@property
362
def doc_id(self) -> str:
363
"""Get document identifier."""
364
```
365
366
## Usage Examples
367
368
### Basic Similarity Filtering
369
370
```python
371
from llama_index.core.postprocessor import SimilarityPostprocessor
372
from llama_index.core.schema import NodeWithScore, TextNode
373
374
# Create test nodes with scores
375
nodes = [
376
NodeWithScore(node=TextNode(text="Machine learning algorithms"), score=0.85),
377
NodeWithScore(node=TextNode(text="Deep learning techniques"), score=0.72),
378
NodeWithScore(node=TextNode(text="Unrelated content here"), score=0.45),
379
NodeWithScore(node=TextNode(text="Neural network architectures"), score=0.78)
380
]
381
382
# Filter by similarity threshold
383
similarity_filter = SimilarityPostprocessor(similarity_cutoff=0.7)
384
filtered_nodes = similarity_filter.postprocess_nodes(nodes)
385
386
print(f"Original nodes: {len(nodes)}")
387
print(f"Filtered nodes: {len(filtered_nodes)}")
388
for node in filtered_nodes:
389
print(f"Score: {node.score:.2f}, Text: {node.text}")
390
```
391
392
### Keyword-Based Filtering
393
394
```python
395
from llama_index.core.postprocessor import KeywordNodePostprocessor
396
397
# Keyword filtering
398
keyword_filter = KeywordNodePostprocessor(
399
required_keywords=["machine", "learning"],
400
exclude_keywords=["unrelated", "spam"]
401
)
402
403
filtered_by_keywords = keyword_filter.postprocess_nodes(nodes)
404
print(f"Keyword filtered nodes: {len(filtered_by_keywords)}")
405
```
406
407
### LLM-Based Reranking
408
409
```python
410
from llama_index.core.postprocessor import LLMRerank
411
from llama_index.core.llms import MockLLM
412
413
# Initialize LLM reranker
414
llm = MockLLM()
415
reranker = LLMRerank(
416
llm=llm,
417
top_n=3,
418
choice_batch_size=5
419
)
420
421
# Rerank nodes based on relevance
422
reranked_nodes = reranker.postprocess_nodes(
423
nodes,
424
query_bundle=QueryBundle(query_str="What is machine learning?")
425
)
426
427
print("Reranked results:")
428
for i, node in enumerate(reranked_nodes):
429
print(f"{i+1}. Score: {node.score:.2f}, Text: {node.text}")
430
```
431
432
### Context Enhancement with Previous/Next Nodes
433
434
```python
435
from llama_index.core.postprocessor import PrevNextNodePostprocessor
436
from llama_index.core.storage.docstore import SimpleDocumentStore
437
438
# Setup document store with node relationships
439
docstore = SimpleDocumentStore()
440
# Add nodes with relationships to docstore
441
# docstore.add_documents([...])
442
443
# Context enhancement postprocessor
444
context_enhancer = PrevNextNodePostprocessor(
445
docstore=docstore,
446
num_nodes=1,
447
mode="both"
448
)
449
450
# Add context to retrieved nodes
451
enhanced_nodes = context_enhancer.postprocess_nodes(nodes)
452
print("Enhanced nodes with context:")
453
for node in enhanced_nodes:
454
print(f"Enhanced text length: {len(node.text)}")
455
```
456
457
### Recency-Based Processing
458
459
```python
460
from llama_index.core.postprocessor import FixedRecencyPostprocessor
461
from datetime import datetime, timedelta
462
463
# Create nodes with date metadata
464
recent_nodes = [
465
NodeWithScore(
466
node=TextNode(
467
text="Latest ML research findings",
468
metadata={"date": datetime.now().isoformat()}
469
),
470
score=0.75
471
),
472
NodeWithScore(
473
node=TextNode(
474
text="Historical ML overview",
475
metadata={"date": (datetime.now() - timedelta(days=365)).isoformat()}
476
),
477
score=0.80
478
)
479
]
480
481
# Prioritize recent content
482
recency_processor = FixedRecencyPostprocessor(
483
top_k=1,
484
date_key="date"
485
)
486
487
recent_filtered = recency_processor.postprocess_nodes(recent_nodes)
488
print("Most recent content:")
489
for node in recent_filtered:
490
print(f"Date: {node.node.metadata['date']}")
491
print(f"Text: {node.text}")
492
```
493
494
### PII Removal
495
496
```python
497
from llama_index.core.postprocessor import PIINodePostprocessor
498
499
# Nodes with potential PII
500
pii_nodes = [
501
NodeWithScore(
502
node=TextNode(text="Contact John Doe at john.doe@email.com for more info"),
503
score=0.80
504
),
505
NodeWithScore(
506
node=TextNode(text="The phone number is 555-123-4567"),
507
score=0.75
508
)
509
]
510
511
# Remove PII from nodes
512
pii_remover = PIINodePostprocessor(pii_str_tmpl="[REDACTED]")
513
sanitized_nodes = pii_remover.postprocess_nodes(pii_nodes)
514
515
print("Sanitized content:")
516
for node in sanitized_nodes:
517
print(f"Text: {node.text}")
518
```
519
520
### Long Context Optimization
521
522
```python
523
from llama_index.core.postprocessor import LongContextReorder
524
525
# Reorder for long context optimization
526
long_context_reorder = LongContextReorder()
527
reordered_nodes = long_context_reorder.postprocess_nodes(nodes)
528
529
print("Reordered for long context:")
530
for i, node in enumerate(reordered_nodes):
531
print(f"Position {i}: {node.text[:50]}...")
532
```
533
534
### Sentence Transformer Reranking
535
536
```python
537
from llama_index.core.postprocessor import SentenceTransformerRerank
538
539
# Advanced semantic reranking
540
sentence_reranker = SentenceTransformerRerank(
541
model="cross-encoder/ms-marco-MiniLM-L-2-v2",
542
top_n=3,
543
keep_retrieval_score=True
544
)
545
546
# Note: This requires actual sentence-transformers library
547
# reranked_semantic = sentence_reranker.postprocess_nodes(
548
# nodes,
549
# query_bundle=QueryBundle(query_str="machine learning algorithms")
550
# )
551
```
552
553
### Chaining Multiple Postprocessors
554
555
```python
556
# Chain multiple postprocessors
557
postprocessors = [
558
SimilarityPostprocessor(similarity_cutoff=0.6),
559
KeywordNodePostprocessor(required_keywords=["machine", "learning"]),
560
LLMRerank(llm=llm, top_n=2)
561
]
562
563
# Apply postprocessors in sequence
564
processed_nodes = nodes
565
for processor in postprocessors:
566
processed_nodes = processor.postprocess_nodes(processed_nodes)
567
568
print(f"Final processed nodes: {len(processed_nodes)}")
569
for node in processed_nodes:
570
print(f"Final result: {node.text}")
571
```
572
573
## Configuration & Types
574
575
```python { .api }
576
# Postprocessor modes and configurations
577
class PostprocessorMode(str, Enum):
578
SIMILARITY = "similarity"
579
KEYWORD = "keyword"
580
LLM_RERANK = "llm_rerank"
581
RECENCY = "recency"
582
PII_REMOVAL = "pii_removal"
583
584
# Default configuration values
585
DEFAULT_SIMILARITY_CUTOFF = 0.7
586
DEFAULT_TOP_N = 10
587
DEFAULT_BATCH_SIZE = 10
588
DEFAULT_PII_TEMPLATE = "[PII_REMOVED]"
589
DEFAULT_DATE_KEY = "date"
590
```