0
# Retriever Components
1
2
Retrievers are the search components in Haystack that find relevant documents from document stores based on queries. They implement various retrieval strategies including sparse keyword-based methods, dense vector similarity, and specialized multi-modal approaches.
3
4
## Core Imports
5
6
```python { .api }
7
from haystack.nodes.retriever import (
8
BaseRetriever,
9
# Sparse retrievers
10
BM25Retriever, TfidfRetriever, FilterRetriever,
11
# Dense retrievers
12
DenseRetriever, DensePassageRetriever, EmbeddingRetriever,
13
MultihopEmbeddingRetriever, TableTextRetriever,
14
# Specialized retrievers
15
MultiModalRetriever, WebRetriever, LinkContentFetcher
16
)
17
```
18
19
## Base Retriever
20
21
### BaseRetriever
22
23
Abstract base class defining the retriever interface.
24
25
```python { .api }
26
from haystack.nodes.retriever.base import BaseRetriever
27
from haystack.document_stores.base import BaseDocumentStore, FilterType
28
from haystack.schema import Document
29
from typing import List, Optional, Dict, Union
30
31
class BaseRetriever(BaseComponent):
32
"""Abstract base class for all retrievers."""
33
34
def retrieve(
35
self,
36
query: str,
37
filters: Optional[FilterType] = None,
38
top_k: Optional[int] = None,
39
index: Optional[str] = None,
40
headers: Optional[Dict[str, str]] = None,
41
scale_score: Optional[bool] = None,
42
document_store: Optional[BaseDocumentStore] = None,
43
) -> List[Document]:
44
"""
45
Retrieve documents most relevant to the query.
46
47
Args:
48
query: Search query string
49
filters: Metadata filters to narrow search scope
50
top_k: Number of documents to retrieve
51
index: Document store index name
52
headers: Custom HTTP headers for document store
53
scale_score: Whether to normalize scores to [0,1] range
54
document_store: Override default document store
55
56
Returns:
57
List of retrieved Document objects with relevance scores
58
"""
59
60
def retrieve_batch(
61
self,
62
queries: List[str],
63
filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,
64
top_k: Optional[int] = None,
65
index: Optional[str] = None,
66
headers: Optional[Dict[str, str]] = None,
67
batch_size: Optional[int] = None,
68
scale_score: Optional[bool] = None,
69
document_store: Optional[BaseDocumentStore] = None,
70
) -> List[List[Document]]:
71
"""Batch retrieval for multiple queries."""
72
```
73
74
## Sparse Retrievers
75
76
Sparse retrievers use keyword-based methods like BM25 and TF-IDF for document retrieval.
77
78
### BM25Retriever
79
80
Best Matching 25 algorithm for keyword-based retrieval.
81
82
```python { .api }
83
from haystack.nodes.retriever import BM25Retriever
84
from haystack.document_stores import KeywordDocumentStore
85
86
class BM25Retriever(BaseRetriever):
87
def __init__(
88
self,
89
document_store: Optional[KeywordDocumentStore] = None,
90
top_k: int = 10,
91
all_terms_must_match: bool = False,
92
custom_query: Optional[str] = None,
93
scale_score: bool = True,
94
):
95
"""
96
Initialize BM25 retriever.
97
98
Args:
99
document_store: Keyword-searchable document store
100
top_k: Number of documents to retrieve
101
all_terms_must_match: Whether all query terms must match (AND vs OR)
102
custom_query: Custom Elasticsearch query template
103
scale_score: Whether to normalize scores to [0,1]
104
"""
105
```
106
107
### Usage Examples
108
109
```python { .api }
110
from haystack.nodes.retriever import BM25Retriever
111
from haystack.document_stores import ElasticsearchDocumentStore
112
113
# Basic setup
114
document_store = ElasticsearchDocumentStore()
115
bm25_retriever = BM25Retriever(
116
document_store=document_store,
117
top_k=10,
118
all_terms_must_match=False
119
)
120
121
# Simple retrieval
122
documents = bm25_retriever.retrieve(
123
query="Python machine learning framework",
124
filters={"category": "documentation"}
125
)
126
127
# Custom Elasticsearch query
128
custom_bm25 = BM25Retriever(
129
document_store=document_store,
130
custom_query={
131
"size": 10,
132
"query": {
133
"bool": {
134
"should": [{"multi_match": {
135
"query": "${query}",
136
"type": "most_fields",
137
"fields": ["content", "title"]
138
}}],
139
"filter": "${filters}"
140
}
141
},
142
"highlight": {
143
"fields": {"content": {}, "title": {}}
144
}
145
}
146
)
147
148
# Access highlighted results
149
highlighted_docs = custom_bm25.retrieve(query="Haystack framework")
150
highlighted_content = highlighted_docs[0].meta["highlighted"]["content"]
151
```
152
153
### TfidfRetriever
154
155
Term Frequency-Inverse Document Frequency retriever.
156
157
```python { .api }
158
from haystack.nodes.retriever import TfidfRetriever
159
160
class TfidfRetriever(BaseRetriever):
161
def __init__(
162
self,
163
document_store: Optional[BaseDocumentStore] = None,
164
top_k: int = 10,
165
auto_fit: bool = True,
166
):
167
"""
168
Initialize TF-IDF retriever.
169
170
Args:
171
document_store: Document store to retrieve from
172
top_k: Number of documents to retrieve
173
auto_fit: Whether to automatically fit the TF-IDF model
174
"""
175
176
# Usage
177
tfidf_retriever = TfidfRetriever(
178
document_store=document_store,
179
top_k=10
180
)
181
182
# Fit the model (if not auto_fit)
183
tfidf_retriever.fit()
184
185
# Retrieve documents
186
results = tfidf_retriever.retrieve(query="information retrieval")
187
```
188
189
### FilterRetriever
190
191
Metadata-based document filtering without similarity scoring.
192
193
```python { .api }
194
from haystack.nodes.retriever import FilterRetriever
195
196
class FilterRetriever(BaseRetriever):
197
def __init__(
198
self,
199
document_store: Optional[BaseDocumentStore] = None,
200
top_k: int = 10,
201
):
202
"""
203
Initialize filter-based retriever.
204
205
Args:
206
document_store: Document store to filter
207
top_k: Maximum number of documents to return
208
"""
209
210
# Usage - returns documents matching filters only
211
filter_retriever = FilterRetriever(document_store=document_store)
212
213
# Retrieve by metadata only (no query text needed)
214
filtered_docs = filter_retriever.retrieve(
215
query="", # Empty query
216
filters={
217
"source": "documentation",
218
"date": {"$gte": "2023-01-01"},
219
"status": "published"
220
}
221
)
222
```
223
224
## Dense Retrievers
225
226
Dense retrievers use embedding vectors for semantic similarity search.
227
228
### EmbeddingRetriever
229
230
General-purpose embedding-based retriever.
231
232
```python { .api }
233
from haystack.nodes.retriever import EmbeddingRetriever
234
from haystack.document_stores import FAISSDocumentStore
235
236
class EmbeddingRetriever(BaseRetriever):
237
def __init__(
238
self,
239
document_store: Optional[BaseDocumentStore] = None,
240
embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2",
241
model_version: Optional[str] = None,
242
use_gpu: bool = True,
243
batch_size: int = 32,
244
max_seq_len: int = 512,
245
model_format: str = "sentence_transformers",
246
pooling_strategy: str = "reduce_mean",
247
emb_extraction_layer: int = -1,
248
top_k: int = 10,
249
similarity_function: str = "dot_product",
250
progress_bar: bool = True,
251
devices: Optional[List[Union[str, torch.device]]] = None,
252
use_auth_token: Optional[Union[str, bool]] = None,
253
scale_score: bool = True,
254
embed_title: bool = True,
255
api_key: Optional[str] = None,
256
azure_api_version: str = "2022-12-01",
257
azure_base_url: Optional[str] = None,
258
):
259
"""
260
Initialize embedding retriever.
261
262
Args:
263
document_store: Vector-enabled document store
264
embedding_model: Model name or path for embeddings
265
model_format: Format type ("sentence_transformers", "transformers", "openai")
266
use_gpu: Whether to use GPU for embedding generation
267
batch_size: Batch size for embedding generation
268
max_seq_len: Maximum sequence length for model
269
similarity_function: Similarity metric ("dot_product", "cosine")
270
embed_title: Whether to include document title in embeddings
271
"""
272
```
273
274
### DensePassageRetriever (DPR)
275
276
Facebook's Dense Passage Retrieval implementation.
277
278
```python { .api }
279
from haystack.nodes.retriever import DensePassageRetriever
280
281
class DensePassageRetriever(BaseRetriever):
282
def __init__(
283
self,
284
document_store: Optional[BaseDocumentStore] = None,
285
query_embedding_model: Union[str, Path] = "facebook/dpr-question_encoder-single-nq-base",
286
passage_embedding_model: Union[str, Path] = "facebook/dpr-ctx_encoder-single-nq-base",
287
model_version: Optional[str] = None,
288
max_seq_len_query: int = 64,
289
max_seq_len_passage: int = 256,
290
top_k: int = 10,
291
use_gpu: bool = True,
292
batch_size: int = 16,
293
embed_title: bool = True,
294
use_fast_tokenizers: bool = True,
295
infer_tokenizer_classes: bool = False,
296
similarity_function: str = "dot_product",
297
progress_bar: bool = True,
298
devices: Optional[List[Union[str, torch.device]]] = None,
299
use_auth_token: Optional[Union[str, bool]] = None,
300
scale_score: bool = True,
301
):
302
"""
303
Initialize DPR retriever.
304
305
Args:
306
document_store: Vector-enabled document store
307
query_embedding_model: Model for encoding queries
308
passage_embedding_model: Model for encoding passages
309
max_seq_len_query: Max sequence length for queries
310
max_seq_len_passage: Max sequence length for passages
311
embed_title: Whether to embed document titles
312
similarity_function: Similarity metric for ranking
313
"""
314
```
315
316
### Usage Examples
317
318
```python { .api }
319
from haystack.nodes.retriever import EmbeddingRetriever, DensePassageRetriever
320
from haystack.document_stores import FAISSDocumentStore
321
322
# Embedding retriever setup
323
document_store = FAISSDocumentStore(
324
vector_dim=384,
325
faiss_index_factory_str="Flat"
326
)
327
328
embedding_retriever = EmbeddingRetriever(
329
document_store=document_store,
330
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
331
model_format="sentence_transformers",
332
top_k=10
333
)
334
335
# Generate embeddings for documents
336
document_store.update_embeddings(embedding_retriever)
337
338
# Semantic search
339
semantic_results = embedding_retriever.retrieve(
340
query="How to build chatbots with AI?",
341
top_k=5
342
)
343
344
# DPR retriever for question-answering
345
dpr_retriever = DensePassageRetriever(
346
document_store=document_store,
347
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
348
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
349
max_seq_len_query=64,
350
max_seq_len_passage=256
351
)
352
353
# Generate DPR embeddings
354
document_store.update_embeddings(dpr_retriever)
355
356
# QA-optimized retrieval
357
qa_results = dpr_retriever.retrieve(
358
query="What is the capital of France?",
359
top_k=3
360
)
361
```
362
363
### MultihopEmbeddingRetriever
364
365
Multi-step reasoning retriever for complex queries.
366
367
```python { .api }
368
from haystack.nodes.retriever import MultihopEmbeddingRetriever
369
370
class MultihopEmbeddingRetriever(BaseRetriever):
371
def __init__(
372
self,
373
document_store: Optional[BaseDocumentStore] = None,
374
embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2",
375
num_iterations: int = 2,
376
top_k: int = 10,
377
use_gpu: bool = True,
378
batch_size: int = 32,
379
):
380
"""
381
Initialize multi-hop retriever for iterative document retrieval.
382
383
Args:
384
document_store: Vector-enabled document store
385
embedding_model: Model for generating embeddings
386
num_iterations: Number of retrieval iterations
387
top_k: Documents per iteration
388
"""
389
390
# Usage for complex multi-step queries
391
multihop_retriever = MultihopEmbeddingRetriever(
392
document_store=document_store,
393
num_iterations=3,
394
top_k=5
395
)
396
397
# Complex reasoning query
398
complex_results = multihop_retriever.retrieve(
399
query="What company founded by Steve Jobs created the iPhone and when was it released?"
400
)
401
```
402
403
### TableTextRetriever
404
405
Joint retrieval from both text and tabular data.
406
407
```python { .api }
408
from haystack.nodes.retriever import TableTextRetriever
409
410
class TableTextRetriever(BaseRetriever):
411
def __init__(
412
self,
413
document_store: Optional[BaseDocumentStore] = None,
414
query_embedding_model: str = "deepset/all-mpnet-base-v2-table",
415
passage_embedding_model: str = "deepset/all-mpnet-base-v2-table",
416
table_embedding_model: str = "deepset/all-mpnet-base-v2-table",
417
model_version: Optional[str] = None,
418
max_seq_len: int = 256,
419
use_gpu: bool = True,
420
batch_size: int = 16,
421
similarity_function: str = "dot_product",
422
top_k: int = 10,
423
):
424
"""
425
Initialize table-text joint retriever.
426
427
Args:
428
document_store: Document store with text and table documents
429
query_embedding_model: Model for query embeddings
430
passage_embedding_model: Model for text passage embeddings
431
table_embedding_model: Model for table embeddings
432
similarity_function: Similarity computation method
433
"""
434
435
# Usage with mixed text and table documents
436
table_text_retriever = TableTextRetriever(
437
document_store=document_store,
438
query_embedding_model="deepset/all-mpnet-base-v2-table"
439
)
440
441
# Retrieve from both text and tables
442
mixed_results = table_text_retriever.retrieve(
443
query="What was the revenue in Q4 2022?",
444
top_k=10
445
)
446
```
447
448
## Specialized Retrievers
449
450
### MultiModalRetriever
451
452
Retrieval across multiple content modalities (text, images, etc.).
453
454
```python { .api }
455
from haystack.nodes.retriever import MultiModalRetriever
456
457
class MultiModalRetriever(BaseRetriever):
458
def __init__(
459
self,
460
document_store: Optional[BaseDocumentStore] = None,
461
query_embedding_model: str = "sentence-transformers/clip-ViT-B-32",
462
document_embedding_models: Optional[Dict[str, str]] = None,
463
top_k: int = 10,
464
progress_bar: bool = True,
465
):
466
"""
467
Initialize multi-modal retriever.
468
469
Args:
470
document_store: Document store with multi-modal documents
471
query_embedding_model: Model for query embeddings (e.g., CLIP)
472
document_embedding_models: Models per content type
473
top_k: Number of documents to retrieve
474
"""
475
476
# Usage with images and text
477
multimodal_retriever = MultiModalRetriever(
478
document_store=document_store,
479
query_embedding_model="sentence-transformers/clip-ViT-B-32",
480
document_embedding_models={
481
"text": "sentence-transformers/all-MiniLM-L6-v2",
482
"image": "sentence-transformers/clip-ViT-B-32"
483
}
484
)
485
486
# Search across text and images
487
multimodal_results = multimodal_retriever.retrieve(
488
query="sunset over mountains",
489
top_k=5
490
)
491
```
492
493
### WebRetriever
494
495
Web search integration for external content retrieval.
496
497
```python { .api }
498
from haystack.nodes.retriever import WebRetriever
499
500
class WebRetriever(BaseRetriever):
501
def __init__(
502
self,
503
api_key: str,
504
search_engine_provider: str = "SerperDev",
505
top_k: int = 10,
506
mode: str = "preprocessed_documents",
507
preprocessor: Optional[BasePreProcessor] = None,
508
cache_document: Optional[bool] = None,
509
cache_index: Optional[str] = None,
510
cache_headers: Optional[Dict[str, str]] = None,
511
document_store: Optional[BaseDocumentStore] = None,
512
):
513
"""
514
Initialize web search retriever.
515
516
Args:
517
api_key: API key for search engine provider
518
search_engine_provider: Provider ("SerperDev", "SerpAPI")
519
mode: Return mode ("preprocessed_documents", "raw_documents", "snippets")
520
preprocessor: Text preprocessor for web content
521
cache_document: Whether to cache retrieved documents
522
document_store: Optional document store for caching
523
"""
524
525
# Usage for web search
526
web_retriever = WebRetriever(
527
api_key="your-serper-api-key",
528
search_engine_provider="SerperDev",
529
top_k=10,
530
mode="preprocessed_documents"
531
)
532
533
# Search the web
534
web_results = web_retriever.retrieve(
535
query="latest developments in large language models 2024"
536
)
537
```
538
539
### LinkContentFetcher
540
541
Fetch and process content from web links.
542
543
```python { .api }
544
from haystack.nodes.retriever import LinkContentFetcher
545
546
class LinkContentFetcher(BaseRetriever):
547
def __init__(
548
self,
549
raise_on_failure: bool = False,
550
suppress_extraction_errors: bool = True,
551
):
552
"""
553
Initialize link content fetcher.
554
555
Args:
556
raise_on_failure: Whether to raise exceptions on fetch failures
557
suppress_extraction_errors: Whether to suppress content extraction errors
558
"""
559
560
# Usage for processing web links
561
link_fetcher = LinkContentFetcher()
562
563
# Process documents containing URLs
564
documents_with_links = [
565
Document(content="", meta={"url": "https://example.com/article1"}),
566
Document(content="", meta={"url": "https://example.com/article2"})
567
]
568
569
# Fetch content from URLs
570
fetched_content = link_fetcher.run(documents=documents_with_links)
571
```
572
573
## Batch Processing and Performance
574
575
### Batch Retrieval
576
577
All retrievers support efficient batch processing:
578
579
```python { .api }
580
# Batch queries for efficiency
581
queries = [
582
"What is machine learning?",
583
"How does deep learning work?",
584
"What are neural networks?"
585
]
586
587
# Batch retrieval
588
batch_results = embedding_retriever.retrieve_batch(
589
queries=queries,
590
top_k=5,
591
batch_size=10
592
)
593
594
# Process results
595
for i, query in enumerate(queries):
596
docs = batch_results[i]
597
print(f"Query: {query}")
598
print(f"Found {len(docs)} documents")
599
```
600
601
### Performance Optimization
602
603
```python { .api }
604
# GPU acceleration for embeddings
605
gpu_retriever = EmbeddingRetriever(
606
document_store=document_store,
607
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
608
use_gpu=True,
609
batch_size=64, # Larger batch for GPU efficiency
610
devices=["cuda:0"]
611
)
612
613
# Optimize for production
614
production_retriever = EmbeddingRetriever(
615
document_store=document_store,
616
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
617
use_gpu=True,
618
batch_size=128,
619
progress_bar=False, # Disable for production
620
scale_score=True
621
)
622
```
623
624
## Advanced Filtering
625
626
### Complex Metadata Filters
627
628
```python { .api }
629
# Complex filter examples
630
advanced_filters = {
631
"$and": [
632
{"source": {"$in": ["docs", "tutorials"]}},
633
{"date": {"$gte": "2023-01-01"}},
634
{"$or": [
635
{"category": "beginner"},
636
{"rating": {"$gte": 4.5}}
637
]},
638
{"tags": {"$in": ["python", "ai"]}}
639
]
640
}
641
642
# Apply complex filters
643
filtered_results = embedding_retriever.retrieve(
644
query="getting started with AI",
645
filters=advanced_filters,
646
top_k=10
647
)
648
649
# Date range filtering
650
date_filters = {
651
"published_date": {
652
"$gte": "2023-01-01",
653
"$lt": "2024-01-01"
654
}
655
}
656
657
# Numeric range filtering
658
score_filters = {
659
"confidence": {"$gte": 0.8},
660
"word_count": {"$gte": 100, "$lte": 5000}
661
}
662
```
663
664
## Evaluation and Metrics
665
666
### Retriever Evaluation
667
668
```python { .api }
669
# Evaluate retriever performance
670
eval_result = bm25_retriever.eval(
671
label_index="evaluation_labels",
672
doc_index="evaluation_docs",
673
top_k=10,
674
open_domain=True,
675
return_preds=True
676
)
677
678
# Access metrics
679
print(f"Recall@10: {eval_result['recall']}")
680
print(f"MAP: {eval_result['map']}")
681
print(f"MRR: {eval_result['mrr']}")
682
683
# Performance timing
684
print(f"Average query time: {bm25_retriever.query_time / bm25_retriever.query_count:.3f}s")
685
```
686
687
## Integration with Pipelines
688
689
### Pipeline Integration
690
691
```python { .api }
692
from haystack import Pipeline
693
694
# Create retrieval pipeline
695
pipeline = Pipeline()
696
pipeline.add_node(
697
component=embedding_retriever,
698
name="Retriever",
699
inputs=["Query"]
700
)
701
702
# Add additional processing
703
pipeline.add_node(
704
component=reader,
705
name="Reader",
706
inputs=["Retriever"]
707
)
708
709
# Execute pipeline
710
result = pipeline.run(
711
query="How to use Haystack retrievers?",
712
params={
713
"Retriever": {
714
"top_k": 5,
715
"filters": {"source": "documentation"}
716
},
717
"Reader": {"top_k": 3}
718
}
719
)
720
```
721
722
Retrievers form the core search capability in Haystack, enabling everything from simple keyword matching to sophisticated semantic search and multi-modal retrieval across diverse content types and storage backends.