or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

agents.mdcore-schema.mddocument-stores.mdevaluation-utilities.mdfile-processing.mdgenerators.mdindex.mdpipelines.mdreaders.mdretrievers.md

retrievers.mddocs/

0

# Retriever Components

1

2

Retrievers are the search components in Haystack that find relevant documents from document stores based on queries. They implement various retrieval strategies including sparse keyword-based methods, dense vector similarity, and specialized multi-modal approaches.

3

4

## Core Imports

5

6

```python { .api }

7

from haystack.nodes.retriever import (

8

BaseRetriever,

9

# Sparse retrievers

10

BM25Retriever, TfidfRetriever, FilterRetriever,

11

# Dense retrievers

12

DenseRetriever, DensePassageRetriever, EmbeddingRetriever,

13

MultihopEmbeddingRetriever, TableTextRetriever,

14

# Specialized retrievers

15

MultiModalRetriever, WebRetriever, LinkContentFetcher

16

)

17

```

18

19

## Base Retriever

20

21

### BaseRetriever

22

23

Abstract base class defining the retriever interface.

24

25

```python { .api }

26

from haystack.nodes.retriever.base import BaseRetriever

27

from haystack.document_stores.base import BaseDocumentStore, FilterType

28

from haystack.schema import Document

29

from typing import List, Optional, Dict, Union

30

31

class BaseRetriever(BaseComponent):

32

"""Abstract base class for all retrievers."""

33

34

def retrieve(

35

self,

36

query: str,

37

filters: Optional[FilterType] = None,

38

top_k: Optional[int] = None,

39

index: Optional[str] = None,

40

headers: Optional[Dict[str, str]] = None,

41

scale_score: Optional[bool] = None,

42

document_store: Optional[BaseDocumentStore] = None,

43

) -> List[Document]:

44

"""

45

Retrieve documents most relevant to the query.

46

47

Args:

48

query: Search query string

49

filters: Metadata filters to narrow search scope

50

top_k: Number of documents to retrieve

51

index: Document store index name

52

headers: Custom HTTP headers for document store

53

scale_score: Whether to normalize scores to [0,1] range

54

document_store: Override default document store

55

56

Returns:

57

List of retrieved Document objects with relevance scores

58

"""

59

60

def retrieve_batch(

61

self,

62

queries: List[str],

63

filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,

64

top_k: Optional[int] = None,

65

index: Optional[str] = None,

66

headers: Optional[Dict[str, str]] = None,

67

batch_size: Optional[int] = None,

68

scale_score: Optional[bool] = None,

69

document_store: Optional[BaseDocumentStore] = None,

70

) -> List[List[Document]]:

71

"""Batch retrieval for multiple queries."""

72

```

73

74

## Sparse Retrievers

75

76

Sparse retrievers use keyword-based methods like BM25 and TF-IDF for document retrieval.

77

78

### BM25Retriever

79

80

Best Matching 25 algorithm for keyword-based retrieval.

81

82

```python { .api }

83

from haystack.nodes.retriever import BM25Retriever

84

from haystack.document_stores import KeywordDocumentStore

85

86

class BM25Retriever(BaseRetriever):

87

def __init__(

88

self,

89

document_store: Optional[KeywordDocumentStore] = None,

90

top_k: int = 10,

91

all_terms_must_match: bool = False,

92

custom_query: Optional[str] = None,

93

scale_score: bool = True,

94

):

95

"""

96

Initialize BM25 retriever.

97

98

Args:

99

document_store: Keyword-searchable document store

100

top_k: Number of documents to retrieve

101

all_terms_must_match: Whether all query terms must match (AND vs OR)

102

custom_query: Custom Elasticsearch query template

103

scale_score: Whether to normalize scores to [0,1]

104

"""

105

```

106

107

### Usage Examples

108

109

```python { .api }

110

from haystack.nodes.retriever import BM25Retriever

111

from haystack.document_stores import ElasticsearchDocumentStore

112

113

# Basic setup

114

document_store = ElasticsearchDocumentStore()

115

bm25_retriever = BM25Retriever(

116

document_store=document_store,

117

top_k=10,

118

all_terms_must_match=False

119

)

120

121

# Simple retrieval

122

documents = bm25_retriever.retrieve(

123

query="Python machine learning framework",

124

filters={"category": "documentation"}

125

)

126

127

# Custom Elasticsearch query

128

custom_bm25 = BM25Retriever(

129

document_store=document_store,

130

custom_query={

131

"size": 10,

132

"query": {

133

"bool": {

134

"should": [{"multi_match": {

135

"query": "${query}",

136

"type": "most_fields",

137

"fields": ["content", "title"]

138

}}],

139

"filter": "${filters}"

140

}

141

},

142

"highlight": {

143

"fields": {"content": {}, "title": {}}

144

}

145

}

146

)

147

148

# Access highlighted results

149

highlighted_docs = custom_bm25.retrieve(query="Haystack framework")

150

highlighted_content = highlighted_docs[0].meta["highlighted"]["content"]

151

```

152

153

### TfidfRetriever

154

155

Term Frequency-Inverse Document Frequency retriever.

156

157

```python { .api }

158

from haystack.nodes.retriever import TfidfRetriever

159

160

class TfidfRetriever(BaseRetriever):

161

def __init__(

162

self,

163

document_store: Optional[BaseDocumentStore] = None,

164

top_k: int = 10,

165

auto_fit: bool = True,

166

):

167

"""

168

Initialize TF-IDF retriever.

169

170

Args:

171

document_store: Document store to retrieve from

172

top_k: Number of documents to retrieve

173

auto_fit: Whether to automatically fit the TF-IDF model

174

"""

175

176

# Usage

177

tfidf_retriever = TfidfRetriever(

178

document_store=document_store,

179

top_k=10

180

)

181

182

# Fit the model (if not auto_fit)

183

tfidf_retriever.fit()

184

185

# Retrieve documents

186

results = tfidf_retriever.retrieve(query="information retrieval")

187

```

188

189

### FilterRetriever

190

191

Metadata-based document filtering without similarity scoring.

192

193

```python { .api }

194

from haystack.nodes.retriever import FilterRetriever

195

196

class FilterRetriever(BaseRetriever):

197

def __init__(

198

self,

199

document_store: Optional[BaseDocumentStore] = None,

200

top_k: int = 10,

201

):

202

"""

203

Initialize filter-based retriever.

204

205

Args:

206

document_store: Document store to filter

207

top_k: Maximum number of documents to return

208

"""

209

210

# Usage - returns documents matching filters only

211

filter_retriever = FilterRetriever(document_store=document_store)

212

213

# Retrieve by metadata only (no query text needed)

214

filtered_docs = filter_retriever.retrieve(

215

query="", # Empty query

216

filters={

217

"source": "documentation",

218

"date": {"$gte": "2023-01-01"},

219

"status": "published"

220

}

221

)

222

```

223

224

## Dense Retrievers

225

226

Dense retrievers use embedding vectors for semantic similarity search.

227

228

### EmbeddingRetriever

229

230

General-purpose embedding-based retriever.

231

232

```python { .api }

233

from haystack.nodes.retriever import EmbeddingRetriever

234

from haystack.document_stores import FAISSDocumentStore

235

236

class EmbeddingRetriever(BaseRetriever):

237

def __init__(

238

self,

239

document_store: Optional[BaseDocumentStore] = None,

240

embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2",

241

model_version: Optional[str] = None,

242

use_gpu: bool = True,

243

batch_size: int = 32,

244

max_seq_len: int = 512,

245

model_format: str = "sentence_transformers",

246

pooling_strategy: str = "reduce_mean",

247

emb_extraction_layer: int = -1,

248

top_k: int = 10,

249

similarity_function: str = "dot_product",

250

progress_bar: bool = True,

251

devices: Optional[List[Union[str, torch.device]]] = None,

252

use_auth_token: Optional[Union[str, bool]] = None,

253

scale_score: bool = True,

254

embed_title: bool = True,

255

api_key: Optional[str] = None,

256

azure_api_version: str = "2022-12-01",

257

azure_base_url: Optional[str] = None,

258

):

259

"""

260

Initialize embedding retriever.

261

262

Args:

263

document_store: Vector-enabled document store

264

embedding_model: Model name or path for embeddings

265

model_format: Format type ("sentence_transformers", "transformers", "openai")

266

use_gpu: Whether to use GPU for embedding generation

267

batch_size: Batch size for embedding generation

268

max_seq_len: Maximum sequence length for model

269

similarity_function: Similarity metric ("dot_product", "cosine")

270

embed_title: Whether to include document title in embeddings

271

"""

272

```

273

274

### DensePassageRetriever (DPR)

275

276

Facebook's Dense Passage Retrieval implementation.

277

278

```python { .api }

279

from haystack.nodes.retriever import DensePassageRetriever

280

281

class DensePassageRetriever(BaseRetriever):

282

def __init__(

283

self,

284

document_store: Optional[BaseDocumentStore] = None,

285

query_embedding_model: Union[str, Path] = "facebook/dpr-question_encoder-single-nq-base",

286

passage_embedding_model: Union[str, Path] = "facebook/dpr-ctx_encoder-single-nq-base",

287

model_version: Optional[str] = None,

288

max_seq_len_query: int = 64,

289

max_seq_len_passage: int = 256,

290

top_k: int = 10,

291

use_gpu: bool = True,

292

batch_size: int = 16,

293

embed_title: bool = True,

294

use_fast_tokenizers: bool = True,

295

infer_tokenizer_classes: bool = False,

296

similarity_function: str = "dot_product",

297

progress_bar: bool = True,

298

devices: Optional[List[Union[str, torch.device]]] = None,

299

use_auth_token: Optional[Union[str, bool]] = None,

300

scale_score: bool = True,

301

):

302

"""

303

Initialize DPR retriever.

304

305

Args:

306

document_store: Vector-enabled document store

307

query_embedding_model: Model for encoding queries

308

passage_embedding_model: Model for encoding passages

309

max_seq_len_query: Max sequence length for queries

310

max_seq_len_passage: Max sequence length for passages

311

embed_title: Whether to embed document titles

312

similarity_function: Similarity metric for ranking

313

"""

314

```

315

316

### Usage Examples

317

318

```python { .api }

319

from haystack.nodes.retriever import EmbeddingRetriever, DensePassageRetriever

320

from haystack.document_stores import FAISSDocumentStore

321

322

# Embedding retriever setup

323

document_store = FAISSDocumentStore(

324

vector_dim=384,

325

faiss_index_factory_str="Flat"

326

)

327

328

embedding_retriever = EmbeddingRetriever(

329

document_store=document_store,

330

embedding_model="sentence-transformers/all-MiniLM-L6-v2",

331

model_format="sentence_transformers",

332

top_k=10

333

)

334

335

# Generate embeddings for documents

336

document_store.update_embeddings(embedding_retriever)

337

338

# Semantic search

339

semantic_results = embedding_retriever.retrieve(

340

query="How to build chatbots with AI?",

341

top_k=5

342

)

343

344

# DPR retriever for question-answering

345

dpr_retriever = DensePassageRetriever(

346

document_store=document_store,

347

query_embedding_model="facebook/dpr-question_encoder-single-nq-base",

348

passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",

349

max_seq_len_query=64,

350

max_seq_len_passage=256

351

)

352

353

# Generate DPR embeddings

354

document_store.update_embeddings(dpr_retriever)

355

356

# QA-optimized retrieval

357

qa_results = dpr_retriever.retrieve(

358

query="What is the capital of France?",

359

top_k=3

360

)

361

```

362

363

### MultihopEmbeddingRetriever

364

365

Multi-step reasoning retriever for complex queries.

366

367

```python { .api }

368

from haystack.nodes.retriever import MultihopEmbeddingRetriever

369

370

class MultihopEmbeddingRetriever(BaseRetriever):

371

def __init__(

372

self,

373

document_store: Optional[BaseDocumentStore] = None,

374

embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2",

375

num_iterations: int = 2,

376

top_k: int = 10,

377

use_gpu: bool = True,

378

batch_size: int = 32,

379

):

380

"""

381

Initialize multi-hop retriever for iterative document retrieval.

382

383

Args:

384

document_store: Vector-enabled document store

385

embedding_model: Model for generating embeddings

386

num_iterations: Number of retrieval iterations

387

top_k: Documents per iteration

388

"""

389

390

# Usage for complex multi-step queries

391

multihop_retriever = MultihopEmbeddingRetriever(

392

document_store=document_store,

393

num_iterations=3,

394

top_k=5

395

)

396

397

# Complex reasoning query

398

complex_results = multihop_retriever.retrieve(

399

query="What company founded by Steve Jobs created the iPhone and when was it released?"

400

)

401

```

402

403

### TableTextRetriever

404

405

Joint retrieval from both text and tabular data.

406

407

```python { .api }

408

from haystack.nodes.retriever import TableTextRetriever

409

410

class TableTextRetriever(BaseRetriever):

411

def __init__(

412

self,

413

document_store: Optional[BaseDocumentStore] = None,

414

query_embedding_model: str = "deepset/all-mpnet-base-v2-table",

415

passage_embedding_model: str = "deepset/all-mpnet-base-v2-table",

416

table_embedding_model: str = "deepset/all-mpnet-base-v2-table",

417

model_version: Optional[str] = None,

418

max_seq_len: int = 256,

419

use_gpu: bool = True,

420

batch_size: int = 16,

421

similarity_function: str = "dot_product",

422

top_k: int = 10,

423

):

424

"""

425

Initialize table-text joint retriever.

426

427

Args:

428

document_store: Document store with text and table documents

429

query_embedding_model: Model for query embeddings

430

passage_embedding_model: Model for text passage embeddings

431

table_embedding_model: Model for table embeddings

432

similarity_function: Similarity computation method

433

"""

434

435

# Usage with mixed text and table documents

436

table_text_retriever = TableTextRetriever(

437

document_store=document_store,

438

query_embedding_model="deepset/all-mpnet-base-v2-table"

439

)

440

441

# Retrieve from both text and tables

442

mixed_results = table_text_retriever.retrieve(

443

query="What was the revenue in Q4 2022?",

444

top_k=10

445

)

446

```

447

448

## Specialized Retrievers

449

450

### MultiModalRetriever

451

452

Retrieval across multiple content modalities (text, images, etc.).

453

454

```python { .api }

455

from haystack.nodes.retriever import MultiModalRetriever

456

457

class MultiModalRetriever(BaseRetriever):

458

def __init__(

459

self,

460

document_store: Optional[BaseDocumentStore] = None,

461

query_embedding_model: str = "sentence-transformers/clip-ViT-B-32",

462

document_embedding_models: Optional[Dict[str, str]] = None,

463

top_k: int = 10,

464

progress_bar: bool = True,

465

):

466

"""

467

Initialize multi-modal retriever.

468

469

Args:

470

document_store: Document store with multi-modal documents

471

query_embedding_model: Model for query embeddings (e.g., CLIP)

472

document_embedding_models: Models per content type

473

top_k: Number of documents to retrieve

474

"""

475

476

# Usage with images and text

477

multimodal_retriever = MultiModalRetriever(

478

document_store=document_store,

479

query_embedding_model="sentence-transformers/clip-ViT-B-32",

480

document_embedding_models={

481

"text": "sentence-transformers/all-MiniLM-L6-v2",

482

"image": "sentence-transformers/clip-ViT-B-32"

483

}

484

)

485

486

# Search across text and images

487

multimodal_results = multimodal_retriever.retrieve(

488

query="sunset over mountains",

489

top_k=5

490

)

491

```

492

493

### WebRetriever

494

495

Web search integration for external content retrieval.

496

497

```python { .api }

498

from haystack.nodes.retriever import WebRetriever

499

500

class WebRetriever(BaseRetriever):

501

def __init__(

502

self,

503

api_key: str,

504

search_engine_provider: str = "SerperDev",

505

top_k: int = 10,

506

mode: str = "preprocessed_documents",

507

preprocessor: Optional[BasePreProcessor] = None,

508

cache_document: Optional[bool] = None,

509

cache_index: Optional[str] = None,

510

cache_headers: Optional[Dict[str, str]] = None,

511

document_store: Optional[BaseDocumentStore] = None,

512

):

513

"""

514

Initialize web search retriever.

515

516

Args:

517

api_key: API key for search engine provider

518

search_engine_provider: Provider ("SerperDev", "SerpAPI")

519

mode: Return mode ("preprocessed_documents", "raw_documents", "snippets")

520

preprocessor: Text preprocessor for web content

521

cache_document: Whether to cache retrieved documents

522

document_store: Optional document store for caching

523

"""

524

525

# Usage for web search

526

web_retriever = WebRetriever(

527

api_key="your-serper-api-key",

528

search_engine_provider="SerperDev",

529

top_k=10,

530

mode="preprocessed_documents"

531

)

532

533

# Search the web

534

web_results = web_retriever.retrieve(

535

query="latest developments in large language models 2024"

536

)

537

```

538

539

### LinkContentFetcher

540

541

Fetch and process content from web links.

542

543

```python { .api }

544

from haystack.nodes.retriever import LinkContentFetcher

545

546

class LinkContentFetcher(BaseRetriever):

547

def __init__(

548

self,

549

raise_on_failure: bool = False,

550

suppress_extraction_errors: bool = True,

551

):

552

"""

553

Initialize link content fetcher.

554

555

Args:

556

raise_on_failure: Whether to raise exceptions on fetch failures

557

suppress_extraction_errors: Whether to suppress content extraction errors

558

"""

559

560

# Usage for processing web links

561

link_fetcher = LinkContentFetcher()

562

563

# Process documents containing URLs

564

documents_with_links = [

565

Document(content="", meta={"url": "https://example.com/article1"}),

566

Document(content="", meta={"url": "https://example.com/article2"})

567

]

568

569

# Fetch content from URLs

570

fetched_content = link_fetcher.run(documents=documents_with_links)

571

```

572

573

## Batch Processing and Performance

574

575

### Batch Retrieval

576

577

All retrievers support efficient batch processing:

578

579

```python { .api }

580

# Batch queries for efficiency

581

queries = [

582

"What is machine learning?",

583

"How does deep learning work?",

584

"What are neural networks?"

585

]

586

587

# Batch retrieval

588

batch_results = embedding_retriever.retrieve_batch(

589

queries=queries,

590

top_k=5,

591

batch_size=10

592

)

593

594

# Process results

595

for i, query in enumerate(queries):

596

docs = batch_results[i]

597

print(f"Query: {query}")

598

print(f"Found {len(docs)} documents")

599

```

600

601

### Performance Optimization

602

603

```python { .api }

604

# GPU acceleration for embeddings

605

gpu_retriever = EmbeddingRetriever(

606

document_store=document_store,

607

embedding_model="sentence-transformers/all-MiniLM-L6-v2",

608

use_gpu=True,

609

batch_size=64, # Larger batch for GPU efficiency

610

devices=["cuda:0"]

611

)

612

613

# Optimize for production

614

production_retriever = EmbeddingRetriever(

615

document_store=document_store,

616

embedding_model="sentence-transformers/all-MiniLM-L6-v2",

617

use_gpu=True,

618

batch_size=128,

619

progress_bar=False, # Disable for production

620

scale_score=True

621

)

622

```

623

624

## Advanced Filtering

625

626

### Complex Metadata Filters

627

628

```python { .api }

629

# Complex filter examples

630

advanced_filters = {

631

"$and": [

632

{"source": {"$in": ["docs", "tutorials"]}},

633

{"date": {"$gte": "2023-01-01"}},

634

{"$or": [

635

{"category": "beginner"},

636

{"rating": {"$gte": 4.5}}

637

]},

638

{"tags": {"$in": ["python", "ai"]}}

639

]

640

}

641

642

# Apply complex filters

643

filtered_results = embedding_retriever.retrieve(

644

query="getting started with AI",

645

filters=advanced_filters,

646

top_k=10

647

)

648

649

# Date range filtering

650

date_filters = {

651

"published_date": {

652

"$gte": "2023-01-01",

653

"$lt": "2024-01-01"

654

}

655

}

656

657

# Numeric range filtering

658

score_filters = {

659

"confidence": {"$gte": 0.8},

660

"word_count": {"$gte": 100, "$lte": 5000}

661

}

662

```

663

664

## Evaluation and Metrics

665

666

### Retriever Evaluation

667

668

```python { .api }

669

# Evaluate retriever performance

670

eval_result = bm25_retriever.eval(

671

label_index="evaluation_labels",

672

doc_index="evaluation_docs",

673

top_k=10,

674

open_domain=True,

675

return_preds=True

676

)

677

678

# Access metrics

679

print(f"Recall@10: {eval_result['recall']}")

680

print(f"MAP: {eval_result['map']}")

681

print(f"MRR: {eval_result['mrr']}")

682

683

# Performance timing

684

print(f"Average query time: {bm25_retriever.query_time / bm25_retriever.query_count:.3f}s")

685

```

686

687

## Integration with Pipelines

688

689

### Pipeline Integration

690

691

```python { .api }

692

from haystack import Pipeline

693

694

# Create retrieval pipeline

695

pipeline = Pipeline()

696

pipeline.add_node(

697

component=embedding_retriever,

698

name="Retriever",

699

inputs=["Query"]

700

)

701

702

# Add additional processing

703

pipeline.add_node(

704

component=reader,

705

name="Reader",

706

inputs=["Retriever"]

707

)

708

709

# Execute pipeline

710

result = pipeline.run(

711

query="How to use Haystack retrievers?",

712

params={

713

"Retriever": {

714

"top_k": 5,

715

"filters": {"source": "documentation"}

716

},

717

"Reader": {"top_k": 3}

718

}

719

)

720

```

721

722

Retrievers form the core search capability in Haystack, enabling everything from simple keyword matching to sophisticated semantic search and multi-modal retrieval across diverse content types and storage backends.