0
# Haystack
1
2
Haystack is a comprehensive end-to-end NLP framework that enables developers to build sophisticated applications powered by Large Language Models (LLMs), Transformer models, and vector search capabilities. The framework provides a modular architecture based on Pipelines that connect various Nodes (preprocessing, retrieval, language model components) to perform complex NLP tasks such as retrieval-augmented generation (RAG), question answering, semantic document search, and answer generation.
3
4
## Package Information
5
6
- **Package Name**: farm-haystack
7
- **Language**: Python
8
- **Installation**: `pip install farm-haystack`
9
- **Python Support**: 3.8+
10
- **License**: Apache-2.0
11
12
## Core Imports
13
14
```python
15
import haystack
16
from haystack import Document, Answer, Label, MultiLabel, Span, EvaluationResult, TableCell, Pipeline, hash128
17
from haystack.nodes.base import BaseComponent
18
```
19
20
Common imports for building pipelines:
21
22
```python
23
from haystack.document_stores import InMemoryDocumentStore, ElasticsearchDocumentStore, FAISSDocumentStore
24
from haystack.nodes import BM25Retriever, EmbeddingRetriever, FARMReader, TransformersReader
25
from haystack.pipelines import ExtractiveQAPipeline, DocumentSearchPipeline
26
```
27
28
## Basic Usage
29
30
```python
31
from haystack import Document, Pipeline
32
from haystack.document_stores import InMemoryDocumentStore
33
from haystack.nodes import BM25Retriever, FARMReader
34
from haystack.pipelines import ExtractiveQAPipeline
35
36
# Create documents
37
docs = [
38
Document(content="Paris is the capital of France."),
39
Document(content="Berlin is the capital of Germany."),
40
Document(content="Madrid is the capital of Spain.")
41
]
42
43
# Initialize document store and add documents
44
document_store = InMemoryDocumentStore()
45
document_store.write_documents(docs)
46
47
# Create retriever and reader components
48
retriever = BM25Retriever(document_store=document_store)
49
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")
50
51
# Build pipeline
52
pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)
53
54
# Ask a question
55
result = pipeline.run(query="What is the capital of France?")
56
print(result["answers"][0].answer) # "Paris"
57
```
58
59
## Architecture
60
61
Haystack follows a modular component-based architecture with three core concepts:
62
63
- **Components (Nodes)**: Modular processing units that perform specific tasks (retrieval, reading, generation, preprocessing)
64
- **Document Stores**: Backend storage systems for documents and embeddings (Elasticsearch, FAISS, Pinecone, etc.)
65
- **Pipelines**: Orchestration layer that connects components in directed graphs to solve complex NLP tasks
66
67
The framework supports both **Retrieval-Augmented Generation (RAG)** workflows and **Agent-based** interactive systems, making it suitable for production-grade applications requiring sophisticated natural language processing capabilities.
68
69
## Capabilities
70
71
### Core Schema & Data Structures
72
73
Fundamental data classes for documents, answers, labels, and evaluation results that form the foundation of all Haystack operations.
74
75
```python { .api }
76
class Document:
77
def __init__(self, content: Union[str, DataFrame], content_type: str = "text",
78
meta: Dict[str, Any] = None, id: Optional[str] = None): ...
79
80
class Answer:
81
def __init__(self, answer: str, type: str = "extractive",
82
score: Optional[float] = None, context: Optional[str] = None): ...
83
84
class Label:
85
def __init__(self, query: str, answer: Answer, is_correct_answer: bool = True,
86
is_correct_document: bool = True): ...
87
```
88
89
[Core Schema](./core-schema.md)
90
91
### Document Stores
92
93
Backend storage systems supporting vector and keyword search across multiple databases including Elasticsearch, FAISS, Pinecone, Weaviate, and others.
94
95
```python { .api }
96
class BaseDocumentStore:
97
def write_documents(self, documents: List[Document]): ...
98
def get_all_documents(self) -> List[Document]: ...
99
def query(self, query: str, top_k: int = 10) -> List[Document]: ...
100
101
class ElasticsearchDocumentStore(BaseDocumentStore): ...
102
class FAISSDocumentStore(BaseDocumentStore): ...
103
class PineconeDocumentStore(BaseDocumentStore): ...
104
```
105
106
[Document Stores](./document-stores.md)
107
108
### Retriever Components
109
110
Dense and sparse retrieval components for finding relevant documents using embeddings, BM25, TF-IDF, and specialized retrieval methods.
111
112
```python { .api }
113
class BM25Retriever(BaseRetriever):
114
def __init__(self, document_store: BaseDocumentStore): ...
115
def retrieve(self, query: str, top_k: int = 10) -> List[Document]: ...
116
117
class EmbeddingRetriever(BaseRetriever):
118
def __init__(self, document_store: BaseDocumentStore,
119
embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2"): ...
120
```
121
122
[Retriever Components](./retrievers.md)
123
124
### Reader Components
125
126
Reading comprehension components for extractive question answering using FARM, Transformers, and specialized table readers.
127
128
```python { .api }
129
class FARMReader(BaseReader):
130
def __init__(self, model_name_or_path: str = "deepset/roberta-base-squad2"): ...
131
def predict(self, query: str, documents: List[Document]) -> List[Answer]: ...
132
133
class TransformersReader(BaseReader):
134
def __init__(self, model_name_or_path: str = "deepset/roberta-base-squad2"): ...
135
```
136
137
[Reader Components](./readers.md)
138
139
### Generator Components
140
141
Language model components for text generation using OpenAI, Transformers, and other LLM providers for generative QA and text synthesis.
142
143
```python { .api }
144
class OpenAIAnswerGenerator(BaseGenerator):
145
def __init__(self, api_key: str, model: str = "gpt-3.5-turbo"): ...
146
def predict(self, query: str, documents: List[Document]) -> List[Answer]: ...
147
148
class OpenAIChatGenerator(BaseGenerator):
149
def __init__(self, api_key: str, model: str = "gpt-3.5-turbo"): ...
150
```
151
152
[Generator Components](./generators.md)
153
154
### Pipeline System
155
156
Pre-built and custom pipeline templates for orchestrating component workflows including QA, search, generation, and indexing pipelines.
157
158
```python { .api }
159
class Pipeline:
160
def __init__(self): ...
161
def add_node(self, component: BaseComponent, name: str, inputs: List[str]): ...
162
def run(self, **kwargs): ...
163
164
class ExtractiveQAPipeline(Pipeline): ...
165
class GenerativeQAPipeline(Pipeline): ...
166
class DocumentSearchPipeline(Pipeline): ...
167
```
168
169
[Pipeline System](./pipelines.md)
170
171
### Agent System
172
173
Interactive LLM agents with tool usage, memory management, and conversational capabilities for complex reasoning tasks.
174
175
```python { .api }
176
class Agent:
177
def __init__(self, prompt_node: PromptNode, memory: Optional[BaseMemory] = None): ...
178
def run(self, query: str) -> AgentStep: ...
179
180
class ConversationalAgent(Agent): ...
181
class Tool:
182
def __init__(self, name: str, pipeline_or_node: Union[BaseComponent, Pipeline]): ...
183
```
184
185
[Agent System](./agents.md)
186
187
### File Processing
188
189
Document converters and preprocessors for handling PDF, DOCX, HTML, images, and other file formats with text extraction and cleaning.
190
191
```python { .api }
192
class BaseConverter:
193
def convert(self, file_path: Path, **kwargs) -> List[Document]: ...
194
195
class PDFToTextConverter(BaseConverter): ...
196
class DocxToTextConverter(BaseConverter): ...
197
class PreProcessor(BaseComponent):
198
def process(self, documents: List[Document]) -> List[Document]: ...
199
```
200
201
[File Processing](./file-processing.md)
202
203
### Evaluation & Utilities
204
205
Evaluation metrics, model evaluation tools, and utility functions for assessing pipeline performance and data processing.
206
207
```python { .api }
208
def eval_pipeline(pipeline: Pipeline, eval_labels: List[Label]) -> EvaluationResult: ...
209
210
class EvaluationResult:
211
def __init__(self): ...
212
def calculate_metrics(self) -> Dict[str, float]: ...
213
```
214
215
[Evaluation & Utilities](./evaluation-utilities.md)