0
# Core Schema & Data Structures
1
2
Haystack's core data structures form the foundation of the framework, providing standardized representations for documents, answers, labels, and evaluation results. These Pydantic dataclass-based structures ensure type safety and seamless serialization across all components.
3
4
## Core Imports
5
6
```python { .api }
7
from haystack.schema import Document, Answer, Label, MultiLabel, Span, TableCell, EvaluationResult
8
from haystack.schema import ContentTypes, FilterType, LABEL_DATETIME_FORMAT
9
```
10
11
## Document Class
12
13
The `Document` class is the primary data structure for representing content in Haystack.
14
15
### Document Definition
16
17
```python { .api }
18
from haystack.schema import Document
19
from pandas import DataFrame
20
from numpy import ndarray
21
from typing import Union, Dict, Any, List, Optional, Literal
22
23
ContentTypes = Literal["text", "table", "image", "audio"]
24
25
@dataclass
26
class Document:
27
id: str
28
content: Union[str, DataFrame]
29
content_type: ContentTypes = "text"
30
meta: Dict[str, Any] = {}
31
id_hash_keys: List[str] = ["content"]
32
score: Optional[float] = None
33
embedding: Optional[ndarray] = None
34
35
def __init__(
36
self,
37
content: Union[str, DataFrame],
38
content_type: ContentTypes = "text",
39
id: Optional[str] = None,
40
score: Optional[float] = None,
41
meta: Optional[Dict[str, Any]] = None,
42
embedding: Optional[ndarray] = None,
43
id_hash_keys: Optional[List[str]] = None,
44
):
45
"""
46
Creates a Document instance representing a piece of content.
47
48
Args:
49
content: The document content (text string or DataFrame for tables)
50
content_type: One of "text", "table", "image", "audio"
51
id: Unique identifier; auto-generated from content hash if None
52
score: Relevance score [0,1] from retrieval/ranking models
53
meta: Custom metadata dictionary
54
embedding: Vector representation of the content
55
id_hash_keys: Document attributes used for ID generation
56
"""
57
```
58
59
### Document Methods
60
61
```python { .api }
62
# Serialization
63
document.to_dict(field_map: Optional[Dict[str, Any]] = None) -> Dict
64
document.to_json(field_map: Optional[Dict[str, Any]] = None) -> str
65
66
# Deserialization
67
Document.from_dict(dict: Dict[str, Any], field_map: Optional[Dict[str, Any]] = None) -> Document
68
Document.from_json(data: Union[str, Dict[str, Any]], field_map: Optional[Dict[str, Any]] = None) -> Document
69
```
70
71
### Document Usage Examples
72
73
```python { .api }
74
from haystack.schema import Document
75
import pandas as pd
76
77
# Text document
78
text_doc = Document(
79
content="Haystack is a Python framework for building LLM applications.",
80
meta={"source": "documentation", "author": "deepset"}
81
)
82
83
# Table document
84
df = pd.DataFrame({"Name": ["Alice", "Bob"], "Age": [25, 30]})
85
table_doc = Document(
86
content=df,
87
content_type="table",
88
meta={"source": "user_data.csv"}
89
)
90
91
# Document with custom ID generation
92
doc_with_meta_id = Document(
93
content="Content with metadata-based ID",
94
meta={"url": "https://example.com/page1"},
95
id_hash_keys=["content", "meta.url"]
96
)
97
98
# Serialization
99
doc_dict = text_doc.to_dict()
100
doc_json = text_doc.to_json()
101
restored_doc = Document.from_dict(doc_dict)
102
```
103
104
## Answer Class
105
106
The `Answer` class represents answers from question-answering systems.
107
108
### Answer Definition
109
110
```python { .api }
111
from haystack.schema import Answer, Span, TableCell
112
from pandas import DataFrame
113
from typing import List, Optional, Union, Dict, Any, Literal
114
115
@dataclass
116
class Answer:
117
answer: str
118
type: Literal["generative", "extractive", "other"] = "extractive"
119
score: Optional[float] = None
120
context: Optional[Union[str, DataFrame]] = None
121
offsets_in_document: Optional[Union[List[Span], List[TableCell]]] = None
122
offsets_in_context: Optional[Union[List[Span], List[TableCell]]] = None
123
document_ids: Optional[List[str]] = None
124
meta: Optional[Dict[str, Any]] = None
125
126
"""
127
Creates an Answer instance from QA systems.
128
129
Args:
130
answer: The answer string (empty if no answer found)
131
type: "extractive" (from document text), "generative" (LLM-generated), or "other"
132
score: Confidence score [0,1] from the QA model
133
context: Source context (text passage or table) used for the answer
134
offsets_in_document: Character/cell positions in original document
135
offsets_in_context: Character/cell positions in the context window
136
document_ids: List of document IDs containing the answer
137
meta: Additional metadata about the answer
138
"""
139
```
140
141
### Answer Usage Examples
142
143
```python { .api }
144
from haystack.schema import Answer, Span
145
146
# Extractive answer
147
extractive_answer = Answer(
148
answer="Python framework",
149
type="extractive",
150
score=0.95,
151
context="Haystack is a Python framework for building LLM applications.",
152
offsets_in_document=[Span(start=13, end=28)],
153
offsets_in_context=[Span(start=13, end=28)],
154
document_ids=["doc123"],
155
meta={"model": "bert-base-uncased-qa"}
156
)
157
158
# Generative answer
159
generative_answer = Answer(
160
answer="Haystack enables developers to build production-ready LLM applications with modular components.",
161
type="generative",
162
score=0.88,
163
document_ids=["doc123", "doc124", "doc125"],
164
meta={"model": "gpt-3.5-turbo", "tokens_used": 45}
165
)
166
167
# Table-based answer
168
table_answer = Answer(
169
answer="25",
170
type="extractive",
171
offsets_in_document=[TableCell(row=0, col=1)],
172
document_ids=["table_doc_1"]
173
)
174
```
175
176
## Label Class
177
178
The `Label` class represents training and evaluation labels for supervised learning.
179
180
### Label Definition
181
182
```python { .api }
183
from haystack.schema import Label, Document, Answer
184
from typing import Optional, Dict, Any, Literal
185
186
@dataclass
187
class Label:
188
id: str
189
query: str
190
document: Document
191
is_correct_answer: bool
192
is_correct_document: bool
193
origin: Literal["user-feedback", "gold-label"]
194
answer: Optional[Answer] = None
195
pipeline_id: Optional[str] = None
196
created_at: Optional[str] = None
197
updated_at: Optional[str] = None
198
meta: Optional[Dict[str, Any]] = None
199
filters: Optional[Dict[str, Any]] = None
200
201
def __init__(
202
self,
203
query: str,
204
document: Document,
205
is_correct_answer: bool,
206
is_correct_document: bool,
207
origin: Literal["user-feedback", "gold-label"],
208
answer: Optional[Answer] = None,
209
id: Optional[str] = None,
210
pipeline_id: Optional[str] = None,
211
created_at: Optional[str] = None,
212
updated_at: Optional[str] = None,
213
meta: Optional[Dict[str, Any]] = None,
214
filters: Optional[Dict[str, Any]] = None,
215
):
216
"""
217
Creates a Label for training/evaluation.
218
219
Args:
220
query: The question or query text
221
document: Document containing the answer
222
is_correct_answer: Whether the provided answer is correct
223
is_correct_document: Whether the document is relevant
224
origin: "user-feedback" (human annotation) or "gold-label" (reference data)
225
answer: Optional Answer object with correct answer
226
id: Unique label identifier
227
pipeline_id: ID of pipeline that generated this label
228
created_at: Creation timestamp (ISO format)
229
updated_at: Last update timestamp (ISO format)
230
meta: Additional metadata
231
filters: Document store filters applied during labeling
232
"""
233
```
234
235
### Label Usage Examples
236
237
```python { .api }
238
from haystack.schema import Label, Document, Answer
239
from datetime import datetime
240
241
# Create training label
242
training_doc = Document(content="The capital of France is Paris.")
243
training_label = Label(
244
query="What is the capital of France?",
245
document=training_doc,
246
is_correct_answer=True,
247
is_correct_document=True,
248
origin="gold-label",
249
answer=Answer(answer="Paris", type="extractive"),
250
meta={"dataset": "squad", "difficulty": "easy"}
251
)
252
253
# User feedback label
254
feedback_label = Label(
255
query="How does Haystack work?",
256
document=Document(content="Haystack uses modular components..."),
257
is_correct_answer=False,
258
is_correct_document=True,
259
origin="user-feedback",
260
created_at=datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
261
meta={"user_id": "user123", "feedback_type": "incorrect_answer"}
262
)
263
```
264
265
## Supporting Classes
266
267
### Span Class
268
269
```python { .api }
270
from haystack.schema import Span
271
272
@dataclass
273
class Span:
274
start: int
275
end: int
276
277
def __contains__(self, value) -> bool:
278
"""Check if a value or span is contained within this span."""
279
280
# Usage
281
span = Span(start=10, end=20)
282
assert 15 in span # True - value is in range
283
assert Span(12, 18) in span # True - span is fully contained
284
assert 25 in span # False - value outside range
285
```
286
287
### TableCell Class
288
289
```python { .api }
290
from haystack.schema import TableCell
291
292
@dataclass
293
class TableCell:
294
row: int
295
col: int
296
297
# Usage
298
cell = TableCell(row=2, col=3) # Third row, fourth column (0-indexed)
299
```
300
301
### MultiLabel Class
302
303
```python { .api }
304
from haystack.schema import MultiLabel, Label
305
306
class MultiLabel:
307
def __init__(self, labels: List[Label]):
308
"""Container for multiple labels, typically for multi-answer questions."""
309
310
# Methods for label aggregation and evaluation
311
labels: List[Label]
312
313
# Usage
314
multi_label = MultiLabel([label1, label2, label3])
315
```
316
317
### EvaluationResult Class
318
319
```python { .api }
320
from haystack.schema import EvaluationResult
321
322
class EvaluationResult:
323
def __init__(self):
324
"""Container for evaluation metrics and results."""
325
326
# Evaluation metrics and analysis methods
327
def calculate_metrics(self, predictions: List, labels: List) -> Dict[str, float]
328
def print_metrics(self) -> None
329
```
330
331
## Type Definitions
332
333
### Core Types
334
335
```python { .api }
336
from typing import Literal, Dict, Union, List, Any
337
338
# Content types supported by Document
339
ContentTypes = Literal["text", "table", "image", "audio"]
340
341
# Filter type for document stores
342
FilterType = Dict[str, Union[Dict[str, Any], List[Any], str, int, float, bool]]
343
344
# Date format constant
345
LABEL_DATETIME_FORMAT: str = "%Y-%m-%d %H:%M:%S"
346
```
347
348
## Serialization & Interoperability
349
350
### Field Mapping
351
352
All core classes support field mapping for custom serialization:
353
354
```python { .api }
355
# Custom field names for external systems
356
field_map = {"custom_content_field": "content", "custom_score": "score"}
357
358
# Serialize with custom field names
359
doc_dict = document.to_dict(field_map=field_map)
360
# Result: {"custom_content_field": "...", "custom_score": 0.95, ...}
361
362
# Deserialize with custom field names
363
restored_doc = Document.from_dict(external_dict, field_map=field_map)
364
```
365
366
### JSON Serialization
367
368
```python { .api }
369
# All classes support JSON serialization
370
doc_json = document.to_json()
371
answer_json = answer.to_json()
372
label_json = label.to_json()
373
374
# And deserialization
375
doc = Document.from_json(doc_json)
376
answer = Answer.from_json(answer_json)
377
label = Label.from_json(label_json)
378
```
379
380
## Integration with Components
381
382
### Document Store Integration
383
384
```python { .api }
385
from haystack.document_stores import InMemoryDocumentStore
386
387
document_store = InMemoryDocumentStore()
388
389
# Documents are stored and retrieved as Document objects
390
documents = [Document(content="Text 1"), Document(content="Text 2")]
391
document_store.write_documents(documents)
392
393
retrieved_docs = document_store.get_all_documents()
394
# Returns List[Document]
395
```
396
397
### Pipeline Integration
398
399
```python { .api }
400
from haystack import Pipeline
401
402
# Pipeline components work with standardized data structures
403
pipeline = Pipeline()
404
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
405
pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])
406
407
# Pipeline returns structured results
408
result = pipeline.run(query="What is Haystack?")
409
# result["answers"] contains List[Answer]
410
# result["documents"] contains List[Document]
411
```
412
413
## Validation & Error Handling
414
415
```python { .api }
416
# Pydantic validation ensures type safety
417
try:
418
doc = Document(content=None) # Raises ValueError
419
except ValueError as e:
420
print(f"Validation error: {e}")
421
422
# Proper content types are enforced
423
doc = Document(content="text", content_type="invalid_type") # Validation error
424
```
425
426
These core data structures provide the foundation for all Haystack operations, ensuring consistent, type-safe data flow throughout the framework while supporting flexible serialization and integration patterns.