Tessl Tile for pypi/farm-haystack@1.26.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

agents.md core-schema.md document-stores.md evaluation-utilities.md file-processing.md generators.md index.md pipelines.md readers.md retrievers.md

core-schema.mddocs/

0
# Core Schema & Data Structures
1

2
Haystack's core data structures form the foundation of the framework, providing standardized representations for documents, answers, labels, and evaluation results. These Pydantic dataclass-based structures ensure type safety and seamless serialization across all components.
3

4
## Core Imports
5

6
```python { .api }
7
from haystack.schema import Document, Answer, Label, MultiLabel, Span, TableCell, EvaluationResult
8
from haystack.schema import ContentTypes, FilterType, LABEL_DATETIME_FORMAT
9
```
10

11
## Document Class
12

13
The `Document` class is the primary data structure for representing content in Haystack.
14

15
### Document Definition
16

17
```python { .api }
18
from haystack.schema import Document
19
from pandas import DataFrame
20
from numpy import ndarray
21
from typing import Union, Dict, Any, List, Optional, Literal
22

23
ContentTypes = Literal["text", "table", "image", "audio"]
24

25
@dataclass
26
class Document:
27
    id: str
28
    content: Union[str, DataFrame]
29
    content_type: ContentTypes = "text"
30
    meta: Dict[str, Any] = {}
31
    id_hash_keys: List[str] = ["content"]
32
    score: Optional[float] = None
33
    embedding: Optional[ndarray] = None
34
    
35
    def __init__(
36
        self,
37
        content: Union[str, DataFrame],
38
        content_type: ContentTypes = "text",
39
        id: Optional[str] = None,
40
        score: Optional[float] = None,
41
        meta: Optional[Dict[str, Any]] = None,
42
        embedding: Optional[ndarray] = None,
43
        id_hash_keys: Optional[List[str]] = None,
44
    ):
45
        """
46
        Creates a Document instance representing a piece of content.
47
        
48
        Args:
49
            content: The document content (text string or DataFrame for tables)
50
            content_type: One of "text", "table", "image", "audio"
51
            id: Unique identifier; auto-generated from content hash if None
52
            score: Relevance score [0,1] from retrieval/ranking models
53
            meta: Custom metadata dictionary
54
            embedding: Vector representation of the content
55
            id_hash_keys: Document attributes used for ID generation
56
        """
57
```
58

59
### Document Methods
60

61
```python { .api }
62
# Serialization
63
document.to_dict(field_map: Optional[Dict[str, Any]] = None) -> Dict
64
document.to_json(field_map: Optional[Dict[str, Any]] = None) -> str
65

66
# Deserialization
67
Document.from_dict(dict: Dict[str, Any], field_map: Optional[Dict[str, Any]] = None) -> Document
68
Document.from_json(data: Union[str, Dict[str, Any]], field_map: Optional[Dict[str, Any]] = None) -> Document
69
```
70

71
### Document Usage Examples
72

73
```python { .api }
74
from haystack.schema import Document
75
import pandas as pd
76

77
# Text document
78
text_doc = Document(
79
    content="Haystack is a Python framework for building LLM applications.",
80
    meta={"source": "documentation", "author": "deepset"}
81
)
82

83
# Table document
84
df = pd.DataFrame({"Name": ["Alice", "Bob"], "Age": [25, 30]})
85
table_doc = Document(
86
    content=df,
87
    content_type="table",
88
    meta={"source": "user_data.csv"}
89
)
90

91
# Document with custom ID generation
92
doc_with_meta_id = Document(
93
    content="Content with metadata-based ID",
94
    meta={"url": "https://example.com/page1"},
95
    id_hash_keys=["content", "meta.url"]
96
)
97

98
# Serialization
99
doc_dict = text_doc.to_dict()
100
doc_json = text_doc.to_json()
101
restored_doc = Document.from_dict(doc_dict)
102
```
103

104
## Answer Class
105

106
The `Answer` class represents answers from question-answering systems.
107

108
### Answer Definition
109

110
```python { .api }
111
from haystack.schema import Answer, Span, TableCell
112
from pandas import DataFrame
113
from typing import List, Optional, Union, Dict, Any, Literal
114

115
@dataclass
116
class Answer:
117
    answer: str
118
    type: Literal["generative", "extractive", "other"] = "extractive"
119
    score: Optional[float] = None
120
    context: Optional[Union[str, DataFrame]] = None
121
    offsets_in_document: Optional[Union[List[Span], List[TableCell]]] = None
122
    offsets_in_context: Optional[Union[List[Span], List[TableCell]]] = None
123
    document_ids: Optional[List[str]] = None
124
    meta: Optional[Dict[str, Any]] = None
125
    
126
    """
127
    Creates an Answer instance from QA systems.
128
    
129
    Args:
130
        answer: The answer string (empty if no answer found)
131
        type: "extractive" (from document text), "generative" (LLM-generated), or "other"
132
        score: Confidence score [0,1] from the QA model
133
        context: Source context (text passage or table) used for the answer
134
        offsets_in_document: Character/cell positions in original document
135
        offsets_in_context: Character/cell positions in the context window
136
        document_ids: List of document IDs containing the answer
137
        meta: Additional metadata about the answer
138
    """
139
```
140

141
### Answer Usage Examples
142

143
```python { .api }
144
from haystack.schema import Answer, Span
145

146
# Extractive answer
147
extractive_answer = Answer(
148
    answer="Python framework",
149
    type="extractive",
150
    score=0.95,
151
    context="Haystack is a Python framework for building LLM applications.",
152
    offsets_in_document=[Span(start=13, end=28)],
153
    offsets_in_context=[Span(start=13, end=28)],
154
    document_ids=["doc123"],
155
    meta={"model": "bert-base-uncased-qa"}
156
)
157

158
# Generative answer
159
generative_answer = Answer(
160
    answer="Haystack enables developers to build production-ready LLM applications with modular components.",
161
    type="generative",
162
    score=0.88,
163
    document_ids=["doc123", "doc124", "doc125"],
164
    meta={"model": "gpt-3.5-turbo", "tokens_used": 45}
165
)
166

167
# Table-based answer
168
table_answer = Answer(
169
    answer="25",
170
    type="extractive",
171
    offsets_in_document=[TableCell(row=0, col=1)],
172
    document_ids=["table_doc_1"]
173
)
174
```
175

176
## Label Class
177

178
The `Label` class represents training and evaluation labels for supervised learning.
179

180
### Label Definition
181

182
```python { .api }
183
from haystack.schema import Label, Document, Answer
184
from typing import Optional, Dict, Any, Literal
185

186
@dataclass
187
class Label:
188
    id: str
189
    query: str
190
    document: Document
191
    is_correct_answer: bool
192
    is_correct_document: bool
193
    origin: Literal["user-feedback", "gold-label"]
194
    answer: Optional[Answer] = None
195
    pipeline_id: Optional[str] = None
196
    created_at: Optional[str] = None
197
    updated_at: Optional[str] = None
198
    meta: Optional[Dict[str, Any]] = None
199
    filters: Optional[Dict[str, Any]] = None
200
    
201
    def __init__(
202
        self,
203
        query: str,
204
        document: Document,
205
        is_correct_answer: bool,
206
        is_correct_document: bool,
207
        origin: Literal["user-feedback", "gold-label"],
208
        answer: Optional[Answer] = None,
209
        id: Optional[str] = None,
210
        pipeline_id: Optional[str] = None,
211
        created_at: Optional[str] = None,
212
        updated_at: Optional[str] = None,
213
        meta: Optional[Dict[str, Any]] = None,
214
        filters: Optional[Dict[str, Any]] = None,
215
    ):
216
        """
217
        Creates a Label for training/evaluation.
218
        
219
        Args:
220
            query: The question or query text
221
            document: Document containing the answer
222
            is_correct_answer: Whether the provided answer is correct
223
            is_correct_document: Whether the document is relevant
224
            origin: "user-feedback" (human annotation) or "gold-label" (reference data)
225
            answer: Optional Answer object with correct answer
226
            id: Unique label identifier
227
            pipeline_id: ID of pipeline that generated this label
228
            created_at: Creation timestamp (ISO format)
229
            updated_at: Last update timestamp (ISO format)
230
            meta: Additional metadata
231
            filters: Document store filters applied during labeling
232
        """
233
```
234

235
### Label Usage Examples
236

237
```python { .api }
238
from haystack.schema import Label, Document, Answer
239
from datetime import datetime
240

241
# Create training label
242
training_doc = Document(content="The capital of France is Paris.")
243
training_label = Label(
244
    query="What is the capital of France?",
245
    document=training_doc,
246
    is_correct_answer=True,
247
    is_correct_document=True,
248
    origin="gold-label",
249
    answer=Answer(answer="Paris", type="extractive"),
250
    meta={"dataset": "squad", "difficulty": "easy"}
251
)
252

253
# User feedback label
254
feedback_label = Label(
255
    query="How does Haystack work?",
256
    document=Document(content="Haystack uses modular components..."),
257
    is_correct_answer=False,
258
    is_correct_document=True,
259
    origin="user-feedback",
260
    created_at=datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
261
    meta={"user_id": "user123", "feedback_type": "incorrect_answer"}
262
)
263
```
264

265
## Supporting Classes
266

267
### Span Class
268

269
```python { .api }
270
from haystack.schema import Span
271

272
@dataclass
273
class Span:
274
    start: int
275
    end: int
276
    
277
    def __contains__(self, value) -> bool:
278
        """Check if a value or span is contained within this span."""
279

280
# Usage
281
span = Span(start=10, end=20)
282
assert 15 in span  # True - value is in range
283
assert Span(12, 18) in span  # True - span is fully contained
284
assert 25 in span  # False - value outside range
285
```
286

287
### TableCell Class
288

289
```python { .api }
290
from haystack.schema import TableCell
291

292
@dataclass
293
class TableCell:
294
    row: int
295
    col: int
296

297
# Usage
298
cell = TableCell(row=2, col=3)  # Third row, fourth column (0-indexed)
299
```
300

301
### MultiLabel Class
302

303
```python { .api }
304
from haystack.schema import MultiLabel, Label
305

306
class MultiLabel:
307
    def __init__(self, labels: List[Label]):
308
        """Container for multiple labels, typically for multi-answer questions."""
309
        
310
    # Methods for label aggregation and evaluation
311
    labels: List[Label]
312
    
313
# Usage
314
multi_label = MultiLabel([label1, label2, label3])
315
```
316

317
### EvaluationResult Class
318

319
```python { .api }
320
from haystack.schema import EvaluationResult
321

322
class EvaluationResult:
323
    def __init__(self):
324
        """Container for evaluation metrics and results."""
325
        
326
    # Evaluation metrics and analysis methods
327
    def calculate_metrics(self, predictions: List, labels: List) -> Dict[str, float]
328
    def print_metrics(self) -> None
329
```
330

331
## Type Definitions
332

333
### Core Types
334

335
```python { .api }
336
from typing import Literal, Dict, Union, List, Any
337

338
# Content types supported by Document
339
ContentTypes = Literal["text", "table", "image", "audio"]
340

341
# Filter type for document stores  
342
FilterType = Dict[str, Union[Dict[str, Any], List[Any], str, int, float, bool]]
343

344
# Date format constant
345
LABEL_DATETIME_FORMAT: str = "%Y-%m-%d %H:%M:%S"
346
```
347

348
## Serialization & Interoperability
349

350
### Field Mapping
351

352
All core classes support field mapping for custom serialization:
353

354
```python { .api }
355
# Custom field names for external systems
356
field_map = {"custom_content_field": "content", "custom_score": "score"}
357

358
# Serialize with custom field names
359
doc_dict = document.to_dict(field_map=field_map)
360
# Result: {"custom_content_field": "...", "custom_score": 0.95, ...}
361

362
# Deserialize with custom field names
363
restored_doc = Document.from_dict(external_dict, field_map=field_map)
364
```
365

366
### JSON Serialization
367

368
```python { .api }
369
# All classes support JSON serialization
370
doc_json = document.to_json()
371
answer_json = answer.to_json()  
372
label_json = label.to_json()
373

374
# And deserialization
375
doc = Document.from_json(doc_json)
376
answer = Answer.from_json(answer_json)
377
label = Label.from_json(label_json)
378
```
379

380
## Integration with Components
381

382
### Document Store Integration
383

384
```python { .api }
385
from haystack.document_stores import InMemoryDocumentStore
386

387
document_store = InMemoryDocumentStore()
388

389
# Documents are stored and retrieved as Document objects
390
documents = [Document(content="Text 1"), Document(content="Text 2")]
391
document_store.write_documents(documents)
392

393
retrieved_docs = document_store.get_all_documents()
394
# Returns List[Document]
395
```
396

397
### Pipeline Integration
398

399
```python { .api }
400
from haystack import Pipeline
401

402
# Pipeline components work with standardized data structures
403
pipeline = Pipeline()
404
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
405
pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])
406

407
# Pipeline returns structured results
408
result = pipeline.run(query="What is Haystack?")
409
# result["answers"] contains List[Answer]
410
# result["documents"] contains List[Document]
411
```
412

413
## Validation & Error Handling
414

415
```python { .api }
416
# Pydantic validation ensures type safety
417
try:
418
    doc = Document(content=None)  # Raises ValueError
419
except ValueError as e:
420
    print(f"Validation error: {e}")
421

422
# Proper content types are enforced
423
doc = Document(content="text", content_type="invalid_type")  # Validation error
424
```
425

426
These core data structures provide the foundation for all Haystack operations, ensuring consistent, type-safe data flow throughout the framework while supporting flexible serialization and integration patterns.

Version

Tile

Files

core-schema.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

core-schema.mddocs/