or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

agents.mdcore-schema.mddocument-stores.mdevaluation-utilities.mdfile-processing.mdgenerators.mdindex.mdpipelines.mdreaders.mdretrievers.md

core-schema.mddocs/

0

# Core Schema & Data Structures

1

2

Haystack's core data structures form the foundation of the framework, providing standardized representations for documents, answers, labels, and evaluation results. These Pydantic dataclass-based structures ensure type safety and seamless serialization across all components.

3

4

## Core Imports

5

6

```python { .api }

7

from haystack.schema import Document, Answer, Label, MultiLabel, Span, TableCell, EvaluationResult

8

from haystack.schema import ContentTypes, FilterType, LABEL_DATETIME_FORMAT

9

```

10

11

## Document Class

12

13

The `Document` class is the primary data structure for representing content in Haystack.

14

15

### Document Definition

16

17

```python { .api }

18

from haystack.schema import Document

19

from pandas import DataFrame

20

from numpy import ndarray

21

from typing import Union, Dict, Any, List, Optional, Literal

22

23

ContentTypes = Literal["text", "table", "image", "audio"]

24

25

@dataclass

26

class Document:

27

id: str

28

content: Union[str, DataFrame]

29

content_type: ContentTypes = "text"

30

meta: Dict[str, Any] = {}

31

id_hash_keys: List[str] = ["content"]

32

score: Optional[float] = None

33

embedding: Optional[ndarray] = None

34

35

def __init__(

36

self,

37

content: Union[str, DataFrame],

38

content_type: ContentTypes = "text",

39

id: Optional[str] = None,

40

score: Optional[float] = None,

41

meta: Optional[Dict[str, Any]] = None,

42

embedding: Optional[ndarray] = None,

43

id_hash_keys: Optional[List[str]] = None,

44

):

45

"""

46

Creates a Document instance representing a piece of content.

47

48

Args:

49

content: The document content (text string or DataFrame for tables)

50

content_type: One of "text", "table", "image", "audio"

51

id: Unique identifier; auto-generated from content hash if None

52

score: Relevance score [0,1] from retrieval/ranking models

53

meta: Custom metadata dictionary

54

embedding: Vector representation of the content

55

id_hash_keys: Document attributes used for ID generation

56

"""

57

```

58

59

### Document Methods

60

61

```python { .api }

62

# Serialization

63

document.to_dict(field_map: Optional[Dict[str, Any]] = None) -> Dict

64

document.to_json(field_map: Optional[Dict[str, Any]] = None) -> str

65

66

# Deserialization

67

Document.from_dict(dict: Dict[str, Any], field_map: Optional[Dict[str, Any]] = None) -> Document

68

Document.from_json(data: Union[str, Dict[str, Any]], field_map: Optional[Dict[str, Any]] = None) -> Document

69

```

70

71

### Document Usage Examples

72

73

```python { .api }

74

from haystack.schema import Document

75

import pandas as pd

76

77

# Text document

78

text_doc = Document(

79

content="Haystack is a Python framework for building LLM applications.",

80

meta={"source": "documentation", "author": "deepset"}

81

)

82

83

# Table document

84

df = pd.DataFrame({"Name": ["Alice", "Bob"], "Age": [25, 30]})

85

table_doc = Document(

86

content=df,

87

content_type="table",

88

meta={"source": "user_data.csv"}

89

)

90

91

# Document with custom ID generation

92

doc_with_meta_id = Document(

93

content="Content with metadata-based ID",

94

meta={"url": "https://example.com/page1"},

95

id_hash_keys=["content", "meta.url"]

96

)

97

98

# Serialization

99

doc_dict = text_doc.to_dict()

100

doc_json = text_doc.to_json()

101

restored_doc = Document.from_dict(doc_dict)

102

```

103

104

## Answer Class

105

106

The `Answer` class represents answers from question-answering systems.

107

108

### Answer Definition

109

110

```python { .api }

111

from haystack.schema import Answer, Span, TableCell

112

from pandas import DataFrame

113

from typing import List, Optional, Union, Dict, Any, Literal

114

115

@dataclass

116

class Answer:

117

answer: str

118

type: Literal["generative", "extractive", "other"] = "extractive"

119

score: Optional[float] = None

120

context: Optional[Union[str, DataFrame]] = None

121

offsets_in_document: Optional[Union[List[Span], List[TableCell]]] = None

122

offsets_in_context: Optional[Union[List[Span], List[TableCell]]] = None

123

document_ids: Optional[List[str]] = None

124

meta: Optional[Dict[str, Any]] = None

125

126

"""

127

Creates an Answer instance from QA systems.

128

129

Args:

130

answer: The answer string (empty if no answer found)

131

type: "extractive" (from document text), "generative" (LLM-generated), or "other"

132

score: Confidence score [0,1] from the QA model

133

context: Source context (text passage or table) used for the answer

134

offsets_in_document: Character/cell positions in original document

135

offsets_in_context: Character/cell positions in the context window

136

document_ids: List of document IDs containing the answer

137

meta: Additional metadata about the answer

138

"""

139

```

140

141

### Answer Usage Examples

142

143

```python { .api }

144

from haystack.schema import Answer, Span

145

146

# Extractive answer

147

extractive_answer = Answer(

148

answer="Python framework",

149

type="extractive",

150

score=0.95,

151

context="Haystack is a Python framework for building LLM applications.",

152

offsets_in_document=[Span(start=13, end=28)],

153

offsets_in_context=[Span(start=13, end=28)],

154

document_ids=["doc123"],

155

meta={"model": "bert-base-uncased-qa"}

156

)

157

158

# Generative answer

159

generative_answer = Answer(

160

answer="Haystack enables developers to build production-ready LLM applications with modular components.",

161

type="generative",

162

score=0.88,

163

document_ids=["doc123", "doc124", "doc125"],

164

meta={"model": "gpt-3.5-turbo", "tokens_used": 45}

165

)

166

167

# Table-based answer

168

table_answer = Answer(

169

answer="25",

170

type="extractive",

171

offsets_in_document=[TableCell(row=0, col=1)],

172

document_ids=["table_doc_1"]

173

)

174

```

175

176

## Label Class

177

178

The `Label` class represents training and evaluation labels for supervised learning.

179

180

### Label Definition

181

182

```python { .api }

183

from haystack.schema import Label, Document, Answer

184

from typing import Optional, Dict, Any, Literal

185

186

@dataclass

187

class Label:

188

id: str

189

query: str

190

document: Document

191

is_correct_answer: bool

192

is_correct_document: bool

193

origin: Literal["user-feedback", "gold-label"]

194

answer: Optional[Answer] = None

195

pipeline_id: Optional[str] = None

196

created_at: Optional[str] = None

197

updated_at: Optional[str] = None

198

meta: Optional[Dict[str, Any]] = None

199

filters: Optional[Dict[str, Any]] = None

200

201

def __init__(

202

self,

203

query: str,

204

document: Document,

205

is_correct_answer: bool,

206

is_correct_document: bool,

207

origin: Literal["user-feedback", "gold-label"],

208

answer: Optional[Answer] = None,

209

id: Optional[str] = None,

210

pipeline_id: Optional[str] = None,

211

created_at: Optional[str] = None,

212

updated_at: Optional[str] = None,

213

meta: Optional[Dict[str, Any]] = None,

214

filters: Optional[Dict[str, Any]] = None,

215

):

216

"""

217

Creates a Label for training/evaluation.

218

219

Args:

220

query: The question or query text

221

document: Document containing the answer

222

is_correct_answer: Whether the provided answer is correct

223

is_correct_document: Whether the document is relevant

224

origin: "user-feedback" (human annotation) or "gold-label" (reference data)

225

answer: Optional Answer object with correct answer

226

id: Unique label identifier

227

pipeline_id: ID of pipeline that generated this label

228

created_at: Creation timestamp (ISO format)

229

updated_at: Last update timestamp (ISO format)

230

meta: Additional metadata

231

filters: Document store filters applied during labeling

232

"""

233

```

234

235

### Label Usage Examples

236

237

```python { .api }

238

from haystack.schema import Label, Document, Answer

239

from datetime import datetime

240

241

# Create training label

242

training_doc = Document(content="The capital of France is Paris.")

243

training_label = Label(

244

query="What is the capital of France?",

245

document=training_doc,

246

is_correct_answer=True,

247

is_correct_document=True,

248

origin="gold-label",

249

answer=Answer(answer="Paris", type="extractive"),

250

meta={"dataset": "squad", "difficulty": "easy"}

251

)

252

253

# User feedback label

254

feedback_label = Label(

255

query="How does Haystack work?",

256

document=Document(content="Haystack uses modular components..."),

257

is_correct_answer=False,

258

is_correct_document=True,

259

origin="user-feedback",

260

created_at=datetime.now().strftime("%Y-%m-%d %H:%M:%S"),

261

meta={"user_id": "user123", "feedback_type": "incorrect_answer"}

262

)

263

```

264

265

## Supporting Classes

266

267

### Span Class

268

269

```python { .api }

270

from haystack.schema import Span

271

272

@dataclass

273

class Span:

274

start: int

275

end: int

276

277

def __contains__(self, value) -> bool:

278

"""Check if a value or span is contained within this span."""

279

280

# Usage

281

span = Span(start=10, end=20)

282

assert 15 in span # True - value is in range

283

assert Span(12, 18) in span # True - span is fully contained

284

assert 25 in span # False - value outside range

285

```

286

287

### TableCell Class

288

289

```python { .api }

290

from haystack.schema import TableCell

291

292

@dataclass

293

class TableCell:

294

row: int

295

col: int

296

297

# Usage

298

cell = TableCell(row=2, col=3) # Third row, fourth column (0-indexed)

299

```

300

301

### MultiLabel Class

302

303

```python { .api }

304

from haystack.schema import MultiLabel, Label

305

306

class MultiLabel:

307

def __init__(self, labels: List[Label]):

308

"""Container for multiple labels, typically for multi-answer questions."""

309

310

# Methods for label aggregation and evaluation

311

labels: List[Label]

312

313

# Usage

314

multi_label = MultiLabel([label1, label2, label3])

315

```

316

317

### EvaluationResult Class

318

319

```python { .api }

320

from haystack.schema import EvaluationResult

321

322

class EvaluationResult:

323

def __init__(self):

324

"""Container for evaluation metrics and results."""

325

326

# Evaluation metrics and analysis methods

327

def calculate_metrics(self, predictions: List, labels: List) -> Dict[str, float]

328

def print_metrics(self) -> None

329

```

330

331

## Type Definitions

332

333

### Core Types

334

335

```python { .api }

336

from typing import Literal, Dict, Union, List, Any

337

338

# Content types supported by Document

339

ContentTypes = Literal["text", "table", "image", "audio"]

340

341

# Filter type for document stores

342

FilterType = Dict[str, Union[Dict[str, Any], List[Any], str, int, float, bool]]

343

344

# Date format constant

345

LABEL_DATETIME_FORMAT: str = "%Y-%m-%d %H:%M:%S"

346

```

347

348

## Serialization & Interoperability

349

350

### Field Mapping

351

352

All core classes support field mapping for custom serialization:

353

354

```python { .api }

355

# Custom field names for external systems

356

field_map = {"custom_content_field": "content", "custom_score": "score"}

357

358

# Serialize with custom field names

359

doc_dict = document.to_dict(field_map=field_map)

360

# Result: {"custom_content_field": "...", "custom_score": 0.95, ...}

361

362

# Deserialize with custom field names

363

restored_doc = Document.from_dict(external_dict, field_map=field_map)

364

```

365

366

### JSON Serialization

367

368

```python { .api }

369

# All classes support JSON serialization

370

doc_json = document.to_json()

371

answer_json = answer.to_json()

372

label_json = label.to_json()

373

374

# And deserialization

375

doc = Document.from_json(doc_json)

376

answer = Answer.from_json(answer_json)

377

label = Label.from_json(label_json)

378

```

379

380

## Integration with Components

381

382

### Document Store Integration

383

384

```python { .api }

385

from haystack.document_stores import InMemoryDocumentStore

386

387

document_store = InMemoryDocumentStore()

388

389

# Documents are stored and retrieved as Document objects

390

documents = [Document(content="Text 1"), Document(content="Text 2")]

391

document_store.write_documents(documents)

392

393

retrieved_docs = document_store.get_all_documents()

394

# Returns List[Document]

395

```

396

397

### Pipeline Integration

398

399

```python { .api }

400

from haystack import Pipeline

401

402

# Pipeline components work with standardized data structures

403

pipeline = Pipeline()

404

pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])

405

pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])

406

407

# Pipeline returns structured results

408

result = pipeline.run(query="What is Haystack?")

409

# result["answers"] contains List[Answer]

410

# result["documents"] contains List[Document]

411

```

412

413

## Validation & Error Handling

414

415

```python { .api }

416

# Pydantic validation ensures type safety

417

try:

418

doc = Document(content=None) # Raises ValueError

419

except ValueError as e:

420

print(f"Validation error: {e}")

421

422

# Proper content types are enforced

423

doc = Document(content="text", content_type="invalid_type") # Validation error

424

```

425

426

These core data structures provide the foundation for all Haystack operations, ensuring consistent, type-safe data flow throughout the framework while supporting flexible serialization and integration patterns.