0
# Langfuse
1
2
A comprehensive Python SDK for AI application observability and experimentation. Langfuse provides automatic tracing of LLM applications, experiment management with evaluation capabilities, dataset handling, and prompt template management - all built on OpenTelemetry standards for seamless integration with existing observability infrastructure.
3
4
## Package Information
5
6
- **Package Name**: langfuse
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install langfuse`
10
- **Version**: 3.7.0
11
- **License**: MIT
12
13
## Core Imports
14
15
```python
16
from langfuse import Langfuse, observe, get_client
17
```
18
19
For specialized functionality:
20
21
```python
22
# Experiment system
23
from langfuse import Evaluation
24
25
# Span types
26
from langfuse import (
27
LangfuseSpan, LangfuseGeneration, LangfuseEvent,
28
LangfuseAgent, LangfuseTool, LangfuseChain
29
)
30
31
# OpenAI integration (drop-in replacement)
32
from langfuse.openai import OpenAI, AsyncOpenAI
33
34
# LangChain integration
35
from langfuse.langchain import CallbackHandler
36
```
37
38
## Basic Usage
39
40
```python
41
from langfuse import Langfuse, observe
42
43
# Initialize client
44
langfuse = Langfuse(
45
public_key="your-public-key",
46
secret_key="your-secret-key",
47
host="https://cloud.langfuse.com" # or your self-hosted URL
48
)
49
50
# Simple tracing with decorator
51
@observe(as_type="generation")
52
def generate_response(prompt: str) -> str:
53
# Your LLM call here
54
response = openai.chat.completions.create(
55
model="gpt-4",
56
messages=[{"role": "user", "content": prompt}]
57
)
58
return response.choices[0].message.content
59
60
# Manual span creation
61
with langfuse.start_as_current_span(name="process-query") as span:
62
result = process_data()
63
span.update(output=result)
64
span.score(name="accuracy", value=0.95)
65
66
# Experiments with evaluators
67
def accuracy_evaluator(*, input, output, expected_output=None, **kwargs):
68
from langfuse import Evaluation
69
is_correct = output.strip().lower() == expected_output.strip().lower()
70
return Evaluation(
71
name="accuracy",
72
value=1.0 if is_correct else 0.0,
73
comment="Exact match" if is_correct else "No match"
74
)
75
76
result = langfuse.run_experiment(
77
name="Capital Cities Test",
78
data=[{"input": "Capital of France?", "expected_output": "Paris"}],
79
task=generate_response,
80
evaluators=[accuracy_evaluator]
81
)
82
```
83
84
## Architecture
85
86
Langfuse is built around four core concepts that work together to provide comprehensive observability:
87
88
### Tracing Foundation
89
Built on **OpenTelemetry**, providing industry-standard distributed tracing with hierarchical span relationships. Every operation creates spans that capture timing, inputs, outputs, and metadata, enabling detailed performance analysis and debugging.
90
91
### Observation Types
92
**Specialized span types** for AI/LLM applications including generations (for model calls), agents (for reasoning), tools (for external calls), chains (for workflows), and evaluators (for quality assessment). Each type captures relevant metadata and provides appropriate visualizations.
93
94
### Automatic Instrumentation
95
**Decorator-based tracing** with the `@observe` decorator automatically instruments Python functions, supporting both synchronous and asynchronous operations with proper context propagation and error handling.
96
97
### Experiment Framework
98
**Built-in experimentation** system for running evaluations on datasets with automatic tracing, supporting both local data and Langfuse-managed datasets with comprehensive result formatting and analysis.
99
100
## Capabilities
101
102
### Core Tracing and Observability
103
104
Fundamental tracing functionality for instrumenting AI applications with automatic span creation, context propagation, and detailed performance monitoring.
105
106
```python { .api }
107
class Langfuse:
108
def start_span(self, name: str, **kwargs) -> LangfuseSpan: ...
109
def start_as_current_span(self, *, name: str, **kwargs) -> ContextManager[LangfuseSpan]: ...
110
def start_observation(self, *, name: str, as_type: str, **kwargs) -> Union[LangfuseSpan, LangfuseGeneration, ...]: ...
111
def start_as_current_observation(self, *, name: str, as_type: str, **kwargs) -> ContextManager[...]: ...
112
def create_event(self, *, name: str, **kwargs) -> LangfuseEvent: ...
113
def flush(self) -> None: ...
114
def shutdown(self) -> None: ...
115
116
def observe(func=None, *, name: str = None, as_type: str = None, **kwargs) -> Callable: ...
117
def get_client(*, public_key: str = None) -> Langfuse: ...
118
```
119
120
[Core Tracing](./core-tracing.md)
121
122
### Specialized Observation Types
123
124
Dedicated span types for different AI application components, each optimized for specific use cases with appropriate metadata and visualization.
125
126
```python { .api }
127
class LangfuseGeneration:
128
# Specialized for LLM calls with model metrics
129
def update(self, *, model: str = None, usage_details: Dict[str, int] = None,
130
cost_details: Dict[str, float] = None, **kwargs) -> "LangfuseGeneration": ...
131
132
class LangfuseAgent:
133
# For agent reasoning blocks
134
pass
135
136
class LangfuseTool:
137
# For external tool calls (APIs, databases)
138
pass
139
140
class LangfuseChain:
141
# For connecting application steps
142
pass
143
144
class LangfuseRetriever:
145
# For data retrieval operations
146
pass
147
```
148
149
[Observation Types](./observation-types.md)
150
151
### Experiment Management
152
153
Comprehensive system for running experiments on datasets with automatic evaluation, result aggregation, and detailed reporting capabilities.
154
155
```python { .api }
156
class Evaluation:
157
def __init__(self, *, name: str, value: Union[int, float, str, bool, None],
158
comment: str = None, metadata: Dict[str, Any] = None): ...
159
160
class ExperimentResult:
161
def format(self, *, include_item_results: bool = False) -> str: ...
162
163
# Attributes
164
name: str
165
item_results: List[ExperimentItemResult]
166
run_evaluations: List[Evaluation]
167
168
def run_experiment(*, name: str, data: List[Any], task: Callable,
169
evaluators: List[Callable] = None, **kwargs) -> ExperimentResult: ...
170
```
171
172
[Experiments](./experiments.md)
173
174
### Dataset Management
175
176
Tools for creating, managing, and running experiments on datasets with support for both local data and Langfuse-hosted datasets.
177
178
```python { .api }
179
class DatasetClient:
180
def run_experiment(self, *, name: str, task: Callable, **kwargs) -> ExperimentResult: ...
181
182
# Attributes
183
id: str
184
name: str
185
items: List[DatasetItemClient]
186
187
class DatasetItemClient:
188
# Attributes
189
input: Any
190
expected_output: Any
191
metadata: Any
192
193
class Langfuse:
194
def get_dataset(self, name: str) -> DatasetClient: ...
195
def create_dataset(self, *, name: str, **kwargs) -> DatasetClient: ...
196
def create_dataset_item(self, *, dataset_name: str, **kwargs) -> DatasetItemClient: ...
197
```
198
199
[Dataset Management](./datasets.md)
200
201
### Prompt Management
202
203
Template management system supporting both text and chat-based prompts with variable interpolation and LangChain integration.
204
205
```python { .api }
206
class TextPromptClient:
207
def compile(self, **kwargs) -> str: ...
208
def get_langchain_prompt(self) -> Any: ...
209
210
# Attributes
211
name: str
212
version: int
213
prompt: str
214
215
class ChatPromptClient:
216
def compile(self, **kwargs) -> List[Dict[str, str]]: ...
217
def get_langchain_prompt(self) -> Any: ...
218
219
# Attributes
220
name: str
221
version: int
222
prompt: List[Dict[str, Any]]
223
224
class Langfuse:
225
def get_prompt(self, name: str, version: int = None, **kwargs) -> Union[TextPromptClient, ChatPromptClient]: ...
226
def create_prompt(self, *, name: str, prompt: Union[str, List[Dict]], **kwargs) -> Union[TextPromptClient, ChatPromptClient]: ...
227
```
228
229
[Prompt Management](./prompts.md)
230
231
### Scoring and Evaluation
232
233
System for adding scores and evaluations to traces and observations, supporting numeric, categorical, and boolean score types.
234
235
```python { .api }
236
class LangfuseObservationWrapper:
237
def score(self, *, name: str, value: Union[float, str],
238
data_type: str = None, comment: str = None) -> None: ...
239
def score_trace(self, *, name: str, value: Union[float, str],
240
data_type: str = None, comment: str = None) -> None: ...
241
242
class Langfuse:
243
def create_score(self, *, name: str, value: str, trace_id: str = None,
244
observation_id: str = None, **kwargs) -> None: ...
245
```
246
247
[Scoring](./scoring.md)
248
249
### Integration Support
250
251
Pre-built integrations for popular AI frameworks with automatic instrumentation and minimal configuration required.
252
253
```python { .api }
254
# OpenAI Integration (drop-in replacement)
255
from langfuse.openai import OpenAI, AsyncOpenAI, AzureOpenAI
256
257
# LangChain Integration
258
from langfuse.langchain import CallbackHandler
259
260
class CallbackHandler:
261
def __init__(self, *, public_key: str = None, secret_key: str = None, **kwargs): ...
262
```
263
264
[Integrations](./integrations.md)
265
266
### Media and Advanced Features
267
268
Support for media uploads, data masking, multi-project setups, and advanced configuration options.
269
270
```python { .api }
271
class LangfuseMedia:
272
def __init__(self, *, obj: object = None, base64_data_uri: str = None,
273
content_type: str = None, **kwargs): ...
274
275
class Langfuse:
276
def get_trace_url(self, trace_id: str) -> str: ...
277
def auth_check(self) -> bool: ...
278
def create_trace_id(self) -> str: ...
279
def get_current_trace_id(self) -> str: ...
280
```
281
282
[Advanced Features](./advanced.md)
283
284
## Types
285
286
```python { .api }
287
# Core Types
288
SpanLevel = Literal["DEBUG", "DEFAULT", "WARNING", "ERROR"]
289
ScoreDataType = Literal["NUMERIC", "CATEGORICAL", "BOOLEAN"]
290
ObservationTypeLiteral = Literal["span", "generation", "event", "agent", "tool", "chain", "retriever", "embedding", "evaluator", "guardrail"]
291
292
# Experiment Types
293
LocalExperimentItem = TypedDict('LocalExperimentItem', {
294
'input': Any,
295
'expected_output': Any,
296
'metadata': Optional[Dict[str, Any]]
297
}, total=False)
298
299
ExperimentItem = Union[LocalExperimentItem, DatasetItemClient]
300
301
# Function Protocols
302
class TaskFunction(Protocol):
303
def __call__(self, *, item: ExperimentItem, **kwargs) -> Union[Any, Awaitable[Any]]: ...
304
305
class EvaluatorFunction(Protocol):
306
def __call__(self, *, input: Any, output: Any, expected_output: Any = None,
307
metadata: Dict[str, Any] = None, **kwargs) -> Union[Evaluation, List[Evaluation], Awaitable[Union[Evaluation, List[Evaluation]]]]: ...
308
```