Apache Airflow provider for executing Jupyter notebooks with Papermill
npx @tessl/cli install tessl/pypi-apache-airflow-providers-papermill@2021.3.00
# Apache Airflow Papermill Provider
1
2
Apache Airflow provider for executing Jupyter notebooks with Papermill. This provider enables data teams to integrate notebook-based analytics and machine learning workflows into their Airflow DAGs by executing parameterized Jupyter notebooks through the Papermill library.
3
4
## Package Information
5
6
- **Package Name**: apache-airflow-providers-papermill
7
- **Language**: Python
8
- **Installation**: `pip install apache-airflow-providers-papermill`
9
10
## Core Imports
11
12
```python
13
from airflow.providers.papermill.operators.papermill import PapermillOperator
14
```
15
16
For lineage tracking:
17
18
```python
19
from airflow.providers.papermill.operators.papermill import NoteBook
20
```
21
22
## Basic Usage
23
24
```python
25
from airflow import DAG
26
from airflow.providers.papermill.operators.papermill import PapermillOperator
27
from airflow.utils.dates import days_ago
28
from datetime import timedelta
29
30
# Define DAG
31
dag = DAG(
32
'example_papermill',
33
default_args={'owner': 'airflow'},
34
schedule_interval='0 0 * * *',
35
start_date=days_ago(2),
36
dagrun_timeout=timedelta(minutes=60),
37
)
38
39
# Execute a notebook with parameters
40
run_notebook = PapermillOperator(
41
task_id="run_analysis_notebook",
42
input_nb="/path/to/analysis.ipynb",
43
output_nb="/path/to/output-{{ execution_date }}.ipynb",
44
parameters={"date": "{{ execution_date }}", "source": "airflow"},
45
dag=dag
46
)
47
```
48
49
## Capabilities
50
51
### Notebook Execution
52
53
Execute Jupyter notebooks through Papermill with parameter injection and lineage tracking support.
54
55
```python { .api }
56
class PapermillOperator(BaseOperator):
57
"""
58
Executes a jupyter notebook through papermill that is annotated with parameters
59
60
:param input_nb: input notebook (can also be a NoteBook or a File inlet)
61
:type input_nb: str
62
:param output_nb: output notebook (can also be a NoteBook or File outlet)
63
:type output_nb: str
64
:param parameters: the notebook parameters to set
65
:type parameters: dict
66
"""
67
68
supports_lineage = True
69
70
@apply_defaults
71
def __init__(
72
self,
73
*,
74
input_nb: Optional[str] = None,
75
output_nb: Optional[str] = None,
76
parameters: Optional[Dict] = None,
77
**kwargs,
78
) -> None: ...
79
80
def execute(self, context): ...
81
```
82
83
### Lineage Entity
84
85
Represents Jupyter notebooks for Airflow lineage tracking.
86
87
```python { .api }
88
@attr.s(auto_attribs=True)
89
class NoteBook(File):
90
"""Jupyter notebook"""
91
92
type_hint: Optional[str] = "jupyter_notebook"
93
parameters: Optional[Dict] = {}
94
meta_schema: str = __name__ + '.NoteBook'
95
```
96
97
## Types
98
99
```python { .api }
100
from typing import Dict, Optional
101
import attr
102
import papermill as pm
103
from airflow.lineage.entities import File
104
from airflow.models import BaseOperator
105
from airflow.utils.decorators import apply_defaults
106
```
107
108
## Advanced Usage Examples
109
110
### Template Variables
111
112
Use Airflow's templating system for dynamic notebook paths and parameters:
113
114
```python
115
run_notebook = PapermillOperator(
116
task_id="daily_report",
117
input_nb="/notebooks/daily_report_template.ipynb",
118
output_nb="/reports/daily_report_{{ ds }}.ipynb",
119
parameters={
120
"report_date": "{{ ds }}",
121
"execution_time": "{{ execution_date }}",
122
"run_id": "{{ run_id }}"
123
}
124
)
125
```
126
127
### Lineage Tracking
128
129
The operator automatically creates lineage entities that can be used by downstream tasks:
130
131
```python
132
from airflow.lineage import AUTO
133
from airflow.operators.python import PythonOperator
134
135
def process_notebook_output(inlets, **context):
136
# Access the output notebook through lineage
137
notebook_path = inlets[0].url
138
# Process the executed notebook...
139
140
process_task = PythonOperator(
141
task_id='process_output',
142
python_callable=process_notebook_output,
143
inlets=AUTO # Automatically detects upstream notebook outputs
144
)
145
146
run_notebook >> process_task
147
```
148
149
### Multiple Notebook Execution
150
151
Execute multiple notebooks in sequence by setting up multiple inlets and outlets:
152
153
```python
154
# Note: This pattern requires careful setup of inlets/outlets
155
# The operator will execute notebooks in pairs (inlet[i] -> outlet[i])
156
multi_notebook = PapermillOperator(
157
task_id="run_multiple_notebooks",
158
# Set up inlets and outlets manually for multiple notebooks
159
dag=dag
160
)
161
# Additional configuration needed for multiple notebook execution
162
```
163
164
## Error Handling
165
166
The operator performs validation during execution:
167
168
- Raises `ValueError` if inlets or outlets are not properly configured (i.e., "Input notebook or output notebook is not specified")
169
- Papermill execution errors are propagated as task failures
170
171
## Integration Notes
172
173
### Papermill Configuration
174
175
The operator calls `papermill.execute_notebook()` with these settings:
176
- `progress_bar=False` - Disables progress display for cleaner logs
177
- `report_mode=True` - Enables report generation mode
178
- Parameters are passed through for notebook injection
179
180
### Airflow Features
181
182
- **Templating**: All string parameters support Airflow's Jinja templating
183
- **Lineage**: Automatic lineage tracking through `NoteBook` entities
184
- **XCom**: Can be used with XCom for passing data between tasks
185
- **Retries**: Supports standard Airflow retry mechanisms
186
- **Connections**: Can use Airflow connections for remote notebook storage
187
188
## Migration Notes
189
190
The legacy import path is deprecated and issues a warning:
191
192
```python
193
# DEPRECATED - issues warning
194
from airflow.operators.papermill_operator import PapermillOperator
195
```
196
197
Use the current import path:
198
199
```python
200
# Current import path
201
from airflow.providers.papermill.operators.papermill import PapermillOperator
202
```