Protocol classes and configuration objects for ETL (Extract, Transform, Load) pipelines in CDAP
npx @tessl/cli install tessl/maven-co-cask-cdap--cdap-etl-proto@5.1.00
# CDAP ETL Protocol
1
2
CDAP ETL Protocol provides protocol classes and configuration objects for defining ETL (Extract, Transform, Load) pipelines in the Cask Data Application Platform (CDAP). It contains core data structures for representing ETL configurations, stages, plugins, and connections with comprehensive backward compatibility through versioned protocol support (v0, v1, v2).
3
4
## Package Information
5
6
- **Package Name**: cdap-etl-proto
7
- **Package Type**: maven
8
- **Language**: Java
9
- **Installation**:
10
```xml
11
<dependency>
12
<groupId>co.cask.cdap</groupId>
13
<artifactId>cdap-etl-proto</artifactId>
14
<version>5.1.2</version>
15
</dependency>
16
```
17
18
## Core Imports
19
20
```java
21
import co.cask.cdap.etl.proto.v2.*;
22
import co.cask.cdap.etl.proto.Connection;
23
import co.cask.cdap.api.Resources;
24
import co.cask.cdap.etl.api.Engine;
25
```
26
27
For legacy compatibility:
28
```java
29
import co.cask.cdap.etl.proto.v1.*;
30
import co.cask.cdap.etl.proto.v0.*;
31
```
32
33
## Basic Usage
34
35
```java
36
import co.cask.cdap.etl.proto.v2.*;
37
import co.cask.cdap.etl.proto.Connection;
38
import co.cask.cdap.api.Resources;
39
40
// Create ETL plugin configuration
41
Map<String, String> sourceProps = new HashMap<>();
42
sourceProps.put("name", "users");
43
sourceProps.put("schema.row.field", "userid");
44
45
ETLPlugin sourcePlugin = new ETLPlugin(
46
"Table",
47
"batchsource",
48
sourceProps
49
);
50
51
Map<String, String> transformProps = new HashMap<>();
52
transformProps.put("script", "function transform(input, emitter, context) { emitter.emit(input); }");
53
54
ETLPlugin transformPlugin = new ETLPlugin(
55
"JavaScript",
56
"transform",
57
transformProps
58
);
59
60
Map<String, String> sinkProps = new HashMap<>();
61
sinkProps.put("name", "processed_users");
62
63
ETLPlugin sinkPlugin = new ETLPlugin(
64
"Table",
65
"batchsink",
66
sinkProps
67
);
68
69
// Create ETL stages
70
ETLStage sourceStage = new ETLStage("users_source", sourcePlugin);
71
ETLStage transformStage = new ETLStage("process_users", transformPlugin);
72
ETLStage sinkStage = new ETLStage("users_sink", sinkPlugin);
73
74
// Create pipeline connections
75
Set<Connection> connections = new HashSet<>();
76
connections.add(new Connection("users_source", "process_users"));
77
connections.add(new Connection("process_users", "users_sink"));
78
79
// Build batch ETL configuration
80
ETLBatchConfig config = ETLBatchConfig.builder()
81
.addStage(sourceStage)
82
.addStage(transformStage)
83
.addStage(sinkStage)
84
.addConnections(connections)
85
.setResources(new Resources(1024, 2))
86
.build();
87
88
// Validate configuration
89
config.validate();
90
```
91
92
## Architecture
93
94
CDAP ETL Protocol is structured around versioned protocol packages that enable backward compatibility and smooth upgrades:
95
96
- **Version Management**: Three protocol versions (v0, v1, v2) with automatic upgrade paths
97
- **Configuration Hierarchy**: Base ETL configuration classes extended for specific pipeline types (batch, streaming)
98
- **Stage System**: Modular pipeline components with plugin-based architecture
99
- **Connection Model**: Explicit data flow definition between pipeline stages
100
- **Resource Management**: Configurable resource allocation for different pipeline components
101
- **Validation Framework**: Comprehensive configuration validation with detailed error reporting
102
103
## Capabilities
104
105
### Pipeline Configuration (v2 - Current)
106
107
Latest version ETL pipeline configuration with comprehensive features for batch and streaming data processing. Includes advanced resource management, stage logging, and extensive property support.
108
109
```java { .api }
110
public class ETLConfig extends Config implements UpgradeableConfig {
111
public String getDescription();
112
public Set<ETLStage> getStages();
113
public Set<Connection> getConnections();
114
public Resources getResources();
115
public Resources getDriverResources();
116
public Resources getClientResources();
117
public int getNumOfRecordsPreview();
118
public boolean isStageLoggingEnabled();
119
public boolean isProcessTimingEnabled();
120
public Map<String, String> getProperties();
121
public void validate();
122
public boolean canUpgrade();
123
public UpgradeableConfig upgrade(UpgradeContext upgradeContext);
124
}
125
126
public final class ETLBatchConfig extends ETLConfig {
127
public List<ETLStage> getPostActions();
128
public Engine getEngine();
129
public String getSchedule();
130
public Integer getMaxConcurrentRuns();
131
public ETLBatchConfig convertOldConfig();
132
public static Builder builder();
133
public static Builder builder(String schedule);
134
}
135
136
public final class DataStreamsConfig extends ETLConfig {
137
public String getBatchInterval();
138
public boolean isUnitTest();
139
public boolean checkpointsDisabled();
140
public String getExtraJavaOpts();
141
public Boolean getStopGracefully();
142
public String getCheckpointDir();
143
public static Builder builder();
144
}
145
```
146
147
[Pipeline Configuration](./pipeline-configuration.md)
148
149
### Stage and Plugin Management
150
151
Core components for defining individual pipeline stages and their associated plugins. Handles plugin configuration, artifact selection, and validation.
152
153
```java { .api }
154
public final class ETLStage {
155
public ETLStage(String name, ETLPlugin plugin);
156
public String getName();
157
public ETLPlugin getPlugin();
158
public void validate();
159
}
160
161
public class ETLPlugin {
162
public ETLPlugin(String name, String type, Map<String, String> properties);
163
public ETLPlugin(String name, String type, Map<String, String> properties, ArtifactSelectorConfig artifact);
164
public String getName();
165
public String getType();
166
public Map<String, String> getProperties();
167
public PluginProperties getPluginProperties();
168
public ArtifactSelectorConfig getArtifactConfig();
169
public void validate();
170
}
171
```
172
173
[Stage and Plugin Management](./stage-plugin-management.md)
174
175
### Connection and Data Flow
176
177
Connection management for defining data flow between pipeline stages, including support for conditional connections and port-based routing.
178
179
```java { .api }
180
public class Connection {
181
public Connection(String from, String to);
182
public Connection(String from, String to, String port);
183
public Connection(String from, String to, Boolean condition);
184
public String getFrom();
185
public String getTo();
186
public String getPort();
187
public Boolean getCondition();
188
}
189
```
190
191
[Connection and Data Flow](./connection-data-flow.md)
192
193
### Pipeline Triggering and Property Mapping
194
195
Advanced pipeline triggering capabilities with property mapping between triggering and triggered pipelines, supporting both argument and plugin property mappings.
196
197
```java { .api }
198
public class TriggeringPropertyMapping {
199
public TriggeringPropertyMapping(List<ArgumentMapping> arguments, List<PluginPropertyMapping> pluginProperties);
200
public List<ArgumentMapping> getArguments();
201
public List<PluginPropertyMapping> getPluginProperties();
202
}
203
204
public class ArgumentMapping {
205
public ArgumentMapping(String source, String target);
206
public String getSource();
207
public String getTarget();
208
}
209
210
public class PluginPropertyMapping extends ArgumentMapping {
211
public PluginPropertyMapping(String stageName, String source, String target);
212
public String getStageName();
213
}
214
```
215
216
[Pipeline Triggering](./pipeline-triggering.md)
217
218
### Legacy Version Support
219
220
Backward compatibility support for v0 and v1 protocol versions with automatic upgrade mechanisms to current v2 format.
221
222
```java { .api }
223
public interface UpgradeableConfig<T extends UpgradeableConfig> {
224
boolean canUpgrade();
225
T upgrade(UpgradeContext upgradeContext);
226
}
227
228
public interface UpgradeContext {
229
ArtifactSelectorConfig getPluginArtifact(String pluginType, String pluginName);
230
}
231
```
232
233
[Legacy Version Support](./legacy-version-support.md)
234
235
### Artifact and Resource Management
236
237
Configuration management for plugin artifacts and pipeline resource allocation, including driver, client, and executor resource specifications.
238
239
```java { .api }
240
public class ArtifactSelectorConfig {
241
public ArtifactSelectorConfig();
242
public ArtifactSelectorConfig(String scope, String name, String version);
243
public String getScope();
244
public String getName();
245
public String getVersion();
246
}
247
```
248
249
[Artifact and Resource Management](./artifact-resource-management.md)
250
251
## Types
252
253
```java { .api }
254
// From co.cask.cdap.etl.api.Engine
255
public enum Engine {
256
MAPREDUCE, SPARK
257
}
258
259
// From co.cask.cdap.api.Resources
260
public class Resources {
261
public Resources(int memoryMB, int virtualCores);
262
public int getMemoryMB();
263
public int getVirtualCores();
264
}
265
266
// Validation error structure (conceptual)
267
public class ValidationError {
268
public String getMessage();
269
public String getField();
270
}
271
```