The Cask Data Application Platform (CDAP) is an integrated, open source application development platform for the Hadoop ecosystem that provides developers with data and application abstractions to simplify and accelerate application development.
npx @tessl/cli install tessl/maven-io-cdap-cdap--cdap@6.11.00
# CDAP - Cask Data Application Platform
1
2
The Cask Data Application Platform (CDAP) is an integrated, open source application development platform for the Hadoop ecosystem that provides developers with data and application abstractions to simplify and accelerate application development, address a broader range of real-time and batch use cases, and deploy applications into production while satisfying enterprise requirements.
3
4
## Package Information
5
6
- **Package Name**: cdap
7
- **Package Type**: maven
8
- **Language**: Java
9
- **Installation**: Add to your Maven dependencies:
10
11
```xml
12
<dependency>
13
<groupId>io.cdap.cdap</groupId>
14
<artifactId>cdap-api</artifactId>
15
<version>6.11.0</version>
16
</dependency>
17
```
18
19
## Overview
20
21
CDAP abstracts the complexity of underlying infrastructure while providing developers with powerful tools for building portable, maintainable data applications. From simple MapReduce jobs to complex ETL pipelines and real-time data processing workflows, CDAP enables enterprise-ready development with features like automatic metadata capture, operational control, security integration, and lineage tracking.
22
23
## Core Imports
24
25
```java
26
// Core application framework
27
import io.cdap.cdap.api.app.*;
28
import io.cdap.cdap.api.*;
29
30
// Configuration
31
import io.cdap.cdap.api.Config;
32
33
// Common utilities
34
import io.cdap.cdap.api.common.*;
35
```
36
37
For specific program types:
38
39
```java
40
// MapReduce
41
import io.cdap.cdap.api.mapreduce.*;
42
43
// Spark (Beta)
44
import io.cdap.cdap.api.spark.*;
45
46
// Services
47
import io.cdap.cdap.api.service.*;
48
import io.cdap.cdap.api.service.http.*;
49
50
// Workers
51
import io.cdap.cdap.api.worker.*;
52
53
// Workflows
54
import io.cdap.cdap.api.workflow.*;
55
```
56
57
## Basic Usage
58
59
```java
60
import io.cdap.cdap.api.app.*;
61
import io.cdap.cdap.api.*;
62
63
// 1. Define application configuration
64
public class MyAppConfig extends Config {
65
private String inputDataset = "input";
66
private String outputDataset = "output";
67
68
// getters and setters...
69
}
70
71
// 2. Create a simple application
72
public class MyDataApp extends AbstractApplication<MyAppConfig> {
73
74
@Override
75
public void configure(ApplicationConfigurer configurer,
76
ApplicationContext<MyAppConfig> context) {
77
MyAppConfig config = context.getConfig();
78
79
// Set application metadata
80
configurer.setName("MyDataProcessingApp");
81
configurer.setDescription("Processes data from input to output");
82
83
// Add a simple MapReduce program
84
configurer.addMapReduce(new MyMapReduce());
85
86
// Create datasets
87
configurer.createDataset(config.inputDataset, Table.class);
88
configurer.createDataset(config.outputDataset, Table.class);
89
}
90
}
91
92
// 3. Simple MapReduce program
93
public class MyMapReduce extends AbstractMapReduce {
94
@Override
95
public void configure(MapReduceConfigurer configurer) {
96
configurer.setName("MyMapReduce");
97
configurer.setDescription("Simple data processing");
98
}
99
}
100
```
101
102
## Package Structure
103
104
The CDAP API is organized into logical functional areas:
105
106
```java { .api }
107
// Core API packages
108
io.cdap.cdap.api.* // Base interfaces and classes
109
io.cdap.cdap.api.annotation.* // Annotations for configuration and metadata
110
111
// Application Framework
112
io.cdap.cdap.api.app.* // Application building blocks
113
io.cdap.cdap.api.service.* // HTTP services and handlers
114
io.cdap.cdap.api.worker.* // Worker programs
115
io.cdap.cdap.api.workflow.* // Workflow orchestration
116
117
// Data Processing
118
io.cdap.cdap.api.mapreduce.* // MapReduce integration
119
io.cdap.cdap.api.spark.* // Spark integration
120
io.cdap.cdap.api.customaction.* // Custom workflow actions
121
122
// Data Management
123
io.cdap.cdap.api.dataset.* // Dataset abstractions and implementations
124
io.cdap.cdap.api.messaging.* // Messaging system integration
125
126
// Plugin System
127
io.cdap.cdap.api.plugin.* // Extensibility framework
128
129
// Operations & Governance
130
io.cdap.cdap.api.metrics.* // Metrics collection
131
io.cdap.cdap.api.metadata.* // Metadata management
132
io.cdap.cdap.api.lineage.* // Data lineage tracking
133
io.cdap.cdap.api.security.* // Authentication and authorization
134
```
135
136
## Architecture Overview
137
138
CDAP applications follow a component-based architecture where different types of programs work together:
139
140
### Core Components
141
142
```java { .api }
143
// Application - Root container
144
public interface Application<T extends Config> {
145
void configure(ApplicationConfigurer configurer, ApplicationContext<T> context);
146
default boolean isUpdateSupported() { return false; }
147
default ApplicationUpdateResult<T> updateConfig(ApplicationUpdateContext applicationUpdateContext)
148
throws Exception {
149
throw new UnsupportedOperationException("Application config update operation is not supported.");
150
}
151
}
152
153
// Base program types
154
public enum ProgramType {
155
MAPREDUCE, // Batch data processing with Hadoop MapReduce
156
SPARK, // Batch/streaming processing with Apache Spark
157
SERVICE, // HTTP services for real-time data access
158
WORKER, // Long-running background processes
159
WORKFLOW // Orchestration of multiple programs
160
}
161
162
// Resource allocation
163
public final class Resources {
164
public static final int DEFAULT_VIRTUAL_CORES = 1;
165
public static final int DEFAULT_MEMORY_MB = 512;
166
167
public Resources() { /* 512MB, 1 core */ }
168
public Resources(int memoryMB) { /* specified memory, 1 core */ }
169
public Resources(int memoryMB, int cores) { /* specified memory and cores */ }
170
171
public int getMemoryMB() { /* returns allocated memory */ }
172
public int getVirtualCores() { /* returns allocated CPU cores */ }
173
}
174
```
175
176
### Application Lifecycle
177
178
```java { .api }
179
// Program lifecycle interface
180
public interface ProgramLifecycle<T extends RuntimeContext> {
181
@TransactionPolicy(TransactionControl.IMPLICIT)
182
void initialize(T context) throws Exception;
183
184
@TransactionPolicy(TransactionControl.IMPLICIT)
185
void destroy();
186
}
187
188
// Program execution states
189
public enum ProgramStatus {
190
INITIALIZING, // Program is starting up
191
RUNNING, // Program is executing
192
STOPPING, // Program is shutting down
193
COMPLETED, // Program finished successfully
194
FAILED, // Program failed with error
195
KILLED; // Program was terminated
196
197
public static final Set<ProgramStatus> TERMINAL_STATES =
198
EnumSet.of(COMPLETED, FAILED, KILLED);
199
}
200
```
201
202
## Getting Started
203
204
### Basic Application Structure
205
206
```java { .api }
207
import io.cdap.cdap.api.app.*;
208
import io.cdap.cdap.api.Config;
209
210
// 1. Define application configuration
211
public class MyAppConfig extends Config {
212
@Description("Input dataset name")
213
private String inputDataset = "input";
214
215
@Description("Output dataset name")
216
private String outputDataset = "output";
217
218
// getters and setters...
219
}
220
221
// 2. Create the application
222
public class MyDataApp extends AbstractApplication<MyAppConfig> {
223
224
@Override
225
public void configure(ApplicationConfigurer configurer,
226
ApplicationContext<MyAppConfig> context) {
227
MyAppConfig config = context.getConfig();
228
229
// Set application metadata
230
configurer.setName("MyDataProcessingApp");
231
configurer.setDescription("Processes data from input to output");
232
233
// Add programs
234
configurer.addMapReduce(new MyMapReduce());
235
configurer.addSpark(new MySparkProgram());
236
configurer.addService(new MyDataService());
237
configurer.addWorkflow(new MyProcessingWorkflow());
238
239
// Create datasets
240
configurer.createDataset(config.inputDataset, Table.class);
241
configurer.createDataset(config.outputDataset, Table.class);
242
}
243
}
244
```
245
246
### Runtime Context Access
247
248
All CDAP programs receive runtime context providing access to system services:
249
250
```java { .api }
251
// Base runtime context
252
public interface RuntimeContext extends FeatureFlagsProvider {
253
String getNamespace();
254
ApplicationSpecification getApplicationSpecification();
255
String getClusterName();
256
long getLogicalStartTime();
257
Map<String, String> getRuntimeArguments();
258
Metrics getMetrics();
259
}
260
261
// Dataset access context
262
public interface DatasetContext {
263
<T extends Dataset> T getDataset(String name) throws DataSetException;
264
<T extends Dataset> T getDataset(String namespace, String name) throws DataSetException;
265
void releaseDataset(Dataset dataset);
266
void discardDataset(Dataset dataset);
267
}
268
269
// Service discovery
270
public interface ServiceDiscoverer {
271
URL getServiceURL(String applicationId, String serviceId);
272
URL getServiceURL(String applicationId, String serviceId, String methodPath);
273
}
274
```
275
276
## Core Concepts
277
278
### Configuration and Plugins
279
280
```java { .api }
281
// Configuration base class
282
public class Config implements Serializable {
283
// Base for all configuration classes
284
}
285
286
// Plugin configuration
287
public abstract class PluginConfig extends Config {
288
// Base for plugin configurations
289
}
290
291
// Plugin interface
292
public interface PluginConfigurer {
293
<T> T usePlugin(String pluginType, String pluginName, String pluginId,
294
PluginProperties properties);
295
<T> Class<T> usePluginClass(String pluginType, String pluginName, String pluginId,
296
PluginProperties properties);
297
}
298
```
299
300
### Transaction Management
301
302
```java { .api }
303
// Transactional interface for explicit transaction control
304
public interface Transactional {
305
<T> T execute(TxRunnable runnable) throws TransactionFailureException;
306
<T> T execute(int timeoutInSeconds, TxRunnable runnable)
307
throws TransactionFailureException;
308
}
309
310
// Transactional operations
311
public interface TxRunnable {
312
void run(DatasetContext context) throws Exception;
313
}
314
315
public interface TxCallable<V> {
316
V call(DatasetContext context) throws Exception;
317
}
318
319
// Utility for transaction operations
320
public final class Transactionals {
321
// Utility methods for transaction management
322
}
323
```
324
325
### Annotations
326
327
Key annotations for configuration and behavior control:
328
329
```java { .api }
330
// Core annotations
331
@Description("Provides descriptive text for API elements")
332
@Name("Specifies custom names for elements")
333
@Property // Marks fields as configuration properties
334
@Macro // Enables macro substitution in field values
335
336
// Plugin annotations
337
@Plugin(type = "source") // Marks classes as plugins of specific types
338
@Category("transform") // Categorizes elements for organization
339
340
// Metadata annotations
341
@Metadata(properties = {@MetadataProperty(key = "author", value = "team")})
342
```
343
344
## Key Features
345
346
### Enterprise Capabilities
347
348
- **Automatic Metadata Capture**: Track data lineage and transformations
349
- **Security Integration**: Role-based access control and secure storage
350
- **Operational Control**: Metrics collection, logging, and monitoring
351
- **Plugin Extensibility**: Custom transformations and connectors
352
- **Multi-tenancy**: Namespace isolation and resource management
353
354
### Hadoop Ecosystem Integration
355
356
- **Apache Spark**: Native integration for batch and streaming processing
357
- **Hadoop MapReduce**: Direct MapReduce job execution and management
358
- **Apache Hive**: SQL-based data processing and analytics
359
- **Apache HBase**: NoSQL database operations and management
360
- **HDFS**: Distributed file system access and operations
361
362
### Development Productivity
363
364
- **Unified API**: Consistent programming model across all components
365
- **Configuration Management**: Type-safe configuration with validation
366
- **Testing Support**: Local execution and testing frameworks
367
- **Deployment Flexibility**: Multi-environment deployment support
368
369
## Quick Reference
370
371
### Import Statements
372
373
```java { .api }
374
// Core application framework
375
import io.cdap.cdap.api.app.*;
376
import io.cdap.cdap.api.*;
377
378
// Data processing
379
import io.cdap.cdap.api.mapreduce.*;
380
import io.cdap.cdap.api.spark.*;
381
382
// Data access
383
import io.cdap.cdap.api.dataset.*;
384
import io.cdap.cdap.api.dataset.lib.*;
385
386
// Services
387
import io.cdap.cdap.api.service.*;
388
import io.cdap.cdap.api.service.http.*;
389
390
// Workflows
391
import io.cdap.cdap.api.workflow.*;
392
393
// Plugin system
394
import io.cdap.cdap.api.plugin.*;
395
396
// Annotations
397
import io.cdap.cdap.api.annotation.*;
398
399
// Utilities
400
import io.cdap.cdap.api.common.*;
401
import io.cdap.cdap.api.metrics.*;
402
```
403
404
### Program Types Summary
405
406
| Program Type | Purpose | Context Interface | Use Cases |
407
|--------------|---------|------------------|-----------|
408
| **MapReduce** | Batch processing | `MapReduceContext` | ETL, data transformation, aggregation |
409
| **Spark** | Batch/Stream processing | `SparkClientContext` | ML, real-time analytics, complex transformations |
410
| **Service** | HTTP endpoints | `HttpServiceContext` | REST APIs, real-time queries, data serving |
411
| **Worker** | Background processing | `WorkerContext` | Data ingestion, monitoring, housekeeping |
412
| **Workflow** | Orchestration | `WorkflowContext` | Multi-step pipelines, conditional logic |
413
414
## Next Steps
415
416
- **[Application Framework](application-framework.md)**: Learn about building applications, services, and workflows
417
- **[Data Processing](data-processing.md)**: Explore MapReduce and Spark integration patterns
418
- **[Data Management](data-management.md)**: Understand datasets, tables, and data access patterns
419
- **[Plugin System](plugin-system.md)**: Build extensible applications with custom plugins
420
- **[Security & Metadata](security-metadata.md)**: Implement security and governance features
421
- **[Operational APIs](operational.md)**: Add metrics, scheduling, and operational control
422
423
The CDAP API provides a complete toolkit for enterprise data application development, combining the power of Hadoop ecosystem tools with enterprise-grade operational features.