Core data management capabilities for CDAP including dataset operations, metadata management, lineage tracking, audit functionality, and data registry services for Hadoop-based applications.
npx @tessl/cli install tessl/maven-co-cask-cdap--cdap-data-fabric@5.1.00
# CDAP Data Fabric
1
2
CDAP Data Fabric is a core component of the Cask Data Application Platform that provides essential data management and infrastructure services for Hadoop-based applications. It handles dataset metrics reporting and monitoring, comprehensive metadata management with indexing capabilities for efficient search and discovery, data lineage tracking to understand data flow and transformations, audit trail functionality for compliance and debugging, and a centralized data registry for managing dataset definitions and configurations.
3
4
## Package Information
5
6
- **Package Name**: cdap-data-fabric
7
- **Package Type**: maven
8
- **Group ID**: co.cask.cdap
9
- **Language**: Java
10
- **Installation**: Add to Maven dependencies:
11
```xml
12
<dependency>
13
<groupId>co.cask.cdap</groupId>
14
<artifactId>cdap-data-fabric</artifactId>
15
<version>5.1.2</version>
16
</dependency>
17
```
18
19
## Core Imports
20
21
```java
22
import co.cask.cdap.data2.dataset2.DatasetFramework;
23
import co.cask.cdap.data2.metadata.dataset.MetadataDataset;
24
import co.cask.cdap.data2.registry.UsageRegistry;
25
import co.cask.cdap.store.NamespaceStore;
26
import co.cask.cdap.data2.audit.AuditPublisher;
27
```
28
29
## Basic Usage
30
31
```java
32
// Dataset management example
33
DatasetFramework datasetFramework = // ... obtain from DI container
34
DatasetId datasetId = NamespaceId.DEFAULT.dataset("myDataset");
35
DatasetProperties properties = DatasetProperties.builder().build();
36
37
// Create a dataset instance
38
datasetFramework.addInstance("keyValueTable", datasetId, properties, null);
39
40
// Access the dataset
41
KeyValueTable dataset = datasetFramework.getDataset(
42
datasetId, null, null, null, null, AccessType.READ_WRITE);
43
44
// Metadata management example
45
MetadataDataset metadataDataset = // ... obtain instance
46
MetadataEntity entity = // ... define entity
47
Map<String, String> properties = Map.of("environment", "production", "owner", "team-alpha");
48
49
// Set metadata properties
50
metadataDataset.setProperty(entity, properties);
51
52
// Add tags
53
Set<String> tags = Set.of("production", "critical", "team-alpha");
54
metadataDataset.addTags(entity, tags);
55
56
// Search metadata
57
SearchRequest searchRequest = SearchRequest.of("production").build();
58
SearchResults results = metadataDataset.search(searchRequest);
59
```
60
61
## Architecture
62
63
The CDAP Data Fabric follows a layered architecture that abstracts complex data operations:
64
65
- **Dataset Framework Layer**: Provides unified APIs for dataset lifecycle management across different storage backends (HBase, LevelDB, in-memory)
66
- **Metadata Management Layer**: Handles comprehensive metadata operations including properties, tags, lineage, and search with pluggable indexing strategies
67
- **Transaction Layer**: Integrates with Apache Tephra for ACID transactions across distributed datasets
68
- **Registry Layer**: Tracks usage relationships between programs and datasets for governance and lineage
69
- **Audit Layer**: Provides comprehensive audit trails for compliance and debugging
70
- **Storage Abstraction Layer**: Supports multiple storage backends with consistent APIs
71
72
This architecture enables developers to build scalable data applications without dealing directly with underlying Hadoop complexities while maintaining full transactional guarantees and comprehensive metadata management.
73
74
## Capabilities
75
76
### Dataset Management
77
78
Comprehensive dataset lifecycle management including creation, configuration, access, and administration across multiple storage backends with transaction support and lineage tracking.
79
80
```java { .api }
81
public interface DatasetFramework {
82
void addInstance(String datasetTypeName, DatasetId datasetInstanceId,
83
DatasetProperties props, KerberosPrincipalId ownerPrincipal)
84
throws DatasetManagementException, IOException;
85
<T extends Dataset> T getDataset(DatasetId datasetInstanceId, Map<String, String> arguments,
86
ClassLoader classLoader, DatasetClassLoaderProvider classLoaderProvider,
87
Iterable<? extends EntityId> owners, AccessType accessType)
88
throws DatasetManagementException, IOException;
89
void deleteInstance(DatasetId datasetInstanceId) throws DatasetManagementException, IOException;
90
Collection<DatasetSpecificationSummary> getInstances(NamespaceId namespaceId)
91
throws DatasetManagementException;
92
}
93
```
94
95
[Dataset Management](./dataset-management.md)
96
97
### Metadata Management
98
99
Complete metadata management system for properties, tags, search, and indexing with support for custom indexing strategies and historical snapshots.
100
101
```java { .api }
102
public class MetadataDataset extends AbstractDataset {
103
public MetadataChange setProperty(MetadataEntity metadataEntity, String key, String value);
104
public MetadataChange addTags(MetadataEntity metadataEntity, Set<String> tagsToAdd);
105
public Metadata getMetadata(MetadataEntity metadataEntity);
106
public SearchResults search(SearchRequest request) throws BadRequestException;
107
public Set<Metadata> getSnapshotBeforeTime(Set<MetadataEntity> metadataEntitys, long timeMillis);
108
}
109
```
110
111
[Metadata Management](./metadata-management.md)
112
113
### Usage Registry
114
115
Program-dataset relationship tracking for governance, lineage analysis, and impact assessment with comprehensive query capabilities.
116
117
```java { .api }
118
public interface UsageRegistry extends UsageWriter {
119
void unregister(ApplicationId applicationId);
120
Set<DatasetId> getDatasets(ApplicationId id);
121
Set<ProgramId> getPrograms(DatasetId id);
122
Set<StreamId> getStreams(ProgramId id);
123
}
124
```
125
126
[Usage Registry](./usage-registry.md)
127
128
### Namespace Management
129
130
Namespace lifecycle management for multi-tenancy support with metadata persistence and comprehensive administrative operations.
131
132
```java { .api }
133
public interface NamespaceStore {
134
NamespaceMeta create(NamespaceMeta metadata);
135
void update(NamespaceMeta metadata);
136
NamespaceMeta get(NamespaceId id);
137
NamespaceMeta delete(NamespaceId id);
138
List<NamespaceMeta> list();
139
}
140
```
141
142
[Namespace Management](./namespace-management.md)
143
144
### Audit and Compliance
145
146
Comprehensive audit logging system for compliance, monitoring, and debugging with pluggable publishers and structured payload builders.
147
148
```java { .api }
149
public interface AuditPublisher {
150
void publish(EntityId entityId, AuditType auditType, AuditPayload auditPayload);
151
void publish(MetadataEntity metadataEntity, AuditType auditType, AuditPayload auditPayload);
152
}
153
```
154
155
[Audit and Compliance](./audit-compliance.md)
156
157
### Transaction Management
158
159
Distributed transaction support with retry logic, consumer state management, and integration with Apache Tephra for ACID guarantees.
160
161
```java { .api }
162
public interface TransactionExecutorFactory extends org.apache.tephra.TransactionExecutorFactory {
163
// Transaction executor creation with custom configuration
164
}
165
166
public interface TransactionSystemClient {
167
// Transaction system client operations with distributed coordination
168
}
169
```
170
171
[Transaction Management](./transaction-management.md)
172
173
### Stream Processing
174
175
Real-time stream processing capabilities with coordination, file management, partitioning, and multiple decoder support for various data formats.
176
177
```java { .api }
178
public interface StreamAdmin {
179
// Stream administration and lifecycle operations
180
}
181
182
public interface StreamConsumer extends Closeable, TransactionAware {
183
// Stream consumption with transaction support and state management
184
}
185
```
186
187
[Stream Processing](./stream-processing.md)
188
189
## Common Types
190
191
```java { .api }
192
// Core entity identifiers
193
public final class DatasetId extends EntityId {
194
public static DatasetId of(String namespace, String dataset);
195
}
196
197
public final class NamespaceId extends EntityId {
198
public static final NamespaceId DEFAULT = new NamespaceId("default");
199
public static NamespaceId of(String namespace);
200
}
201
202
public final class ProgramId extends EntityId {
203
// Program identification with application and program type context
204
}
205
206
public final class ApplicationId extends EntityId {
207
// Application identification within namespace context
208
}
209
210
// Metadata entities
211
public interface MetadataEntity {
212
// Metadata entity representation for flexible entity types
213
}
214
215
// Dataset properties and specifications
216
public final class DatasetProperties {
217
public static Builder builder();
218
public Map<String, String> getProperties();
219
}
220
221
public interface DatasetSpecification {
222
String getName();
223
String getType();
224
DatasetProperties getProperties();
225
}
226
227
// Access and security
228
public enum AccessType {
229
READ, WRITE, ADMIN, READ_WRITE
230
}
231
232
public final class KerberosPrincipalId {
233
public static KerberosPrincipalId of(String principal);
234
}
235
236
// Metadata types
237
public enum MetadataScope {
238
USER, SYSTEM
239
}
240
241
public final class MetadataRecordV2 {
242
public MetadataEntity getMetadataEntity();
243
public Map<String, String> getProperties();
244
public Set<String> getTags();
245
public MetadataScope getScope();
246
}
247
248
public final class ViewSpecification {
249
// Stream view configuration specification
250
public String getFormat();
251
public Schema getSchema();
252
public Map<String, String> getSettings();
253
}
254
255
public final class ViewDetail {
256
// Complete view information including metadata
257
public StreamViewId getId();
258
public ViewSpecification getSpec();
259
public Map<String, String> getProperties();
260
}
261
262
public final class StreamViewId extends EntityId {
263
public static StreamViewId of(String namespace, String stream, String view);
264
public StreamId getParent();
265
public String getView();
266
}
267
268
public final class RetryStrategy {
269
// Configurable retry policies for operations
270
public static RetryStrategy noRetry();
271
public static RetryStrategy exponentialDelay(long initialDelay, long maxDelay, int maxAttempts);
272
}
273
274
public final class TransactionContextFactory {
275
// Factory for creating transaction contexts
276
}
277
278
// Exceptions
279
public class DatasetManagementException extends Exception {
280
// Dataset operation failures with detailed error context
281
}
282
283
public class BadRequestException extends Exception {
284
// Invalid request parameter handling
285
}
286
287
public class NotFoundException extends Exception {
288
// Resource not found handling
289
}
290
```