Low-level memory management and unsafe operations library for Apache Spark with direct memory access and optimized data structures
npx @tessl/cli install tessl/maven-org-apache-spark--spark-unsafe_2.12@3.5.00
# Apache Spark Unsafe
1
2
Apache Spark Unsafe is a low-level memory management and unsafe operations library that provides direct memory access, optimized data structures, and platform-specific functionality for high-performance data processing within Apache Spark. This library enables bypassing standard Java memory safety mechanisms when necessary, offering direct access to off-heap memory, efficient serialization operations, and optimized data structures for large-scale distributed computing workloads.
3
4
## Package Information
5
6
- **Package Name**: spark-unsafe_2.12
7
- **Package Type**: Maven
8
- **Language**: Java
9
- **Group ID**: org.apache.spark
10
- **Artifact ID**: spark-unsafe_2.12
11
- **Installation**: Add to Maven `pom.xml`:
12
13
```xml
14
<dependency>
15
<groupId>org.apache.spark</groupId>
16
<artifactId>spark-unsafe_2.12</artifactId>
17
<version>3.5.6</version>
18
</dependency>
19
```
20
21
## Core Imports
22
23
```java
24
// Memory operations and platform utilities
25
import org.apache.spark.unsafe.Platform;
26
import org.apache.spark.unsafe.memory.MemoryAllocator;
27
import org.apache.spark.unsafe.memory.MemoryBlock;
28
import org.apache.spark.unsafe.memory.MemoryLocation;
29
import org.apache.spark.unsafe.memory.HeapMemoryAllocator;
30
import org.apache.spark.unsafe.memory.UnsafeMemoryAllocator;
31
32
// Data structures and arrays
33
import org.apache.spark.unsafe.array.LongArray;
34
import org.apache.spark.unsafe.array.ByteArrayMethods;
35
import org.apache.spark.unsafe.types.UTF8String;
36
import org.apache.spark.unsafe.types.UTF8StringBuilder;
37
import org.apache.spark.unsafe.types.CalendarInterval;
38
import org.apache.spark.unsafe.types.ByteArray;
39
40
// Hashing and utilities
41
import org.apache.spark.unsafe.hash.Murmur3_x86_32;
42
import org.apache.spark.unsafe.bitset.BitSetMethods;
43
import org.apache.spark.unsafe.UnsafeAlignedOffset;
44
import org.apache.spark.sql.catalyst.expressions.HiveHasher;
45
import org.apache.spark.sql.catalyst.util.DateTimeConstants;
46
```
47
48
## Basic Usage
49
50
```java
51
import org.apache.spark.unsafe.Platform;
52
import org.apache.spark.unsafe.memory.*;
53
import org.apache.spark.unsafe.types.UTF8String;
54
55
// Memory allocation example
56
MemoryAllocator allocator = MemoryAllocator.HEAP;
57
MemoryBlock block = allocator.allocate(1024);
58
59
// Direct memory operations
60
Platform.putLong(block.getBaseObject(), block.getBaseOffset(), 42L);
61
long value = Platform.getLong(block.getBaseObject(), block.getBaseOffset());
62
63
// UTF8String operations
64
UTF8String str = UTF8String.fromString("Hello Spark");
65
UTF8String upper = str.toUpperCase();
66
int length = str.numChars();
67
68
// Clean up
69
allocator.free(block);
70
```
71
72
## Architecture
73
74
Apache Spark Unsafe is built around several key components:
75
76
- **Memory Management**: Allocation strategies for heap and off-heap memory with automatic pooling
77
- **Platform Layer**: Low-level unsafe operations that provide direct memory access across different platforms
78
- **Data Structures**: Memory-efficient arrays, strings, and bitsets optimized for Spark's use cases
79
- **Type System**: Specialized types like UTF8String and CalendarInterval designed for internal Spark operations
80
- **Hashing**: Fast hashing implementations for data distribution and partitioning
81
- **Safety Features**: Debug modes and alignment handling for platform-specific memory requirements
82
83
## Capabilities
84
85
### Memory Management
86
87
Core memory allocation and management functionality providing both heap and off-heap memory strategies with automatic pooling and debug support.
88
89
```java { .api }
90
interface MemoryAllocator {
91
MemoryBlock allocate(long size) throws OutOfMemoryError;
92
void free(MemoryBlock memory);
93
94
static final MemoryAllocator HEAP = new HeapMemoryAllocator();
95
static final MemoryAllocator UNSAFE = new UnsafeMemoryAllocator();
96
}
97
98
class MemoryBlock extends MemoryLocation {
99
public MemoryBlock(Object obj, long offset, long length);
100
public long size();
101
public void fill(byte value);
102
public static MemoryBlock fromLongArray(long[] array);
103
}
104
```
105
106
[Memory Management](./memory-management.md)
107
108
### Platform Operations
109
110
Low-level unsafe memory operations and platform-specific functionality for direct memory access, copying, and platform feature detection.
111
112
```java { .api }
113
final class Platform {
114
// Memory access methods for all primitive types
115
public static int getInt(Object object, long offset);
116
public static void putInt(Object object, long offset, int value);
117
public static long getLong(Object object, long offset);
118
public static void putLong(Object object, long offset, long value);
119
public static byte getByte(Object object, long offset);
120
public static void putByte(Object object, long offset, byte value);
121
public static short getShort(Object object, long offset);
122
public static void putShort(Object object, long offset, short value);
123
public static float getFloat(Object object, long offset);
124
public static void putFloat(Object object, long offset, float value);
125
public static double getDouble(Object object, long offset);
126
public static void putDouble(Object object, long offset, double value);
127
public static boolean getBoolean(Object object, long offset);
128
public static void putBoolean(Object object, long offset, boolean value);
129
130
// Volatile operations
131
public static Object getObjectVolatile(Object object, long offset);
132
public static void putObjectVolatile(Object object, long offset, Object value);
133
134
// Memory allocation and management
135
public static long allocateMemory(long size);
136
public static void freeMemory(long address);
137
public static long reallocateMemory(long address, long oldSize, long newSize);
138
public static void copyMemory(Object src, long srcOffset, Object dst, long dstOffset, long length);
139
public static void setMemory(Object object, long offset, long size, byte value);
140
public static void setMemory(long address, byte value, long size);
141
142
// Platform detection and utilities
143
public static boolean unaligned();
144
public static boolean cleanerCreateMethodIsDefined();
145
public static void throwException(Throwable t);
146
public static ByteBuffer allocateDirectBuffer(int size);
147
}
148
```
149
150
[Platform Operations](./platform-operations.md)
151
152
### UTF8 String Operations
153
154
Memory-efficient UTF-8 string implementation with comprehensive string manipulation, parsing, and comparison operations optimized for Spark's internal use.
155
156
```java { .api }
157
final class UTF8String implements Comparable<UTF8String>, Externalizable, KryoSerializable, Cloneable {
158
// Factory methods
159
public static UTF8String fromString(String str);
160
public static UTF8String fromBytes(byte[] bytes);
161
public static UTF8String concat(UTF8String... inputs);
162
163
// String operations
164
public UTF8String substring(int start, int until);
165
public UTF8String toUpperCase();
166
public UTF8String trim();
167
public boolean contains(UTF8String substring);
168
169
// Parsing
170
public long toLongExact();
171
public int toIntExact();
172
}
173
```
174
175
[UTF8 String Operations](./utf8-string-operations.md)
176
177
### Array Operations
178
179
Efficient array implementations and utility methods for byte arrays and long arrays with support for both heap and off-heap memory storage.
180
181
```java { .api }
182
final class LongArray {
183
public LongArray(MemoryBlock memory);
184
public long get(int index);
185
public void set(int index, long value);
186
public long size();
187
public void zeroOut();
188
}
189
190
class ByteArrayMethods {
191
public static boolean arrayEquals(Object leftBase, long leftOffset, Object rightBase, long rightOffset, long length);
192
public static boolean contains(byte[] arr, byte[] sub);
193
public static long nextPowerOf2(long num);
194
}
195
```
196
197
[Array Operations](./array-operations.md)
198
199
### Hashing and Utilities
200
201
High-performance hashing implementations and utility classes including Murmur3 hashing, bitset operations, date/time constants, and platform alignment utilities.
202
203
```java { .api }
204
final class Murmur3_x86_32 {
205
public Murmur3_x86_32(int seed);
206
public static int hashInt(int input, int seed);
207
public static int hashLong(long input, int seed);
208
public static int hashUnsafeBytes(Object base, long offset, int lengthInBytes, int seed);
209
}
210
211
final class BitSetMethods {
212
public static void set(Object baseObject, long baseOffset, int index);
213
public static boolean isSet(Object baseObject, long baseOffset, int index);
214
public static int nextSetBit(Object baseObject, long baseOffset, int fromIndex, int bitsetSizeInWords);
215
}
216
217
class HiveHasher {
218
public static int hashInt(int input);
219
public static int hashLong(long input);
220
public static int hashUnsafeBytes(Object base, long offset, int lengthInBytes);
221
}
222
223
class DateTimeConstants {
224
public static final int MONTHS_PER_YEAR = 12;
225
public static final long HOURS_PER_DAY = 24L;
226
public static final long SECONDS_PER_MINUTE = 60L;
227
public static final long MILLIS_PER_SECOND = 1000L;
228
public static final long MICROS_PER_MILLIS = 1000L;
229
public static final long NANOS_PER_MICROS = 1000L;
230
}
231
232
class UnsafeAlignedOffset {
233
public static int getUaoSize();
234
public static int getSize(Object object, long offset);
235
public static void putSize(Object object, long offset, int value);
236
}
237
```
238
239
[Hashing and Utilities](./hashing-utilities.md)
240
241
## Types
242
243
### Core Memory Types
244
245
```java { .api }
246
class MemoryLocation {
247
public MemoryLocation(Object obj, long offset);
248
public final Object getBaseObject();
249
public final long getBaseOffset();
250
}
251
252
abstract class KVIterator<K, V> {
253
public abstract boolean next() throws IOException;
254
public abstract K getKey();
255
public abstract V getValue();
256
public abstract void close();
257
}
258
```
259
260
### Specialized Data Types
261
262
```java { .api }
263
final class CalendarInterval implements Serializable {
264
public final int months;
265
public final int days;
266
public final long microseconds;
267
268
public CalendarInterval(int months, int days, long microseconds);
269
}
270
271
class UTF8StringBuilder {
272
public UTF8StringBuilder();
273
public UTF8StringBuilder(int initialSize);
274
public void append(UTF8String value);
275
public UTF8String build();
276
}
277
278
final class ByteArray {
279
public static final byte[] EMPTY_BYTE;
280
public static void writeToMemory(byte[] src, Object target, long targetOffset);
281
public static long getPrefix(byte[] bytes);
282
public static int compareBinary(byte[] leftBase, byte[] rightBase);
283
public static byte[] subStringSQL(byte[] bytes, int pos, int len);
284
public static byte[] concat(byte[]... inputs);
285
}
286
```
287
288
## Error Handling
289
290
The library throws standard Java exceptions:
291
- `OutOfMemoryError` - When memory allocation fails
292
- `IOException` - During iterator operations
293
- `NumberFormatException` - During string parsing operations
294
- `IndexOutOfBoundsException` - For array access violations
295
296
Most operations are designed to be fail-fast and do not perform bounds checking for performance reasons. Users should validate inputs before calling unsafe operations.
297
298
## Performance Considerations
299
300
- **Zero-Copy Operations**: Many string and array operations avoid memory copying
301
- **Direct Memory Access**: Bypasses JVM safety checks for maximum performance
302
- **Memory Pooling**: Automatic pooling for large heap allocations
303
- **Platform Optimization**: Platform-specific optimizations for memory alignment
304
- **Lazy Evaluation**: Some operations use lazy evaluation patterns for efficiency