JavaScript Unicode normalization library providing NFC, NFD, NFKC, NFKD forms and String.prototype.normalize polyfill
npx @tessl/cli install tessl/npm-unorm@1.6.00
# Unorm
1
2
Unorm is a JavaScript Unicode normalization library that provides all four Unicode normalization forms (NFC, NFD, NFKC, NFKD) according to Unicode 8.0 standard. It serves as both a standalone library and a polyfill for `String.prototype.normalize()` in environments that don't natively support it.
3
4
## Package Information
5
6
- **Package Name**: unorm
7
- **Package Type**: npm
8
- **Language**: JavaScript
9
- **Installation**: `npm install unorm`
10
11
## Core Imports
12
13
```javascript
14
const unorm = require('unorm');
15
```
16
17
For AMD (RequireJS):
18
19
```javascript
20
define(['unorm'], function(unorm) {
21
// Use unorm functions
22
});
23
```
24
25
In browser (global):
26
27
```javascript
28
// Available as global unorm object
29
unorm.nfc(string);
30
```
31
32
## Basic Usage
33
34
```javascript
35
const unorm = require('unorm');
36
37
// Example text with mixed Unicode forms
38
const text = 'The \u212B symbol invented by A. J. \u00C5ngstr\u00F6m';
39
40
// Apply different normalization forms
41
const nfcText = unorm.nfc(text); // Canonical composition
42
const nfdText = unorm.nfd(text); // Canonical decomposition
43
const nfkcText = unorm.nfkc(text); // Compatibility composition
44
const nfkdText = unorm.nfkd(text); // Compatibility decomposition
45
46
console.log('Original:', text);
47
console.log('NFC:', nfcText);
48
console.log('NFD:', nfdText);
49
console.log('NFKC:', nfkcText);
50
console.log('NFKD:', nfkdText);
51
52
// Using as String.prototype.normalize polyfill
53
console.log('Polyfill:', text.normalize('NFC'));
54
```
55
56
## Architecture
57
58
Unorm implements Unicode normalization according to Unicode Standard Annex #15, providing a comprehensive solution for text normalization in JavaScript environments.
59
60
### Core Components
61
62
- **Normalization Engine**: Unicode character decomposition and composition engine with built-in Unicode data tables
63
- **Polyfill System**: Automatic detection and implementation of `String.prototype.normalize()` when native support is unavailable
64
- **Multi-Environment Support**: Works consistently across CommonJS (Node.js), AMD (RequireJS), and browser global contexts
65
66
### Unicode Normalization Forms
67
68
Unicode normalization addresses the fact that the same text can be represented in multiple ways using different combinations of base characters and combining marks.
69
70
**Canonical vs Compatibility:**
71
- **Canonical**: Deals with different representations of the same abstract character (e.g., é as single codepoint vs. e + combining accent)
72
- **Compatibility**: Also handles formatting differences and alternative representations (e.g., superscript/subscript digits)
73
74
**Decomposition vs Composition:**
75
- **Decomposition**: Breaks composite characters into base characters plus combining marks
76
- **Composition**: Combines base characters and marks into single composite characters where possible
77
78
**The Four Forms:**
79
- **NFC** (Canonical Composition): Most common form, produces composed characters when possible
80
- **NFD** (Canonical Decomposition): Breaks down composed characters, useful for mark removal and analysis
81
- **NFKC** (Compatibility Composition): Like NFC but also normalizes compatibility characters (superscripts, etc.)
82
- **NFKD** (Compatibility Decomposition): Most decomposed form, ideal for search and indexing operations
83
84
### Polyfill Mechanism
85
86
The library automatically detects if `String.prototype.normalize()` is available in the current environment. If not present, it adds the method using `Object.defineProperty()` with proper error handling that matches the ECMAScript specification. The `shimApplied` property indicates whether the polyfill was activated.
87
88
## Capabilities
89
90
### Canonical Composition (NFC)
91
92
Applies canonical decomposition followed by canonical composition to produce a composed form.
93
94
```javascript { .api }
95
/**
96
* Normalize string using Canonical Decomposition followed by Canonical Composition
97
* @param {string} str - String to normalize
98
* @returns {string} NFC normalized string
99
*/
100
function nfc(str);
101
```
102
103
**Usage Example:**
104
105
```javascript
106
const unorm = require('unorm');
107
108
// Combining characters are composed into single codepoints when possible
109
const result = unorm.nfc('a\u0308'); // ä (combining diaeresis) -> ä (single codepoint)
110
console.log(result); // "\u00e4"
111
```
112
113
### Canonical Decomposition (NFD)
114
115
Applies canonical decomposition to produce a decomposed form where composite characters are broken down into base characters plus combining marks.
116
117
```javascript { .api }
118
/**
119
* Normalize string using Canonical Decomposition
120
* @param {string} str - String to normalize
121
* @returns {string} NFD normalized string
122
*/
123
function nfd(str);
124
```
125
126
**Usage Example:**
127
128
```javascript
129
const unorm = require('unorm');
130
131
// Composite characters are decomposed into base + combining marks
132
const result = unorm.nfd('ä'); // ä (single codepoint) -> a + combining diaeresis
133
console.log(result); // "a\u0308"
134
```
135
136
### Compatibility Composition (NFKC)
137
138
Applies compatibility decomposition followed by canonical composition, replacing compatibility characters with their canonical equivalents.
139
140
```javascript { .api }
141
/**
142
* Normalize string using Compatibility Decomposition followed by Canonical Composition
143
* @param {string} str - String to normalize
144
* @returns {string} NFKC normalized string
145
*/
146
function nfkc(str);
147
```
148
149
**Usage Example:**
150
151
```javascript
152
const unorm = require('unorm');
153
154
// Compatibility characters like subscripts are replaced with normal equivalents
155
const result = unorm.nfkc('CO₂'); // Subscript 2 becomes normal 2
156
console.log(result); // "CO2"
157
```
158
159
### Compatibility Decomposition (NFKD)
160
161
Applies compatibility decomposition to replace compatibility characters with their canonical forms and decompose composite characters.
162
163
```javascript { .api }
164
/**
165
* Normalize string using Compatibility Decomposition
166
* @param {string} str - String to normalize
167
* @returns {string} NFKD normalized string
168
*/
169
function nfkd(str);
170
```
171
172
**Usage Example:**
173
174
```javascript
175
const unorm = require('unorm');
176
177
// Useful for search/indexing by removing combining marks
178
const text = 'Ångström';
179
const normalized = unorm.nfkd(text);
180
const withoutMarks = normalized.replace(/[\u0300-\u036F]/g, ''); // Remove combining marks
181
console.log(withoutMarks); // "Angstrom"
182
```
183
184
### String.prototype.normalize Polyfill
185
186
Automatically provides `String.prototype.normalize()` method when not natively available in the JavaScript environment.
187
188
```javascript { .api }
189
/**
190
* Polyfill for String.prototype.normalize method
191
* @param {string} [form="NFC"] - Normalization form: "NFC", "NFD", "NFKC", or "NFKD"
192
* @returns {string} Normalized string according to specified form
193
* @throws {TypeError} When called on null or undefined
194
* @throws {RangeError} When invalid normalization form provided
195
*/
196
String.prototype.normalize(form);
197
```
198
199
**Usage Examples:**
200
201
```javascript
202
// When native normalize() isn't available, unorm provides it
203
require('unorm'); // Automatically adds polyfill if needed
204
205
const text = 'café';
206
console.log(text.normalize('NFC')); // Uses unorm's implementation
207
console.log(text.normalize('NFD')); // Decomposes é into e + combining accent
208
console.log(text.normalize('NFKC')); // Same as NFC for this example
209
console.log(text.normalize('NFKD')); // Same as NFD for this example
210
211
// Error handling
212
try {
213
text.normalize('INVALID'); // Throws RangeError
214
} catch (error) {
215
console.error(error.message); // "Invalid normalization form: INVALID"
216
}
217
```
218
219
### Polyfill Status Detection
220
221
Property to check whether the String.prototype.normalize polyfill was applied.
222
223
```javascript { .api }
224
/**
225
* Boolean indicating whether String.prototype.normalize polyfill was applied
226
* @type {boolean}
227
*/
228
unorm.shimApplied;
229
```
230
231
**Usage Example:**
232
233
```javascript
234
const unorm = require('unorm');
235
236
if (unorm.shimApplied) {
237
console.log('String.prototype.normalize polyfill was applied');
238
} else {
239
console.log('Native String.prototype.normalize is available');
240
}
241
```
242
243
## Types
244
245
```javascript { .api }
246
/**
247
* Main unorm module interface
248
*/
249
interface UnormModule {
250
/** Canonical Decomposition followed by Canonical Composition */
251
nfc: (str: string) => string;
252
/** Canonical Decomposition */
253
nfd: (str: string) => string;
254
/** Compatibility Decomposition followed by Canonical Composition */
255
nfkc: (str: string) => string;
256
/** Compatibility Decomposition */
257
nfkd: (str: string) => string;
258
/** Whether String.prototype.normalize polyfill was applied */
259
shimApplied: boolean;
260
}
261
262
/**
263
* Valid normalization forms for String.prototype.normalize
264
*/
265
type NormalizationForm = "NFC" | "NFD" | "NFKC" | "NFKD";
266
```
267
268
## Common Use Cases
269
270
### Text Search and Indexing
271
272
```javascript
273
const unorm = require('unorm');
274
275
function normalizeForSearch(text) {
276
// Use NFKD to decompose, then remove combining marks for search
277
const decomposed = unorm.nfkd(text);
278
return decomposed.replace(/[\u0300-\u036F]/g, ''); // Remove combining marks
279
}
280
281
const searchTerm = normalizeForSearch('café');
282
const document = normalizeForSearch('I love café au lait');
283
console.log(document.includes(searchTerm)); // true
284
```
285
286
### String Comparison
287
288
```javascript
289
const unorm = require('unorm');
290
291
function compareStrings(str1, str2) {
292
// Normalize both strings to same form for accurate comparison
293
return unorm.nfc(str1) === unorm.nfc(str2);
294
}
295
296
const text1 = 'é'; // Single codepoint
297
const text2 = 'e\u0301'; // e + combining acute accent
298
console.log(compareStrings(text1, text2)); // true
299
```
300
301
### Data Cleaning
302
303
```javascript
304
const unorm = require('unorm');
305
306
function cleanUserInput(input) {
307
// Normalize to consistent form and trim
308
return unorm.nfc(input.trim());
309
}
310
311
const userInput = ' café '; // With inconsistent Unicode
312
const cleaned = cleanUserInput(userInput);
313
console.log(cleaned); // Normalized "café"
314
```
315
316
## Browser Compatibility
317
318
- **Modern Browsers**: Works in all modern browsers
319
- **Legacy Support**: Requires ES5 features (Object.defineProperty)
320
- **Recommended**: Use [es5-shim](https://github.com/kriskowal/es5-shim) for older browsers
321
- **Node.js**: Supports Node.js >= 0.4.0