or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.md

index.mddocs/

0

# Unorm

1

2

Unorm is a JavaScript Unicode normalization library that provides all four Unicode normalization forms (NFC, NFD, NFKC, NFKD) according to Unicode 8.0 standard. It serves as both a standalone library and a polyfill for `String.prototype.normalize()` in environments that don't natively support it.

3

4

## Package Information

5

6

- **Package Name**: unorm

7

- **Package Type**: npm

8

- **Language**: JavaScript

9

- **Installation**: `npm install unorm`

10

11

## Core Imports

12

13

```javascript

14

const unorm = require('unorm');

15

```

16

17

For AMD (RequireJS):

18

19

```javascript

20

define(['unorm'], function(unorm) {

21

// Use unorm functions

22

});

23

```

24

25

In browser (global):

26

27

```javascript

28

// Available as global unorm object

29

unorm.nfc(string);

30

```

31

32

## Basic Usage

33

34

```javascript

35

const unorm = require('unorm');

36

37

// Example text with mixed Unicode forms

38

const text = 'The \u212B symbol invented by A. J. \u00C5ngstr\u00F6m';

39

40

// Apply different normalization forms

41

const nfcText = unorm.nfc(text); // Canonical composition

42

const nfdText = unorm.nfd(text); // Canonical decomposition

43

const nfkcText = unorm.nfkc(text); // Compatibility composition

44

const nfkdText = unorm.nfkd(text); // Compatibility decomposition

45

46

console.log('Original:', text);

47

console.log('NFC:', nfcText);

48

console.log('NFD:', nfdText);

49

console.log('NFKC:', nfkcText);

50

console.log('NFKD:', nfkdText);

51

52

// Using as String.prototype.normalize polyfill

53

console.log('Polyfill:', text.normalize('NFC'));

54

```

55

56

## Architecture

57

58

Unorm implements Unicode normalization according to Unicode Standard Annex #15, providing a comprehensive solution for text normalization in JavaScript environments.

59

60

### Core Components

61

62

- **Normalization Engine**: Unicode character decomposition and composition engine with built-in Unicode data tables

63

- **Polyfill System**: Automatic detection and implementation of `String.prototype.normalize()` when native support is unavailable

64

- **Multi-Environment Support**: Works consistently across CommonJS (Node.js), AMD (RequireJS), and browser global contexts

65

66

### Unicode Normalization Forms

67

68

Unicode normalization addresses the fact that the same text can be represented in multiple ways using different combinations of base characters and combining marks.

69

70

**Canonical vs Compatibility:**

71

- **Canonical**: Deals with different representations of the same abstract character (e.g., é as single codepoint vs. e + combining accent)

72

- **Compatibility**: Also handles formatting differences and alternative representations (e.g., superscript/subscript digits)

73

74

**Decomposition vs Composition:**

75

- **Decomposition**: Breaks composite characters into base characters plus combining marks

76

- **Composition**: Combines base characters and marks into single composite characters where possible

77

78

**The Four Forms:**

79

- **NFC** (Canonical Composition): Most common form, produces composed characters when possible

80

- **NFD** (Canonical Decomposition): Breaks down composed characters, useful for mark removal and analysis

81

- **NFKC** (Compatibility Composition): Like NFC but also normalizes compatibility characters (superscripts, etc.)

82

- **NFKD** (Compatibility Decomposition): Most decomposed form, ideal for search and indexing operations

83

84

### Polyfill Mechanism

85

86

The library automatically detects if `String.prototype.normalize()` is available in the current environment. If not present, it adds the method using `Object.defineProperty()` with proper error handling that matches the ECMAScript specification. The `shimApplied` property indicates whether the polyfill was activated.

87

88

## Capabilities

89

90

### Canonical Composition (NFC)

91

92

Applies canonical decomposition followed by canonical composition to produce a composed form.

93

94

```javascript { .api }

95

/**

96

* Normalize string using Canonical Decomposition followed by Canonical Composition

97

* @param {string} str - String to normalize

98

* @returns {string} NFC normalized string

99

*/

100

function nfc(str);

101

```

102

103

**Usage Example:**

104

105

```javascript

106

const unorm = require('unorm');

107

108

// Combining characters are composed into single codepoints when possible

109

const result = unorm.nfc('a\u0308'); // ä (combining diaeresis) -> ä (single codepoint)

110

console.log(result); // "\u00e4"

111

```

112

113

### Canonical Decomposition (NFD)

114

115

Applies canonical decomposition to produce a decomposed form where composite characters are broken down into base characters plus combining marks.

116

117

```javascript { .api }

118

/**

119

* Normalize string using Canonical Decomposition

120

* @param {string} str - String to normalize

121

* @returns {string} NFD normalized string

122

*/

123

function nfd(str);

124

```

125

126

**Usage Example:**

127

128

```javascript

129

const unorm = require('unorm');

130

131

// Composite characters are decomposed into base + combining marks

132

const result = unorm.nfd('ä'); // ä (single codepoint) -> a + combining diaeresis

133

console.log(result); // "a\u0308"

134

```

135

136

### Compatibility Composition (NFKC)

137

138

Applies compatibility decomposition followed by canonical composition, replacing compatibility characters with their canonical equivalents.

139

140

```javascript { .api }

141

/**

142

* Normalize string using Compatibility Decomposition followed by Canonical Composition

143

* @param {string} str - String to normalize

144

* @returns {string} NFKC normalized string

145

*/

146

function nfkc(str);

147

```

148

149

**Usage Example:**

150

151

```javascript

152

const unorm = require('unorm');

153

154

// Compatibility characters like subscripts are replaced with normal equivalents

155

const result = unorm.nfkc('CO₂'); // Subscript 2 becomes normal 2

156

console.log(result); // "CO2"

157

```

158

159

### Compatibility Decomposition (NFKD)

160

161

Applies compatibility decomposition to replace compatibility characters with their canonical forms and decompose composite characters.

162

163

```javascript { .api }

164

/**

165

* Normalize string using Compatibility Decomposition

166

* @param {string} str - String to normalize

167

* @returns {string} NFKD normalized string

168

*/

169

function nfkd(str);

170

```

171

172

**Usage Example:**

173

174

```javascript

175

const unorm = require('unorm');

176

177

// Useful for search/indexing by removing combining marks

178

const text = 'Ångström';

179

const normalized = unorm.nfkd(text);

180

const withoutMarks = normalized.replace(/[\u0300-\u036F]/g, ''); // Remove combining marks

181

console.log(withoutMarks); // "Angstrom"

182

```

183

184

### String.prototype.normalize Polyfill

185

186

Automatically provides `String.prototype.normalize()` method when not natively available in the JavaScript environment.

187

188

```javascript { .api }

189

/**

190

* Polyfill for String.prototype.normalize method

191

* @param {string} [form="NFC"] - Normalization form: "NFC", "NFD", "NFKC", or "NFKD"

192

* @returns {string} Normalized string according to specified form

193

* @throws {TypeError} When called on null or undefined

194

* @throws {RangeError} When invalid normalization form provided

195

*/

196

String.prototype.normalize(form);

197

```

198

199

**Usage Examples:**

200

201

```javascript

202

// When native normalize() isn't available, unorm provides it

203

require('unorm'); // Automatically adds polyfill if needed

204

205

const text = 'café';

206

console.log(text.normalize('NFC')); // Uses unorm's implementation

207

console.log(text.normalize('NFD')); // Decomposes é into e + combining accent

208

console.log(text.normalize('NFKC')); // Same as NFC for this example

209

console.log(text.normalize('NFKD')); // Same as NFD for this example

210

211

// Error handling

212

try {

213

text.normalize('INVALID'); // Throws RangeError

214

} catch (error) {

215

console.error(error.message); // "Invalid normalization form: INVALID"

216

}

217

```

218

219

### Polyfill Status Detection

220

221

Property to check whether the String.prototype.normalize polyfill was applied.

222

223

```javascript { .api }

224

/**

225

* Boolean indicating whether String.prototype.normalize polyfill was applied

226

* @type {boolean}

227

*/

228

unorm.shimApplied;

229

```

230

231

**Usage Example:**

232

233

```javascript

234

const unorm = require('unorm');

235

236

if (unorm.shimApplied) {

237

console.log('String.prototype.normalize polyfill was applied');

238

} else {

239

console.log('Native String.prototype.normalize is available');

240

}

241

```

242

243

## Types

244

245

```javascript { .api }

246

/**

247

* Main unorm module interface

248

*/

249

interface UnormModule {

250

/** Canonical Decomposition followed by Canonical Composition */

251

nfc: (str: string) => string;

252

/** Canonical Decomposition */

253

nfd: (str: string) => string;

254

/** Compatibility Decomposition followed by Canonical Composition */

255

nfkc: (str: string) => string;

256

/** Compatibility Decomposition */

257

nfkd: (str: string) => string;

258

/** Whether String.prototype.normalize polyfill was applied */

259

shimApplied: boolean;

260

}

261

262

/**

263

* Valid normalization forms for String.prototype.normalize

264

*/

265

type NormalizationForm = "NFC" | "NFD" | "NFKC" | "NFKD";

266

```

267

268

## Common Use Cases

269

270

### Text Search and Indexing

271

272

```javascript

273

const unorm = require('unorm');

274

275

function normalizeForSearch(text) {

276

// Use NFKD to decompose, then remove combining marks for search

277

const decomposed = unorm.nfkd(text);

278

return decomposed.replace(/[\u0300-\u036F]/g, ''); // Remove combining marks

279

}

280

281

const searchTerm = normalizeForSearch('café');

282

const document = normalizeForSearch('I love café au lait');

283

console.log(document.includes(searchTerm)); // true

284

```

285

286

### String Comparison

287

288

```javascript

289

const unorm = require('unorm');

290

291

function compareStrings(str1, str2) {

292

// Normalize both strings to same form for accurate comparison

293

return unorm.nfc(str1) === unorm.nfc(str2);

294

}

295

296

const text1 = 'é'; // Single codepoint

297

const text2 = 'e\u0301'; // e + combining acute accent

298

console.log(compareStrings(text1, text2)); // true

299

```

300

301

### Data Cleaning

302

303

```javascript

304

const unorm = require('unorm');

305

306

function cleanUserInput(input) {

307

// Normalize to consistent form and trim

308

return unorm.nfc(input.trim());

309

}

310

311

const userInput = ' café '; // With inconsistent Unicode

312

const cleaned = cleanUserInput(userInput);

313

console.log(cleaned); // Normalized "café"

314

```

315

316

## Browser Compatibility

317

318

- **Modern Browsers**: Works in all modern browsers

319

- **Legacy Support**: Requires ES5 features (Object.defineProperty)

320

- **Recommended**: Use [es5-shim](https://github.com/kriskowal/es5-shim) for older browsers

321

- **Node.js**: Supports Node.js >= 0.4.0