or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-objects.mdindex.mdlanguage-models.mdpattern-matching.mdpipeline-components.mdtraining.mdvisualization.md

language-models.mddocs/

0

# Language Models

1

2

Access to 70+ language-specific models and processing pipelines, each optimized for specific linguistic characteristics and writing systems. spaCy provides pre-trained models and blank language classes for custom training.

3

4

## Capabilities

5

6

### Model Loading and Management

7

8

Functions for loading pre-trained models and creating blank language objects for custom training.

9

10

```python { .api }

11

def load(name: str, vocab: Vocab = None, disable: List[str] = None,

12

exclude: List[str] = None, config: dict = None) -> Language:

13

"""

14

Load a spaCy model by name or path.

15

16

Args:

17

name: Model name (e.g., 'en_core_web_sm') or path

18

vocab: Optional vocabulary to use

19

disable: Pipeline components to disable

20

exclude: Pipeline components to exclude entirely

21

config: Config overrides

22

23

Returns:

24

Language object with loaded model

25

"""

26

27

def blank(name: str, vocab: Vocab = None, config: dict = None) -> Language:

28

"""

29

Create a blank Language object for a given language.

30

31

Args:

32

name: Language code (e.g., 'en', 'de', 'zh')

33

vocab: Optional vocabulary

34

config: Optional config overrides

35

36

Returns:

37

Blank Language object without trained models

38

"""

39

40

def info(model: str = None, markdown: bool = False, silent: bool = False) -> None:

41

"""

42

Display information about a model or spaCy installation.

43

44

Args:

45

model: Model name to get info for

46

markdown: Print in markdown format

47

silent: Don't print to stdout

48

"""

49

```

50

51

### Language Classes

52

53

Each supported language has a specialized Language subclass with language-specific tokenization rules, stop words, and linguistic features.

54

55

#### Major Languages

56

57

```python { .api }

58

class English(Language):

59

"""English language processing pipeline."""

60

lang = "en"

61

62

class German(Language):

63

"""German language processing pipeline."""

64

lang = "de"

65

66

class French(Language):

67

"""French language processing pipeline."""

68

lang = "fr"

69

70

class Spanish(Language):

71

"""Spanish language processing pipeline."""

72

lang = "es"

73

74

class Italian(Language):

75

"""Italian language processing pipeline."""

76

lang = "it"

77

78

class Portuguese(Language):

79

"""Portuguese language processing pipeline."""

80

lang = "pt"

81

82

class Russian(Language):

83

"""Russian language processing pipeline."""

84

lang = "ru"

85

86

class Chinese(Language):

87

"""Chinese language processing pipeline with specialized tokenizer."""

88

lang = "zh"

89

90

class Japanese(Language):

91

"""Japanese language processing pipeline with specialized tokenizer."""

92

lang = "ja"

93

94

class Korean(Language):

95

"""Korean language processing pipeline."""

96

lang = "ko"

97

98

class Arabic(Language):

99

"""Arabic language processing pipeline."""

100

lang = "ar"

101

102

class Hindi(Language):

103

"""Hindi language processing pipeline."""

104

lang = "hi"

105

```

106

107

#### Supported Languages (70+ total)

108

109

All supported language codes and their corresponding Language classes:

110

111

- **European**: en, de, fr, es, it, pt, ru, pl, nl, sv, da, no, fi, is, et, lv, lt, sl, sk, cs, hr, bg, mk, sr, hu, ro, el, ca, eu, ga, cy, mt, sq, lb

112

- **Asian**: zh, ja, ko, hi, bn, ta, te, ml, kn, gu, mr, ne, si, th, vi, id, ms, tl

113

- **Middle Eastern/African**: ar, fa, he, tr, ur, am, ti, yo

114

- **Others**: xx (multi-language)

115

116

### Language Configuration

117

118

Each language class has an associated Defaults class containing language-specific configuration.

119

120

```python { .api }

121

class LanguageDefaults:

122

"""Language-specific configuration and defaults."""

123

124

# Tokenizer configuration

125

tokenizer_exceptions: dict

126

prefixes: List[str]

127

suffixes: List[str]

128

infixes: List[str]

129

token_match: Pattern

130

url_match: Pattern

131

132

# Stop words

133

stop_words: Set[str]

134

135

# Writing system info

136

writing_system: dict

137

138

# Lemmatizer and lookup tables

139

lemma_rules: dict

140

lemma_index: dict

141

lemma_exc: dict

142

143

# Morph rules

144

morph_rules: dict

145

146

# Tag map

147

tag_map: dict

148

149

# Syntax iterators (noun chunks, etc.)

150

syntax_iterators: dict

151

```

152

153

### Pre-trained Models

154

155

spaCy provides pre-trained models in different sizes for many languages:

156

157

#### Model Sizes

158

- **sm (small)**: ~15MB, CPU-optimized, basic accuracy

159

- **md (medium)**: ~50MB, word vectors, better accuracy

160

- **lg (large)**: ~750MB, large word vectors, best accuracy

161

- **trf (transformer)**: ~500MB, transformer-based, state-of-the-art accuracy

162

163

#### Available Models

164

165

```python

166

# English models

167

"en_core_web_sm" # Small English model

168

"en_core_web_md" # Medium English model with vectors

169

"en_core_web_lg" # Large English model with large vectors

170

"en_core_web_trf" # Transformer-based English model

171

172

# German models

173

"de_core_news_sm" # Small German model

174

"de_core_news_md" # Medium German model

175

"de_core_news_lg" # Large German model

176

177

# French models

178

"fr_core_news_sm" # Small French model

179

"fr_core_news_md" # Medium French model

180

"fr_core_news_lg" # Large French model

181

182

# Spanish models

183

"es_core_news_sm" # Small Spanish model

184

"es_core_news_md" # Medium Spanish model

185

"es_core_news_lg" # Large Spanish model

186

187

# Chinese models

188

"zh_core_web_sm" # Small Chinese model

189

"zh_core_web_md" # Medium Chinese model

190

"zh_core_web_lg" # Large Chinese model

191

192

# And models for: pt, it, nl, ru, ja, ko, ca, da, el, lt, mk, nb, pl, ro, xx

193

```

194

195

## Usage Examples

196

197

### Loading Models

198

199

```python

200

import spacy

201

202

# Load pre-trained models

203

nlp_en = spacy.load("en_core_web_sm")

204

nlp_de = spacy.load("de_core_news_sm")

205

nlp_fr = spacy.load("fr_core_news_sm")

206

207

# Load with specific configuration

208

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

209

210

# Load with config overrides

211

config = {"nlp": {"batch_size": 1000}}

212

nlp = spacy.load("en_core_web_sm", config=config)

213

214

# Process text with different models

215

text = "Hello world"

216

doc_en = nlp_en(text)

217

doc_de = nlp_de("Hallo Welt")

218

doc_fr = nlp_fr("Bonjour le monde")

219

```

220

221

### Creating Blank Models

222

223

```python

224

import spacy

225

226

# Create blank models for custom training

227

nlp_en = spacy.blank("en")

228

nlp_de = spacy.blank("de")

229

nlp_zh = spacy.blank("zh")

230

231

# Add components to blank model

232

nlp_en.add_pipe("tagger")

233

nlp_en.add_pipe("parser")

234

nlp_en.add_pipe("ner")

235

236

# Create with custom vocabulary

237

from spacy.vocab import Vocab

238

custom_vocab = Vocab()

239

nlp = spacy.blank("en", vocab=custom_vocab)

240

241

print(f"Language: {nlp.lang}")

242

print(f"Pipeline: {nlp.pipe_names}")

243

```

244

245

### Multi-language Processing

246

247

```python

248

import spacy

249

250

# Load multiple language models

251

models = {

252

"en": spacy.load("en_core_web_sm"),

253

"de": spacy.load("de_core_news_sm"),

254

"fr": spacy.load("fr_core_news_sm"),

255

"es": spacy.load("es_core_news_sm")

256

}

257

258

# Process texts in different languages

259

texts = {

260

"en": "Apple Inc. is an American technology company.",

261

"de": "Apple Inc. ist ein amerikanisches Technologieunternehmen.",

262

"fr": "Apple Inc. est une entreprise technologique américaine.",

263

"es": "Apple Inc. es una empresa tecnológica estadounidense."

264

}

265

266

for lang, text in texts.items():

267

doc = models[lang](text)

268

print(f"{lang.upper()}: {doc.text}")

269

for ent in doc.ents:

270

print(f" {ent.text} -> {ent.label_}")

271

```

272

273

### Language Detection and Processing

274

275

```python

276

import spacy

277

from spacy.lang.en import English

278

from spacy.lang.de import German

279

from spacy.lang.fr import French

280

281

# Detect and process based on language

282

def process_multilingual(text, detected_lang="en"):

283

"""Process text with appropriate language model."""

284

285

language_models = {

286

"en": "en_core_web_sm",

287

"de": "de_core_news_sm",

288

"fr": "fr_core_news_sm",

289

"es": "es_core_news_sm"

290

}

291

292

if detected_lang in language_models:

293

nlp = spacy.load(language_models[detected_lang])

294

return nlp(text)

295

else:

296

# Fallback to English

297

nlp = spacy.load("en_core_web_sm")

298

return nlp(text)

299

300

# Process texts

301

english_doc = process_multilingual("Hello world", "en")

302

german_doc = process_multilingual("Hallo Welt", "de")

303

```

304

305

### Working with Language-Specific Features

306

307

```python

308

import spacy

309

310

# Load models with different capabilities

311

nlp_en = spacy.load("en_core_web_sm")

312

nlp_zh = spacy.load("zh_core_web_sm") # Chinese with specialized tokenizer

313

nlp_ja = spacy.load("ja_core_news_sm") # Japanese with specialized tokenizer

314

315

# English processing

316

doc_en = nlp_en("Apple Inc. is buying a startup for $1 billion.")

317

print("English tokens:")

318

for token in doc_en:

319

print(f" {token.text} ({token.pos_})")

320

321

# Chinese processing (no spaces between words)

322

doc_zh = nlp_zh("苹果公司正在收购一家初创公司")

323

print("\nChinese tokens:")

324

for token in doc_zh:

325

print(f" {token.text} ({token.pos_})")

326

327

# Japanese processing (mixed scripts)

328

doc_ja = nlp_ja("アップル社はスタートアップを買収している")

329

print("\nJapanese tokens:")

330

for token in doc_ja:

331

print(f" {token.text} ({token.pos_})")

332

```

333

334

### Custom Language Classes

335

336

```python

337

import spacy

338

from spacy.lang.en import English

339

340

# Extend existing language class

341

class CustomEnglish(English):

342

"""Custom English class with additional features."""

343

344

def __init__(self, vocab=None, **kwargs):

345

super().__init__(vocab, **kwargs)

346

# Add custom initialization

347

348

# Register custom language

349

@spacy.registry.languages("custom_en")

350

def create_custom_english():

351

return CustomEnglish()

352

353

# Use custom language

354

nlp = spacy.blank("custom_en")

355

```

356

357

### Model Information and Metadata

358

359

```python

360

import spacy

361

362

# Load model and inspect metadata

363

nlp = spacy.load("en_core_web_sm")

364

365

# Model metadata

366

print("Model info:")

367

print(f" Language: {nlp.lang}")

368

print(f" Name: {nlp.meta['name']}")

369

print(f" Version: {nlp.meta['version']}")

370

print(f" Description: {nlp.meta['description']}")

371

print(f" Pipeline: {nlp.pipe_names}")

372

373

# Vocabulary info

374

print(f"\nVocabulary size: {len(nlp.vocab)}")

375

print(f"Vectors: {nlp.vocab.vectors.size}")

376

377

# Component info

378

for name, component in nlp.pipeline:

379

print(f"Component '{name}': {type(component)}")

380

381

# Display full model info

382

spacy.info("en_core_web_sm")

383

```

384

385

### Performance and Memory Optimization

386

387

```python

388

import spacy

389

390

# Load model with specific components for performance

391

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"]) # Faster tokenization + tagging only

392

393

# Use smaller model for memory constraints

394

nlp_small = spacy.load("en_core_web_sm") # ~15MB

395

nlp_large = spacy.load("en_core_web_lg") # ~750MB

396

397

# Process with disabled components temporarily

398

nlp = spacy.load("en_core_web_sm")

399

with nlp.disable_pipes("parser", "ner"):

400

# Faster processing without parsing and NER

401

docs = list(nlp.pipe(texts))

402

403

# Batch processing for efficiency

404

texts = ["Text 1", "Text 2", "Text 3"]

405

docs = list(nlp.pipe(texts, batch_size=100))

406

```