0
# Language Models
1
2
Access to 70+ language-specific models and processing pipelines, each optimized for specific linguistic characteristics and writing systems. spaCy provides pre-trained models and blank language classes for custom training.
3
4
## Capabilities
5
6
### Model Loading and Management
7
8
Functions for loading pre-trained models and creating blank language objects for custom training.
9
10
```python { .api }
11
def load(name: str, vocab: Vocab = None, disable: List[str] = None,
12
exclude: List[str] = None, config: dict = None) -> Language:
13
"""
14
Load a spaCy model by name or path.
15
16
Args:
17
name: Model name (e.g., 'en_core_web_sm') or path
18
vocab: Optional vocabulary to use
19
disable: Pipeline components to disable
20
exclude: Pipeline components to exclude entirely
21
config: Config overrides
22
23
Returns:
24
Language object with loaded model
25
"""
26
27
def blank(name: str, vocab: Vocab = None, config: dict = None) -> Language:
28
"""
29
Create a blank Language object for a given language.
30
31
Args:
32
name: Language code (e.g., 'en', 'de', 'zh')
33
vocab: Optional vocabulary
34
config: Optional config overrides
35
36
Returns:
37
Blank Language object without trained models
38
"""
39
40
def info(model: str = None, markdown: bool = False, silent: bool = False) -> None:
41
"""
42
Display information about a model or spaCy installation.
43
44
Args:
45
model: Model name to get info for
46
markdown: Print in markdown format
47
silent: Don't print to stdout
48
"""
49
```
50
51
### Language Classes
52
53
Each supported language has a specialized Language subclass with language-specific tokenization rules, stop words, and linguistic features.
54
55
#### Major Languages
56
57
```python { .api }
58
class English(Language):
59
"""English language processing pipeline."""
60
lang = "en"
61
62
class German(Language):
63
"""German language processing pipeline."""
64
lang = "de"
65
66
class French(Language):
67
"""French language processing pipeline."""
68
lang = "fr"
69
70
class Spanish(Language):
71
"""Spanish language processing pipeline."""
72
lang = "es"
73
74
class Italian(Language):
75
"""Italian language processing pipeline."""
76
lang = "it"
77
78
class Portuguese(Language):
79
"""Portuguese language processing pipeline."""
80
lang = "pt"
81
82
class Russian(Language):
83
"""Russian language processing pipeline."""
84
lang = "ru"
85
86
class Chinese(Language):
87
"""Chinese language processing pipeline with specialized tokenizer."""
88
lang = "zh"
89
90
class Japanese(Language):
91
"""Japanese language processing pipeline with specialized tokenizer."""
92
lang = "ja"
93
94
class Korean(Language):
95
"""Korean language processing pipeline."""
96
lang = "ko"
97
98
class Arabic(Language):
99
"""Arabic language processing pipeline."""
100
lang = "ar"
101
102
class Hindi(Language):
103
"""Hindi language processing pipeline."""
104
lang = "hi"
105
```
106
107
#### Supported Languages (70+ total)
108
109
All supported language codes and their corresponding Language classes:
110
111
- **European**: en, de, fr, es, it, pt, ru, pl, nl, sv, da, no, fi, is, et, lv, lt, sl, sk, cs, hr, bg, mk, sr, hu, ro, el, ca, eu, ga, cy, mt, sq, lb
112
- **Asian**: zh, ja, ko, hi, bn, ta, te, ml, kn, gu, mr, ne, si, th, vi, id, ms, tl
113
- **Middle Eastern/African**: ar, fa, he, tr, ur, am, ti, yo
114
- **Others**: xx (multi-language)
115
116
### Language Configuration
117
118
Each language class has an associated Defaults class containing language-specific configuration.
119
120
```python { .api }
121
class LanguageDefaults:
122
"""Language-specific configuration and defaults."""
123
124
# Tokenizer configuration
125
tokenizer_exceptions: dict
126
prefixes: List[str]
127
suffixes: List[str]
128
infixes: List[str]
129
token_match: Pattern
130
url_match: Pattern
131
132
# Stop words
133
stop_words: Set[str]
134
135
# Writing system info
136
writing_system: dict
137
138
# Lemmatizer and lookup tables
139
lemma_rules: dict
140
lemma_index: dict
141
lemma_exc: dict
142
143
# Morph rules
144
morph_rules: dict
145
146
# Tag map
147
tag_map: dict
148
149
# Syntax iterators (noun chunks, etc.)
150
syntax_iterators: dict
151
```
152
153
### Pre-trained Models
154
155
spaCy provides pre-trained models in different sizes for many languages:
156
157
#### Model Sizes
158
- **sm (small)**: ~15MB, CPU-optimized, basic accuracy
159
- **md (medium)**: ~50MB, word vectors, better accuracy
160
- **lg (large)**: ~750MB, large word vectors, best accuracy
161
- **trf (transformer)**: ~500MB, transformer-based, state-of-the-art accuracy
162
163
#### Available Models
164
165
```python
166
# English models
167
"en_core_web_sm" # Small English model
168
"en_core_web_md" # Medium English model with vectors
169
"en_core_web_lg" # Large English model with large vectors
170
"en_core_web_trf" # Transformer-based English model
171
172
# German models
173
"de_core_news_sm" # Small German model
174
"de_core_news_md" # Medium German model
175
"de_core_news_lg" # Large German model
176
177
# French models
178
"fr_core_news_sm" # Small French model
179
"fr_core_news_md" # Medium French model
180
"fr_core_news_lg" # Large French model
181
182
# Spanish models
183
"es_core_news_sm" # Small Spanish model
184
"es_core_news_md" # Medium Spanish model
185
"es_core_news_lg" # Large Spanish model
186
187
# Chinese models
188
"zh_core_web_sm" # Small Chinese model
189
"zh_core_web_md" # Medium Chinese model
190
"zh_core_web_lg" # Large Chinese model
191
192
# And models for: pt, it, nl, ru, ja, ko, ca, da, el, lt, mk, nb, pl, ro, xx
193
```
194
195
## Usage Examples
196
197
### Loading Models
198
199
```python
200
import spacy
201
202
# Load pre-trained models
203
nlp_en = spacy.load("en_core_web_sm")
204
nlp_de = spacy.load("de_core_news_sm")
205
nlp_fr = spacy.load("fr_core_news_sm")
206
207
# Load with specific configuration
208
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
209
210
# Load with config overrides
211
config = {"nlp": {"batch_size": 1000}}
212
nlp = spacy.load("en_core_web_sm", config=config)
213
214
# Process text with different models
215
text = "Hello world"
216
doc_en = nlp_en(text)
217
doc_de = nlp_de("Hallo Welt")
218
doc_fr = nlp_fr("Bonjour le monde")
219
```
220
221
### Creating Blank Models
222
223
```python
224
import spacy
225
226
# Create blank models for custom training
227
nlp_en = spacy.blank("en")
228
nlp_de = spacy.blank("de")
229
nlp_zh = spacy.blank("zh")
230
231
# Add components to blank model
232
nlp_en.add_pipe("tagger")
233
nlp_en.add_pipe("parser")
234
nlp_en.add_pipe("ner")
235
236
# Create with custom vocabulary
237
from spacy.vocab import Vocab
238
custom_vocab = Vocab()
239
nlp = spacy.blank("en", vocab=custom_vocab)
240
241
print(f"Language: {nlp.lang}")
242
print(f"Pipeline: {nlp.pipe_names}")
243
```
244
245
### Multi-language Processing
246
247
```python
248
import spacy
249
250
# Load multiple language models
251
models = {
252
"en": spacy.load("en_core_web_sm"),
253
"de": spacy.load("de_core_news_sm"),
254
"fr": spacy.load("fr_core_news_sm"),
255
"es": spacy.load("es_core_news_sm")
256
}
257
258
# Process texts in different languages
259
texts = {
260
"en": "Apple Inc. is an American technology company.",
261
"de": "Apple Inc. ist ein amerikanisches Technologieunternehmen.",
262
"fr": "Apple Inc. est une entreprise technologique américaine.",
263
"es": "Apple Inc. es una empresa tecnológica estadounidense."
264
}
265
266
for lang, text in texts.items():
267
doc = models[lang](text)
268
print(f"{lang.upper()}: {doc.text}")
269
for ent in doc.ents:
270
print(f" {ent.text} -> {ent.label_}")
271
```
272
273
### Language Detection and Processing
274
275
```python
276
import spacy
277
from spacy.lang.en import English
278
from spacy.lang.de import German
279
from spacy.lang.fr import French
280
281
# Detect and process based on language
282
def process_multilingual(text, detected_lang="en"):
283
"""Process text with appropriate language model."""
284
285
language_models = {
286
"en": "en_core_web_sm",
287
"de": "de_core_news_sm",
288
"fr": "fr_core_news_sm",
289
"es": "es_core_news_sm"
290
}
291
292
if detected_lang in language_models:
293
nlp = spacy.load(language_models[detected_lang])
294
return nlp(text)
295
else:
296
# Fallback to English
297
nlp = spacy.load("en_core_web_sm")
298
return nlp(text)
299
300
# Process texts
301
english_doc = process_multilingual("Hello world", "en")
302
german_doc = process_multilingual("Hallo Welt", "de")
303
```
304
305
### Working with Language-Specific Features
306
307
```python
308
import spacy
309
310
# Load models with different capabilities
311
nlp_en = spacy.load("en_core_web_sm")
312
nlp_zh = spacy.load("zh_core_web_sm") # Chinese with specialized tokenizer
313
nlp_ja = spacy.load("ja_core_news_sm") # Japanese with specialized tokenizer
314
315
# English processing
316
doc_en = nlp_en("Apple Inc. is buying a startup for $1 billion.")
317
print("English tokens:")
318
for token in doc_en:
319
print(f" {token.text} ({token.pos_})")
320
321
# Chinese processing (no spaces between words)
322
doc_zh = nlp_zh("苹果公司正在收购一家初创公司")
323
print("\nChinese tokens:")
324
for token in doc_zh:
325
print(f" {token.text} ({token.pos_})")
326
327
# Japanese processing (mixed scripts)
328
doc_ja = nlp_ja("アップル社はスタートアップを買収している")
329
print("\nJapanese tokens:")
330
for token in doc_ja:
331
print(f" {token.text} ({token.pos_})")
332
```
333
334
### Custom Language Classes
335
336
```python
337
import spacy
338
from spacy.lang.en import English
339
340
# Extend existing language class
341
class CustomEnglish(English):
342
"""Custom English class with additional features."""
343
344
def __init__(self, vocab=None, **kwargs):
345
super().__init__(vocab, **kwargs)
346
# Add custom initialization
347
348
# Register custom language
349
@spacy.registry.languages("custom_en")
350
def create_custom_english():
351
return CustomEnglish()
352
353
# Use custom language
354
nlp = spacy.blank("custom_en")
355
```
356
357
### Model Information and Metadata
358
359
```python
360
import spacy
361
362
# Load model and inspect metadata
363
nlp = spacy.load("en_core_web_sm")
364
365
# Model metadata
366
print("Model info:")
367
print(f" Language: {nlp.lang}")
368
print(f" Name: {nlp.meta['name']}")
369
print(f" Version: {nlp.meta['version']}")
370
print(f" Description: {nlp.meta['description']}")
371
print(f" Pipeline: {nlp.pipe_names}")
372
373
# Vocabulary info
374
print(f"\nVocabulary size: {len(nlp.vocab)}")
375
print(f"Vectors: {nlp.vocab.vectors.size}")
376
377
# Component info
378
for name, component in nlp.pipeline:
379
print(f"Component '{name}': {type(component)}")
380
381
# Display full model info
382
spacy.info("en_core_web_sm")
383
```
384
385
### Performance and Memory Optimization
386
387
```python
388
import spacy
389
390
# Load model with specific components for performance
391
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"]) # Faster tokenization + tagging only
392
393
# Use smaller model for memory constraints
394
nlp_small = spacy.load("en_core_web_sm") # ~15MB
395
nlp_large = spacy.load("en_core_web_lg") # ~750MB
396
397
# Process with disabled components temporarily
398
nlp = spacy.load("en_core_web_sm")
399
with nlp.disable_pipes("parser", "ner"):
400
# Faster processing without parsing and NER
401
docs = list(nlp.pipe(texts))
402
403
# Batch processing for efficiency
404
texts = ["Text 1", "Text 2", "Text 3"]
405
docs = list(nlp.pipe(texts, batch_size=100))
406
```