0
# spaCy
1
2
Industrial-strength Natural Language Processing (NLP) in Python. spaCy is designed for production use and provides fast, accurate processing for 70+ languages with state-of-the-art neural network models for tokenization, part-of-speech tagging, dependency parsing, named entity recognition, and text classification.
3
4
## Package Information
5
6
- **Package Name**: spacy
7
- **Language**: Python
8
- **Installation**: `pip install spacy`
9
- **Models**: Download language models with `python -m spacy download en_core_web_sm`
10
11
## Core Imports
12
13
```python
14
import spacy
15
16
# Load a language model
17
nlp = spacy.load("en_core_web_sm")
18
```
19
20
Most common imports:
21
```python
22
from spacy import displacy
23
from spacy.matcher import Matcher, PhraseMatcher
24
from spacy.tokens import Doc, Token, Span
25
```
26
27
## Basic Usage
28
29
```python
30
import spacy
31
32
# Load a language model
33
nlp = spacy.load("en_core_web_sm")
34
35
# Process text
36
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
37
38
# Access linguistic annotations
39
for token in doc:
40
print(token.text, token.pos_, token.dep_, token.lemma_)
41
42
# Access named entities
43
for ent in doc.ents:
44
print(ent.text, ent.label_)
45
46
# Process multiple texts efficiently
47
texts = ["First text", "Second text", "Third text"]
48
docs = list(nlp.pipe(texts))
49
```
50
51
## Architecture
52
53
spaCy's processing pipeline is built around a Language object that chains together multiple pipeline components. Each document passes through tokenization, then through pipeline components (tagger, parser, NER, etc.) in sequence. This design allows for:
54
55
- **Efficient processing**: Stream processing with `nlp.pipe()` for batches
56
- **Modular architecture**: Add, remove, or replace pipeline components
57
- **Multi-language support**: 70+ language models with specialized tokenizers
58
- **Production-ready**: Optimized for speed and memory usage
59
60
## Capabilities
61
62
### Core Processing Objects
63
64
The fundamental objects for text processing including documents, tokens, spans, and vocabulary management. These form the foundation of all spaCy operations.
65
66
```python { .api }
67
class Language:
68
def __call__(self, text: str) -> Doc: ...
69
def pipe(self, texts: Iterable[str]) -> Iterator[Doc]: ...
70
71
class Doc:
72
text: str
73
ents: tuple
74
sents: Iterator
75
76
class Token:
77
text: str
78
pos_: str
79
lemma_: str
80
81
class Span:
82
text: str
83
label_: str
84
```
85
86
[Core Objects](./core-objects.md)
87
88
### Pipeline Components
89
90
Built-in pipeline components for linguistic analysis including part-of-speech tagging, dependency parsing, named entity recognition, and text classification.
91
92
```python { .api }
93
class Tagger: ...
94
class DependencyParser: ...
95
class EntityRecognizer: ...
96
class TextCategorizer: ...
97
```
98
99
[Pipeline Components](./pipeline-components.md)
100
101
### Pattern Matching
102
103
Powerful pattern matching systems for finding and extracting specific linguistic patterns, phrases, and dependency structures from text.
104
105
```python { .api }
106
class Matcher:
107
def add(self, key: str, patterns: List[dict]) -> None: ...
108
def __call__(self, doc: Doc) -> List[tuple]: ...
109
110
class PhraseMatcher:
111
def add(self, key: str, docs: List[Doc]) -> None: ...
112
```
113
114
[Pattern Matching](./pattern-matching.md)
115
116
### Language Models
117
118
Access to 70+ language-specific models and tokenizers, each optimized for specific linguistic characteristics and writing systems.
119
120
```python { .api }
121
def load(name: str, **overrides) -> Language: ...
122
def blank(name: str, **kwargs) -> Language: ...
123
```
124
125
[Language Models](./language-models.md)
126
127
### Visualization
128
129
Interactive visualization tools for displaying linguistic analysis including dependency trees, named entities, and custom visualizations.
130
131
```python { .api }
132
def render(docs, style: str = "dep", **options) -> str: ...
133
def serve(docs, style: str = "dep", port: int = 5000, **options) -> None: ...
134
```
135
136
[Visualization](./visualization.md)
137
138
### Training and Model Building
139
140
Tools for training custom models, fine-tuning existing models, and creating specialized NLP pipelines for domain-specific applications.
141
142
```python { .api }
143
def train(nlp: Language, examples: List, **kwargs) -> dict: ...
144
def evaluate(nlp: Language, examples: List, **kwargs) -> dict: ...
145
```
146
147
[Training](./training.md)
148
149
## Key Types
150
151
```python { .api }
152
class Language:
153
"""Main NLP pipeline class."""
154
vocab: Vocab
155
pipeline: List[tuple]
156
pipe_names: List[str]
157
158
def __call__(self, text: str) -> Doc: ...
159
def pipe(self, texts: Iterable[str], batch_size: int = 1000) -> Iterator[Doc]: ...
160
def add_pipe(self, component, name: str = None, **kwargs) -> callable: ...
161
162
class Doc:
163
"""Container for accessing linguistic annotations."""
164
text: str
165
text_with_ws: str
166
ents: tuple
167
noun_chunks: Iterator
168
sents: Iterator
169
vector: numpy.ndarray
170
171
def similarity(self, other) -> float: ...
172
def to_json(self) -> dict: ...
173
174
class Token:
175
"""Individual token with linguistic annotations."""
176
text: str
177
lemma_: str
178
pos_: str
179
tag_: str
180
dep_: str
181
ent_type_: str
182
head: 'Token'
183
children: Iterator
184
is_alpha: bool
185
is_digit: bool
186
is_punct: bool
187
like_num: bool
188
189
class Span:
190
"""Slice of a document."""
191
text: str
192
label_: str
193
kb_id_: str
194
vector: numpy.ndarray
195
196
def similarity(self, other) -> float: ...
197
def as_doc(self) -> Doc: ...
198
199
class Vocab:
200
"""Vocabulary store."""
201
strings: StringStore
202
vectors: Vectors
203
204
def __getitem__(self, string: str) -> Lexeme: ...
205
```