or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-objects.mdindex.mdlanguage-models.mdpattern-matching.mdpipeline-components.mdtraining.mdvisualization.md

index.mddocs/

0

# spaCy

1

2

Industrial-strength Natural Language Processing (NLP) in Python. spaCy is designed for production use and provides fast, accurate processing for 70+ languages with state-of-the-art neural network models for tokenization, part-of-speech tagging, dependency parsing, named entity recognition, and text classification.

3

4

## Package Information

5

6

- **Package Name**: spacy

7

- **Language**: Python

8

- **Installation**: `pip install spacy`

9

- **Models**: Download language models with `python -m spacy download en_core_web_sm`

10

11

## Core Imports

12

13

```python

14

import spacy

15

16

# Load a language model

17

nlp = spacy.load("en_core_web_sm")

18

```

19

20

Most common imports:

21

```python

22

from spacy import displacy

23

from spacy.matcher import Matcher, PhraseMatcher

24

from spacy.tokens import Doc, Token, Span

25

```

26

27

## Basic Usage

28

29

```python

30

import spacy

31

32

# Load a language model

33

nlp = spacy.load("en_core_web_sm")

34

35

# Process text

36

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

37

38

# Access linguistic annotations

39

for token in doc:

40

print(token.text, token.pos_, token.dep_, token.lemma_)

41

42

# Access named entities

43

for ent in doc.ents:

44

print(ent.text, ent.label_)

45

46

# Process multiple texts efficiently

47

texts = ["First text", "Second text", "Third text"]

48

docs = list(nlp.pipe(texts))

49

```

50

51

## Architecture

52

53

spaCy's processing pipeline is built around a Language object that chains together multiple pipeline components. Each document passes through tokenization, then through pipeline components (tagger, parser, NER, etc.) in sequence. This design allows for:

54

55

- **Efficient processing**: Stream processing with `nlp.pipe()` for batches

56

- **Modular architecture**: Add, remove, or replace pipeline components

57

- **Multi-language support**: 70+ language models with specialized tokenizers

58

- **Production-ready**: Optimized for speed and memory usage

59

60

## Capabilities

61

62

### Core Processing Objects

63

64

The fundamental objects for text processing including documents, tokens, spans, and vocabulary management. These form the foundation of all spaCy operations.

65

66

```python { .api }

67

class Language:

68

def __call__(self, text: str) -> Doc: ...

69

def pipe(self, texts: Iterable[str]) -> Iterator[Doc]: ...

70

71

class Doc:

72

text: str

73

ents: tuple

74

sents: Iterator

75

76

class Token:

77

text: str

78

pos_: str

79

lemma_: str

80

81

class Span:

82

text: str

83

label_: str

84

```

85

86

[Core Objects](./core-objects.md)

87

88

### Pipeline Components

89

90

Built-in pipeline components for linguistic analysis including part-of-speech tagging, dependency parsing, named entity recognition, and text classification.

91

92

```python { .api }

93

class Tagger: ...

94

class DependencyParser: ...

95

class EntityRecognizer: ...

96

class TextCategorizer: ...

97

```

98

99

[Pipeline Components](./pipeline-components.md)

100

101

### Pattern Matching

102

103

Powerful pattern matching systems for finding and extracting specific linguistic patterns, phrases, and dependency structures from text.

104

105

```python { .api }

106

class Matcher:

107

def add(self, key: str, patterns: List[dict]) -> None: ...

108

def __call__(self, doc: Doc) -> List[tuple]: ...

109

110

class PhraseMatcher:

111

def add(self, key: str, docs: List[Doc]) -> None: ...

112

```

113

114

[Pattern Matching](./pattern-matching.md)

115

116

### Language Models

117

118

Access to 70+ language-specific models and tokenizers, each optimized for specific linguistic characteristics and writing systems.

119

120

```python { .api }

121

def load(name: str, **overrides) -> Language: ...

122

def blank(name: str, **kwargs) -> Language: ...

123

```

124

125

[Language Models](./language-models.md)

126

127

### Visualization

128

129

Interactive visualization tools for displaying linguistic analysis including dependency trees, named entities, and custom visualizations.

130

131

```python { .api }

132

def render(docs, style: str = "dep", **options) -> str: ...

133

def serve(docs, style: str = "dep", port: int = 5000, **options) -> None: ...

134

```

135

136

[Visualization](./visualization.md)

137

138

### Training and Model Building

139

140

Tools for training custom models, fine-tuning existing models, and creating specialized NLP pipelines for domain-specific applications.

141

142

```python { .api }

143

def train(nlp: Language, examples: List, **kwargs) -> dict: ...

144

def evaluate(nlp: Language, examples: List, **kwargs) -> dict: ...

145

```

146

147

[Training](./training.md)

148

149

## Key Types

150

151

```python { .api }

152

class Language:

153

"""Main NLP pipeline class."""

154

vocab: Vocab

155

pipeline: List[tuple]

156

pipe_names: List[str]

157

158

def __call__(self, text: str) -> Doc: ...

159

def pipe(self, texts: Iterable[str], batch_size: int = 1000) -> Iterator[Doc]: ...

160

def add_pipe(self, component, name: str = None, **kwargs) -> callable: ...

161

162

class Doc:

163

"""Container for accessing linguistic annotations."""

164

text: str

165

text_with_ws: str

166

ents: tuple

167

noun_chunks: Iterator

168

sents: Iterator

169

vector: numpy.ndarray

170

171

def similarity(self, other) -> float: ...

172

def to_json(self) -> dict: ...

173

174

class Token:

175

"""Individual token with linguistic annotations."""

176

text: str

177

lemma_: str

178

pos_: str

179

tag_: str

180

dep_: str

181

ent_type_: str

182

head: 'Token'

183

children: Iterator

184

is_alpha: bool

185

is_digit: bool

186

is_punct: bool

187

like_num: bool

188

189

class Span:

190

"""Slice of a document."""

191

text: str

192

label_: str

193

kb_id_: str

194

vector: numpy.ndarray

195

196

def similarity(self, other) -> float: ...

197

def as_doc(self) -> Doc: ...

198

199

class Vocab:

200

"""Vocabulary store."""

201

strings: StringStore

202

vectors: Vectors

203

204

def __getitem__(self, string: str) -> Lexeme: ...

205

```