0
# Text Classification
1
2
Categorizes text documents into predefined classification categories, enabling automated content organization and filtering based on subject matter and themes. The classification system can identify topics, genres, and content types to help with content management, routing, and analysis at scale.
3
4
## Capabilities
5
6
### Classify Text
7
8
Analyzes the provided text and assigns it to relevant predefined categories with confidence scores.
9
10
```python { .api }
11
def classify_text(
12
self,
13
request: Optional[Union[ClassifyTextRequest, dict]] = None,
14
*,
15
document: Optional[Document] = None,
16
retry: OptionalRetry = gapic_v1.method.DEFAULT,
17
timeout: Union[float, object] = gapic_v1.method.DEFAULT,
18
metadata: Sequence[Tuple[str, Union[str, bytes]]] = ()
19
) -> ClassifyTextResponse:
20
"""
21
Classifies a document into categories.
22
23
Args:
24
request: The request object containing document and options
25
document: Input document for classification
26
retry: Retry configuration for the request
27
timeout: Request timeout in seconds
28
metadata: Additional metadata to send with the request
29
30
Returns:
31
ClassifyTextResponse containing classification categories
32
"""
33
```
34
35
#### Usage Example
36
37
```python
38
from google.cloud import language
39
40
# Initialize client
41
client = language.LanguageServiceClient()
42
43
# Create document
44
document = language.Document(
45
content="""
46
The latest advancements in artificial intelligence and machine learning
47
are revolutionizing how we approach data analysis and predictive modeling.
48
Neural networks and deep learning algorithms are becoming increasingly
49
sophisticated, enabling more accurate predictions and insights from
50
complex datasets.
51
""",
52
type_=language.Document.Type.PLAIN_TEXT
53
)
54
55
# Classify text
56
response = client.classify_text(
57
request={"document": document}
58
)
59
60
# Process classification results
61
print("Classification Results:")
62
for category in response.categories:
63
print(f"Category: {category.name}")
64
print(f"Confidence: {category.confidence:.3f}")
65
print()
66
```
67
68
## Request and Response Types
69
70
### ClassifyTextRequest
71
72
```python { .api }
73
class ClassifyTextRequest:
74
document: Document
75
classification_model_options: ClassificationModelOptions # v1/v1beta2 only
76
```
77
78
### ClassifyTextResponse
79
80
```python { .api }
81
class ClassifyTextResponse:
82
categories: MutableSequence[ClassificationCategory]
83
```
84
85
## Supporting Types
86
87
### ClassificationCategory
88
89
Represents a classification category with confidence score.
90
91
```python { .api }
92
class ClassificationCategory:
93
name: str # Category name (hierarchical path)
94
confidence: float # Confidence score [0.0, 1.0]
95
```
96
97
### ClassificationModelOptions (v1/v1beta2 only)
98
99
Configuration options for the classification model.
100
101
```python { .api }
102
class ClassificationModelOptions:
103
class V1Model(proto.Message):
104
pass
105
106
class V2Model(proto.Message):
107
pass
108
109
v1_model: V1Model # Use V1 classification model
110
v2_model: V2Model # Use V2 classification model
111
```
112
113
## Category Hierarchy
114
115
Classification categories follow a hierarchical structure using forward slashes:
116
117
### Common Top-Level Categories
118
119
- `/Arts & Entertainment`
120
- `/Autos & Vehicles`
121
- `/Beauty & Fitness`
122
- `/Books & Literature`
123
- `/Business & Industrial`
124
- `/Computers & Electronics`
125
- `/Finance`
126
- `/Food & Drink`
127
- `/Games`
128
- `/Health`
129
- `/Hobbies & Leisure`
130
- `/Home & Garden`
131
- `/Internet & Telecom`
132
- `/Jobs & Education`
133
- `/Law & Government`
134
- `/News`
135
- `/Online Communities`
136
- `/People & Society`
137
- `/Pets & Animals`
138
- `/Real Estate`
139
- `/Reference`
140
- `/Science`
141
- `/Shopping`
142
- `/Sports`
143
- `/Travel`
144
145
### Hierarchical Examples
146
147
- `/Computers & Electronics/Software`
148
- `/Computers & Electronics/Software/Business Software`
149
- `/Arts & Entertainment/Movies`
150
- `/Arts & Entertainment/Music & Audio`
151
- `/Science/Computer Science`
152
- `/Business & Industrial/Advertising & Marketing`
153
154
## Advanced Usage
155
156
### Multi-Category Classification
157
158
```python
159
def classify_and_rank_categories(client, text, min_confidence=0.1):
160
"""Classify text and rank all categories above threshold."""
161
document = language.Document(
162
content=text,
163
type_=language.Document.Type.PLAIN_TEXT
164
)
165
166
response = client.classify_text(
167
request={"document": document}
168
)
169
170
# Filter and sort categories
171
filtered_categories = [
172
cat for cat in response.categories
173
if cat.confidence >= min_confidence
174
]
175
176
sorted_categories = sorted(
177
filtered_categories,
178
key=lambda x: x.confidence,
179
reverse=True
180
)
181
182
return sorted_categories
183
184
# Usage
185
text = """
186
Machine learning algorithms are transforming healthcare by enabling
187
early disease detection through medical imaging analysis. Artificial
188
intelligence systems can now identify patterns in X-rays, MRIs, and
189
CT scans that might be missed by human radiologists.
190
"""
191
192
categories = classify_and_rank_categories(client, text, min_confidence=0.1)
193
194
print("All Categories (above 10% confidence):")
195
for cat in categories:
196
print(f"{cat.name}: {cat.confidence:.3f}")
197
```
198
199
### Batch Classification
200
201
```python
202
def classify_multiple_documents(client, documents):
203
"""Classify multiple documents and return aggregated results."""
204
results = []
205
206
for i, doc_text in enumerate(documents):
207
document = language.Document(
208
content=doc_text,
209
type_=language.Document.Type.PLAIN_TEXT
210
)
211
212
try:
213
response = client.classify_text(
214
request={"document": document}
215
)
216
217
doc_categories = []
218
for category in response.categories:
219
doc_categories.append({
220
'name': category.name,
221
'confidence': category.confidence
222
})
223
224
results.append({
225
'document_index': i,
226
'text_preview': doc_text[:100] + "..." if len(doc_text) > 100 else doc_text,
227
'categories': doc_categories
228
})
229
230
except Exception as e:
231
results.append({
232
'document_index': i,
233
'text_preview': doc_text[:100] + "..." if len(doc_text) > 100 else doc_text,
234
'error': str(e),
235
'categories': []
236
})
237
238
return results
239
240
# Usage
241
documents = [
242
"Stock market analysis and investment strategies for portfolio management.",
243
"Latest updates in artificial intelligence and machine learning research.",
244
"Healthy cooking recipes for vegetarian and vegan diets.",
245
"Professional basketball game highlights and player statistics."
246
]
247
248
batch_results = classify_multiple_documents(client, documents)
249
250
for result in batch_results:
251
print(f"Document {result['document_index']}: {result['text_preview']}")
252
if 'error' in result:
253
print(f" Error: {result['error']}")
254
else:
255
for cat in result['categories']:
256
print(f" {cat['name']}: {cat['confidence']:.3f}")
257
print()
258
```
259
260
### Category Filtering and Grouping
261
262
```python
263
def group_by_top_level_category(categories):
264
"""Group categories by their top-level parent."""
265
grouped = {}
266
267
for category in categories:
268
# Extract top-level category
269
parts = category.name.split('/')
270
top_level = '/' + parts[1] if len(parts) > 1 else category.name
271
272
if top_level not in grouped:
273
grouped[top_level] = []
274
275
grouped[top_level].append(category)
276
277
return grouped
278
279
def get_most_specific_categories(categories, max_categories=3):
280
"""Get the most specific (deepest) categories with highest confidence."""
281
# Sort by depth (number of slashes) and confidence
282
sorted_cats = sorted(
283
categories,
284
key=lambda x: (x.name.count('/'), x.confidence),
285
reverse=True
286
)
287
288
return sorted_cats[:max_categories]
289
290
# Usage
291
response = client.classify_text(request={"document": document})
292
293
# Group by top-level category
294
grouped_categories = group_by_top_level_category(response.categories)
295
296
print("Categories grouped by top-level:")
297
for top_level, cats in grouped_categories.items():
298
print(f"{top_level}:")
299
for cat in cats:
300
print(f" {cat.name}: {cat.confidence:.3f}")
301
print()
302
303
# Get most specific categories
304
specific_categories = get_most_specific_categories(response.categories)
305
306
print("Most specific categories:")
307
for cat in specific_categories:
308
depth = cat.name.count('/')
309
print(f"{cat.name} (depth: {depth}): {cat.confidence:.3f}")
310
```
311
312
### Content Organization System
313
314
```python
315
class ContentOrganizer:
316
def __init__(self, client):
317
self.client = client
318
self.category_mapping = {
319
'technology': ['/Computers & Electronics', '/Science'],
320
'business': ['/Business & Industrial', '/Finance'],
321
'entertainment': ['/Arts & Entertainment', '/Games'],
322
'health': ['/Health', '/Beauty & Fitness'],
323
'lifestyle': ['/Home & Garden', '/Food & Drink', '/Hobbies & Leisure'],
324
'news': ['/News', '/Law & Government'],
325
'education': ['/Jobs & Education', '/Reference', '/Books & Literature'],
326
'travel': ['/Travel'],
327
'sports': ['/Sports'],
328
'other': [] # Catch-all for unmatched categories
329
}
330
331
def organize_content(self, text):
332
"""Organize content into predefined buckets."""
333
document = language.Document(
334
content=text,
335
type_=language.Document.Type.PLAIN_TEXT
336
)
337
338
response = self.client.classify_text(
339
request={"document": document}
340
)
341
342
if not response.categories:
343
return 'other', []
344
345
# Find best matching bucket
346
best_bucket = 'other'
347
best_confidence = 0
348
matched_categories = []
349
350
for category in response.categories:
351
for bucket, prefixes in self.category_mapping.items():
352
for prefix in prefixes:
353
if category.name.startswith(prefix):
354
if category.confidence > best_confidence:
355
best_bucket = bucket
356
best_confidence = category.confidence
357
matched_categories.append({
358
'bucket': bucket,
359
'category': category.name,
360
'confidence': category.confidence
361
})
362
break
363
364
return best_bucket, matched_categories
365
366
def get_bucket_statistics(self, texts):
367
"""Get distribution of texts across buckets."""
368
bucket_counts = {bucket: 0 for bucket in self.category_mapping.keys()}
369
bucket_examples = {bucket: [] for bucket in self.category_mapping.keys()}
370
371
for text in texts:
372
bucket, categories = self.organize_content(text)
373
bucket_counts[bucket] += 1
374
375
if len(bucket_examples[bucket]) < 3: # Store up to 3 examples
376
bucket_examples[bucket].append({
377
'text': text[:50] + "..." if len(text) > 50 else text,
378
'categories': categories
379
})
380
381
return bucket_counts, bucket_examples
382
383
# Usage
384
organizer = ContentOrganizer(client)
385
386
sample_texts = [
387
"Latest developments in quantum computing and artificial intelligence.",
388
"Investment strategies for stock market volatility and portfolio management.",
389
"Delicious pasta recipes with organic ingredients and wine pairings.",
390
"Professional soccer match analysis and player performance statistics.",
391
"Breaking news about government policy changes and legal implications."
392
]
393
394
bucket_counts, bucket_examples = organizer.get_bucket_statistics(sample_texts)
395
396
print("Content Distribution:")
397
for bucket, count in bucket_counts.items():
398
if count > 0:
399
print(f"{bucket}: {count} documents")
400
for example in bucket_examples[bucket]:
401
print(f" - {example['text']}")
402
```
403
404
### Model Selection (v1/v1beta2 only)
405
406
```python
407
def classify_with_specific_model(client, text, model_version='v2'):
408
"""Classify text using a specific model version."""
409
document = language_v1.Document(
410
content=text,
411
type_=language_v1.Document.Type.PLAIN_TEXT
412
)
413
414
# Configure model options
415
if model_version == 'v1':
416
model_options = language_v1.ClassificationModelOptions(
417
v1_model=language_v1.ClassificationModelOptions.V1Model()
418
)
419
else: # v2
420
model_options = language_v1.ClassificationModelOptions(
421
v2_model=language_v1.ClassificationModelOptions.V2Model()
422
)
423
424
response = client.classify_text(
425
request={
426
"document": document,
427
"classification_model_options": model_options
428
}
429
)
430
431
return response.categories
432
433
# Usage (only with v1/v1beta2 clients)
434
# v1_categories = classify_with_specific_model(client, text, 'v1')
435
# v2_categories = classify_with_specific_model(client, text, 'v2')
436
```
437
438
### Confidence Threshold Analysis
439
440
```python
441
def analyze_classification_confidence(client, texts, thresholds=[0.1, 0.3, 0.5, 0.7]):
442
"""Analyze how classification results vary with different confidence thresholds."""
443
results = {}
444
445
for threshold in thresholds:
446
results[threshold] = {
447
'classified_count': 0,
448
'unclassified_count': 0,
449
'avg_categories_per_doc': 0,
450
'total_categories': 0
451
}
452
453
for text in texts:
454
document = language.Document(
455
content=text,
456
type_=language.Document.Type.PLAIN_TEXT
457
)
458
459
try:
460
response = client.classify_text(
461
request={"document": document}
462
)
463
464
for threshold in thresholds:
465
filtered_categories = [
466
cat for cat in response.categories
467
if cat.confidence >= threshold
468
]
469
470
if filtered_categories:
471
results[threshold]['classified_count'] += 1
472
results[threshold]['total_categories'] += len(filtered_categories)
473
else:
474
results[threshold]['unclassified_count'] += 1
475
476
except Exception:
477
for threshold in thresholds:
478
results[threshold]['unclassified_count'] += 1
479
480
# Calculate averages
481
for threshold in thresholds:
482
classified = results[threshold]['classified_count']
483
if classified > 0:
484
results[threshold]['avg_categories_per_doc'] = (
485
results[threshold]['total_categories'] / classified
486
)
487
488
return results
489
490
# Usage
491
texts = [
492
"Advanced machine learning techniques for predictive analytics.",
493
"Gourmet cooking with seasonal vegetables and herbs.",
494
"Financial planning strategies for retirement savings.",
495
"Professional basketball playoffs and championship predictions."
496
]
497
498
confidence_analysis = analyze_classification_confidence(client, texts)
499
500
print("Classification Analysis by Confidence Threshold:")
501
for threshold, stats in confidence_analysis.items():
502
print(f"Threshold {threshold}:")
503
print(f" Classified: {stats['classified_count']}")
504
print(f" Unclassified: {stats['unclassified_count']}")
505
print(f" Avg categories per doc: {stats['avg_categories_per_doc']:.2f}")
506
print()
507
```
508
509
## Error Handling
510
511
```python
512
from google.api_core import exceptions
513
514
try:
515
response = client.classify_text(
516
request={"document": document},
517
timeout=15.0
518
)
519
except exceptions.InvalidArgument as e:
520
print(f"Invalid document: {e}")
521
# Common causes: empty document, unsupported language, insufficient content
522
except exceptions.ResourceExhausted:
523
print("API quota exceeded")
524
except exceptions.DeadlineExceeded:
525
print("Request timed out")
526
except exceptions.GoogleAPIError as e:
527
print(f"API error: {e}")
528
529
# Handle no classification results
530
if not response.categories:
531
print("No classification categories found - document may be too short or ambiguous")
532
```
533
534
## Performance Considerations
535
536
- **Text Length**: Requires sufficient text (typically 20+ words) for accurate classification
537
- **Content Quality**: Better results with well-written, focused content
538
- **Language Support**: Optimized for English, with varying support for other languages
539
- **Caching**: Results can be cached for static content
540
- **Batch Processing**: Use async client for large document sets
541
542
## Use Cases
543
544
- **Content Management**: Automatically organize articles, documents, and web content
545
- **Email Routing**: Route support emails to appropriate departments
546
- **News Categorization**: Classify news articles by topic and theme
547
- **Product Categorization**: Organize product descriptions and reviews
548
- **Social Media Monitoring**: Categorize social media posts and comments
549
- **Document Archival**: Organize large document repositories
550
- **Content Recommendation**: Suggest related content based on categories
551
- **Compliance Filtering**: Filter content for regulatory compliance