Google Cloud Document AI client library for extracting structured information from documents using machine learning
npx @tessl/cli install tessl/pypi-google-cloud-documentai@3.6.00
# Google Cloud Document AI
1
2
Google Cloud Document AI is a machine learning service that extracts structured data from documents using pre-trained and custom document processors. The service can process various document types including invoices, receipts, forms, contracts, and other business documents.
3
4
## Package Information
5
6
**Package Name:** `google-cloud-documentai`
7
**Version:** 3.6.0
8
**Documentation:** [Google Cloud Document AI Documentation](https://cloud.google.com/document-ai)
9
10
### Installation
11
12
```bash
13
pip install google-cloud-documentai
14
```
15
16
### Authentication
17
18
This package requires Google Cloud authentication. Set up authentication using one of these methods:
19
20
1. **Application Default Credentials (Recommended)**:
21
```bash
22
gcloud auth application-default login
23
```
24
25
2. **Service Account Key**:
26
```bash
27
export GOOGLE_APPLICATION_CREDENTIALS="path/to/service-account-key.json"
28
```
29
30
3. **Environment Variables**:
31
```bash
32
export GOOGLE_CLOUD_PROJECT="your-project-id"
33
```
34
35
## Core Imports
36
37
```python { .api }
38
# Main module - exports v1 (stable) API
39
from google.cloud.documentai import DocumentProcessorServiceClient
40
from google.cloud.documentai import Document, ProcessRequest, ProcessResponse
41
42
# Alternative import pattern
43
from google.cloud import documentai
44
45
# For async operations
46
from google.cloud.documentai import DocumentProcessorServiceAsyncClient
47
48
# Core types for document processing
49
from google.cloud.documentai.types import (
50
RawDocument,
51
GcsDocument,
52
Processor,
53
ProcessorType,
54
BoundingPoly,
55
Vertex
56
)
57
```
58
59
## Basic Usage Example
60
61
```python { .api }
62
from google.cloud.documentai import DocumentProcessorServiceClient
63
from google.cloud.documentai.types import RawDocument, ProcessRequest
64
65
def process_document(project_id: str, location: str, processor_id: str, file_path: str, mime_type: str):
66
"""
67
Process a document using Google Cloud Document AI.
68
69
Args:
70
project_id: Google Cloud project ID
71
location: Processor location (e.g., 'us' or 'eu')
72
processor_id: ID of the document processor to use
73
file_path: Path to the document file
74
mime_type: MIME type of the document (e.g., 'application/pdf')
75
76
Returns:
77
Document: Processed document with extracted data
78
"""
79
# Initialize the client
80
client = DocumentProcessorServiceClient()
81
82
# The full resource name of the processor
83
name = client.processor_path(project_id, location, processor_id)
84
85
# Read the document file
86
with open(file_path, "rb") as document:
87
document_content = document.read()
88
89
# Create raw document
90
raw_document = RawDocument(content=document_content, mime_type=mime_type)
91
92
# Configure the process request
93
request = ProcessRequest(name=name, raw_document=raw_document)
94
95
# Process the document
96
result = client.process_document(request=request)
97
98
# Access processed document
99
document = result.document
100
101
print(f"Document text: {document.text}")
102
print(f"Number of pages: {len(document.pages)}")
103
104
# Extract entities
105
for entity in document.entities:
106
print(f"Entity: {entity.type_} = {entity.mention_text}")
107
108
return document
109
110
# Example usage
111
document = process_document(
112
project_id="my-project",
113
location="us",
114
processor_id="abc123def456",
115
file_path="invoice.pdf",
116
mime_type="application/pdf"
117
)
118
```
119
120
## Architecture
121
122
### Document Processing Workflow
123
124
Google Cloud Document AI follows this processing workflow:
125
126
1. **Document Input**: Raw documents (PDF, images) or Cloud Storage references
127
2. **Processor Selection**: Choose appropriate pre-trained or custom processor
128
3. **Processing**: AI models extract text, layout, and structured data
129
4. **Output**: Structured document with text, entities, tables, and metadata
130
131
### Key Concepts
132
133
#### Processors
134
Processors are AI models that extract data from specific document types:
135
- **Pre-trained processors**: Ready-to-use for common documents (invoices, receipts, forms)
136
- **Custom processors**: Trained on your specific document types
137
- **Processor versions**: Different iterations of a processor with varying capabilities
138
139
#### Documents
140
The `Document` type represents processed documents with:
141
- **Text**: Extracted text content with character-level positioning
142
- **Pages**: Individual pages with layout elements (blocks, paragraphs, lines, tokens)
143
- **Entities**: Extracted structured data (names, dates, amounts, addresses)
144
- **Tables**: Detected tables with cell-level data
145
- **Form fields**: Key-value pairs from forms
146
147
#### Locations
148
Processors are deployed in specific regions:
149
- `us`: United States (Iowa)
150
- `eu`: Europe (Belgium)
151
- Custom locations for enterprise customers
152
153
## Capabilities
154
155
### Document Processing Operations
156
Core functionality for processing individual and batch documents.
157
158
```python { .api }
159
# Process single document
160
from google.cloud.documentai import DocumentProcessorServiceClient
161
from google.cloud.documentai.types import ProcessRequest
162
163
client = DocumentProcessorServiceClient()
164
request = ProcessRequest(name="projects/my-project/locations/us/processors/abc123")
165
result = client.process_document(request=request)
166
```
167
168
**[→ Document Processing Operations](./document-processing.md)**
169
170
### Processor Management
171
Manage processor lifecycle including creation, deployment, and training.
172
173
```python { .api }
174
# List available processors
175
from google.cloud.documentai import DocumentProcessorServiceClient
176
from google.cloud.documentai.types import ListProcessorsRequest
177
178
client = DocumentProcessorServiceClient()
179
request = ListProcessorsRequest(parent="projects/my-project/locations/us")
180
response = client.list_processors(request=request)
181
182
for processor in response.processors:
183
print(f"Processor: {processor.display_name} ({processor.name})")
184
```
185
186
**[→ Processor Management](./processor-management.md)**
187
188
### Document Types and Schemas
189
Work with document structures, entities, and type definitions.
190
191
```python { .api }
192
# Access document structure
193
from google.cloud.documentai.types import Document
194
195
def analyze_document_structure(document: Document):
196
"""Analyze the structure of a processed document."""
197
print(f"Total text length: {len(document.text)}")
198
199
# Analyze pages
200
for i, page in enumerate(document.pages):
201
print(f"Page {i+1}: {len(page.blocks)} blocks, {len(page.paragraphs)} paragraphs")
202
203
# Analyze entities by type
204
entity_types = {}
205
for entity in document.entities:
206
entity_type = entity.type_
207
if entity_type not in entity_types:
208
entity_types[entity_type] = []
209
entity_types[entity_type].append(entity.mention_text)
210
211
for entity_type, mentions in entity_types.items():
212
print(f"{entity_type}: {len(mentions)} instances")
213
```
214
215
**[→ Document Types and Schemas](./document-types.md)**
216
217
### Batch Operations
218
Process multiple documents asynchronously for high-volume workflows.
219
220
```python { .api }
221
# Batch process documents
222
from google.cloud.documentai import DocumentProcessorServiceClient
223
from google.cloud.documentai.types import BatchProcessRequest, GcsDocuments
224
225
client = DocumentProcessorServiceClient()
226
227
# Configure batch request
228
gcs_documents = GcsDocuments(documents=[
229
{"gcs_uri": "gs://my-bucket/doc1.pdf", "mime_type": "application/pdf"},
230
{"gcs_uri": "gs://my-bucket/doc2.pdf", "mime_type": "application/pdf"}
231
])
232
233
request = BatchProcessRequest(
234
name="projects/my-project/locations/us/processors/abc123",
235
input_documents=gcs_documents,
236
document_output_config={
237
"gcs_output_config": {"gcs_uri": "gs://my-bucket/output/"}
238
}
239
)
240
241
operation = client.batch_process_documents(request=request)
242
```
243
244
**[→ Batch Operations](./batch-operations.md)**
245
246
### Beta Features (v1beta3)
247
Access experimental features including dataset management and enhanced document processing.
248
249
```python { .api }
250
# Beta features - DocumentService for dataset management
251
from google.cloud.documentai_v1beta3 import DocumentServiceClient
252
from google.cloud.documentai_v1beta3.types import Dataset
253
254
client = DocumentServiceClient()
255
256
# List documents in a dataset
257
request = {"parent": "projects/my-project/locations/us/processors/abc123/dataset"}
258
response = client.list_documents(request=request)
259
```
260
261
**[→ Beta Features](./beta-features.md)**
262
263
## API Versions
264
265
### V1 (Stable)
266
The main `google.cloud.documentai` module exports the stable v1 API:
267
- **Module**: `google.cloud.documentai`
268
- **Direct access**: `google.cloud.documentai_v1`
269
- **Status**: Production ready
270
- **Features**: Core document processing and processor management
271
272
### V1beta3 (Beta)
273
Extended API with additional features:
274
- **Module**: `google.cloud.documentai_v1beta3`
275
- **Status**: Beta (subject to breaking changes)
276
- **Additional features**: Dataset management, enhanced document operations, custom training
277
278
## Error Handling
279
280
```python { .api }
281
from google.cloud.documentai import DocumentProcessorServiceClient
282
from google.cloud.exceptions import GoogleCloudError
283
from google.api_core.exceptions import NotFound, InvalidArgument
284
285
client = DocumentProcessorServiceClient()
286
287
try:
288
# Process document
289
result = client.process_document(request=request)
290
except NotFound as e:
291
print(f"Processor not found: {e}")
292
except InvalidArgument as e:
293
print(f"Invalid request: {e}")
294
except GoogleCloudError as e:
295
print(f"Google Cloud error: {e}")
296
except Exception as e:
297
print(f"Unexpected error: {e}")
298
```
299
300
## Resource Names
301
302
Google Cloud Document AI uses hierarchical resource names:
303
304
```python { .api }
305
from google.cloud.documentai import DocumentProcessorServiceClient
306
307
client = DocumentProcessorServiceClient()
308
309
# Build resource names using helper methods
310
processor_path = client.processor_path("my-project", "us", "processor-id")
311
# Result: "projects/my-project/locations/us/processors/processor-id"
312
313
processor_version_path = client.processor_version_path(
314
"my-project", "us", "processor-id", "version-id"
315
)
316
# Result: "projects/my-project/locations/us/processors/processor-id/processorVersions/version-id"
317
318
location_path = client.common_location_path("my-project", "us")
319
# Result: "projects/my-project/locations/us"
320
```
321
322
## Performance Considerations
323
324
- **Document Size**: Individual documents up to 20MB, batch operations up to 1000 documents
325
- **Rate Limits**: Varies by processor type and region
326
- **Async Processing**: Use batch operations for high-volume processing
327
- **Caching**: Consider caching processed results for frequently accessed documents
328
- **Regional Processing**: Use the same region as your data for better performance
329
330
## Next Steps
331
332
- **[Document Processing Operations](./document-processing.md)**: Learn core document processing workflows
333
- **[Processor Management](./processor-management.md)**: Manage and configure processors
334
- **[Document Types and Schemas](./document-types.md)**: Understand document structure and types
335
- **[Batch Operations](./batch-operations.md)**: Process documents at scale
336
- **[Beta Features](./beta-features.md)**: Explore cutting-edge capabilities