CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-google-cloud-documentai

Google Cloud Document AI client library for extracting structured information from documents using machine learning

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

Google Cloud Document AI

Google Cloud Document AI is a machine learning service that extracts structured data from documents using pre-trained and custom document processors. The service can process various document types including invoices, receipts, forms, contracts, and other business documents.

Package Information

Package Name: google-cloud-documentai
Version: 3.6.0
Documentation: Google Cloud Document AI Documentation

Installation

pip install google-cloud-documentai

Authentication

This package requires Google Cloud authentication. Set up authentication using one of these methods:

  1. Application Default Credentials (Recommended):

    gcloud auth application-default login
  2. Service Account Key:

    export GOOGLE_APPLICATION_CREDENTIALS="path/to/service-account-key.json"
  3. Environment Variables:

    export GOOGLE_CLOUD_PROJECT="your-project-id"

Core Imports

# Main module - exports v1 (stable) API
from google.cloud.documentai import DocumentProcessorServiceClient
from google.cloud.documentai import Document, ProcessRequest, ProcessResponse

# Alternative import pattern
from google.cloud import documentai

# For async operations
from google.cloud.documentai import DocumentProcessorServiceAsyncClient

# Core types for document processing
from google.cloud.documentai.types import (
    RawDocument,
    GcsDocument,
    Processor,
    ProcessorType,
    BoundingPoly,
    Vertex
)

Basic Usage Example

from google.cloud.documentai import DocumentProcessorServiceClient
from google.cloud.documentai.types import RawDocument, ProcessRequest

def process_document(project_id: str, location: str, processor_id: str, file_path: str, mime_type: str):
    """
    Process a document using Google Cloud Document AI.
    
    Args:
        project_id: Google Cloud project ID
        location: Processor location (e.g., 'us' or 'eu')  
        processor_id: ID of the document processor to use
        file_path: Path to the document file
        mime_type: MIME type of the document (e.g., 'application/pdf')
    
    Returns:
        Document: Processed document with extracted data
    """
    # Initialize the client
    client = DocumentProcessorServiceClient()
    
    # The full resource name of the processor
    name = client.processor_path(project_id, location, processor_id)
    
    # Read the document file
    with open(file_path, "rb") as document:
        document_content = document.read()
    
    # Create raw document
    raw_document = RawDocument(content=document_content, mime_type=mime_type)
    
    # Configure the process request
    request = ProcessRequest(name=name, raw_document=raw_document)
    
    # Process the document
    result = client.process_document(request=request)
    
    # Access processed document
    document = result.document
    
    print(f"Document text: {document.text}")
    print(f"Number of pages: {len(document.pages)}")
    
    # Extract entities
    for entity in document.entities:
        print(f"Entity: {entity.type_} = {entity.mention_text}")
    
    return document

# Example usage
document = process_document(
    project_id="my-project",
    location="us",
    processor_id="abc123def456",
    file_path="invoice.pdf",
    mime_type="application/pdf"
)

Architecture

Document Processing Workflow

Google Cloud Document AI follows this processing workflow:

  1. Document Input: Raw documents (PDF, images) or Cloud Storage references
  2. Processor Selection: Choose appropriate pre-trained or custom processor
  3. Processing: AI models extract text, layout, and structured data
  4. Output: Structured document with text, entities, tables, and metadata

Key Concepts

Processors

Processors are AI models that extract data from specific document types:

  • Pre-trained processors: Ready-to-use for common documents (invoices, receipts, forms)
  • Custom processors: Trained on your specific document types
  • Processor versions: Different iterations of a processor with varying capabilities

Documents

The Document type represents processed documents with:

  • Text: Extracted text content with character-level positioning
  • Pages: Individual pages with layout elements (blocks, paragraphs, lines, tokens)
  • Entities: Extracted structured data (names, dates, amounts, addresses)
  • Tables: Detected tables with cell-level data
  • Form fields: Key-value pairs from forms

Locations

Processors are deployed in specific regions:

  • us: United States (Iowa)
  • eu: Europe (Belgium)
  • Custom locations for enterprise customers

Capabilities

Document Processing Operations

Core functionality for processing individual and batch documents.

# Process single document
from google.cloud.documentai import DocumentProcessorServiceClient
from google.cloud.documentai.types import ProcessRequest

client = DocumentProcessorServiceClient()
request = ProcessRequest(name="projects/my-project/locations/us/processors/abc123")
result = client.process_document(request=request)

→ Document Processing Operations

Processor Management

Manage processor lifecycle including creation, deployment, and training.

# List available processors
from google.cloud.documentai import DocumentProcessorServiceClient
from google.cloud.documentai.types import ListProcessorsRequest

client = DocumentProcessorServiceClient()
request = ListProcessorsRequest(parent="projects/my-project/locations/us")
response = client.list_processors(request=request)

for processor in response.processors:
    print(f"Processor: {processor.display_name} ({processor.name})")

→ Processor Management

Document Types and Schemas

Work with document structures, entities, and type definitions.

# Access document structure
from google.cloud.documentai.types import Document

def analyze_document_structure(document: Document):
    """Analyze the structure of a processed document."""
    print(f"Total text length: {len(document.text)}")
    
    # Analyze pages
    for i, page in enumerate(document.pages):
        print(f"Page {i+1}: {len(page.blocks)} blocks, {len(page.paragraphs)} paragraphs")
    
    # Analyze entities by type
    entity_types = {}
    for entity in document.entities:
        entity_type = entity.type_
        if entity_type not in entity_types:
            entity_types[entity_type] = []
        entity_types[entity_type].append(entity.mention_text)
    
    for entity_type, mentions in entity_types.items():
        print(f"{entity_type}: {len(mentions)} instances")

→ Document Types and Schemas

Batch Operations

Process multiple documents asynchronously for high-volume workflows.

# Batch process documents
from google.cloud.documentai import DocumentProcessorServiceClient
from google.cloud.documentai.types import BatchProcessRequest, GcsDocuments

client = DocumentProcessorServiceClient()

# Configure batch request
gcs_documents = GcsDocuments(documents=[
    {"gcs_uri": "gs://my-bucket/doc1.pdf", "mime_type": "application/pdf"},
    {"gcs_uri": "gs://my-bucket/doc2.pdf", "mime_type": "application/pdf"}
])

request = BatchProcessRequest(
    name="projects/my-project/locations/us/processors/abc123",
    input_documents=gcs_documents,
    document_output_config={
        "gcs_output_config": {"gcs_uri": "gs://my-bucket/output/"}
    }
)

operation = client.batch_process_documents(request=request)

→ Batch Operations

Beta Features (v1beta3)

Access experimental features including dataset management and enhanced document processing.

# Beta features - DocumentService for dataset management
from google.cloud.documentai_v1beta3 import DocumentServiceClient
from google.cloud.documentai_v1beta3.types import Dataset

client = DocumentServiceClient()

# List documents in a dataset
request = {"parent": "projects/my-project/locations/us/processors/abc123/dataset"}
response = client.list_documents(request=request)

→ Beta Features

API Versions

V1 (Stable)

The main google.cloud.documentai module exports the stable v1 API:

  • Module: google.cloud.documentai
  • Direct access: google.cloud.documentai_v1
  • Status: Production ready
  • Features: Core document processing and processor management

V1beta3 (Beta)

Extended API with additional features:

  • Module: google.cloud.documentai_v1beta3
  • Status: Beta (subject to breaking changes)
  • Additional features: Dataset management, enhanced document operations, custom training

Error Handling

from google.cloud.documentai import DocumentProcessorServiceClient
from google.cloud.exceptions import GoogleCloudError
from google.api_core.exceptions import NotFound, InvalidArgument

client = DocumentProcessorServiceClient()

try:
    # Process document
    result = client.process_document(request=request)
except NotFound as e:
    print(f"Processor not found: {e}")
except InvalidArgument as e:
    print(f"Invalid request: {e}")
except GoogleCloudError as e:
    print(f"Google Cloud error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Resource Names

Google Cloud Document AI uses hierarchical resource names:

from google.cloud.documentai import DocumentProcessorServiceClient

client = DocumentProcessorServiceClient()

# Build resource names using helper methods
processor_path = client.processor_path("my-project", "us", "processor-id")
# Result: "projects/my-project/locations/us/processors/processor-id"

processor_version_path = client.processor_version_path(
    "my-project", "us", "processor-id", "version-id"
)
# Result: "projects/my-project/locations/us/processors/processor-id/processorVersions/version-id"

location_path = client.common_location_path("my-project", "us") 
# Result: "projects/my-project/locations/us"

Performance Considerations

  • Document Size: Individual documents up to 20MB, batch operations up to 1000 documents
  • Rate Limits: Varies by processor type and region
  • Async Processing: Use batch operations for high-volume processing
  • Caching: Consider caching processed results for frequently accessed documents
  • Regional Processing: Use the same region as your data for better performance

Next Steps

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/google-cloud-documentai@3.6.x
Publish Source
CLI
Badge
tessl/pypi-google-cloud-documentai badge