tessl/pypi-google-cloud-documentai

Google Cloud Document AI client library for extracting structured information from documents using machine learning

—

Pending

Overview

Eval results

Files

Document Types and Schemas

Name: tessl/pypi-google-cloud-documentai
Author: tessl

This guide covers the comprehensive type system and document structures in Google Cloud Document AI, including document representation, entity types, geometry, and schema definitions.

Core Document Structure

Document Type

The Document type represents a processed document with all extracted information:

from google.cloud.documentai.types import Document

class Document:
    """
    Represents a processed document with extracted text, layout, and entities.
    
    Attributes:
        text (str): UTF-8 encoded text extracted from the document
        pages (Sequence[Document.Page]): List of document pages
        entities (Sequence[Document.Entity]): Extracted entities
        text_styles (Sequence[Document.Style]): Text styling information
        shards (Sequence[Document.Shard]): Information about document shards
        error (google.rpc.Status): Processing error information if any
        mime_type (str): Original MIME type of the document
        uri (str): Optional URI where the document was retrieved from
    """
    
    class Page:
        """
        Represents a single page in the document.
        
        Attributes:
            page_number (int): 1-based page number
            dimension (Document.Page.Dimension): Page dimensions
            layout (Document.Page.Layout): Page layout information
            detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages detected on page
            blocks (Sequence[Document.Page.Block]): Text blocks on the page
            paragraphs (Sequence[Document.Page.Paragraph]): Paragraphs on the page
            lines (Sequence[Document.Page.Line]): Text lines on the page
            tokens (Sequence[Document.Page.Token]): Individual tokens on the page
            visual_elements (Sequence[Document.Page.VisualElement]): Visual elements like images
            tables (Sequence[Document.Page.Table]): Tables detected on the page
            form_fields (Sequence[Document.Page.FormField]): Form fields detected on the page
            symbols (Sequence[Document.Page.Symbol]): Symbols detected on the page
            detected_barcodes (Sequence[Document.Page.DetectedBarcode]): Barcodes on the page
        """
        
        class Dimension:
            """
            Physical dimension of the page.
            
            Attributes:
                width (float): Page width in specified unit
                height (float): Page height in specified unit  
                unit (str): Unit of measurement ('INCH', 'CM', 'POINT')
            """
            pass
        
        class Layout:
            """
            Layout information for a page element.
            
            Attributes:
                text_anchor (Document.TextAnchor): Text location reference
                confidence (float): Confidence score [0.0, 1.0]
                bounding_poly (BoundingPoly): Bounding box of the element
                orientation (Document.Page.Layout.Orientation): Text orientation
            """
            pass
        
        class Block:
            """
            A block of text on a page.
            
            Attributes:
                layout (Document.Page.Layout): Block layout information
                detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in block
                provenance (Document.Provenance): Processing provenance information
            """
            pass
            
        class Table:
            """
            A table detected on the page.
            
            Attributes:
                layout (Document.Page.Layout): Table layout information
                header_rows (Sequence[Document.Page.Table.TableRow]): Table header rows
                body_rows (Sequence[Document.Page.Table.TableRow]): Table body rows
                detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in table
            """
            
            class TableRow:
                """
                A single row in a table.
                
                Attributes:
                    cells (Sequence[Document.Page.Table.TableCell]): Cells in the row
                """
                pass
                
            class TableCell:
                """
                A single cell in a table.
                
                Attributes:
                    layout (Document.Page.Layout): Cell layout information
                    row_span (int): Number of rows this cell spans
                    col_span (int): Number of columns this cell spans
                    detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in cell
                """
                pass
        
        class FormField:
            """
            A form field (key-value pair) detected on the page.
            
            Attributes:
                field_name (Document.Page.Layout): Layout of the field name/key
                field_value (Document.Page.Layout): Layout of the field value
                name_detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in name
                value_detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in value
                value_type (str): Type of the field value
                corrected_key_text (str): Corrected key text if available
                corrected_value_text (str): Corrected value text if available
            """
            pass
    
    class Entity:
        """
        An entity extracted from the document.
        
        Attributes:
            text_anchor (Document.TextAnchor): Reference to entity text in document
            type_ (str): Entity type (e.g., 'invoice_date', 'total_amount')
            mention_text (str): Text mention of the entity
            mention_id (str): Unique mention identifier
            confidence (float): Confidence score [0.0, 1.0]
            page_anchor (Document.PageAnchor): Page reference for the entity
            id (str): Entity identifier
            normalized_value (Document.Entity.NormalizedValue): Normalized entity value
            properties (Sequence[Document.Entity]): Sub-entities or properties
            provenance (Document.Provenance): Processing provenance
            redacted (bool): Whether entity was redacted
        """
        
        class NormalizedValue:
            """
            Normalized representation of an entity value.
            
            Attributes:
                money_value (google.type.Money): Monetary value
                date_value (google.type.Date): Date value
                datetime_value (google.type.DateTime): DateTime value
                address_value (google.type.PostalAddress): Address value
                boolean_value (bool): Boolean value
                integer_value (int): Integer value
                float_value (float): Float value
                text (str): Text representation
            """
            pass
    
    class TextAnchor:
        """
        Text anchor referencing a segment of text in the document.
        
        Attributes:
            text_segments (Sequence[Document.TextAnchor.TextSegment]): Text segments
            content (str): Text content (if not referencing document.text)
        """
        
        class TextSegment:
            """
            A segment of text.
            
            Attributes:
                start_index (int): Start character index in document text
                end_index (int): End character index in document text
            """
            pass

Document I/O Types

RawDocument

from google.cloud.documentai.types import RawDocument

class RawDocument:
    """
    Represents a raw document for processing.
    
    Attributes:
        content (bytes): Raw document content
        mime_type (str): MIME type of the document
        display_name (str): Optional display name for the document
    """
    
    def __init__(
        self, 
        content: bytes,
        mime_type: str,
        display_name: str = None
    ):
        """
        Initialize a raw document.
        
        Args:
            content: Raw document bytes
            mime_type: Document MIME type (e.g., 'application/pdf')
            display_name: Optional display name
        """
        self.content = content
        self.mime_type = mime_type
        self.display_name = display_name

# Example usage
def create_raw_document_from_file(file_path: str, mime_type: str) -> RawDocument:
    """
    Create RawDocument from a file.
    
    Args:
        file_path: Path to document file
        mime_type: MIME type of the document
        
    Returns:
        RawDocument: Raw document object
    """
    with open(file_path, "rb") as f:
        content = f.read()
    
    return RawDocument(
        content=content,
        mime_type=mime_type,
        display_name=file_path.split("/")[-1]
    )

GcsDocument

from google.cloud.documentai.types import GcsDocument

class GcsDocument:
    """
    Represents a document stored in Google Cloud Storage.
    
    Attributes:
        gcs_uri (str): Cloud Storage URI (gs://bucket/path)
        mime_type (str): MIME type of the document
    """
    
    def __init__(self, gcs_uri: str, mime_type: str):
        """
        Initialize a GCS document reference.
        
        Args:
            gcs_uri: Cloud Storage URI
            mime_type: Document MIME type
        """
        self.gcs_uri = gcs_uri
        self.mime_type = mime_type

# Example usage
def create_gcs_documents_batch(
    gcs_uris: list[str], 
    mime_types: list[str]
) -> list[GcsDocument]:
    """
    Create batch of GCS document references.
    
    Args:
        gcs_uris: List of Cloud Storage URIs
        mime_types: List of corresponding MIME types
        
    Returns:
        list[GcsDocument]: List of GCS document references
    """
    if len(gcs_uris) != len(mime_types):
        raise ValueError("Number of URIs must match number of MIME types")
    
    return [
        GcsDocument(gcs_uri=uri, mime_type=mime_type)
        for uri, mime_type in zip(gcs_uris, mime_types)
    ]

GcsDocuments

from google.cloud.documentai.types import GcsDocuments, GcsDocument

class GcsDocuments:
    """
    Collection of documents stored in Google Cloud Storage.
    
    Attributes:
        documents (Sequence[GcsDocument]): List of GCS documents
    """
    
    def __init__(self, documents: list[GcsDocument]):
        """
        Initialize GCS documents collection.
        
        Args:
            documents: List of GcsDocument objects
        """
        self.documents = documents

# Example usage
def create_gcs_documents_from_prefix(
    gcs_prefix: str,
    file_extensions: list[str] = None
) -> GcsDocuments:
    """
    Create GcsDocuments from a Cloud Storage prefix.
    
    Args:
        gcs_prefix: Cloud Storage prefix (gs://bucket/path/)
        file_extensions: Optional list of file extensions to include
        
    Returns:
        GcsDocuments: Collection of GCS documents
    """
    # This would require Cloud Storage client to list files
    # Simplified example assuming we know the files
    documents = []
    
    # Example files (in practice, you'd list the bucket contents)
    example_files = [
        f"{gcs_prefix}doc1.pdf",
        f"{gcs_prefix}doc2.pdf",
        f"{gcs_prefix}image1.jpg"
    ]
    
    mime_type_map = {
        '.pdf': 'application/pdf',
        '.jpg': 'image/jpeg',
        '.png': 'image/png',
        '.tiff': 'image/tiff'
    }
    
    for file_uri in example_files:
        # Determine MIME type from extension
        for ext, mime_type in mime_type_map.items():
            if file_uri.lower().endswith(ext):
                documents.append(GcsDocument(
                    gcs_uri=file_uri,
                    mime_type=mime_type
                ))
                break
    
    return GcsDocuments(documents=documents)

Geometry Types

BoundingPoly

from google.cloud.documentai.types import BoundingPoly, Vertex, NormalizedVertex

class BoundingPoly:
    """
    A bounding polygon for the detected image annotation.
    
    Attributes:
        vertices (Sequence[Vertex]): Vertices of the bounding polygon
        normalized_vertices (Sequence[NormalizedVertex]): Normalized vertices [0.0, 1.0]
    """
    
    def __init__(
        self,
        vertices: list[Vertex] = None,
        normalized_vertices: list[NormalizedVertex] = None
    ):
        """
        Initialize bounding polygon.
        
        Args:
            vertices: List of pixel-coordinate vertices
            normalized_vertices: List of normalized coordinate vertices
        """
        self.vertices = vertices or []
        self.normalized_vertices = normalized_vertices or []

class Vertex:
    """
    A vertex represents a 2D point in the image.
    
    Attributes:
        x (int): X coordinate in pixels
        y (int): Y coordinate in pixels
    """
    
    def __init__(self, x: int, y: int):
        """
        Initialize vertex with pixel coordinates.
        
        Args:
            x: X coordinate
            y: Y coordinate  
        """
        self.x = x
        self.y = y

class NormalizedVertex:
    """
    A vertex represents a 2D point with normalized coordinates.
    
    Attributes:
        x (float): X coordinate [0.0, 1.0]
        y (float): Y coordinate [0.0, 1.0]
    """
    
    def __init__(self, x: float, y: float):
        """
        Initialize normalized vertex.
        
        Args:
            x: Normalized X coordinate [0.0, 1.0]
            y: Normalized Y coordinate [0.0, 1.0]
        """
        self.x = x
        self.y = y

# Utility functions for geometry
def create_bounding_box(
    left: int, 
    top: int, 
    right: int, 
    bottom: int
) -> BoundingPoly:
    """
    Create a rectangular bounding polygon.
    
    Args:
        left: Left edge X coordinate
        top: Top edge Y coordinate  
        right: Right edge X coordinate
        bottom: Bottom edge Y coordinate
        
    Returns:
        BoundingPoly: Rectangular bounding polygon
    """
    vertices = [
        Vertex(x=left, y=top),      # Top-left
        Vertex(x=right, y=top),     # Top-right  
        Vertex(x=right, y=bottom),  # Bottom-right
        Vertex(x=left, y=bottom)    # Bottom-left
    ]
    
    return BoundingPoly(vertices=vertices)

def normalize_bounding_poly(
    bounding_poly: BoundingPoly,
    page_width: int,
    page_height: int
) -> BoundingPoly:
    """
    Convert pixel coordinates to normalized coordinates.
    
    Args:
        bounding_poly: Bounding polygon with pixel coordinates
        page_width: Page width in pixels
        page_height: Page height in pixels
        
    Returns:
        BoundingPoly: Bounding polygon with normalized coordinates
    """
    normalized_vertices = []
    
    for vertex in bounding_poly.vertices:
        normalized_x = vertex.x / page_width
        normalized_y = vertex.y / page_height
        normalized_vertices.append(
            NormalizedVertex(x=normalized_x, y=normalized_y)
        )
    
    return BoundingPoly(normalized_vertices=normalized_vertices)

Processor and Processor Type Definitions

Processor

from google.cloud.documentai.types import Processor
from google.protobuf.timestamp_pb2 import Timestamp

class Processor:
    """
    The first-class citizen for Document AI.
    
    Attributes:
        name (str): Output only. Immutable. The resource name of the processor
        type_ (str): The processor type, e.g., OCR_PROCESSOR, INVOICE_PROCESSOR
        display_name (str): The display name of the processor
        state (Processor.State): Output only. The state of the processor
        default_processor_version (str): The default processor version
        processor_version_aliases (Sequence[ProcessorVersionAlias]): Version aliases
        process_endpoint (str): Output only. Immutable. The http endpoint for this processor
        create_time (Timestamp): Output only. The time the processor was created
        kms_key_name (str): The KMS key used to encrypt the processor
        satisfies_pzs (bool): Output only. Reserved for future use
        satisfies_pzi (bool): Output only. Reserved for future use
    """
    
    class State(Enum):
        """
        The possible states of the processor.
        
        Values:
            STATE_UNSPECIFIED: The processor state is unspecified
            ENABLED: The processor is enabled, i.e., has an enabled version
            DISABLED: The processor is disabled
            ENABLING: The processor is being enabled, i.e., is having an enabled version  
            DISABLING: The processor is being disabled
            CREATING: The processor is being created
            FAILED: The processor failed during creation or while disabling
            DELETING: The processor is being deleted
        """
        STATE_UNSPECIFIED = 0
        ENABLED = 1
        DISABLED = 2
        ENABLING = 3
        DISABLING = 4
        CREATING = 5
        FAILED = 6
        DELETING = 7

def get_processor_state_description(state: "Processor.State") -> str:
    """
    Get human-readable description of processor state.
    
    Args:
        state: Processor state enum value
        
    Returns:
        str: Description of the state
    """
    descriptions = {
        Processor.State.ENABLED: "Ready for processing documents",
        Processor.State.DISABLED: "Not available for processing",
        Processor.State.ENABLING: "Currently being enabled",
        Processor.State.DISABLING: "Currently being disabled", 
        Processor.State.CREATING: "Being created",
        Processor.State.FAILED: "Failed to create or disable",
        Processor.State.DELETING: "Being permanently deleted"
    }
    
    return descriptions.get(state, "Unknown state")

ProcessorType

from google.cloud.documentai.types import ProcessorType

class ProcessorType:
    """
    A processor type is responsible for performing a certain document understanding task on a certain type of document.
    
    Attributes:
        name (str): The resource name of the processor type
        type_ (str): The processor type, e.g., OCR_PROCESSOR, INVOICE_PROCESSOR
        category (str): The processor category
        available_locations (Sequence[LocationInfo]): The locations where this processor is available
        allow_creation (bool): Whether the processor type allows creation of new processor instances
        launch_stage (google.api.LaunchStage): Launch stage of the processor type
        sample_document_uris (Sequence[str]): Sample documents for this processor type
    """
    
    class LocationInfo:
        """
        Information about the availability of a processor type in a location.
        
        Attributes:
            location_id (str): The location ID (e.g., 'us', 'eu')
        """
        pass

# Common processor types
PROCESSOR_TYPES = {
    # General processors
    "OCR_PROCESSOR": {
        "display_name": "Document OCR", 
        "description": "Extracts text from documents and images"
    },
    "FORM_PARSER_PROCESSOR": {
        "display_name": "Form Parser",
        "description": "Extracts key-value pairs from forms"
    },
    
    # Specialized processors  
    "INVOICE_PROCESSOR": {
        "display_name": "Invoice Parser",
        "description": "Extracts structured data from invoices"
    },
    "RECEIPT_PROCESSOR": {
        "display_name": "Receipt Parser", 
        "description": "Extracts data from receipts"
    },
    "IDENTITY_DOCUMENT_PROCESSOR": {
        "display_name": "Identity Document Parser",
        "description": "Extracts data from identity documents"
    },
    "CONTRACT_PROCESSOR": {
        "display_name": "Contract Parser",
        "description": "Extracts key information from contracts"
    },
    "EXPENSE_PROCESSOR": {
        "display_name": "Expense Parser",
        "description": "Extracts data from expense documents"
    },
    
    # Custom processors
    "CUSTOM_EXTRACTION_PROCESSOR": {
        "display_name": "Custom Extraction Processor",
        "description": "Custom trained processor for specific document types"
    },
    "CUSTOM_CLASSIFICATION_PROCESSOR": {
        "display_name": "Custom Classification Processor", 
        "description": "Custom trained processor for document classification"
    }
}

def get_processor_type_info(processor_type: str) -> dict:
    """
    Get information about a processor type.
    
    Args:
        processor_type: Processor type identifier
        
    Returns:
        dict: Processor type information
    """
    return PROCESSOR_TYPES.get(processor_type, {
        "display_name": processor_type,
        "description": "Unknown processor type"
    })

Document Schema

DocumentSchema

from google.cloud.documentai.types import DocumentSchema

class DocumentSchema:
    """
    The schema defines the output of the processed document by a processor.
    
    Attributes:
        display_name (str): Display name to show to users
        description (str): Description of the schema
        entity_types (Sequence[DocumentSchema.EntityType]): Entity types that this schema produces
        metadata (DocumentSchema.Metadata): Metadata about the schema
    """
    
    class EntityType:
        """
        EntityType is the wrapper of a label of the corresponding model with detailed attributes and limitations for entity-based processors.
        
        Attributes:
            enum_values (DocumentSchema.EntityType.EnumValues): If specified, lists all the possible values for this entity
            display_name (str): User defined name for the type
            name (str): Name of the type
            base_types (Sequence[str]): The entity type that this type is derived from
            properties (Sequence[DocumentSchema.EntityType.Property]): Description the nested structure, or composition of an entity
        """
        
        class Property:
            """
            Defines properties that can be part of the entity type.
            
            Attributes:
                name (str): The name of the property
                display_name (str): User defined name for the property
                value_type (str): A reference to the value type of the property
                occurrence_type (DocumentSchema.EntityType.Property.OccurrenceType): Occurrence type limits the number of instances an entity type appears in the document
            """
            
            class OccurrenceType(Enum):
                """
                Types of occurrences of the entity type in the document.
                
                Values:
                    OCCURRENCE_TYPE_UNSPECIFIED: Unspecified occurrence type
                    OPTIONAL_ONCE: There will be zero or one instance of this entity type
                    OPTIONAL_MULTIPLE: The entity type can have zero or multiple instances
                    REQUIRED_ONCE: The entity type will have exactly one instance
                    REQUIRED_MULTIPLE: The entity type will have one or more instances
                """
                OCCURRENCE_TYPE_UNSPECIFIED = 0
                OPTIONAL_ONCE = 1
                OPTIONAL_MULTIPLE = 2  
                REQUIRED_ONCE = 3
                REQUIRED_MULTIPLE = 4

def create_invoice_schema() -> DocumentSchema:
    """
    Create a document schema for invoice processing.
    
    Returns:
        DocumentSchema: Schema for invoice documents
    """
    # Define entity types for invoice
    entity_types = [
        DocumentSchema.EntityType(
            name="invoice_date",
            display_name="Invoice Date",
            properties=[
                DocumentSchema.EntityType.Property(
                    name="date_value",
                    display_name="Date Value",
                    value_type="date",
                    occurrence_type=DocumentSchema.EntityType.Property.OccurrenceType.REQUIRED_ONCE
                )
            ]
        ),
        DocumentSchema.EntityType(
            name="invoice_number", 
            display_name="Invoice Number",
            properties=[
                DocumentSchema.EntityType.Property(
                    name="text_value",
                    display_name="Text Value", 
                    value_type="text",
                    occurrence_type=DocumentSchema.EntityType.Property.OccurrenceType.REQUIRED_ONCE
                )
            ]
        ),
        DocumentSchema.EntityType(
            name="total_amount",
            display_name="Total Amount",
            properties=[
                DocumentSchema.EntityType.Property(
                    name="money_value",
                    display_name="Money Value",
                    value_type="money", 
                    occurrence_type=DocumentSchema.EntityType.Property.OccurrenceType.REQUIRED_ONCE
                )
            ]
        )
    ]
    
    return DocumentSchema(
        display_name="Invoice Processing Schema",
        description="Schema for extracting key information from invoices",
        entity_types=entity_types
    )

Barcode Types

Barcode

from google.cloud.documentai.types import Barcode

class Barcode:
    """
    Encodes the detailed information of a barcode.
    
    Attributes:
        format_ (str): Format of the barcode (e.g., CODE_128, QR_CODE)
        value_format (str): Format of the barcode value (e.g., CONTACT_INFO, URL)
        raw_value (str): Raw value encoded in the barcode
    """
    
    # Common barcode formats
    FORMATS = {
        "CODE_128": "Code 128 linear barcode",
        "CODE_39": "Code 39 linear barcode", 
        "CODE_93": "Code 93 linear barcode",
        "CODABAR": "Codabar linear barcode",
        "DATA_MATRIX": "Data Matrix 2D barcode",
        "EAN_13": "EAN-13 linear barcode",
        "EAN_8": "EAN-8 linear barcode", 
        "ITF": "ITF (Interleaved 2 of 5) linear barcode",
        "QR_CODE": "QR Code 2D barcode",
        "UPC_A": "UPC-A linear barcode",
        "UPC_E": "UPC-E linear barcode",
        "PDF417": "PDF417 2D barcode",
        "AZTEC": "Aztec 2D barcode"
    }

def extract_barcodes_from_document(document: "Document") -> list[dict]:
    """
    Extract all barcodes from a processed document.
    
    Args:
        document: Processed Document object
        
    Returns:
        list[dict]: List of barcode information
    """
    barcodes = []
    
    for page_idx, page in enumerate(document.pages):
        for barcode_detection in page.detected_barcodes:
            barcode_info = {
                "page": page_idx + 1,
                "format": barcode_detection.barcode.format_,
                "value_format": barcode_detection.barcode.value_format,
                "raw_value": barcode_detection.barcode.raw_value,
                "layout": barcode_detection.layout
            }
            barcodes.append(barcode_info)
    
    return barcodes

Complete Document Analysis Example

from google.cloud.documentai.types import Document
from typing import Dict, List, Any

def comprehensive_document_analysis(document: Document) -> Dict[str, Any]:
    """
    Perform comprehensive analysis of a processed document.
    
    Args:
        document: Processed Document object
        
    Returns:
        Dict[str, Any]: Complete document analysis results
    """
    analysis = {
        "document_info": {
            "mime_type": document.mime_type,
            "text_length": len(document.text),
            "page_count": len(document.pages),
            "entity_count": len(document.entities),
            "has_tables": False,
            "has_form_fields": False,
            "has_barcodes": False
        },
        "pages": [],
        "entities": {},
        "tables": [],
        "form_fields": {},
        "barcodes": [],
        "text_styles": []
    }
    
    # Analyze pages
    for page_idx, page in enumerate(document.pages):
        page_info = {
            "page_number": page_idx + 1,
            "dimensions": {
                "width": page.dimension.width,
                "height": page.dimension.height,
                "unit": page.dimension.unit
            },
            "elements": {
                "blocks": len(page.blocks),
                "paragraphs": len(page.paragraphs),
                "lines": len(page.lines),
                "tokens": len(page.tokens)
            },
            "tables": len(page.tables),
            "form_fields": len(page.form_fields),
            "barcodes": len(page.detected_barcodes),
            "languages": [lang.language_code for lang in page.detected_languages]
        }
        
        analysis["pages"].append(page_info)
        
        # Update document-level flags
        if page.tables:
            analysis["document_info"]["has_tables"] = True
        if page.form_fields:
            analysis["document_info"]["has_form_fields"] = True  
        if page.detected_barcodes:
            analysis["document_info"]["has_barcodes"] = True
    
    # Analyze entities by type
    for entity in document.entities:
        entity_type = entity.type_
        if entity_type not in analysis["entities"]:
            analysis["entities"][entity_type] = []
        
        entity_info = {
            "text": entity.mention_text,
            "confidence": entity.confidence,
            "normalized_value": None
        }
        
        # Extract normalized value if available
        if entity.normalized_value:
            if entity.normalized_value.money_value:
                entity_info["normalized_value"] = {
                    "type": "money",
                    "currency": entity.normalized_value.money_value.currency_code,
                    "amount": entity.normalized_value.money_value.units
                }
            elif entity.normalized_value.date_value:
                entity_info["normalized_value"] = {
                    "type": "date", 
                    "year": entity.normalized_value.date_value.year,
                    "month": entity.normalized_value.date_value.month,
                    "day": entity.normalized_value.date_value.day
                }
            elif entity.normalized_value.text:
                entity_info["normalized_value"] = {
                    "type": "text",
                    "value": entity.normalized_value.text
                }
        
        analysis["entities"][entity_type].append(entity_info)
    
    # Extract tables
    for page_idx, page in enumerate(document.pages):
        for table_idx, table in enumerate(page.tables):
            table_data = {
                "page": page_idx + 1,
                "table_index": table_idx,
                "header_rows": len(table.header_rows),
                "body_rows": len(table.body_rows),
                "total_rows": len(table.header_rows) + len(table.body_rows)
            }
            analysis["tables"].append(table_data)
    
    # Extract form fields
    for page in document.pages:
        for form_field in page.form_fields:
            if form_field.field_name and form_field.field_name.text_anchor:
                field_name = extract_text_from_anchor(
                    document.text, form_field.field_name.text_anchor
                ).strip()
                
                field_value = ""
                if form_field.field_value and form_field.field_value.text_anchor:
                    field_value = extract_text_from_anchor(
                        document.text, form_field.field_value.text_anchor
                    ).strip()
                
                analysis["form_fields"][field_name] = {
                    "value": field_value,
                    "name_confidence": form_field.field_name.confidence,
                    "value_confidence": form_field.field_value.confidence if form_field.field_value else 0.0
                }
    
    # Extract barcodes
    analysis["barcodes"] = extract_barcodes_from_document(document)
    
    return analysis

def extract_text_from_anchor(full_text: str, text_anchor: "Document.TextAnchor") -> str:
    """Extract text using TextAnchor reference."""
    text_segments = []
    for segment in text_anchor.text_segments:
        start_index = int(segment.start_index) if segment.start_index else 0
        end_index = int(segment.end_index) if segment.end_index else len(full_text)
        text_segments.append(full_text[start_index:end_index])
    return "".join(text_segments)

def print_analysis_summary(analysis: Dict[str, Any]) -> None:
    """Print a summary of the document analysis."""
    info = analysis["document_info"]
    
    print("=== DOCUMENT ANALYSIS SUMMARY ===")
    print(f"MIME Type: {info['mime_type']}")
    print(f"Text Length: {info['text_length']:,} characters")
    print(f"Pages: {info['page_count']}")
    print(f"Entities: {info['entity_count']}")
    print(f"Has Tables: {'Yes' if info['has_tables'] else 'No'}")
    print(f"Has Form Fields: {'Yes' if info['has_form_fields'] else 'No'}")
    print(f"Has Barcodes: {'Yes' if info['has_barcodes'] else 'No'}")
    
    print(f"\n=== ENTITY TYPES ===")
    for entity_type, entities in analysis["entities"].items():
        print(f"{entity_type}: {len(entities)} instances")
    
    if analysis["tables"]:
        print(f"\n=== TABLES ===")  
        for table in analysis["tables"]:
            print(f"Page {table['page']}: {table['total_rows']} rows")
    
    if analysis["form_fields"]:
        print(f"\n=== FORM FIELDS ===")
        for field_name, field_info in list(analysis["form_fields"].items())[:5]:
            print(f"{field_name}: {field_info['value']}")

This comprehensive guide covers all document types, structures, and schemas available in Google Cloud Document AI, providing developers with complete type definitions and practical examples for working with processed documents.

Install with Tessl CLI

npx tessl i tessl/pypi-google-cloud-documentai

docs

batch-operations.md

beta-features.md

document-processing.md

document-types.md

index.md

processor-management.md

tile.json