0
# Document Types and Schemas
1
2
This guide covers the comprehensive type system and document structures in Google Cloud Document AI, including document representation, entity types, geometry, and schema definitions.
3
4
## Core Document Structure
5
6
### Document Type
7
8
The `Document` type represents a processed document with all extracted information:
9
10
```python { .api }
11
from google.cloud.documentai.types import Document
12
13
class Document:
14
"""
15
Represents a processed document with extracted text, layout, and entities.
16
17
Attributes:
18
text (str): UTF-8 encoded text extracted from the document
19
pages (Sequence[Document.Page]): List of document pages
20
entities (Sequence[Document.Entity]): Extracted entities
21
text_styles (Sequence[Document.Style]): Text styling information
22
shards (Sequence[Document.Shard]): Information about document shards
23
error (google.rpc.Status): Processing error information if any
24
mime_type (str): Original MIME type of the document
25
uri (str): Optional URI where the document was retrieved from
26
"""
27
28
class Page:
29
"""
30
Represents a single page in the document.
31
32
Attributes:
33
page_number (int): 1-based page number
34
dimension (Document.Page.Dimension): Page dimensions
35
layout (Document.Page.Layout): Page layout information
36
detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages detected on page
37
blocks (Sequence[Document.Page.Block]): Text blocks on the page
38
paragraphs (Sequence[Document.Page.Paragraph]): Paragraphs on the page
39
lines (Sequence[Document.Page.Line]): Text lines on the page
40
tokens (Sequence[Document.Page.Token]): Individual tokens on the page
41
visual_elements (Sequence[Document.Page.VisualElement]): Visual elements like images
42
tables (Sequence[Document.Page.Table]): Tables detected on the page
43
form_fields (Sequence[Document.Page.FormField]): Form fields detected on the page
44
symbols (Sequence[Document.Page.Symbol]): Symbols detected on the page
45
detected_barcodes (Sequence[Document.Page.DetectedBarcode]): Barcodes on the page
46
"""
47
48
class Dimension:
49
"""
50
Physical dimension of the page.
51
52
Attributes:
53
width (float): Page width in specified unit
54
height (float): Page height in specified unit
55
unit (str): Unit of measurement ('INCH', 'CM', 'POINT')
56
"""
57
pass
58
59
class Layout:
60
"""
61
Layout information for a page element.
62
63
Attributes:
64
text_anchor (Document.TextAnchor): Text location reference
65
confidence (float): Confidence score [0.0, 1.0]
66
bounding_poly (BoundingPoly): Bounding box of the element
67
orientation (Document.Page.Layout.Orientation): Text orientation
68
"""
69
pass
70
71
class Block:
72
"""
73
A block of text on a page.
74
75
Attributes:
76
layout (Document.Page.Layout): Block layout information
77
detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in block
78
provenance (Document.Provenance): Processing provenance information
79
"""
80
pass
81
82
class Table:
83
"""
84
A table detected on the page.
85
86
Attributes:
87
layout (Document.Page.Layout): Table layout information
88
header_rows (Sequence[Document.Page.Table.TableRow]): Table header rows
89
body_rows (Sequence[Document.Page.Table.TableRow]): Table body rows
90
detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in table
91
"""
92
93
class TableRow:
94
"""
95
A single row in a table.
96
97
Attributes:
98
cells (Sequence[Document.Page.Table.TableCell]): Cells in the row
99
"""
100
pass
101
102
class TableCell:
103
"""
104
A single cell in a table.
105
106
Attributes:
107
layout (Document.Page.Layout): Cell layout information
108
row_span (int): Number of rows this cell spans
109
col_span (int): Number of columns this cell spans
110
detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in cell
111
"""
112
pass
113
114
class FormField:
115
"""
116
A form field (key-value pair) detected on the page.
117
118
Attributes:
119
field_name (Document.Page.Layout): Layout of the field name/key
120
field_value (Document.Page.Layout): Layout of the field value
121
name_detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in name
122
value_detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in value
123
value_type (str): Type of the field value
124
corrected_key_text (str): Corrected key text if available
125
corrected_value_text (str): Corrected value text if available
126
"""
127
pass
128
129
class Entity:
130
"""
131
An entity extracted from the document.
132
133
Attributes:
134
text_anchor (Document.TextAnchor): Reference to entity text in document
135
type_ (str): Entity type (e.g., 'invoice_date', 'total_amount')
136
mention_text (str): Text mention of the entity
137
mention_id (str): Unique mention identifier
138
confidence (float): Confidence score [0.0, 1.0]
139
page_anchor (Document.PageAnchor): Page reference for the entity
140
id (str): Entity identifier
141
normalized_value (Document.Entity.NormalizedValue): Normalized entity value
142
properties (Sequence[Document.Entity]): Sub-entities or properties
143
provenance (Document.Provenance): Processing provenance
144
redacted (bool): Whether entity was redacted
145
"""
146
147
class NormalizedValue:
148
"""
149
Normalized representation of an entity value.
150
151
Attributes:
152
money_value (google.type.Money): Monetary value
153
date_value (google.type.Date): Date value
154
datetime_value (google.type.DateTime): DateTime value
155
address_value (google.type.PostalAddress): Address value
156
boolean_value (bool): Boolean value
157
integer_value (int): Integer value
158
float_value (float): Float value
159
text (str): Text representation
160
"""
161
pass
162
163
class TextAnchor:
164
"""
165
Text anchor referencing a segment of text in the document.
166
167
Attributes:
168
text_segments (Sequence[Document.TextAnchor.TextSegment]): Text segments
169
content (str): Text content (if not referencing document.text)
170
"""
171
172
class TextSegment:
173
"""
174
A segment of text.
175
176
Attributes:
177
start_index (int): Start character index in document text
178
end_index (int): End character index in document text
179
"""
180
pass
181
```
182
183
### Document I/O Types
184
185
#### RawDocument
186
187
```python { .api }
188
from google.cloud.documentai.types import RawDocument
189
190
class RawDocument:
191
"""
192
Represents a raw document for processing.
193
194
Attributes:
195
content (bytes): Raw document content
196
mime_type (str): MIME type of the document
197
display_name (str): Optional display name for the document
198
"""
199
200
def __init__(
201
self,
202
content: bytes,
203
mime_type: str,
204
display_name: str = None
205
):
206
"""
207
Initialize a raw document.
208
209
Args:
210
content: Raw document bytes
211
mime_type: Document MIME type (e.g., 'application/pdf')
212
display_name: Optional display name
213
"""
214
self.content = content
215
self.mime_type = mime_type
216
self.display_name = display_name
217
218
# Example usage
219
def create_raw_document_from_file(file_path: str, mime_type: str) -> RawDocument:
220
"""
221
Create RawDocument from a file.
222
223
Args:
224
file_path: Path to document file
225
mime_type: MIME type of the document
226
227
Returns:
228
RawDocument: Raw document object
229
"""
230
with open(file_path, "rb") as f:
231
content = f.read()
232
233
return RawDocument(
234
content=content,
235
mime_type=mime_type,
236
display_name=file_path.split("/")[-1]
237
)
238
```
239
240
#### GcsDocument
241
242
```python { .api }
243
from google.cloud.documentai.types import GcsDocument
244
245
class GcsDocument:
246
"""
247
Represents a document stored in Google Cloud Storage.
248
249
Attributes:
250
gcs_uri (str): Cloud Storage URI (gs://bucket/path)
251
mime_type (str): MIME type of the document
252
"""
253
254
def __init__(self, gcs_uri: str, mime_type: str):
255
"""
256
Initialize a GCS document reference.
257
258
Args:
259
gcs_uri: Cloud Storage URI
260
mime_type: Document MIME type
261
"""
262
self.gcs_uri = gcs_uri
263
self.mime_type = mime_type
264
265
# Example usage
266
def create_gcs_documents_batch(
267
gcs_uris: list[str],
268
mime_types: list[str]
269
) -> list[GcsDocument]:
270
"""
271
Create batch of GCS document references.
272
273
Args:
274
gcs_uris: List of Cloud Storage URIs
275
mime_types: List of corresponding MIME types
276
277
Returns:
278
list[GcsDocument]: List of GCS document references
279
"""
280
if len(gcs_uris) != len(mime_types):
281
raise ValueError("Number of URIs must match number of MIME types")
282
283
return [
284
GcsDocument(gcs_uri=uri, mime_type=mime_type)
285
for uri, mime_type in zip(gcs_uris, mime_types)
286
]
287
```
288
289
#### GcsDocuments
290
291
```python { .api }
292
from google.cloud.documentai.types import GcsDocuments, GcsDocument
293
294
class GcsDocuments:
295
"""
296
Collection of documents stored in Google Cloud Storage.
297
298
Attributes:
299
documents (Sequence[GcsDocument]): List of GCS documents
300
"""
301
302
def __init__(self, documents: list[GcsDocument]):
303
"""
304
Initialize GCS documents collection.
305
306
Args:
307
documents: List of GcsDocument objects
308
"""
309
self.documents = documents
310
311
# Example usage
312
def create_gcs_documents_from_prefix(
313
gcs_prefix: str,
314
file_extensions: list[str] = None
315
) -> GcsDocuments:
316
"""
317
Create GcsDocuments from a Cloud Storage prefix.
318
319
Args:
320
gcs_prefix: Cloud Storage prefix (gs://bucket/path/)
321
file_extensions: Optional list of file extensions to include
322
323
Returns:
324
GcsDocuments: Collection of GCS documents
325
"""
326
# This would require Cloud Storage client to list files
327
# Simplified example assuming we know the files
328
documents = []
329
330
# Example files (in practice, you'd list the bucket contents)
331
example_files = [
332
f"{gcs_prefix}doc1.pdf",
333
f"{gcs_prefix}doc2.pdf",
334
f"{gcs_prefix}image1.jpg"
335
]
336
337
mime_type_map = {
338
'.pdf': 'application/pdf',
339
'.jpg': 'image/jpeg',
340
'.png': 'image/png',
341
'.tiff': 'image/tiff'
342
}
343
344
for file_uri in example_files:
345
# Determine MIME type from extension
346
for ext, mime_type in mime_type_map.items():
347
if file_uri.lower().endswith(ext):
348
documents.append(GcsDocument(
349
gcs_uri=file_uri,
350
mime_type=mime_type
351
))
352
break
353
354
return GcsDocuments(documents=documents)
355
```
356
357
## Geometry Types
358
359
### BoundingPoly
360
361
```python { .api }
362
from google.cloud.documentai.types import BoundingPoly, Vertex, NormalizedVertex
363
364
class BoundingPoly:
365
"""
366
A bounding polygon for the detected image annotation.
367
368
Attributes:
369
vertices (Sequence[Vertex]): Vertices of the bounding polygon
370
normalized_vertices (Sequence[NormalizedVertex]): Normalized vertices [0.0, 1.0]
371
"""
372
373
def __init__(
374
self,
375
vertices: list[Vertex] = None,
376
normalized_vertices: list[NormalizedVertex] = None
377
):
378
"""
379
Initialize bounding polygon.
380
381
Args:
382
vertices: List of pixel-coordinate vertices
383
normalized_vertices: List of normalized coordinate vertices
384
"""
385
self.vertices = vertices or []
386
self.normalized_vertices = normalized_vertices or []
387
388
class Vertex:
389
"""
390
A vertex represents a 2D point in the image.
391
392
Attributes:
393
x (int): X coordinate in pixels
394
y (int): Y coordinate in pixels
395
"""
396
397
def __init__(self, x: int, y: int):
398
"""
399
Initialize vertex with pixel coordinates.
400
401
Args:
402
x: X coordinate
403
y: Y coordinate
404
"""
405
self.x = x
406
self.y = y
407
408
class NormalizedVertex:
409
"""
410
A vertex represents a 2D point with normalized coordinates.
411
412
Attributes:
413
x (float): X coordinate [0.0, 1.0]
414
y (float): Y coordinate [0.0, 1.0]
415
"""
416
417
def __init__(self, x: float, y: float):
418
"""
419
Initialize normalized vertex.
420
421
Args:
422
x: Normalized X coordinate [0.0, 1.0]
423
y: Normalized Y coordinate [0.0, 1.0]
424
"""
425
self.x = x
426
self.y = y
427
428
# Utility functions for geometry
429
def create_bounding_box(
430
left: int,
431
top: int,
432
right: int,
433
bottom: int
434
) -> BoundingPoly:
435
"""
436
Create a rectangular bounding polygon.
437
438
Args:
439
left: Left edge X coordinate
440
top: Top edge Y coordinate
441
right: Right edge X coordinate
442
bottom: Bottom edge Y coordinate
443
444
Returns:
445
BoundingPoly: Rectangular bounding polygon
446
"""
447
vertices = [
448
Vertex(x=left, y=top), # Top-left
449
Vertex(x=right, y=top), # Top-right
450
Vertex(x=right, y=bottom), # Bottom-right
451
Vertex(x=left, y=bottom) # Bottom-left
452
]
453
454
return BoundingPoly(vertices=vertices)
455
456
def normalize_bounding_poly(
457
bounding_poly: BoundingPoly,
458
page_width: int,
459
page_height: int
460
) -> BoundingPoly:
461
"""
462
Convert pixel coordinates to normalized coordinates.
463
464
Args:
465
bounding_poly: Bounding polygon with pixel coordinates
466
page_width: Page width in pixels
467
page_height: Page height in pixels
468
469
Returns:
470
BoundingPoly: Bounding polygon with normalized coordinates
471
"""
472
normalized_vertices = []
473
474
for vertex in bounding_poly.vertices:
475
normalized_x = vertex.x / page_width
476
normalized_y = vertex.y / page_height
477
normalized_vertices.append(
478
NormalizedVertex(x=normalized_x, y=normalized_y)
479
)
480
481
return BoundingPoly(normalized_vertices=normalized_vertices)
482
```
483
484
## Processor and Processor Type Definitions
485
486
### Processor
487
488
```python { .api }
489
from google.cloud.documentai.types import Processor
490
from google.protobuf.timestamp_pb2 import Timestamp
491
492
class Processor:
493
"""
494
The first-class citizen for Document AI.
495
496
Attributes:
497
name (str): Output only. Immutable. The resource name of the processor
498
type_ (str): The processor type, e.g., OCR_PROCESSOR, INVOICE_PROCESSOR
499
display_name (str): The display name of the processor
500
state (Processor.State): Output only. The state of the processor
501
default_processor_version (str): The default processor version
502
processor_version_aliases (Sequence[ProcessorVersionAlias]): Version aliases
503
process_endpoint (str): Output only. Immutable. The http endpoint for this processor
504
create_time (Timestamp): Output only. The time the processor was created
505
kms_key_name (str): The KMS key used to encrypt the processor
506
satisfies_pzs (bool): Output only. Reserved for future use
507
satisfies_pzi (bool): Output only. Reserved for future use
508
"""
509
510
class State(Enum):
511
"""
512
The possible states of the processor.
513
514
Values:
515
STATE_UNSPECIFIED: The processor state is unspecified
516
ENABLED: The processor is enabled, i.e., has an enabled version
517
DISABLED: The processor is disabled
518
ENABLING: The processor is being enabled, i.e., is having an enabled version
519
DISABLING: The processor is being disabled
520
CREATING: The processor is being created
521
FAILED: The processor failed during creation or while disabling
522
DELETING: The processor is being deleted
523
"""
524
STATE_UNSPECIFIED = 0
525
ENABLED = 1
526
DISABLED = 2
527
ENABLING = 3
528
DISABLING = 4
529
CREATING = 5
530
FAILED = 6
531
DELETING = 7
532
533
def get_processor_state_description(state: "Processor.State") -> str:
534
"""
535
Get human-readable description of processor state.
536
537
Args:
538
state: Processor state enum value
539
540
Returns:
541
str: Description of the state
542
"""
543
descriptions = {
544
Processor.State.ENABLED: "Ready for processing documents",
545
Processor.State.DISABLED: "Not available for processing",
546
Processor.State.ENABLING: "Currently being enabled",
547
Processor.State.DISABLING: "Currently being disabled",
548
Processor.State.CREATING: "Being created",
549
Processor.State.FAILED: "Failed to create or disable",
550
Processor.State.DELETING: "Being permanently deleted"
551
}
552
553
return descriptions.get(state, "Unknown state")
554
```
555
556
### ProcessorType
557
558
```python { .api }
559
from google.cloud.documentai.types import ProcessorType
560
561
class ProcessorType:
562
"""
563
A processor type is responsible for performing a certain document understanding task on a certain type of document.
564
565
Attributes:
566
name (str): The resource name of the processor type
567
type_ (str): The processor type, e.g., OCR_PROCESSOR, INVOICE_PROCESSOR
568
category (str): The processor category
569
available_locations (Sequence[LocationInfo]): The locations where this processor is available
570
allow_creation (bool): Whether the processor type allows creation of new processor instances
571
launch_stage (google.api.LaunchStage): Launch stage of the processor type
572
sample_document_uris (Sequence[str]): Sample documents for this processor type
573
"""
574
575
class LocationInfo:
576
"""
577
Information about the availability of a processor type in a location.
578
579
Attributes:
580
location_id (str): The location ID (e.g., 'us', 'eu')
581
"""
582
pass
583
584
# Common processor types
585
PROCESSOR_TYPES = {
586
# General processors
587
"OCR_PROCESSOR": {
588
"display_name": "Document OCR",
589
"description": "Extracts text from documents and images"
590
},
591
"FORM_PARSER_PROCESSOR": {
592
"display_name": "Form Parser",
593
"description": "Extracts key-value pairs from forms"
594
},
595
596
# Specialized processors
597
"INVOICE_PROCESSOR": {
598
"display_name": "Invoice Parser",
599
"description": "Extracts structured data from invoices"
600
},
601
"RECEIPT_PROCESSOR": {
602
"display_name": "Receipt Parser",
603
"description": "Extracts data from receipts"
604
},
605
"IDENTITY_DOCUMENT_PROCESSOR": {
606
"display_name": "Identity Document Parser",
607
"description": "Extracts data from identity documents"
608
},
609
"CONTRACT_PROCESSOR": {
610
"display_name": "Contract Parser",
611
"description": "Extracts key information from contracts"
612
},
613
"EXPENSE_PROCESSOR": {
614
"display_name": "Expense Parser",
615
"description": "Extracts data from expense documents"
616
},
617
618
# Custom processors
619
"CUSTOM_EXTRACTION_PROCESSOR": {
620
"display_name": "Custom Extraction Processor",
621
"description": "Custom trained processor for specific document types"
622
},
623
"CUSTOM_CLASSIFICATION_PROCESSOR": {
624
"display_name": "Custom Classification Processor",
625
"description": "Custom trained processor for document classification"
626
}
627
}
628
629
def get_processor_type_info(processor_type: str) -> dict:
630
"""
631
Get information about a processor type.
632
633
Args:
634
processor_type: Processor type identifier
635
636
Returns:
637
dict: Processor type information
638
"""
639
return PROCESSOR_TYPES.get(processor_type, {
640
"display_name": processor_type,
641
"description": "Unknown processor type"
642
})
643
```
644
645
## Document Schema
646
647
### DocumentSchema
648
649
```python { .api }
650
from google.cloud.documentai.types import DocumentSchema
651
652
class DocumentSchema:
653
"""
654
The schema defines the output of the processed document by a processor.
655
656
Attributes:
657
display_name (str): Display name to show to users
658
description (str): Description of the schema
659
entity_types (Sequence[DocumentSchema.EntityType]): Entity types that this schema produces
660
metadata (DocumentSchema.Metadata): Metadata about the schema
661
"""
662
663
class EntityType:
664
"""
665
EntityType is the wrapper of a label of the corresponding model with detailed attributes and limitations for entity-based processors.
666
667
Attributes:
668
enum_values (DocumentSchema.EntityType.EnumValues): If specified, lists all the possible values for this entity
669
display_name (str): User defined name for the type
670
name (str): Name of the type
671
base_types (Sequence[str]): The entity type that this type is derived from
672
properties (Sequence[DocumentSchema.EntityType.Property]): Description the nested structure, or composition of an entity
673
"""
674
675
class Property:
676
"""
677
Defines properties that can be part of the entity type.
678
679
Attributes:
680
name (str): The name of the property
681
display_name (str): User defined name for the property
682
value_type (str): A reference to the value type of the property
683
occurrence_type (DocumentSchema.EntityType.Property.OccurrenceType): Occurrence type limits the number of instances an entity type appears in the document
684
"""
685
686
class OccurrenceType(Enum):
687
"""
688
Types of occurrences of the entity type in the document.
689
690
Values:
691
OCCURRENCE_TYPE_UNSPECIFIED: Unspecified occurrence type
692
OPTIONAL_ONCE: There will be zero or one instance of this entity type
693
OPTIONAL_MULTIPLE: The entity type can have zero or multiple instances
694
REQUIRED_ONCE: The entity type will have exactly one instance
695
REQUIRED_MULTIPLE: The entity type will have one or more instances
696
"""
697
OCCURRENCE_TYPE_UNSPECIFIED = 0
698
OPTIONAL_ONCE = 1
699
OPTIONAL_MULTIPLE = 2
700
REQUIRED_ONCE = 3
701
REQUIRED_MULTIPLE = 4
702
703
def create_invoice_schema() -> DocumentSchema:
704
"""
705
Create a document schema for invoice processing.
706
707
Returns:
708
DocumentSchema: Schema for invoice documents
709
"""
710
# Define entity types for invoice
711
entity_types = [
712
DocumentSchema.EntityType(
713
name="invoice_date",
714
display_name="Invoice Date",
715
properties=[
716
DocumentSchema.EntityType.Property(
717
name="date_value",
718
display_name="Date Value",
719
value_type="date",
720
occurrence_type=DocumentSchema.EntityType.Property.OccurrenceType.REQUIRED_ONCE
721
)
722
]
723
),
724
DocumentSchema.EntityType(
725
name="invoice_number",
726
display_name="Invoice Number",
727
properties=[
728
DocumentSchema.EntityType.Property(
729
name="text_value",
730
display_name="Text Value",
731
value_type="text",
732
occurrence_type=DocumentSchema.EntityType.Property.OccurrenceType.REQUIRED_ONCE
733
)
734
]
735
),
736
DocumentSchema.EntityType(
737
name="total_amount",
738
display_name="Total Amount",
739
properties=[
740
DocumentSchema.EntityType.Property(
741
name="money_value",
742
display_name="Money Value",
743
value_type="money",
744
occurrence_type=DocumentSchema.EntityType.Property.OccurrenceType.REQUIRED_ONCE
745
)
746
]
747
)
748
]
749
750
return DocumentSchema(
751
display_name="Invoice Processing Schema",
752
description="Schema for extracting key information from invoices",
753
entity_types=entity_types
754
)
755
```
756
757
## Barcode Types
758
759
### Barcode
760
761
```python { .api }
762
from google.cloud.documentai.types import Barcode
763
764
class Barcode:
765
"""
766
Encodes the detailed information of a barcode.
767
768
Attributes:
769
format_ (str): Format of the barcode (e.g., CODE_128, QR_CODE)
770
value_format (str): Format of the barcode value (e.g., CONTACT_INFO, URL)
771
raw_value (str): Raw value encoded in the barcode
772
"""
773
774
# Common barcode formats
775
FORMATS = {
776
"CODE_128": "Code 128 linear barcode",
777
"CODE_39": "Code 39 linear barcode",
778
"CODE_93": "Code 93 linear barcode",
779
"CODABAR": "Codabar linear barcode",
780
"DATA_MATRIX": "Data Matrix 2D barcode",
781
"EAN_13": "EAN-13 linear barcode",
782
"EAN_8": "EAN-8 linear barcode",
783
"ITF": "ITF (Interleaved 2 of 5) linear barcode",
784
"QR_CODE": "QR Code 2D barcode",
785
"UPC_A": "UPC-A linear barcode",
786
"UPC_E": "UPC-E linear barcode",
787
"PDF417": "PDF417 2D barcode",
788
"AZTEC": "Aztec 2D barcode"
789
}
790
791
def extract_barcodes_from_document(document: "Document") -> list[dict]:
792
"""
793
Extract all barcodes from a processed document.
794
795
Args:
796
document: Processed Document object
797
798
Returns:
799
list[dict]: List of barcode information
800
"""
801
barcodes = []
802
803
for page_idx, page in enumerate(document.pages):
804
for barcode_detection in page.detected_barcodes:
805
barcode_info = {
806
"page": page_idx + 1,
807
"format": barcode_detection.barcode.format_,
808
"value_format": barcode_detection.barcode.value_format,
809
"raw_value": barcode_detection.barcode.raw_value,
810
"layout": barcode_detection.layout
811
}
812
barcodes.append(barcode_info)
813
814
return barcodes
815
```
816
817
## Complete Document Analysis Example
818
819
```python { .api }
820
from google.cloud.documentai.types import Document
821
from typing import Dict, List, Any
822
823
def comprehensive_document_analysis(document: Document) -> Dict[str, Any]:
824
"""
825
Perform comprehensive analysis of a processed document.
826
827
Args:
828
document: Processed Document object
829
830
Returns:
831
Dict[str, Any]: Complete document analysis results
832
"""
833
analysis = {
834
"document_info": {
835
"mime_type": document.mime_type,
836
"text_length": len(document.text),
837
"page_count": len(document.pages),
838
"entity_count": len(document.entities),
839
"has_tables": False,
840
"has_form_fields": False,
841
"has_barcodes": False
842
},
843
"pages": [],
844
"entities": {},
845
"tables": [],
846
"form_fields": {},
847
"barcodes": [],
848
"text_styles": []
849
}
850
851
# Analyze pages
852
for page_idx, page in enumerate(document.pages):
853
page_info = {
854
"page_number": page_idx + 1,
855
"dimensions": {
856
"width": page.dimension.width,
857
"height": page.dimension.height,
858
"unit": page.dimension.unit
859
},
860
"elements": {
861
"blocks": len(page.blocks),
862
"paragraphs": len(page.paragraphs),
863
"lines": len(page.lines),
864
"tokens": len(page.tokens)
865
},
866
"tables": len(page.tables),
867
"form_fields": len(page.form_fields),
868
"barcodes": len(page.detected_barcodes),
869
"languages": [lang.language_code for lang in page.detected_languages]
870
}
871
872
analysis["pages"].append(page_info)
873
874
# Update document-level flags
875
if page.tables:
876
analysis["document_info"]["has_tables"] = True
877
if page.form_fields:
878
analysis["document_info"]["has_form_fields"] = True
879
if page.detected_barcodes:
880
analysis["document_info"]["has_barcodes"] = True
881
882
# Analyze entities by type
883
for entity in document.entities:
884
entity_type = entity.type_
885
if entity_type not in analysis["entities"]:
886
analysis["entities"][entity_type] = []
887
888
entity_info = {
889
"text": entity.mention_text,
890
"confidence": entity.confidence,
891
"normalized_value": None
892
}
893
894
# Extract normalized value if available
895
if entity.normalized_value:
896
if entity.normalized_value.money_value:
897
entity_info["normalized_value"] = {
898
"type": "money",
899
"currency": entity.normalized_value.money_value.currency_code,
900
"amount": entity.normalized_value.money_value.units
901
}
902
elif entity.normalized_value.date_value:
903
entity_info["normalized_value"] = {
904
"type": "date",
905
"year": entity.normalized_value.date_value.year,
906
"month": entity.normalized_value.date_value.month,
907
"day": entity.normalized_value.date_value.day
908
}
909
elif entity.normalized_value.text:
910
entity_info["normalized_value"] = {
911
"type": "text",
912
"value": entity.normalized_value.text
913
}
914
915
analysis["entities"][entity_type].append(entity_info)
916
917
# Extract tables
918
for page_idx, page in enumerate(document.pages):
919
for table_idx, table in enumerate(page.tables):
920
table_data = {
921
"page": page_idx + 1,
922
"table_index": table_idx,
923
"header_rows": len(table.header_rows),
924
"body_rows": len(table.body_rows),
925
"total_rows": len(table.header_rows) + len(table.body_rows)
926
}
927
analysis["tables"].append(table_data)
928
929
# Extract form fields
930
for page in document.pages:
931
for form_field in page.form_fields:
932
if form_field.field_name and form_field.field_name.text_anchor:
933
field_name = extract_text_from_anchor(
934
document.text, form_field.field_name.text_anchor
935
).strip()
936
937
field_value = ""
938
if form_field.field_value and form_field.field_value.text_anchor:
939
field_value = extract_text_from_anchor(
940
document.text, form_field.field_value.text_anchor
941
).strip()
942
943
analysis["form_fields"][field_name] = {
944
"value": field_value,
945
"name_confidence": form_field.field_name.confidence,
946
"value_confidence": form_field.field_value.confidence if form_field.field_value else 0.0
947
}
948
949
# Extract barcodes
950
analysis["barcodes"] = extract_barcodes_from_document(document)
951
952
return analysis
953
954
def extract_text_from_anchor(full_text: str, text_anchor: "Document.TextAnchor") -> str:
955
"""Extract text using TextAnchor reference."""
956
text_segments = []
957
for segment in text_anchor.text_segments:
958
start_index = int(segment.start_index) if segment.start_index else 0
959
end_index = int(segment.end_index) if segment.end_index else len(full_text)
960
text_segments.append(full_text[start_index:end_index])
961
return "".join(text_segments)
962
963
def print_analysis_summary(analysis: Dict[str, Any]) -> None:
964
"""Print a summary of the document analysis."""
965
info = analysis["document_info"]
966
967
print("=== DOCUMENT ANALYSIS SUMMARY ===")
968
print(f"MIME Type: {info['mime_type']}")
969
print(f"Text Length: {info['text_length']:,} characters")
970
print(f"Pages: {info['page_count']}")
971
print(f"Entities: {info['entity_count']}")
972
print(f"Has Tables: {'Yes' if info['has_tables'] else 'No'}")
973
print(f"Has Form Fields: {'Yes' if info['has_form_fields'] else 'No'}")
974
print(f"Has Barcodes: {'Yes' if info['has_barcodes'] else 'No'}")
975
976
print(f"\n=== ENTITY TYPES ===")
977
for entity_type, entities in analysis["entities"].items():
978
print(f"{entity_type}: {len(entities)} instances")
979
980
if analysis["tables"]:
981
print(f"\n=== TABLES ===")
982
for table in analysis["tables"]:
983
print(f"Page {table['page']}: {table['total_rows']} rows")
984
985
if analysis["form_fields"]:
986
print(f"\n=== FORM FIELDS ===")
987
for field_name, field_info in list(analysis["form_fields"].items())[:5]:
988
print(f"{field_name}: {field_info['value']}")
989
```
990
991
This comprehensive guide covers all document types, structures, and schemas available in Google Cloud Document AI, providing developers with complete type definitions and practical examples for working with processed documents.