or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

batch-operations.mdbeta-features.mddocument-processing.mddocument-types.mdindex.mdprocessor-management.md

document-types.mddocs/

0

# Document Types and Schemas

1

2

This guide covers the comprehensive type system and document structures in Google Cloud Document AI, including document representation, entity types, geometry, and schema definitions.

3

4

## Core Document Structure

5

6

### Document Type

7

8

The `Document` type represents a processed document with all extracted information:

9

10

```python { .api }

11

from google.cloud.documentai.types import Document

12

13

class Document:

14

"""

15

Represents a processed document with extracted text, layout, and entities.

16

17

Attributes:

18

text (str): UTF-8 encoded text extracted from the document

19

pages (Sequence[Document.Page]): List of document pages

20

entities (Sequence[Document.Entity]): Extracted entities

21

text_styles (Sequence[Document.Style]): Text styling information

22

shards (Sequence[Document.Shard]): Information about document shards

23

error (google.rpc.Status): Processing error information if any

24

mime_type (str): Original MIME type of the document

25

uri (str): Optional URI where the document was retrieved from

26

"""

27

28

class Page:

29

"""

30

Represents a single page in the document.

31

32

Attributes:

33

page_number (int): 1-based page number

34

dimension (Document.Page.Dimension): Page dimensions

35

layout (Document.Page.Layout): Page layout information

36

detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages detected on page

37

blocks (Sequence[Document.Page.Block]): Text blocks on the page

38

paragraphs (Sequence[Document.Page.Paragraph]): Paragraphs on the page

39

lines (Sequence[Document.Page.Line]): Text lines on the page

40

tokens (Sequence[Document.Page.Token]): Individual tokens on the page

41

visual_elements (Sequence[Document.Page.VisualElement]): Visual elements like images

42

tables (Sequence[Document.Page.Table]): Tables detected on the page

43

form_fields (Sequence[Document.Page.FormField]): Form fields detected on the page

44

symbols (Sequence[Document.Page.Symbol]): Symbols detected on the page

45

detected_barcodes (Sequence[Document.Page.DetectedBarcode]): Barcodes on the page

46

"""

47

48

class Dimension:

49

"""

50

Physical dimension of the page.

51

52

Attributes:

53

width (float): Page width in specified unit

54

height (float): Page height in specified unit

55

unit (str): Unit of measurement ('INCH', 'CM', 'POINT')

56

"""

57

pass

58

59

class Layout:

60

"""

61

Layout information for a page element.

62

63

Attributes:

64

text_anchor (Document.TextAnchor): Text location reference

65

confidence (float): Confidence score [0.0, 1.0]

66

bounding_poly (BoundingPoly): Bounding box of the element

67

orientation (Document.Page.Layout.Orientation): Text orientation

68

"""

69

pass

70

71

class Block:

72

"""

73

A block of text on a page.

74

75

Attributes:

76

layout (Document.Page.Layout): Block layout information

77

detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in block

78

provenance (Document.Provenance): Processing provenance information

79

"""

80

pass

81

82

class Table:

83

"""

84

A table detected on the page.

85

86

Attributes:

87

layout (Document.Page.Layout): Table layout information

88

header_rows (Sequence[Document.Page.Table.TableRow]): Table header rows

89

body_rows (Sequence[Document.Page.Table.TableRow]): Table body rows

90

detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in table

91

"""

92

93

class TableRow:

94

"""

95

A single row in a table.

96

97

Attributes:

98

cells (Sequence[Document.Page.Table.TableCell]): Cells in the row

99

"""

100

pass

101

102

class TableCell:

103

"""

104

A single cell in a table.

105

106

Attributes:

107

layout (Document.Page.Layout): Cell layout information

108

row_span (int): Number of rows this cell spans

109

col_span (int): Number of columns this cell spans

110

detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in cell

111

"""

112

pass

113

114

class FormField:

115

"""

116

A form field (key-value pair) detected on the page.

117

118

Attributes:

119

field_name (Document.Page.Layout): Layout of the field name/key

120

field_value (Document.Page.Layout): Layout of the field value

121

name_detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in name

122

value_detected_languages (Sequence[Document.Page.DetectedLanguage]): Languages in value

123

value_type (str): Type of the field value

124

corrected_key_text (str): Corrected key text if available

125

corrected_value_text (str): Corrected value text if available

126

"""

127

pass

128

129

class Entity:

130

"""

131

An entity extracted from the document.

132

133

Attributes:

134

text_anchor (Document.TextAnchor): Reference to entity text in document

135

type_ (str): Entity type (e.g., 'invoice_date', 'total_amount')

136

mention_text (str): Text mention of the entity

137

mention_id (str): Unique mention identifier

138

confidence (float): Confidence score [0.0, 1.0]

139

page_anchor (Document.PageAnchor): Page reference for the entity

140

id (str): Entity identifier

141

normalized_value (Document.Entity.NormalizedValue): Normalized entity value

142

properties (Sequence[Document.Entity]): Sub-entities or properties

143

provenance (Document.Provenance): Processing provenance

144

redacted (bool): Whether entity was redacted

145

"""

146

147

class NormalizedValue:

148

"""

149

Normalized representation of an entity value.

150

151

Attributes:

152

money_value (google.type.Money): Monetary value

153

date_value (google.type.Date): Date value

154

datetime_value (google.type.DateTime): DateTime value

155

address_value (google.type.PostalAddress): Address value

156

boolean_value (bool): Boolean value

157

integer_value (int): Integer value

158

float_value (float): Float value

159

text (str): Text representation

160

"""

161

pass

162

163

class TextAnchor:

164

"""

165

Text anchor referencing a segment of text in the document.

166

167

Attributes:

168

text_segments (Sequence[Document.TextAnchor.TextSegment]): Text segments

169

content (str): Text content (if not referencing document.text)

170

"""

171

172

class TextSegment:

173

"""

174

A segment of text.

175

176

Attributes:

177

start_index (int): Start character index in document text

178

end_index (int): End character index in document text

179

"""

180

pass

181

```

182

183

### Document I/O Types

184

185

#### RawDocument

186

187

```python { .api }

188

from google.cloud.documentai.types import RawDocument

189

190

class RawDocument:

191

"""

192

Represents a raw document for processing.

193

194

Attributes:

195

content (bytes): Raw document content

196

mime_type (str): MIME type of the document

197

display_name (str): Optional display name for the document

198

"""

199

200

def __init__(

201

self,

202

content: bytes,

203

mime_type: str,

204

display_name: str = None

205

):

206

"""

207

Initialize a raw document.

208

209

Args:

210

content: Raw document bytes

211

mime_type: Document MIME type (e.g., 'application/pdf')

212

display_name: Optional display name

213

"""

214

self.content = content

215

self.mime_type = mime_type

216

self.display_name = display_name

217

218

# Example usage

219

def create_raw_document_from_file(file_path: str, mime_type: str) -> RawDocument:

220

"""

221

Create RawDocument from a file.

222

223

Args:

224

file_path: Path to document file

225

mime_type: MIME type of the document

226

227

Returns:

228

RawDocument: Raw document object

229

"""

230

with open(file_path, "rb") as f:

231

content = f.read()

232

233

return RawDocument(

234

content=content,

235

mime_type=mime_type,

236

display_name=file_path.split("/")[-1]

237

)

238

```

239

240

#### GcsDocument

241

242

```python { .api }

243

from google.cloud.documentai.types import GcsDocument

244

245

class GcsDocument:

246

"""

247

Represents a document stored in Google Cloud Storage.

248

249

Attributes:

250

gcs_uri (str): Cloud Storage URI (gs://bucket/path)

251

mime_type (str): MIME type of the document

252

"""

253

254

def __init__(self, gcs_uri: str, mime_type: str):

255

"""

256

Initialize a GCS document reference.

257

258

Args:

259

gcs_uri: Cloud Storage URI

260

mime_type: Document MIME type

261

"""

262

self.gcs_uri = gcs_uri

263

self.mime_type = mime_type

264

265

# Example usage

266

def create_gcs_documents_batch(

267

gcs_uris: list[str],

268

mime_types: list[str]

269

) -> list[GcsDocument]:

270

"""

271

Create batch of GCS document references.

272

273

Args:

274

gcs_uris: List of Cloud Storage URIs

275

mime_types: List of corresponding MIME types

276

277

Returns:

278

list[GcsDocument]: List of GCS document references

279

"""

280

if len(gcs_uris) != len(mime_types):

281

raise ValueError("Number of URIs must match number of MIME types")

282

283

return [

284

GcsDocument(gcs_uri=uri, mime_type=mime_type)

285

for uri, mime_type in zip(gcs_uris, mime_types)

286

]

287

```

288

289

#### GcsDocuments

290

291

```python { .api }

292

from google.cloud.documentai.types import GcsDocuments, GcsDocument

293

294

class GcsDocuments:

295

"""

296

Collection of documents stored in Google Cloud Storage.

297

298

Attributes:

299

documents (Sequence[GcsDocument]): List of GCS documents

300

"""

301

302

def __init__(self, documents: list[GcsDocument]):

303

"""

304

Initialize GCS documents collection.

305

306

Args:

307

documents: List of GcsDocument objects

308

"""

309

self.documents = documents

310

311

# Example usage

312

def create_gcs_documents_from_prefix(

313

gcs_prefix: str,

314

file_extensions: list[str] = None

315

) -> GcsDocuments:

316

"""

317

Create GcsDocuments from a Cloud Storage prefix.

318

319

Args:

320

gcs_prefix: Cloud Storage prefix (gs://bucket/path/)

321

file_extensions: Optional list of file extensions to include

322

323

Returns:

324

GcsDocuments: Collection of GCS documents

325

"""

326

# This would require Cloud Storage client to list files

327

# Simplified example assuming we know the files

328

documents = []

329

330

# Example files (in practice, you'd list the bucket contents)

331

example_files = [

332

f"{gcs_prefix}doc1.pdf",

333

f"{gcs_prefix}doc2.pdf",

334

f"{gcs_prefix}image1.jpg"

335

]

336

337

mime_type_map = {

338

'.pdf': 'application/pdf',

339

'.jpg': 'image/jpeg',

340

'.png': 'image/png',

341

'.tiff': 'image/tiff'

342

}

343

344

for file_uri in example_files:

345

# Determine MIME type from extension

346

for ext, mime_type in mime_type_map.items():

347

if file_uri.lower().endswith(ext):

348

documents.append(GcsDocument(

349

gcs_uri=file_uri,

350

mime_type=mime_type

351

))

352

break

353

354

return GcsDocuments(documents=documents)

355

```

356

357

## Geometry Types

358

359

### BoundingPoly

360

361

```python { .api }

362

from google.cloud.documentai.types import BoundingPoly, Vertex, NormalizedVertex

363

364

class BoundingPoly:

365

"""

366

A bounding polygon for the detected image annotation.

367

368

Attributes:

369

vertices (Sequence[Vertex]): Vertices of the bounding polygon

370

normalized_vertices (Sequence[NormalizedVertex]): Normalized vertices [0.0, 1.0]

371

"""

372

373

def __init__(

374

self,

375

vertices: list[Vertex] = None,

376

normalized_vertices: list[NormalizedVertex] = None

377

):

378

"""

379

Initialize bounding polygon.

380

381

Args:

382

vertices: List of pixel-coordinate vertices

383

normalized_vertices: List of normalized coordinate vertices

384

"""

385

self.vertices = vertices or []

386

self.normalized_vertices = normalized_vertices or []

387

388

class Vertex:

389

"""

390

A vertex represents a 2D point in the image.

391

392

Attributes:

393

x (int): X coordinate in pixels

394

y (int): Y coordinate in pixels

395

"""

396

397

def __init__(self, x: int, y: int):

398

"""

399

Initialize vertex with pixel coordinates.

400

401

Args:

402

x: X coordinate

403

y: Y coordinate

404

"""

405

self.x = x

406

self.y = y

407

408

class NormalizedVertex:

409

"""

410

A vertex represents a 2D point with normalized coordinates.

411

412

Attributes:

413

x (float): X coordinate [0.0, 1.0]

414

y (float): Y coordinate [0.0, 1.0]

415

"""

416

417

def __init__(self, x: float, y: float):

418

"""

419

Initialize normalized vertex.

420

421

Args:

422

x: Normalized X coordinate [0.0, 1.0]

423

y: Normalized Y coordinate [0.0, 1.0]

424

"""

425

self.x = x

426

self.y = y

427

428

# Utility functions for geometry

429

def create_bounding_box(

430

left: int,

431

top: int,

432

right: int,

433

bottom: int

434

) -> BoundingPoly:

435

"""

436

Create a rectangular bounding polygon.

437

438

Args:

439

left: Left edge X coordinate

440

top: Top edge Y coordinate

441

right: Right edge X coordinate

442

bottom: Bottom edge Y coordinate

443

444

Returns:

445

BoundingPoly: Rectangular bounding polygon

446

"""

447

vertices = [

448

Vertex(x=left, y=top), # Top-left

449

Vertex(x=right, y=top), # Top-right

450

Vertex(x=right, y=bottom), # Bottom-right

451

Vertex(x=left, y=bottom) # Bottom-left

452

]

453

454

return BoundingPoly(vertices=vertices)

455

456

def normalize_bounding_poly(

457

bounding_poly: BoundingPoly,

458

page_width: int,

459

page_height: int

460

) -> BoundingPoly:

461

"""

462

Convert pixel coordinates to normalized coordinates.

463

464

Args:

465

bounding_poly: Bounding polygon with pixel coordinates

466

page_width: Page width in pixels

467

page_height: Page height in pixels

468

469

Returns:

470

BoundingPoly: Bounding polygon with normalized coordinates

471

"""

472

normalized_vertices = []

473

474

for vertex in bounding_poly.vertices:

475

normalized_x = vertex.x / page_width

476

normalized_y = vertex.y / page_height

477

normalized_vertices.append(

478

NormalizedVertex(x=normalized_x, y=normalized_y)

479

)

480

481

return BoundingPoly(normalized_vertices=normalized_vertices)

482

```

483

484

## Processor and Processor Type Definitions

485

486

### Processor

487

488

```python { .api }

489

from google.cloud.documentai.types import Processor

490

from google.protobuf.timestamp_pb2 import Timestamp

491

492

class Processor:

493

"""

494

The first-class citizen for Document AI.

495

496

Attributes:

497

name (str): Output only. Immutable. The resource name of the processor

498

type_ (str): The processor type, e.g., OCR_PROCESSOR, INVOICE_PROCESSOR

499

display_name (str): The display name of the processor

500

state (Processor.State): Output only. The state of the processor

501

default_processor_version (str): The default processor version

502

processor_version_aliases (Sequence[ProcessorVersionAlias]): Version aliases

503

process_endpoint (str): Output only. Immutable. The http endpoint for this processor

504

create_time (Timestamp): Output only. The time the processor was created

505

kms_key_name (str): The KMS key used to encrypt the processor

506

satisfies_pzs (bool): Output only. Reserved for future use

507

satisfies_pzi (bool): Output only. Reserved for future use

508

"""

509

510

class State(Enum):

511

"""

512

The possible states of the processor.

513

514

Values:

515

STATE_UNSPECIFIED: The processor state is unspecified

516

ENABLED: The processor is enabled, i.e., has an enabled version

517

DISABLED: The processor is disabled

518

ENABLING: The processor is being enabled, i.e., is having an enabled version

519

DISABLING: The processor is being disabled

520

CREATING: The processor is being created

521

FAILED: The processor failed during creation or while disabling

522

DELETING: The processor is being deleted

523

"""

524

STATE_UNSPECIFIED = 0

525

ENABLED = 1

526

DISABLED = 2

527

ENABLING = 3

528

DISABLING = 4

529

CREATING = 5

530

FAILED = 6

531

DELETING = 7

532

533

def get_processor_state_description(state: "Processor.State") -> str:

534

"""

535

Get human-readable description of processor state.

536

537

Args:

538

state: Processor state enum value

539

540

Returns:

541

str: Description of the state

542

"""

543

descriptions = {

544

Processor.State.ENABLED: "Ready for processing documents",

545

Processor.State.DISABLED: "Not available for processing",

546

Processor.State.ENABLING: "Currently being enabled",

547

Processor.State.DISABLING: "Currently being disabled",

548

Processor.State.CREATING: "Being created",

549

Processor.State.FAILED: "Failed to create or disable",

550

Processor.State.DELETING: "Being permanently deleted"

551

}

552

553

return descriptions.get(state, "Unknown state")

554

```

555

556

### ProcessorType

557

558

```python { .api }

559

from google.cloud.documentai.types import ProcessorType

560

561

class ProcessorType:

562

"""

563

A processor type is responsible for performing a certain document understanding task on a certain type of document.

564

565

Attributes:

566

name (str): The resource name of the processor type

567

type_ (str): The processor type, e.g., OCR_PROCESSOR, INVOICE_PROCESSOR

568

category (str): The processor category

569

available_locations (Sequence[LocationInfo]): The locations where this processor is available

570

allow_creation (bool): Whether the processor type allows creation of new processor instances

571

launch_stage (google.api.LaunchStage): Launch stage of the processor type

572

sample_document_uris (Sequence[str]): Sample documents for this processor type

573

"""

574

575

class LocationInfo:

576

"""

577

Information about the availability of a processor type in a location.

578

579

Attributes:

580

location_id (str): The location ID (e.g., 'us', 'eu')

581

"""

582

pass

583

584

# Common processor types

585

PROCESSOR_TYPES = {

586

# General processors

587

"OCR_PROCESSOR": {

588

"display_name": "Document OCR",

589

"description": "Extracts text from documents and images"

590

},

591

"FORM_PARSER_PROCESSOR": {

592

"display_name": "Form Parser",

593

"description": "Extracts key-value pairs from forms"

594

},

595

596

# Specialized processors

597

"INVOICE_PROCESSOR": {

598

"display_name": "Invoice Parser",

599

"description": "Extracts structured data from invoices"

600

},

601

"RECEIPT_PROCESSOR": {

602

"display_name": "Receipt Parser",

603

"description": "Extracts data from receipts"

604

},

605

"IDENTITY_DOCUMENT_PROCESSOR": {

606

"display_name": "Identity Document Parser",

607

"description": "Extracts data from identity documents"

608

},

609

"CONTRACT_PROCESSOR": {

610

"display_name": "Contract Parser",

611

"description": "Extracts key information from contracts"

612

},

613

"EXPENSE_PROCESSOR": {

614

"display_name": "Expense Parser",

615

"description": "Extracts data from expense documents"

616

},

617

618

# Custom processors

619

"CUSTOM_EXTRACTION_PROCESSOR": {

620

"display_name": "Custom Extraction Processor",

621

"description": "Custom trained processor for specific document types"

622

},

623

"CUSTOM_CLASSIFICATION_PROCESSOR": {

624

"display_name": "Custom Classification Processor",

625

"description": "Custom trained processor for document classification"

626

}

627

}

628

629

def get_processor_type_info(processor_type: str) -> dict:

630

"""

631

Get information about a processor type.

632

633

Args:

634

processor_type: Processor type identifier

635

636

Returns:

637

dict: Processor type information

638

"""

639

return PROCESSOR_TYPES.get(processor_type, {

640

"display_name": processor_type,

641

"description": "Unknown processor type"

642

})

643

```

644

645

## Document Schema

646

647

### DocumentSchema

648

649

```python { .api }

650

from google.cloud.documentai.types import DocumentSchema

651

652

class DocumentSchema:

653

"""

654

The schema defines the output of the processed document by a processor.

655

656

Attributes:

657

display_name (str): Display name to show to users

658

description (str): Description of the schema

659

entity_types (Sequence[DocumentSchema.EntityType]): Entity types that this schema produces

660

metadata (DocumentSchema.Metadata): Metadata about the schema

661

"""

662

663

class EntityType:

664

"""

665

EntityType is the wrapper of a label of the corresponding model with detailed attributes and limitations for entity-based processors.

666

667

Attributes:

668

enum_values (DocumentSchema.EntityType.EnumValues): If specified, lists all the possible values for this entity

669

display_name (str): User defined name for the type

670

name (str): Name of the type

671

base_types (Sequence[str]): The entity type that this type is derived from

672

properties (Sequence[DocumentSchema.EntityType.Property]): Description the nested structure, or composition of an entity

673

"""

674

675

class Property:

676

"""

677

Defines properties that can be part of the entity type.

678

679

Attributes:

680

name (str): The name of the property

681

display_name (str): User defined name for the property

682

value_type (str): A reference to the value type of the property

683

occurrence_type (DocumentSchema.EntityType.Property.OccurrenceType): Occurrence type limits the number of instances an entity type appears in the document

684

"""

685

686

class OccurrenceType(Enum):

687

"""

688

Types of occurrences of the entity type in the document.

689

690

Values:

691

OCCURRENCE_TYPE_UNSPECIFIED: Unspecified occurrence type

692

OPTIONAL_ONCE: There will be zero or one instance of this entity type

693

OPTIONAL_MULTIPLE: The entity type can have zero or multiple instances

694

REQUIRED_ONCE: The entity type will have exactly one instance

695

REQUIRED_MULTIPLE: The entity type will have one or more instances

696

"""

697

OCCURRENCE_TYPE_UNSPECIFIED = 0

698

OPTIONAL_ONCE = 1

699

OPTIONAL_MULTIPLE = 2

700

REQUIRED_ONCE = 3

701

REQUIRED_MULTIPLE = 4

702

703

def create_invoice_schema() -> DocumentSchema:

704

"""

705

Create a document schema for invoice processing.

706

707

Returns:

708

DocumentSchema: Schema for invoice documents

709

"""

710

# Define entity types for invoice

711

entity_types = [

712

DocumentSchema.EntityType(

713

name="invoice_date",

714

display_name="Invoice Date",

715

properties=[

716

DocumentSchema.EntityType.Property(

717

name="date_value",

718

display_name="Date Value",

719

value_type="date",

720

occurrence_type=DocumentSchema.EntityType.Property.OccurrenceType.REQUIRED_ONCE

721

)

722

]

723

),

724

DocumentSchema.EntityType(

725

name="invoice_number",

726

display_name="Invoice Number",

727

properties=[

728

DocumentSchema.EntityType.Property(

729

name="text_value",

730

display_name="Text Value",

731

value_type="text",

732

occurrence_type=DocumentSchema.EntityType.Property.OccurrenceType.REQUIRED_ONCE

733

)

734

]

735

),

736

DocumentSchema.EntityType(

737

name="total_amount",

738

display_name="Total Amount",

739

properties=[

740

DocumentSchema.EntityType.Property(

741

name="money_value",

742

display_name="Money Value",

743

value_type="money",

744

occurrence_type=DocumentSchema.EntityType.Property.OccurrenceType.REQUIRED_ONCE

745

)

746

]

747

)

748

]

749

750

return DocumentSchema(

751

display_name="Invoice Processing Schema",

752

description="Schema for extracting key information from invoices",

753

entity_types=entity_types

754

)

755

```

756

757

## Barcode Types

758

759

### Barcode

760

761

```python { .api }

762

from google.cloud.documentai.types import Barcode

763

764

class Barcode:

765

"""

766

Encodes the detailed information of a barcode.

767

768

Attributes:

769

format_ (str): Format of the barcode (e.g., CODE_128, QR_CODE)

770

value_format (str): Format of the barcode value (e.g., CONTACT_INFO, URL)

771

raw_value (str): Raw value encoded in the barcode

772

"""

773

774

# Common barcode formats

775

FORMATS = {

776

"CODE_128": "Code 128 linear barcode",

777

"CODE_39": "Code 39 linear barcode",

778

"CODE_93": "Code 93 linear barcode",

779

"CODABAR": "Codabar linear barcode",

780

"DATA_MATRIX": "Data Matrix 2D barcode",

781

"EAN_13": "EAN-13 linear barcode",

782

"EAN_8": "EAN-8 linear barcode",

783

"ITF": "ITF (Interleaved 2 of 5) linear barcode",

784

"QR_CODE": "QR Code 2D barcode",

785

"UPC_A": "UPC-A linear barcode",

786

"UPC_E": "UPC-E linear barcode",

787

"PDF417": "PDF417 2D barcode",

788

"AZTEC": "Aztec 2D barcode"

789

}

790

791

def extract_barcodes_from_document(document: "Document") -> list[dict]:

792

"""

793

Extract all barcodes from a processed document.

794

795

Args:

796

document: Processed Document object

797

798

Returns:

799

list[dict]: List of barcode information

800

"""

801

barcodes = []

802

803

for page_idx, page in enumerate(document.pages):

804

for barcode_detection in page.detected_barcodes:

805

barcode_info = {

806

"page": page_idx + 1,

807

"format": barcode_detection.barcode.format_,

808

"value_format": barcode_detection.barcode.value_format,

809

"raw_value": barcode_detection.barcode.raw_value,

810

"layout": barcode_detection.layout

811

}

812

barcodes.append(barcode_info)

813

814

return barcodes

815

```

816

817

## Complete Document Analysis Example

818

819

```python { .api }

820

from google.cloud.documentai.types import Document

821

from typing import Dict, List, Any

822

823

def comprehensive_document_analysis(document: Document) -> Dict[str, Any]:

824

"""

825

Perform comprehensive analysis of a processed document.

826

827

Args:

828

document: Processed Document object

829

830

Returns:

831

Dict[str, Any]: Complete document analysis results

832

"""

833

analysis = {

834

"document_info": {

835

"mime_type": document.mime_type,

836

"text_length": len(document.text),

837

"page_count": len(document.pages),

838

"entity_count": len(document.entities),

839

"has_tables": False,

840

"has_form_fields": False,

841

"has_barcodes": False

842

},

843

"pages": [],

844

"entities": {},

845

"tables": [],

846

"form_fields": {},

847

"barcodes": [],

848

"text_styles": []

849

}

850

851

# Analyze pages

852

for page_idx, page in enumerate(document.pages):

853

page_info = {

854

"page_number": page_idx + 1,

855

"dimensions": {

856

"width": page.dimension.width,

857

"height": page.dimension.height,

858

"unit": page.dimension.unit

859

},

860

"elements": {

861

"blocks": len(page.blocks),

862

"paragraphs": len(page.paragraphs),

863

"lines": len(page.lines),

864

"tokens": len(page.tokens)

865

},

866

"tables": len(page.tables),

867

"form_fields": len(page.form_fields),

868

"barcodes": len(page.detected_barcodes),

869

"languages": [lang.language_code for lang in page.detected_languages]

870

}

871

872

analysis["pages"].append(page_info)

873

874

# Update document-level flags

875

if page.tables:

876

analysis["document_info"]["has_tables"] = True

877

if page.form_fields:

878

analysis["document_info"]["has_form_fields"] = True

879

if page.detected_barcodes:

880

analysis["document_info"]["has_barcodes"] = True

881

882

# Analyze entities by type

883

for entity in document.entities:

884

entity_type = entity.type_

885

if entity_type not in analysis["entities"]:

886

analysis["entities"][entity_type] = []

887

888

entity_info = {

889

"text": entity.mention_text,

890

"confidence": entity.confidence,

891

"normalized_value": None

892

}

893

894

# Extract normalized value if available

895

if entity.normalized_value:

896

if entity.normalized_value.money_value:

897

entity_info["normalized_value"] = {

898

"type": "money",

899

"currency": entity.normalized_value.money_value.currency_code,

900

"amount": entity.normalized_value.money_value.units

901

}

902

elif entity.normalized_value.date_value:

903

entity_info["normalized_value"] = {

904

"type": "date",

905

"year": entity.normalized_value.date_value.year,

906

"month": entity.normalized_value.date_value.month,

907

"day": entity.normalized_value.date_value.day

908

}

909

elif entity.normalized_value.text:

910

entity_info["normalized_value"] = {

911

"type": "text",

912

"value": entity.normalized_value.text

913

}

914

915

analysis["entities"][entity_type].append(entity_info)

916

917

# Extract tables

918

for page_idx, page in enumerate(document.pages):

919

for table_idx, table in enumerate(page.tables):

920

table_data = {

921

"page": page_idx + 1,

922

"table_index": table_idx,

923

"header_rows": len(table.header_rows),

924

"body_rows": len(table.body_rows),

925

"total_rows": len(table.header_rows) + len(table.body_rows)

926

}

927

analysis["tables"].append(table_data)

928

929

# Extract form fields

930

for page in document.pages:

931

for form_field in page.form_fields:

932

if form_field.field_name and form_field.field_name.text_anchor:

933

field_name = extract_text_from_anchor(

934

document.text, form_field.field_name.text_anchor

935

).strip()

936

937

field_value = ""

938

if form_field.field_value and form_field.field_value.text_anchor:

939

field_value = extract_text_from_anchor(

940

document.text, form_field.field_value.text_anchor

941

).strip()

942

943

analysis["form_fields"][field_name] = {

944

"value": field_value,

945

"name_confidence": form_field.field_name.confidence,

946

"value_confidence": form_field.field_value.confidence if form_field.field_value else 0.0

947

}

948

949

# Extract barcodes

950

analysis["barcodes"] = extract_barcodes_from_document(document)

951

952

return analysis

953

954

def extract_text_from_anchor(full_text: str, text_anchor: "Document.TextAnchor") -> str:

955

"""Extract text using TextAnchor reference."""

956

text_segments = []

957

for segment in text_anchor.text_segments:

958

start_index = int(segment.start_index) if segment.start_index else 0

959

end_index = int(segment.end_index) if segment.end_index else len(full_text)

960

text_segments.append(full_text[start_index:end_index])

961

return "".join(text_segments)

962

963

def print_analysis_summary(analysis: Dict[str, Any]) -> None:

964

"""Print a summary of the document analysis."""

965

info = analysis["document_info"]

966

967

print("=== DOCUMENT ANALYSIS SUMMARY ===")

968

print(f"MIME Type: {info['mime_type']}")

969

print(f"Text Length: {info['text_length']:,} characters")

970

print(f"Pages: {info['page_count']}")

971

print(f"Entities: {info['entity_count']}")

972

print(f"Has Tables: {'Yes' if info['has_tables'] else 'No'}")

973

print(f"Has Form Fields: {'Yes' if info['has_form_fields'] else 'No'}")

974

print(f"Has Barcodes: {'Yes' if info['has_barcodes'] else 'No'}")

975

976

print(f"\n=== ENTITY TYPES ===")

977

for entity_type, entities in analysis["entities"].items():

978

print(f"{entity_type}: {len(entities)} instances")

979

980

if analysis["tables"]:

981

print(f"\n=== TABLES ===")

982

for table in analysis["tables"]:

983

print(f"Page {table['page']}: {table['total_rows']} rows")

984

985

if analysis["form_fields"]:

986

print(f"\n=== FORM FIELDS ===")

987

for field_name, field_info in list(analysis["form_fields"].items())[:5]:

988

print(f"{field_name}: {field_info['value']}")

989

```

990

991

This comprehensive guide covers all document types, structures, and schemas available in Google Cloud Document AI, providing developers with complete type definitions and practical examples for working with processed documents.