Tessl Tile for pypi/google-cloud-documentai@3.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

batch-operations.md beta-features.md document-processing.md document-types.md index.md processor-management.md

beta-features.mddocs/

0
# Beta Features (v1beta3)
1

2
This guide covers beta features available in the v1beta3 API of Google Cloud Document AI, including dataset management, enhanced document processing, and experimental capabilities.
3

4
**⚠️ Beta Notice**: These features are in beta and may change or be removed in future versions. Use with caution in production environments.
5

6
## API Versions Overview
7

8
### V1beta3 vs V1 API
9

10
The v1beta3 API includes all v1 functionality plus additional experimental features:
11

12
```python { .api }
13
# V1 (Stable) - Production ready
14
from google.cloud.documentai import DocumentProcessorServiceClient
15
from google.cloud.documentai.types import Document, ProcessRequest
16

17
# V1beta3 (Beta) - Includes experimental features  
18
from google.cloud.documentai_v1beta3 import DocumentProcessorServiceClient as DocumentProcessorServiceClientBeta
19
from google.cloud.documentai_v1beta3 import DocumentServiceClient  # Beta-only service
20
from google.cloud.documentai_v1beta3.types import Dataset, DatasetSchema  # Beta-only types
21
```
22

23
### Import Patterns
24

25
```python { .api }
26
# Beta document processing (enhanced features)
27
from google.cloud.documentai_v1beta3 import (
28
    DocumentProcessorServiceClient,
29
    DocumentProcessorServiceAsyncClient
30
)
31

32
# Beta dataset management (exclusive to v1beta3)
33
from google.cloud.documentai_v1beta3 import (
34
    DocumentServiceClient,
35
    DocumentServiceAsyncClient
36
)
37

38
# Beta-specific types
39
from google.cloud.documentai_v1beta3.types import (
40
    # Dataset types
41
    Dataset,
42
    DatasetSchema, 
43
    DocumentMetadata,
44
    DocumentId,
45
    BatchDatasetDocuments,
46
    DocumentPageRange,
47
    
48
    # Enhanced document types
49
    RevisionRef,
50
    
51
    # Schema enhancements
52
    EntityTypeMetadata,
53
    FieldExtractionMetadata,
54
    PropertyMetadata,
55
    SummaryOptions,
56
    
57
    # Additional beta request/response types
58
    ImportProcessorVersionRequest,
59
    ImportProcessorVersionResponse,
60
    ImportProcessorVersionMetadata,
61
    BatchDeleteDocumentsRequest,
62
    BatchDeleteDocumentsResponse,
63
    BatchDeleteDocumentsMetadata,
64
    UpdateDatasetRequest,
65
    UpdateDatasetSchemaRequest,
66
    GetDatasetSchemaRequest,
67
    GetDocumentRequest,
68
    GetDocumentResponse,
69
    ImportDocumentsRequest,
70
    ImportDocumentsResponse,
71
    ImportDocumentsMetadata,
72
    UpdateDatasetOperationMetadata,
73
    
74
    # Enums
75
    DatasetSplitType,
76
    DocumentLabelingState
77
)
78
```
79

80
## Dataset Management
81

82
### DocumentServiceClient
83

84
The `DocumentServiceClient` is exclusive to v1beta3 and provides dataset management capabilities for custom processor training.
85

86
```python { .api }
87
from google.cloud.documentai_v1beta3 import DocumentServiceClient
88
from google.cloud.documentai_v1beta3.types import (
89
    Dataset,
90
    DatasetSchema,
91
    UpdateDatasetRequest,
92
    ImportDocumentsRequest,
93
    GetDocumentRequest,
94
    ListDocumentsRequest,
95
    BatchDeleteDocumentsRequest
96
)
97

98
class DocumentServiceClient:
99
    """
100
    Service for managing datasets and documents for training custom processors.
101
    
102
    This service is only available in v1beta3 API.
103
    """
104
    
105
    def update_dataset(
106
        self,
107
        request: UpdateDatasetRequest,
108
        **kwargs
109
    ) -> Dataset:
110
        """
111
        Update dataset configuration.
112
        
113
        Args:
114
            request: Update dataset request with dataset and field mask
115
            
116
        Returns:
117
            Dataset: Updated dataset object
118
        """
119
        pass
120
    
121
    def import_documents(
122
        self,
123
        request: ImportDocumentsRequest,
124
        **kwargs
125
    ) -> "Operation":
126
        """
127
        Import documents into a dataset for training.
128
        
129
        Args:
130
            request: Import documents request
131
            
132
        Returns:
133
            Operation: Long-running operation for import process
134
        """
135
        pass
136
    
137
    def get_document(
138
        self,
139
        request: GetDocumentRequest,
140
        **kwargs
141
    ) -> "Document":
142
        """
143
        Get document metadata and content from dataset.
144
        
145
        Args:
146
            request: Get document request with document name
147
            
148
        Returns:
149
            Document: Document object with metadata
150
        """
151
        pass
152
    
153
    def list_documents(
154
        self,
155
        request: ListDocumentsRequest,
156
        **kwargs
157
    ) -> "ListDocumentsResponse":
158
        """
159
        List documents in a dataset.
160
        
161
        Args:
162
            request: List documents request with parent dataset
163
            
164
        Returns:
165
            ListDocumentsResponse: Paginated list of documents
166
        """
167
        pass
168
    
169
    def batch_delete_documents(
170
        self,
171
        request: BatchDeleteDocumentsRequest,
172
        **kwargs
173
    ) -> "Operation":
174
        """
175
        Delete multiple documents from dataset.
176
        
177
        Args:
178
            request: Batch delete request with document names
179
            
180
        Returns:
181
            Operation: Long-running operation for deletion
182
        """
183
        pass
184
    
185
    def get_dataset_schema(
186
        self,
187
        request: "GetDatasetSchemaRequest",
188
        **kwargs
189
    ) -> DatasetSchema:
190
        """
191
        Get dataset schema definition.
192
        
193
        Args:
194
            request: Get schema request
195
            
196
        Returns:
197
            DatasetSchema: Schema definition for dataset
198
        """
199
        pass
200
    
201
    def update_dataset_schema(
202
        self,
203
        request: "UpdateDatasetSchemaRequest",
204
        **kwargs
205
    ) -> DatasetSchema:
206
        """
207
        Update dataset schema definition.
208
        
209
        Args:
210
            request: Update schema request
211
            
212
        Returns:
213
            DatasetSchema: Updated schema definition
214
        """
215
        pass
216

217
# Example usage
218
client = DocumentServiceClient()
219

220
# List documents in a dataset
221
parent = "projects/my-project/locations/us/processors/abc123/dataset"
222
request = ListDocumentsRequest(parent=parent)
223
response = client.list_documents(request=request)
224

225
for document_metadata in response.document_metadata:
226
    print(f"Document: {document_metadata.document_id.document_id}")
227
    print(f"State: {document_metadata.labeling_state}")
228
```
229

230
### Dataset Types
231

232
#### Dataset
233

234
```python { .api }
235
from google.cloud.documentai_v1beta3.types import Dataset
236

237
class Dataset:
238
    """
239
    A dataset of documents for training custom processors.
240
    
241
    Attributes:
242
        name (str): Dataset resource name
243
        display_name (str): Human-readable name
244
        description (str): Dataset description
245
        document_count (int): Number of documents in dataset
246
        satisfies_pzs (bool): Reserved for future use
247
        satisfies_pzi (bool): Reserved for future use
248
    """
249
    
250
    def __init__(
251
        self,
252
        display_name: str,
253
        description: str = ""
254
    ):
255
        """
256
        Initialize dataset.
257
        
258
        Args:
259
            display_name: Human-readable dataset name
260
            description: Optional description
261
        """
262
        self.display_name = display_name
263
        self.description = description
264

265
# Example usage
266
def create_training_dataset_config(
267
    display_name: str,
268
    description: str
269
) -> Dataset:
270
    """
271
    Create dataset configuration for custom processor training.
272
    
273
    Args:
274
        display_name: Dataset name
275
        description: Dataset description
276
        
277
    Returns:
278
        Dataset: Dataset configuration
279
    """
280
    return Dataset(
281
        display_name=display_name,
282
        description=description
283
    )
284
```
285

286
#### DatasetSchema
287

288
```python { .api }
289
from google.cloud.documentai_v1beta3.types import DatasetSchema
290

291
class DatasetSchema:
292
    """
293
    Schema definition for a dataset, specifying entity types and structure.
294
    
295
    Attributes:
296
        name (str): Schema resource name
297
        display_name (str): Human-readable schema name
298
        description (str): Schema description
299
        entity_types (Sequence[DocumentSchema.EntityType]): Entity types in schema
300
    """
301
    
302
    def __init__(
303
        self,
304
        display_name: str,
305
        entity_types: list["DocumentSchema.EntityType"],
306
        description: str = ""
307
    ):
308
        """
309
        Initialize dataset schema.
310
        
311
        Args:
312
            display_name: Schema name
313
            entity_types: List of entity type definitions
314
            description: Optional description
315
        """
316
        self.display_name = display_name
317
        self.entity_types = entity_types
318
        self.description = description
319

320
def create_custom_schema(
321
    schema_name: str,
322
    entity_definitions: list[dict]
323
) -> DatasetSchema:
324
    """
325
    Create custom dataset schema for training.
326
    
327
    Args:
328
        schema_name: Name for the schema
329
        entity_definitions: List of entity type definitions
330
        
331
    Returns:
332
        DatasetSchema: Custom schema definition
333
    """
334
    from google.cloud.documentai_v1beta3.types import DocumentSchema
335
    
336
    entity_types = []
337
    
338
    for entity_def in entity_definitions:
339
        properties = []
340
        
341
        for prop_def in entity_def.get("properties", []):
342
            prop = DocumentSchema.EntityType.Property(
343
                name=prop_def["name"],
344
                display_name=prop_def["display_name"],
345
                value_type=prop_def["value_type"],
346
                occurrence_type=prop_def.get("occurrence_type", "OPTIONAL_ONCE")
347
            )
348
            properties.append(prop)
349
        
350
        entity_type = DocumentSchema.EntityType(
351
            name=entity_def["name"],
352
            display_name=entity_def["display_name"],
353
            properties=properties
354
        )
355
        entity_types.append(entity_type)
356
    
357
    return DatasetSchema(
358
        display_name=schema_name,
359
        entity_types=entity_types,
360
        description=f"Custom schema: {schema_name}"
361
    )
362

363
# Example usage
364
entity_definitions = [
365
    {
366
        "name": "contract_date",
367
        "display_name": "Contract Date",
368
        "properties": [
369
            {
370
                "name": "date_value",
371
                "display_name": "Date Value",
372
                "value_type": "date",
373
                "occurrence_type": "REQUIRED_ONCE"
374
            }
375
        ]
376
    },
377
    {
378
        "name": "contract_parties",
379
        "display_name": "Contract Parties", 
380
        "properties": [
381
            {
382
                "name": "party_name",
383
                "display_name": "Party Name",
384
                "value_type": "text",
385
                "occurrence_type": "REQUIRED_MULTIPLE"
386
            }
387
        ]
388
    }
389
]
390

391
schema = create_custom_schema("Contract Analysis Schema", entity_definitions)
392
```
393

394
### Document Management
395

396
#### DocumentMetadata
397

398
```python { .api }
399
from google.cloud.documentai_v1beta3.types import (
400
    DocumentMetadata,
401
    DocumentId,
402
    DocumentLabelingState
403
)
404

405
class DocumentMetadata:
406
    """
407
    Metadata for documents in a dataset.
408
    
409
    Attributes:
410
        document_id (DocumentId): Document identifier
411
        page_count (int): Number of pages in document
412
        dataset_type (DatasetSplitType): Dataset split type (TRAIN, TEST, etc.)
413
        labeling_state (DocumentLabelingState): Document labeling status
414
        display_name (str): Human-readable document name
415
    """
416
    pass
417

418
class DocumentId:
419
    """
420
    Identifier for a document within a dataset.
421
    
422
    Attributes:
423
        gcs_managed_doc_id (str): Cloud Storage managed document ID
424
        unmanaged_doc_id (str): User-managed document ID  
425
        revision_ref (RevisionRef): Reference to document revision
426
    """
427
    pass
428

429
class DocumentLabelingState:
430
    """
431
    Enum describing document labeling status.
432
    
433
    Values:
434
        DOCUMENT_LABELING_STATE_UNSPECIFIED: Unspecified state
435
        DOCUMENT_LABELED: Document is labeled
436
        DOCUMENT_UNLABELED: Document is not labeled
437
        DOCUMENT_AUTO_LABELED: Document is auto-labeled
438
    """
439
    DOCUMENT_LABELING_STATE_UNSPECIFIED = 0
440
    DOCUMENT_LABELED = 1  
441
    DOCUMENT_UNLABELED = 2
442
    DOCUMENT_AUTO_LABELED = 3
443

444
# Example usage
445
def list_dataset_documents(
446
    client: DocumentServiceClient,
447
    project_id: str,
448
    location: str,
449
    processor_id: str
450
) -> list[DocumentMetadata]:
451
    """
452
    List all documents in a dataset with metadata.
453
    
454
    Args:
455
        client: DocumentServiceClient instance
456
        project_id: Google Cloud project ID
457
        location: Processor location
458
        processor_id: Processor ID
459
        
460
    Returns:
461
        list[DocumentMetadata]: List of document metadata
462
    """
463
    from google.cloud.documentai_v1beta3.types import ListDocumentsRequest
464
    
465
    # Build dataset parent path
466
    parent = f"projects/{project_id}/locations/{location}/processors/{processor_id}/dataset"
467
    
468
    request = ListDocumentsRequest(parent=parent)
469
    response = client.list_documents(request=request)
470
    
471
    documents = []
472
    for doc_metadata in response.document_metadata:
473
        documents.append(doc_metadata)
474
    
475
    return documents
476

477
def filter_labeled_documents(
478
    document_metadata_list: list[DocumentMetadata]
479
) -> list[DocumentMetadata]:
480
    """
481
    Filter documents that are labeled and ready for training.
482
    
483
    Args:
484
        document_metadata_list: List of document metadata
485
        
486
    Returns:
487
        list[DocumentMetadata]: Filtered labeled documents
488
    """
489
    return [
490
        doc for doc in document_metadata_list
491
        if doc.labeling_state == DocumentLabelingState.DOCUMENT_LABELED
492
    ]
493
```
494

495
## Enhanced Document Processing
496

497
### Import Processor Version (Beta)
498

499
```python { .api }
500
from google.cloud.documentai_v1beta3 import DocumentProcessorServiceClient
501
from google.cloud.documentai_v1beta3.types import ImportProcessorVersionRequest
502

503
def import_processor_version(
504
    project_id: str,
505
    location: str,
506
    processor_id: str,
507
    source_processor_version: str
508
) -> "Operation":
509
    """
510
    Import a processor version from another location or project (beta feature).
511
    
512
    Args:
513
        project_id: Target project ID
514
        location: Target location
515
        processor_id: Target processor ID
516
        source_processor_version: Source processor version to import
517
        
518
    Returns:
519
        Operation: Long-running operation for import
520
    """
521
    client = DocumentProcessorServiceClient()
522
    
523
    # Build parent processor path
524
    parent = client.processor_path(project_id, location, processor_id)
525
    
526
    # Create import request
527
    request = ImportProcessorVersionRequest(
528
        parent=parent,
529
        processor_version_source=source_processor_version
530
    )
531
    
532
    # Start import operation
533
    operation = client.import_processor_version(request=request)
534
    
535
    print(f"Importing processor version...")
536
    print(f"Operation: {operation.operation.name}")
537
    
538
    return operation
539

540
# Example usage
541
operation = import_processor_version(
542
    project_id="target-project",
543
    location="us",
544
    processor_id="target-processor-id",
545
    source_processor_version="projects/source-project/locations/eu/processors/source-id/processorVersions/version-id"
546
)
547

548
# Monitor import progress
549
result = operation.result()  # Wait for completion
550
print(f"Import completed: {result}")
551
```
552

553
### Enhanced Schema Types
554

555
#### EntityTypeMetadata
556

557
```python { .api }
558
from google.cloud.documentai_v1beta3.types import EntityTypeMetadata
559

560
class EntityTypeMetadata:
561
    """
562
    Metadata for entity types in document schema (beta feature).
563
    
564
    Attributes:
565
        inactive (bool): Whether entity type is inactive
566
        description (str): Description of the entity type
567
    """
568
    
569
    def __init__(self, description: str = "", inactive: bool = False):
570
        """
571
        Initialize entity type metadata.
572
        
573
        Args:
574
            description: Entity type description
575
            inactive: Whether entity type is inactive
576
        """
577
        self.description = description
578
        self.inactive = inactive
579
```
580

581
#### SummaryOptions
582

583
```python { .api }
584
from google.cloud.documentai_v1beta3.types import SummaryOptions
585

586
class SummaryOptions:
587
    """
588
    Options for document summarization (beta feature).
589
    
590
    Attributes:
591
        length (SummaryOptions.Length): Summary length preference
592
        format_ (SummaryOptions.Format): Summary format preference
593
    """
594
    
595
    class Length:
596
        """Summary length options."""
597
        BRIEF = "BRIEF"
598
        MODERATE = "MODERATE"
599
        COMPREHENSIVE = "COMPREHENSIVE"
600
    
601
    class Format:
602
        """Summary format options."""
603
        PARAGRAPH = "PARAGRAPH"
604
        BULLETS = "BULLETS"
605
        STRUCTURED = "STRUCTURED"
606
    
607
    def __init__(
608
        self,
609
        length: str = "MODERATE",
610
        format_: str = "PARAGRAPH"
611
    ):
612
        """
613
        Initialize summary options.
614
        
615
        Args:
616
            length: Summary length preference
617
            format_: Summary format preference
618
        """
619
        self.length = length
620
        self.format_ = format_
621
```
622

623
## Beta Enums and Constants
624

625
### DatasetSplitType
626

627
```python { .api }
628
from google.cloud.documentai_v1beta3.types import DatasetSplitType
629

630
class DatasetSplitType:
631
    """
632
    Enum for dataset split types used in training (beta feature).
633
    
634
    Values:
635
        DATASET_SPLIT_TYPE_UNSPECIFIED: Unspecified split type
636
        DATASET_SPLIT_TRAIN: Training dataset
637
        DATASET_SPLIT_TEST: Test dataset  
638
        DATASET_SPLIT_UNASSIGNED: Unassigned documents
639
    """
640
    DATASET_SPLIT_TYPE_UNSPECIFIED = 0
641
    DATASET_SPLIT_TRAIN = 1
642
    DATASET_SPLIT_TEST = 2
643
    DATASET_SPLIT_UNASSIGNED = 3
644

645
def categorize_documents_by_split(
646
    document_metadata_list: list[DocumentMetadata]
647
) -> dict[str, list[DocumentMetadata]]:
648
    """
649
    Categorize documents by their dataset split type.
650
    
651
    Args:
652
        document_metadata_list: List of document metadata
653
        
654
    Returns:
655
        dict: Documents organized by split type
656
    """
657
    categorized = {
658
        "train": [],
659
        "test": [], 
660
        "unassigned": [],
661
        "unspecified": []
662
    }
663
    
664
    for doc in document_metadata_list:
665
        if doc.dataset_type == DatasetSplitType.DATASET_SPLIT_TRAIN:
666
            categorized["train"].append(doc)
667
        elif doc.dataset_type == DatasetSplitType.DATASET_SPLIT_TEST:
668
            categorized["test"].append(doc)
669
        elif doc.dataset_type == DatasetSplitType.DATASET_SPLIT_UNASSIGNED:
670
            categorized["unassigned"].append(doc)
671
        else:
672
            categorized["unspecified"].append(doc)
673
    
674
    return categorized
675
```
676

677
## Complete Beta Feature Example
678

679
### Custom Processor Training Workflow
680

681
```python { .api }
682
from google.cloud.documentai_v1beta3 import (
683
    DocumentServiceClient,
684
    DocumentProcessorServiceClient
685
)
686
from google.cloud.documentai_v1beta3.types import (
687
    Dataset,
688
    DatasetSchema, 
689
    ImportDocumentsRequest,
690
    TrainProcessorVersionRequest,
691
    ListDocumentsRequest
692
)
693

694
def complete_custom_training_workflow(
695
    project_id: str,
696
    location: str,
697
    processor_type: str = "CUSTOM_EXTRACTION_PROCESSOR"
698
):
699
    """
700
    Complete workflow for training a custom processor using beta features.
701
    
702
    Args:
703
        project_id: Google Cloud project ID
704
        location: Processing location
705
        processor_type: Type of custom processor to train
706
    """
707
    
708
    # Initialize clients
709
    doc_service = DocumentServiceClient()
710
    processor_service = DocumentProcessorServiceClient()
711
    
712
    print("=== CUSTOM PROCESSOR TRAINING WORKFLOW ===")
713
    
714
    # Step 1: Create processor for training
715
    print("1. Creating custom processor...")
716
    
717
    from google.cloud.documentai_v1beta3.types import CreateProcessorRequest, Processor
718
    
719
    parent = processor_service.common_location_path(project_id, location)
720
    processor = Processor(
721
        display_name="Custom Contract Processor",
722
        type_=processor_type
723
    )
724
    
725
    create_request = CreateProcessorRequest(
726
        parent=parent,
727
        processor=processor
728
    )
729
    
730
    created_processor = processor_service.create_processor(request=create_request)
731
    processor_id = created_processor.name.split('/')[-1]
732
    
733
    print(f"Created processor: {processor_id}")
734
    
735
    # Step 2: Setup dataset schema
736
    print("2. Creating dataset schema...")
737
    
738
    entity_definitions = [
739
        {
740
            "name": "contract_date",
741
            "display_name": "Contract Date",
742
            "properties": [
743
                {
744
                    "name": "date_value", 
745
                    "display_name": "Date Value",
746
                    "value_type": "date",
747
                    "occurrence_type": "REQUIRED_ONCE"
748
                }
749
            ]
750
        },
751
        {
752
            "name": "contract_value",
753
            "display_name": "Contract Value",
754
            "properties": [
755
                {
756
                    "name": "money_value",
757
                    "display_name": "Money Value", 
758
                    "value_type": "money",
759
                    "occurrence_type": "REQUIRED_ONCE"
760
                }
761
            ]
762
        },
763
        {
764
            "name": "party_names",
765
            "display_name": "Party Names",
766
            "properties": [
767
                {
768
                    "name": "text_value",
769
                    "display_name": "Text Value",
770
                    "value_type": "text", 
771
                    "occurrence_type": "REQUIRED_MULTIPLE"
772
                }
773
            ]
774
        }
775
    ]
776
    
777
    schema = create_custom_schema("Contract Schema", entity_definitions)
778
    
779
    # Step 3: Import training documents
780
    print("3. Importing training documents...")
781
    
782
    dataset_parent = f"projects/{project_id}/locations/{location}/processors/{processor_id}/dataset"
783
    
784
    # Configure document import from Cloud Storage
785
    batch_documents_input_config = {
786
        "gcs_prefix": {"gcs_uri_prefix": "gs://my-training-bucket/contracts/"}
787
    }
788
    
789
    import_request = ImportDocumentsRequest(
790
        dataset=dataset_parent,
791
        batch_documents_input_config=batch_documents_input_config
792
    )
793
    
794
    import_operation = doc_service.import_documents(request=import_request)
795
    
796
    print("Importing documents...")
797
    import_result = import_operation.result()  # Wait for completion
798
    print("Documents imported successfully")
799
    
800
    # Step 4: Check dataset status
801
    print("4. Checking dataset status...")
802
    
803
    list_request = ListDocumentsRequest(parent=dataset_parent)
804
    list_response = doc_service.list_documents(request=list_request)
805
    
806
    total_docs = len(list_response.document_metadata)
807
    labeled_docs = len(filter_labeled_documents(list_response.document_metadata))
808
    
809
    print(f"Total documents: {total_docs}")
810
    print(f"Labeled documents: {labeled_docs}")
811
    
812
    # Step 5: Train processor version (if sufficient labeled data)
813
    if labeled_docs >= 10:  # Minimum for training
814
        print("5. Starting processor training...")
815
        
816
        processor_parent = processor_service.processor_path(
817
            project_id, location, processor_id
818
        )
819
        
820
        train_request = TrainProcessorVersionRequest(
821
            parent=processor_parent,
822
            processor_version={
823
                "display_name": "Contract Processor v1.0",
824
                "document_schema": schema
825
            },
826
            input_data={
827
                "training_documents": {
828
                    "gcs_prefix": {"gcs_uri_prefix": "gs://my-training-bucket/contracts/labeled/"}
829
                },
830
                "test_documents": {
831
                    "gcs_prefix": {"gcs_uri_prefix": "gs://my-training-bucket/contracts/test/"}
832
                }
833
            }
834
        )
835
        
836
        train_operation = processor_service.train_processor_version(request=train_request)
837
        
838
        print(f"Training started: {train_operation.operation.name}")
839
        print("Training typically takes several hours. Monitor progress using the operation name.")
840
        
841
    else:
842
        print(f"Insufficient labeled documents ({labeled_docs}). Need at least 10 for training.")
843
    
844
    return {
845
        "processor_id": processor_id,
846
        "dataset_parent": dataset_parent,
847
        "total_documents": total_docs,
848
        "labeled_documents": labeled_docs
849
    }
850
```
851

852
### Beta Feature Monitoring
853

854
```python { .api }
855
def monitor_beta_operations(
856
    project_id: str,
857
    location: str
858
) -> dict:
859
    """
860
    Monitor various beta operations and provide status.
861
    
862
    Args:
863
        project_id: Google Cloud project ID
864
        location: Processing location
865
        
866
    Returns:
867
        dict: Status of beta operations
868
    """
869
    from google.api_core import operations_v1
870
    from google.auth import default
871
    
872
    credentials, _ = default()
873
    operations_client = operations_v1.OperationsClient(credentials=credentials)
874
    
875
    # List all operations for the location
876
    name = f"projects/{project_id}/locations/{location}"
877
    
878
    beta_operations = {
879
        "import_documents": [],
880
        "train_processor": [],
881
        "import_processor_version": [],
882
        "other": []
883
    }
884
    
885
    for operation in operations_client.list_operations(name=name):
886
        op_info = {
887
            "name": operation.name,
888
            "done": operation.done,
889
            "error": operation.error.message if operation.error else None
890
        }
891
        
892
        # Categorize by operation type
893
        if "importDocuments" in operation.name:
894
            beta_operations["import_documents"].append(op_info)
895
        elif "trainProcessorVersion" in operation.name:
896
            beta_operations["train_processor"].append(op_info)
897
        elif "importProcessorVersion" in operation.name:
898
            beta_operations["import_processor_version"].append(op_info)
899
        else:
900
            beta_operations["other"].append(op_info)
901
    
902
    return beta_operations
903

904
def print_beta_status(project_id: str, location: str):
905
    """Print status of beta operations."""
906
    status = monitor_beta_operations(project_id, location)
907
    
908
    print("=== BETA OPERATIONS STATUS ===")
909
    
910
    for op_type, operations in status.items():
911
        print(f"\n{op_type.replace('_', ' ').title()} Operations ({len(operations)}):")
912
        
913
        for op in operations:
914
            status_text = "✓ Complete" if op["done"] else "⏳ Running"
915
            error_text = f" (Error: {op['error']})" if op["error"] else ""
916
            print(f"  - {op['name'].split('/')[-1]}: {status_text}{error_text}")
917

918
# Example usage
919
if __name__ == "__main__":
920
    # Run custom training workflow
921
    result = complete_custom_training_workflow(
922
        project_id="my-project",
923
        location="us"
924
    )
925
    
926
    print(f"\nWorkflow completed:")
927
    print(f"Processor ID: {result['processor_id']}")
928
    print(f"Dataset: {result['dataset_parent']}")
929
    print(f"Documents: {result['labeled_documents']}/{result['total_documents']} labeled")
930
    
931
    # Monitor operations
932
    print_beta_status("my-project", "us")
933
```
934

935
## Migration from V1 to V1beta3
936

937
### Compatibility Notes
938

939
```python { .api }
940
# V1 API (stable) - continues to work
941
from google.cloud.documentai import DocumentProcessorServiceClient as V1Client
942
from google.cloud.documentai.types import ProcessRequest as V1ProcessRequest
943

944
# V1beta3 API (beta) - includes all v1 functionality + beta features  
945
from google.cloud.documentai_v1beta3 import DocumentProcessorServiceClient as V1Beta3Client
946
from google.cloud.documentai_v1beta3.types import ProcessRequest as V1Beta3ProcessRequest
947

948
def migrate_to_beta_client():
949
    """
950
    Example showing migration from v1 to v1beta3 client.
951
    
952
    V1beta3 client is backward compatible with v1 API calls.
953
    """
954
    
955
    # V1 approach (still works)
956
    v1_client = V1Client()
957
    
958
    # V1beta3 approach (recommended for new features)
959
    v1beta3_client = V1Beta3Client()
960
    
961
    # Both clients support the same core operations
962
    processor_name = "projects/my-project/locations/us/processors/abc123"
963
    
964
    # Same request works with both clients
965
    from google.cloud.documentai_v1beta3.types import RawDocument
966
    
967
    raw_doc = RawDocument(content=b"document content", mime_type="application/pdf")
968
    request = V1Beta3ProcessRequest(name=processor_name, raw_document=raw_doc)
969
    
970
    # Both calls work identically
971
    v1_result = v1_client.process_document(request=request)
972
    v1beta3_result = v1beta3_client.process_document(request=request)
973
    
974
    # But only v1beta3 client supports beta features
975
    try:
976
        # This only works with v1beta3 client
977
        import_operation = v1beta3_client.import_processor_version(
978
            # import request
979
        )
980
        print("Beta feature available")
981
    except AttributeError:
982
        print("Beta feature not available in v1 client")
983

984
# Best practice: Use v1beta3 for new projects to access all features
985
def recommended_client_usage():
986
    """Recommended pattern for using v1beta3 client."""
987
    
988
    # Use v1beta3 client for all operations
989
    client = V1Beta3Client()
990
    
991
    # Standard operations work normally
992
    # Beta operations are available when needed
993
    
994
    return client
995
```
996

997
This comprehensive guide covers all beta features available in Google Cloud Document AI v1beta3, including dataset management, enhanced processing capabilities, and migration strategies from the stable v1 API.

Version

Tile

Files

beta-features.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

beta-features.mddocs/