CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-google-cloud-videointelligence

Python Client for Video Intelligence API that enables developers to make videos searchable and discoverable by extracting metadata through machine learning.

Pending
Overview
Eval results
Files

results-data-types.mddocs/

Results and Data Types

Structured data types for representing video analysis results. These classes contain annotations for detected objects, faces, text, speech, and other content with timestamps and confidence scores.

Capabilities

Core Response Types

Main response objects returned from video analysis operations.

class AnnotateVideoResponse:
    """
    Video annotation response. Contains annotation results for one or more videos.
    
    Attributes:
        annotation_results: Annotation results for all videos specified in AnnotateVideoRequest
    """
    
    annotation_results: MutableSequence[VideoAnnotationResults]

class VideoAnnotationResults:
    """
    Annotation results for a single video.
    
    Attributes:
        input_uri: Video file location in Google Cloud Storage
        segment_label_annotations: Label annotations on video level or user-specified segment level
        shot_label_annotations: Label annotations on shot level
        frame_label_annotations: Label annotations on frame level
        face_annotations: Face detection annotations
        shot_annotations: Shot annotations which are represented as a list of shots
        explicit_annotation: Explicit content annotation
        speech_transcriptions: Speech transcription
        text_annotations: OCR text detection and tracking
        object_annotations: Annotations for list of objects detected and tracked in video
        logo_recognition_annotations: Annotations for list of logos detected, tracked and recognized in video
        person_detection_annotations: Person detection annotations
        celebrity_recognition_annotations: Celebrity recognition annotations (available in v1p3beta1)
        error: If processing failed, this field contains the details of the failure
    """
    
    input_uri: str
    segment_label_annotations: MutableSequence[LabelAnnotation]
    shot_label_annotations: MutableSequence[LabelAnnotation]
    frame_label_annotations: MutableSequence[LabelAnnotation]
    face_annotations: MutableSequence[FaceAnnotation]
    shot_annotations: MutableSequence[VideoSegment]
    explicit_annotation: ExplicitContentAnnotation
    speech_transcriptions: MutableSequence[SpeechTranscription]
    text_annotations: MutableSequence[TextAnnotation]
    object_annotations: MutableSequence[ObjectTrackingAnnotation]
    logo_recognition_annotations: MutableSequence[LogoRecognitionAnnotation]
    person_detection_annotations: MutableSequence[PersonDetectionAnnotation]
    celebrity_recognition_annotations: MutableSequence[CelebrityRecognitionAnnotation]
    error: status_pb2.Status

Label Detection Results

Results from label detection analysis, including detected objects, activities, and concepts.

class LabelAnnotation:
    """
    Label annotation.
    
    Attributes:
        entity: Detected entity from Video Intelligence API
        category_entities: Common categories for the detected entity
        segments: All video segments where a label was detected
        frames: All video frames where a label was detected
    """
    
    entity: Entity
    category_entities: MutableSequence[Entity]
    segments: MutableSequence[LabelSegment]
    frames: MutableSequence[LabelFrame]

class LabelSegment:
    """
    Video segment level annotation results for label detection.
    
    Attributes:
        segment: Video segment where a label was detected
        confidence: Confidence that the label is accurate (0.0 to 1.0)
    """
    
    segment: VideoSegment
    confidence: float

class LabelFrame:
    """
    Video frame level annotation results for label detection.
    
    Attributes:
        time_offset: Time-offset, relative to the beginning of the video, corresponding to the video frame for this location
        confidence: Confidence that the label is accurate (0.0 to 1.0)
    """
    
    time_offset: duration_pb2.Duration
    confidence: float

class Entity:
    """
    Detected entity from Video Intelligence API.
    
    Attributes:
        entity_id: Opaque entity ID. Some IDs may be available in Google Knowledge Graph Search API
        description: Textual description, e.g., "Fixed-gear bicycle"
        language_code: Language code for description in BCP-47 format
    """
    
    entity_id: str
    description: str
    language_code: str

Face Detection Results

Results from face detection and tracking analysis.

class FaceDetectionAnnotation:
    """
    Face detection annotation.
    
    Attributes:
        version: Feature version
        tracks: The face tracks with attributes
        thumbnail: The thumbnail of a person's face
    """
    
    version: str
    tracks: MutableSequence[Track]
    thumbnail: bytes

class FaceAnnotation:
    """
    Deprecated. No effect.
    
    Attributes:
        thumbnail: The thumbnail of a person's face
        segments: All video segments where a face was detected
    """
    
    thumbnail: bytes
    segments: MutableSequence[FaceSegment]

class FaceSegment:
    """
    Video segment level annotation results for face detection.
    
    Attributes:
        segment: Video segment where a face was detected
    """
    
    segment: VideoSegment

class FaceFrame:
    """
    Deprecated. No effect.
    
    Attributes:
        normalized_bounding_boxes: Normalized Bounding boxes in a frame
        time_offset: Time-offset, relative to the beginning of the video, corresponding to the video frame for this location
    """
    
    normalized_bounding_boxes: MutableSequence[NormalizedBoundingBox]
    time_offset: duration_pb2.Duration

class Track:
    """
    A track of an object instance.
    
    Attributes:
        segment: Video segment of a track
        timestamped_objects: The object with timestamp and attributes per frame in the track
        attributes: Optional. Attributes in the track level
        confidence: Optional. The confidence score of the tracked object
    """
    
    segment: VideoSegment
    timestamped_objects: MutableSequence[TimestampedObject]
    attributes: MutableSequence[DetectedAttribute]
    confidence: float

class TimestampedObject:
    """
    For tracking the object throughout the video.
    
    Attributes:
        normalized_bounding_box: Normalized Bounding box location of this object track for the frame
        time_offset: Time-offset, relative to the beginning of the video, corresponding to the video frame for this location
        attributes: Optional. The attributes of the object in the bounding box
        landmarks: Optional. The detected landmarks
    """
    
    normalized_bounding_box: NormalizedBoundingBox
    time_offset: duration_pb2.Duration
    attributes: MutableSequence[DetectedAttribute]
    landmarks: MutableSequence[DetectedLandmark]

class DetectedAttribute:
    """
    A generic detected attribute represented by name in string format.
    
    Attributes:
        name: The name of the attribute, for example, glasses, dark_glasses, mouth_open
        confidence: Detected attribute confidence (0.0 to 1.0)
        value: Text value of the detection result
    """
    
    name: str
    confidence: float
    value: str

class DetectedLandmark:
    """
    A generic detected landmark represented by name in string format and a 2D location.
    
    Attributes:
        name: The name of this landmark, for example, left_hand, right_shoulder
        point: The 2D point of the detected landmark using the normalized image coordinate system
        confidence: The confidence score of the detected landmark (0.0 to 1.0)
    """
    
    name: str
    point: NormalizedVertex
    confidence: float

Object Tracking Results

Results from object detection and tracking analysis.

class ObjectTrackingAnnotation:
    """
    Annotations corresponding to one tracked object.
    
    Attributes:
        entity: Entity to specify the object category that this track is labeled as
        confidence: Object category's labeling confidence of this track
        frames: Information corresponding to all frames where this object track appears
        segment: Non-streaming batch mode ONLY. Each object track corresponds to one video segment where it appears
        track_id: Streaming mode ONLY. In streaming mode, we do not know the end time of a tracked object before it is completed
        version: Feature version
    """
    
    entity: Entity
    confidence: float
    frames: MutableSequence[ObjectTrackingFrame]
    segment: VideoSegment
    track_id: int
    version: str

class ObjectTrackingFrame:
    """
    Video frame level annotations for object detection and tracking.
    
    Attributes:
        normalized_bounding_box: The normalized bounding box location of this object track for the frame
        time_offset: The timestamp of the frame in microseconds
    """
    
    normalized_bounding_box: NormalizedBoundingBox
    time_offset: duration_pb2.Duration

Text Detection Results

Results from optical character recognition (OCR) analysis.

class TextAnnotation:
    """
    Annotations related to one detected OCR text snippet.
    
    Attributes:
        text: The detected text
        segments: All video segments where OCR detected text appears
        version: Feature version
    """
    
    text: str
    segments: MutableSequence[TextSegment]
    version: str

class TextSegment:
    """
    Video segment level annotation results for text detection.
    
    Attributes:
        segment: Video segment where a text snippet was detected
        confidence: Confidence for the track of detected text
        frames: Information related to the frames where OCR detected text appears
    """
    
    segment: VideoSegment
    confidence: float
    frames: MutableSequence[TextFrame]

class TextFrame:
    """
    Video frame level annotation results for text annotation (OCR).
    
    Attributes:
        rotated_bounding_box: Bounding polygon of the detected text for this frame
        time_offset: Timestamp of this frame
    """
    
    rotated_bounding_box: NormalizedBoundingPoly
    time_offset: duration_pb2.Duration

Speech Transcription Results

Results from speech-to-text analysis.

class SpeechTranscription:
    """
    A speech recognition result corresponding to a portion of the audio.
    
    Attributes:
        alternatives: May contain one or more recognition hypotheses
        language_code: Output only. The BCP-47 language tag of the language in this result
    """
    
    alternatives: MutableSequence[SpeechRecognitionAlternative]
    language_code: str

class SpeechRecognitionAlternative:
    """
    Alternative hypotheses (a.k.a. n-best list).
    
    Attributes:
        transcript: Transcript text representing the words that the user spoke
        confidence: Output only. The confidence estimate between 0.0 and 1.0
        words: Output only. A list of word-specific information for each recognized word
    """
    
    transcript: str
    confidence: float
    words: MutableSequence[WordInfo]

class WordInfo:
    """
    Word-specific information for recognized words. Word-specific information is only populated if the client requests it.
    
    Attributes:
        start_time: Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word
        end_time: Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word
        word: The word corresponding to this set of information
        confidence: Output only. The confidence estimate between 0.0 and 1.0
        speaker_tag: Output only. A distinct integer value is assigned for every speaker within the audio
    """
    
    start_time: duration_pb2.Duration
    end_time: duration_pb2.Duration
    word: str
    confidence: float
    speaker_tag: int

Explicit Content Detection Results

Results from explicit content detection analysis.

class ExplicitContentAnnotation:
    """
    Explicit content annotation (based on per-frame visual signals only).
    
    Attributes:
        frames: All video frames where explicit content was detected
        version: Feature version
    """
    
    frames: MutableSequence[ExplicitContentFrame]
    version: str

class ExplicitContentFrame:
    """
    Video frame level annotation results for explicit content.
    
    Attributes:
        time_offset: Time-offset, relative to the beginning of the video, corresponding to the video frame for this location
        pornography_likelihood: Likelihood of the pornography content
    """
    
    time_offset: duration_pb2.Duration
    pornography_likelihood: Likelihood

Person Detection Results

Results from person detection analysis.

class PersonDetectionAnnotation:
    """
    Person detection annotation per video.
    
    Attributes:
        tracks: The detected tracks of a person
        version: Feature version
    """
    
    tracks: MutableSequence[Track]
    version: str

Logo Recognition Results

Results from logo detection and recognition analysis.

class LogoRecognitionAnnotation:
    """
    Annotation corresponding to one detected, tracked and recognized logo class.
    
    Attributes:
        entity: Entity category information to specify the logo class that all the logo tracks within this LogoRecognitionAnnotation are recognized as
        tracks: All logo tracks where the recognized logo appears
        segments: All video segments where the recognized logo appears
    """
    
    entity: Entity
    tracks: MutableSequence[Track]
    segments: MutableSequence[VideoSegment]

Celebrity Recognition Results (Beta)

Results from celebrity recognition analysis available in v1p3beta1.

class CelebrityRecognitionAnnotation:
    """
    Celebrity recognition annotation per video.
    
    Attributes:
        celebrity_tracks: The tracks detected from the input video, including recognized celebrities and other detected faces
    """
    
    celebrity_tracks: MutableSequence[CelebrityTrack]

class CelebrityTrack:
    """
    The annotation result of a celebrity face track.
    
    Attributes:
        celebrities: Top N match of the celebrities for the face in this track
        face_track: A track of a person's face
    """
    
    celebrities: MutableSequence[RecognizedCelebrity]
    face_track: Track

class RecognizedCelebrity:
    """
    The recognized celebrity with confidence score.
    
    Attributes:
        celebrity: The recognized celebrity
        confidence: Recognition confidence (0.0 to 1.0)
    """
    
    celebrity: Celebrity
    confidence: float

class Celebrity:
    """
    Celebrity definition.
    
    Attributes:
        name: The resource name of the celebrity (format: video-intelligence/kg-mid)
        display_name: The celebrity name
        description: Textual description of additional information about the celebrity
    """
    
    name: str
    display_name: str
    description: str

Geometric Data Types

Data types for representing spatial information in videos.

class NormalizedBoundingBox:
    """
    Normalized bounding box. The normalized vertex coordinates are relative to the original image. Range: [0, 1].
    
    Attributes:
        left: Left X coordinate
        top: Top Y coordinate
        right: Right X coordinate
        bottom: Bottom Y coordinate
    """
    
    left: float
    top: float
    right: float
    bottom: float

class NormalizedBoundingPoly:
    """
    Normalized bounding polygon for text (that might not be axis-aligned).
    
    Attributes:
        vertices: The bounding polygon vertices
    """
    
    vertices: MutableSequence[NormalizedVertex]

class NormalizedVertex:
    """
    A vertex represents a 2D point in the image. Coordinates are in pixels.
    
    Attributes:
        x: X coordinate
        y: Y coordinate
    """
    
    x: float
    y: float

Progress and Status Types

Types for tracking operation progress and handling errors.

class AnnotateVideoProgress:
    """
    Video annotation progress. Included in the metadata field of the Operation returned by the GetOperation call of the google::longrunning::Operations service.
    
    Attributes:
        annotation_progress: Progress metadata for all videos specified in AnnotateVideoRequest
    """
    
    annotation_progress: MutableSequence[VideoAnnotationProgress]

class VideoAnnotationProgress:
    """
    Annotation progress for a single video.
    
    Attributes:
        input_uri: Video file location in Google Cloud Storage
        progress_percent: Approximate percentage processed thus far (0-100)
        start_time: Time when the request was received
        update_time: Time of the most recent update
        feature: Specifies which feature is being tracked if the request contains more than one feature
        segment: Specifies which segment is being tracked if the request contains more than one segment
    """
    
    input_uri: str
    progress_percent: int
    start_time: timestamp_pb2.Timestamp
    update_time: timestamp_pb2.Timestamp
    feature: Feature
    segment: VideoSegment

Usage Examples

Processing Label Detection Results

from google.cloud import videointelligence

# Perform label detection
client = videointelligence.VideoIntelligenceServiceClient()
operation = client.annotate_video(
    request={
        "features": [videointelligence.Feature.LABEL_DETECTION],
        "input_uri": "gs://your-bucket/your-video.mp4",
    }
)
result = operation.result(timeout=300)

# Process results
for annotation_result in result.annotation_results:
    print(f"Processing video: {annotation_result.input_uri}")
    
    # Segment-level labels
    for label in annotation_result.segment_label_annotations:
        print(f"\nLabel: {label.entity.description}")
        for segment in label.segments:
            start_time = segment.segment.start_time_offset.total_seconds()
            end_time = segment.segment.end_time_offset.total_seconds()
            print(f"  Segment: {start_time:.1f}s to {end_time:.1f}s (confidence: {segment.confidence:.2f})")
    
    # Frame-level labels
    for label in annotation_result.frame_label_annotations:
        print(f"\nFrame-level label: {label.entity.description}")
        for frame in label.frames[:5]:  # Show first 5 frames
            time_offset = frame.time_offset.total_seconds()
            print(f"  Frame at {time_offset:.1f}s (confidence: {frame.confidence:.2f})")

Processing Face Detection Results

# Process face detection results
for annotation_result in result.annotation_results:
    face_annotations = annotation_result.face_annotations
    
    for face_annotation in face_annotations:
        print("Face detected:")
        for segment in face_annotation.segments:
            start_time = segment.segment.start_time_offset.total_seconds()
            end_time = segment.segment.end_time_offset.total_seconds()
            print(f"  Time: {start_time:.1f}s to {end_time:.1f}s")
    
    # Newer face detection format
    for face_detection in annotation_result.face_detection_annotations:
        for track in face_detection.tracks:
            print(f"Face track (confidence: {track.confidence:.2f}):")
            for timestamped_object in track.timestamped_objects:
                time_offset = timestamped_object.time_offset.total_seconds()
                bbox = timestamped_object.normalized_bounding_box
                print(f"  {time_offset:.1f}s: bbox({bbox.left:.3f}, {bbox.top:.3f}, {bbox.right:.3f}, {bbox.bottom:.3f})")

Processing Speech Transcription Results

# Process speech transcription results
for annotation_result in result.annotation_results:
    for transcription in annotation_result.speech_transcriptions:
        print(f"Language: {transcription.language_code}")
        
        for alternative in transcription.alternatives:
            print(f"Transcript: {alternative.transcript}")
            print(f"Confidence: {alternative.confidence:.2f}")
            
            # Word-level information
            for word_info in alternative.words:
                start_time = word_info.start_time.total_seconds()
                end_time = word_info.end_time.total_seconds()
                print(f"  {word_info.word}: {start_time:.1f}s-{end_time:.1f}s (speaker: {word_info.speaker_tag})")

Processing Object Tracking Results

# Process object tracking results
for annotation_result in result.annotation_results:
    for object_annotation in annotation_result.object_annotations:
        print(f"Object: {object_annotation.entity.description}")
        print(f"Confidence: {object_annotation.confidence:.2f}")
        print(f"Track ID: {object_annotation.track_id}")
        
        # Show first few frames
        for frame in object_annotation.frames[:10]:
            time_offset = frame.time_offset.total_seconds()
            bbox = frame.normalized_bounding_box
            print(f"  {time_offset:.1f}s: ({bbox.left:.3f}, {bbox.top:.3f}) to ({bbox.right:.3f}, {bbox.bottom:.3f})")

Processing Text Detection Results

# Process text detection results
for annotation_result in result.annotation_results:
    for text_annotation in annotation_result.text_annotations:
        print(f"Detected text: {text_annotation.text}")
        
        for segment in text_annotation.segments:
            start_time = segment.segment.start_time_offset.total_seconds()
            end_time = segment.segment.end_time_offset.total_seconds()
            print(f"  Time: {start_time:.1f}s to {end_time:.1f}s (confidence: {segment.confidence:.2f})")
            
            # Frame-level information
            for frame in segment.frames:
                time_offset = frame.time_offset.total_seconds()
                print(f"    Frame at {time_offset:.1f}s")

Processing Celebrity Recognition Results (Beta)

from google.cloud import videointelligence_v1p3beta1

# Process celebrity recognition results (available in v1p3beta1)
for annotation_result in result.annotation_results:
    if hasattr(annotation_result, 'celebrity_recognition_annotations'):
        for celebrity_annotation in annotation_result.celebrity_recognition_annotations:
            print("Celebrity Recognition Results:")
            
            for celebrity_track in celebrity_annotation.celebrity_tracks:
                print(f"  Face track detected:")
                
                # Process recognized celebrities for this track
                for recognized_celebrity in celebrity_track.celebrities:
                    celebrity = recognized_celebrity.celebrity
                    confidence = recognized_celebrity.confidence
                    print(f"    Celebrity: {celebrity.display_name}")
                    print(f"    Confidence: {confidence:.2f}")
                    print(f"    Description: {celebrity.description}")
                    print(f"    Resource Name: {celebrity.name}")
                
                # Process face track information
                face_track = celebrity_track.face_track
                if face_track.segment:
                    start_time = face_track.segment.start_time_offset.total_seconds()
                    end_time = face_track.segment.end_time_offset.total_seconds()
                    print(f"    Track Duration: {start_time:.1f}s to {end_time:.1f}s")
                
                # Show first few timestamped objects
                for timestamped_obj in face_track.timestamped_objects[:5]:
                    time_offset = timestamped_obj.time_offset.total_seconds()
                    bbox = timestamped_obj.normalized_bounding_box
                    print(f"    {time_offset:.1f}s: bbox({bbox.left:.3f}, {bbox.top:.3f}, {bbox.right:.3f}, {bbox.bottom:.3f})")

Install with Tessl CLI

npx tessl i tessl/pypi-google-cloud-videointelligence

docs

features-config.md

index.md

results-data-types.md

streaming-analysis.md

video-analysis.md

tile.json