Python Client for Video Intelligence API that enables developers to make videos searchable and discoverable by extracting metadata through machine learning.
—
Structured data types for representing video analysis results. These classes contain annotations for detected objects, faces, text, speech, and other content with timestamps and confidence scores.
Main response objects returned from video analysis operations.
class AnnotateVideoResponse:
"""
Video annotation response. Contains annotation results for one or more videos.
Attributes:
annotation_results: Annotation results for all videos specified in AnnotateVideoRequest
"""
annotation_results: MutableSequence[VideoAnnotationResults]
class VideoAnnotationResults:
"""
Annotation results for a single video.
Attributes:
input_uri: Video file location in Google Cloud Storage
segment_label_annotations: Label annotations on video level or user-specified segment level
shot_label_annotations: Label annotations on shot level
frame_label_annotations: Label annotations on frame level
face_annotations: Face detection annotations
shot_annotations: Shot annotations which are represented as a list of shots
explicit_annotation: Explicit content annotation
speech_transcriptions: Speech transcription
text_annotations: OCR text detection and tracking
object_annotations: Annotations for list of objects detected and tracked in video
logo_recognition_annotations: Annotations for list of logos detected, tracked and recognized in video
person_detection_annotations: Person detection annotations
celebrity_recognition_annotations: Celebrity recognition annotations (available in v1p3beta1)
error: If processing failed, this field contains the details of the failure
"""
input_uri: str
segment_label_annotations: MutableSequence[LabelAnnotation]
shot_label_annotations: MutableSequence[LabelAnnotation]
frame_label_annotations: MutableSequence[LabelAnnotation]
face_annotations: MutableSequence[FaceAnnotation]
shot_annotations: MutableSequence[VideoSegment]
explicit_annotation: ExplicitContentAnnotation
speech_transcriptions: MutableSequence[SpeechTranscription]
text_annotations: MutableSequence[TextAnnotation]
object_annotations: MutableSequence[ObjectTrackingAnnotation]
logo_recognition_annotations: MutableSequence[LogoRecognitionAnnotation]
person_detection_annotations: MutableSequence[PersonDetectionAnnotation]
celebrity_recognition_annotations: MutableSequence[CelebrityRecognitionAnnotation]
error: status_pb2.StatusResults from label detection analysis, including detected objects, activities, and concepts.
class LabelAnnotation:
"""
Label annotation.
Attributes:
entity: Detected entity from Video Intelligence API
category_entities: Common categories for the detected entity
segments: All video segments where a label was detected
frames: All video frames where a label was detected
"""
entity: Entity
category_entities: MutableSequence[Entity]
segments: MutableSequence[LabelSegment]
frames: MutableSequence[LabelFrame]
class LabelSegment:
"""
Video segment level annotation results for label detection.
Attributes:
segment: Video segment where a label was detected
confidence: Confidence that the label is accurate (0.0 to 1.0)
"""
segment: VideoSegment
confidence: float
class LabelFrame:
"""
Video frame level annotation results for label detection.
Attributes:
time_offset: Time-offset, relative to the beginning of the video, corresponding to the video frame for this location
confidence: Confidence that the label is accurate (0.0 to 1.0)
"""
time_offset: duration_pb2.Duration
confidence: float
class Entity:
"""
Detected entity from Video Intelligence API.
Attributes:
entity_id: Opaque entity ID. Some IDs may be available in Google Knowledge Graph Search API
description: Textual description, e.g., "Fixed-gear bicycle"
language_code: Language code for description in BCP-47 format
"""
entity_id: str
description: str
language_code: strResults from face detection and tracking analysis.
class FaceDetectionAnnotation:
"""
Face detection annotation.
Attributes:
version: Feature version
tracks: The face tracks with attributes
thumbnail: The thumbnail of a person's face
"""
version: str
tracks: MutableSequence[Track]
thumbnail: bytes
class FaceAnnotation:
"""
Deprecated. No effect.
Attributes:
thumbnail: The thumbnail of a person's face
segments: All video segments where a face was detected
"""
thumbnail: bytes
segments: MutableSequence[FaceSegment]
class FaceSegment:
"""
Video segment level annotation results for face detection.
Attributes:
segment: Video segment where a face was detected
"""
segment: VideoSegment
class FaceFrame:
"""
Deprecated. No effect.
Attributes:
normalized_bounding_boxes: Normalized Bounding boxes in a frame
time_offset: Time-offset, relative to the beginning of the video, corresponding to the video frame for this location
"""
normalized_bounding_boxes: MutableSequence[NormalizedBoundingBox]
time_offset: duration_pb2.Duration
class Track:
"""
A track of an object instance.
Attributes:
segment: Video segment of a track
timestamped_objects: The object with timestamp and attributes per frame in the track
attributes: Optional. Attributes in the track level
confidence: Optional. The confidence score of the tracked object
"""
segment: VideoSegment
timestamped_objects: MutableSequence[TimestampedObject]
attributes: MutableSequence[DetectedAttribute]
confidence: float
class TimestampedObject:
"""
For tracking the object throughout the video.
Attributes:
normalized_bounding_box: Normalized Bounding box location of this object track for the frame
time_offset: Time-offset, relative to the beginning of the video, corresponding to the video frame for this location
attributes: Optional. The attributes of the object in the bounding box
landmarks: Optional. The detected landmarks
"""
normalized_bounding_box: NormalizedBoundingBox
time_offset: duration_pb2.Duration
attributes: MutableSequence[DetectedAttribute]
landmarks: MutableSequence[DetectedLandmark]
class DetectedAttribute:
"""
A generic detected attribute represented by name in string format.
Attributes:
name: The name of the attribute, for example, glasses, dark_glasses, mouth_open
confidence: Detected attribute confidence (0.0 to 1.0)
value: Text value of the detection result
"""
name: str
confidence: float
value: str
class DetectedLandmark:
"""
A generic detected landmark represented by name in string format and a 2D location.
Attributes:
name: The name of this landmark, for example, left_hand, right_shoulder
point: The 2D point of the detected landmark using the normalized image coordinate system
confidence: The confidence score of the detected landmark (0.0 to 1.0)
"""
name: str
point: NormalizedVertex
confidence: floatResults from object detection and tracking analysis.
class ObjectTrackingAnnotation:
"""
Annotations corresponding to one tracked object.
Attributes:
entity: Entity to specify the object category that this track is labeled as
confidence: Object category's labeling confidence of this track
frames: Information corresponding to all frames where this object track appears
segment: Non-streaming batch mode ONLY. Each object track corresponds to one video segment where it appears
track_id: Streaming mode ONLY. In streaming mode, we do not know the end time of a tracked object before it is completed
version: Feature version
"""
entity: Entity
confidence: float
frames: MutableSequence[ObjectTrackingFrame]
segment: VideoSegment
track_id: int
version: str
class ObjectTrackingFrame:
"""
Video frame level annotations for object detection and tracking.
Attributes:
normalized_bounding_box: The normalized bounding box location of this object track for the frame
time_offset: The timestamp of the frame in microseconds
"""
normalized_bounding_box: NormalizedBoundingBox
time_offset: duration_pb2.DurationResults from optical character recognition (OCR) analysis.
class TextAnnotation:
"""
Annotations related to one detected OCR text snippet.
Attributes:
text: The detected text
segments: All video segments where OCR detected text appears
version: Feature version
"""
text: str
segments: MutableSequence[TextSegment]
version: str
class TextSegment:
"""
Video segment level annotation results for text detection.
Attributes:
segment: Video segment where a text snippet was detected
confidence: Confidence for the track of detected text
frames: Information related to the frames where OCR detected text appears
"""
segment: VideoSegment
confidence: float
frames: MutableSequence[TextFrame]
class TextFrame:
"""
Video frame level annotation results for text annotation (OCR).
Attributes:
rotated_bounding_box: Bounding polygon of the detected text for this frame
time_offset: Timestamp of this frame
"""
rotated_bounding_box: NormalizedBoundingPoly
time_offset: duration_pb2.DurationResults from speech-to-text analysis.
class SpeechTranscription:
"""
A speech recognition result corresponding to a portion of the audio.
Attributes:
alternatives: May contain one or more recognition hypotheses
language_code: Output only. The BCP-47 language tag of the language in this result
"""
alternatives: MutableSequence[SpeechRecognitionAlternative]
language_code: str
class SpeechRecognitionAlternative:
"""
Alternative hypotheses (a.k.a. n-best list).
Attributes:
transcript: Transcript text representing the words that the user spoke
confidence: Output only. The confidence estimate between 0.0 and 1.0
words: Output only. A list of word-specific information for each recognized word
"""
transcript: str
confidence: float
words: MutableSequence[WordInfo]
class WordInfo:
"""
Word-specific information for recognized words. Word-specific information is only populated if the client requests it.
Attributes:
start_time: Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word
end_time: Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word
word: The word corresponding to this set of information
confidence: Output only. The confidence estimate between 0.0 and 1.0
speaker_tag: Output only. A distinct integer value is assigned for every speaker within the audio
"""
start_time: duration_pb2.Duration
end_time: duration_pb2.Duration
word: str
confidence: float
speaker_tag: intResults from explicit content detection analysis.
class ExplicitContentAnnotation:
"""
Explicit content annotation (based on per-frame visual signals only).
Attributes:
frames: All video frames where explicit content was detected
version: Feature version
"""
frames: MutableSequence[ExplicitContentFrame]
version: str
class ExplicitContentFrame:
"""
Video frame level annotation results for explicit content.
Attributes:
time_offset: Time-offset, relative to the beginning of the video, corresponding to the video frame for this location
pornography_likelihood: Likelihood of the pornography content
"""
time_offset: duration_pb2.Duration
pornography_likelihood: LikelihoodResults from person detection analysis.
class PersonDetectionAnnotation:
"""
Person detection annotation per video.
Attributes:
tracks: The detected tracks of a person
version: Feature version
"""
tracks: MutableSequence[Track]
version: strResults from logo detection and recognition analysis.
class LogoRecognitionAnnotation:
"""
Annotation corresponding to one detected, tracked and recognized logo class.
Attributes:
entity: Entity category information to specify the logo class that all the logo tracks within this LogoRecognitionAnnotation are recognized as
tracks: All logo tracks where the recognized logo appears
segments: All video segments where the recognized logo appears
"""
entity: Entity
tracks: MutableSequence[Track]
segments: MutableSequence[VideoSegment]Results from celebrity recognition analysis available in v1p3beta1.
class CelebrityRecognitionAnnotation:
"""
Celebrity recognition annotation per video.
Attributes:
celebrity_tracks: The tracks detected from the input video, including recognized celebrities and other detected faces
"""
celebrity_tracks: MutableSequence[CelebrityTrack]
class CelebrityTrack:
"""
The annotation result of a celebrity face track.
Attributes:
celebrities: Top N match of the celebrities for the face in this track
face_track: A track of a person's face
"""
celebrities: MutableSequence[RecognizedCelebrity]
face_track: Track
class RecognizedCelebrity:
"""
The recognized celebrity with confidence score.
Attributes:
celebrity: The recognized celebrity
confidence: Recognition confidence (0.0 to 1.0)
"""
celebrity: Celebrity
confidence: float
class Celebrity:
"""
Celebrity definition.
Attributes:
name: The resource name of the celebrity (format: video-intelligence/kg-mid)
display_name: The celebrity name
description: Textual description of additional information about the celebrity
"""
name: str
display_name: str
description: strData types for representing spatial information in videos.
class NormalizedBoundingBox:
"""
Normalized bounding box. The normalized vertex coordinates are relative to the original image. Range: [0, 1].
Attributes:
left: Left X coordinate
top: Top Y coordinate
right: Right X coordinate
bottom: Bottom Y coordinate
"""
left: float
top: float
right: float
bottom: float
class NormalizedBoundingPoly:
"""
Normalized bounding polygon for text (that might not be axis-aligned).
Attributes:
vertices: The bounding polygon vertices
"""
vertices: MutableSequence[NormalizedVertex]
class NormalizedVertex:
"""
A vertex represents a 2D point in the image. Coordinates are in pixels.
Attributes:
x: X coordinate
y: Y coordinate
"""
x: float
y: floatTypes for tracking operation progress and handling errors.
class AnnotateVideoProgress:
"""
Video annotation progress. Included in the metadata field of the Operation returned by the GetOperation call of the google::longrunning::Operations service.
Attributes:
annotation_progress: Progress metadata for all videos specified in AnnotateVideoRequest
"""
annotation_progress: MutableSequence[VideoAnnotationProgress]
class VideoAnnotationProgress:
"""
Annotation progress for a single video.
Attributes:
input_uri: Video file location in Google Cloud Storage
progress_percent: Approximate percentage processed thus far (0-100)
start_time: Time when the request was received
update_time: Time of the most recent update
feature: Specifies which feature is being tracked if the request contains more than one feature
segment: Specifies which segment is being tracked if the request contains more than one segment
"""
input_uri: str
progress_percent: int
start_time: timestamp_pb2.Timestamp
update_time: timestamp_pb2.Timestamp
feature: Feature
segment: VideoSegmentfrom google.cloud import videointelligence
# Perform label detection
client = videointelligence.VideoIntelligenceServiceClient()
operation = client.annotate_video(
request={
"features": [videointelligence.Feature.LABEL_DETECTION],
"input_uri": "gs://your-bucket/your-video.mp4",
}
)
result = operation.result(timeout=300)
# Process results
for annotation_result in result.annotation_results:
print(f"Processing video: {annotation_result.input_uri}")
# Segment-level labels
for label in annotation_result.segment_label_annotations:
print(f"\nLabel: {label.entity.description}")
for segment in label.segments:
start_time = segment.segment.start_time_offset.total_seconds()
end_time = segment.segment.end_time_offset.total_seconds()
print(f" Segment: {start_time:.1f}s to {end_time:.1f}s (confidence: {segment.confidence:.2f})")
# Frame-level labels
for label in annotation_result.frame_label_annotations:
print(f"\nFrame-level label: {label.entity.description}")
for frame in label.frames[:5]: # Show first 5 frames
time_offset = frame.time_offset.total_seconds()
print(f" Frame at {time_offset:.1f}s (confidence: {frame.confidence:.2f})")# Process face detection results
for annotation_result in result.annotation_results:
face_annotations = annotation_result.face_annotations
for face_annotation in face_annotations:
print("Face detected:")
for segment in face_annotation.segments:
start_time = segment.segment.start_time_offset.total_seconds()
end_time = segment.segment.end_time_offset.total_seconds()
print(f" Time: {start_time:.1f}s to {end_time:.1f}s")
# Newer face detection format
for face_detection in annotation_result.face_detection_annotations:
for track in face_detection.tracks:
print(f"Face track (confidence: {track.confidence:.2f}):")
for timestamped_object in track.timestamped_objects:
time_offset = timestamped_object.time_offset.total_seconds()
bbox = timestamped_object.normalized_bounding_box
print(f" {time_offset:.1f}s: bbox({bbox.left:.3f}, {bbox.top:.3f}, {bbox.right:.3f}, {bbox.bottom:.3f})")# Process speech transcription results
for annotation_result in result.annotation_results:
for transcription in annotation_result.speech_transcriptions:
print(f"Language: {transcription.language_code}")
for alternative in transcription.alternatives:
print(f"Transcript: {alternative.transcript}")
print(f"Confidence: {alternative.confidence:.2f}")
# Word-level information
for word_info in alternative.words:
start_time = word_info.start_time.total_seconds()
end_time = word_info.end_time.total_seconds()
print(f" {word_info.word}: {start_time:.1f}s-{end_time:.1f}s (speaker: {word_info.speaker_tag})")# Process object tracking results
for annotation_result in result.annotation_results:
for object_annotation in annotation_result.object_annotations:
print(f"Object: {object_annotation.entity.description}")
print(f"Confidence: {object_annotation.confidence:.2f}")
print(f"Track ID: {object_annotation.track_id}")
# Show first few frames
for frame in object_annotation.frames[:10]:
time_offset = frame.time_offset.total_seconds()
bbox = frame.normalized_bounding_box
print(f" {time_offset:.1f}s: ({bbox.left:.3f}, {bbox.top:.3f}) to ({bbox.right:.3f}, {bbox.bottom:.3f})")# Process text detection results
for annotation_result in result.annotation_results:
for text_annotation in annotation_result.text_annotations:
print(f"Detected text: {text_annotation.text}")
for segment in text_annotation.segments:
start_time = segment.segment.start_time_offset.total_seconds()
end_time = segment.segment.end_time_offset.total_seconds()
print(f" Time: {start_time:.1f}s to {end_time:.1f}s (confidence: {segment.confidence:.2f})")
# Frame-level information
for frame in segment.frames:
time_offset = frame.time_offset.total_seconds()
print(f" Frame at {time_offset:.1f}s")from google.cloud import videointelligence_v1p3beta1
# Process celebrity recognition results (available in v1p3beta1)
for annotation_result in result.annotation_results:
if hasattr(annotation_result, 'celebrity_recognition_annotations'):
for celebrity_annotation in annotation_result.celebrity_recognition_annotations:
print("Celebrity Recognition Results:")
for celebrity_track in celebrity_annotation.celebrity_tracks:
print(f" Face track detected:")
# Process recognized celebrities for this track
for recognized_celebrity in celebrity_track.celebrities:
celebrity = recognized_celebrity.celebrity
confidence = recognized_celebrity.confidence
print(f" Celebrity: {celebrity.display_name}")
print(f" Confidence: {confidence:.2f}")
print(f" Description: {celebrity.description}")
print(f" Resource Name: {celebrity.name}")
# Process face track information
face_track = celebrity_track.face_track
if face_track.segment:
start_time = face_track.segment.start_time_offset.total_seconds()
end_time = face_track.segment.end_time_offset.total_seconds()
print(f" Track Duration: {start_time:.1f}s to {end_time:.1f}s")
# Show first few timestamped objects
for timestamped_obj in face_track.timestamped_objects[:5]:
time_offset = timestamped_obj.time_offset.total_seconds()
bbox = timestamped_obj.normalized_bounding_box
print(f" {time_offset:.1f}s: bbox({bbox.left:.3f}, {bbox.top:.3f}, {bbox.right:.3f}, {bbox.bottom:.3f})")Install with Tessl CLI
npx tessl i tessl/pypi-google-cloud-videointelligence