0
# Multimodal Metrics
1
2
Metrics for evaluating multimodal AI systems including video-audio synchronization and cross-modal quality assessment for applications involving multiple data modalities.
3
4
## Capabilities
5
6
### Video-Audio Synchronization
7
8
Metrics for evaluating lip-sync and audio-visual alignment quality.
9
10
```python { .api }
11
class LipVertexError(Metric):
12
def __init__(
13
self,
14
**kwargs
15
): ...
16
```
17
18
### Cross-Modal Quality Assessment
19
20
Deep learning-based metrics for evaluating cross-modal quality (require optional dependencies).
21
22
```python { .api }
23
class CLIPScore(Metric):
24
def __init__(
25
self,
26
model_name_or_path: str = "openai/clip-vit-base-patch16",
27
**kwargs
28
): ...
29
30
class CLIPImageQualityAssessment(Metric):
31
def __init__(
32
self,
33
model_name_or_path: str = "openai/clip-vit-base-patch16",
34
**kwargs
35
): ...
36
```
37
38
## Usage Examples
39
40
```python
41
import torch
42
from torchmetrics.multimodal import LipVertexError
43
44
# Lip vertex error for video analysis
45
lve = LipVertexError()
46
47
# Sample video landmarks (batch, time, landmarks, coords)
48
preds = torch.randn(2, 10, 68, 2) # 2 videos, 10 frames, 68 landmarks, x-y coords
49
target = torch.randn(2, 10, 68, 2)
50
51
# Compute lip synchronization error
52
lve_score = lve(preds, target)
53
print(f"Lip Vertex Error: {lve_score:.4f}")
54
55
# CLIP Score (requires transformers)
56
try:
57
from torchmetrics.multimodal import CLIPScore
58
59
clip_metric = CLIPScore()
60
61
# Sample text and images
62
images = torch.randint(0, 256, (4, 3, 224, 224), dtype=torch.uint8)
63
texts = ["a photo of a cat", "a dog playing", "a beautiful sunset", "a city skyline"]
64
65
# Compute CLIP score
66
clip_score = clip_metric(images, texts)
67
print(f"CLIP Score: {clip_score:.4f}")
68
69
except ImportError:
70
print("CLIP metrics require 'transformers' package")
71
```
72
73
## Types
74
75
```python { .api }
76
VideoLandmarks = Tensor # Shape: (batch, time, landmarks, coordinates)
77
TextPrompts = List[str] # Text descriptions or prompts
78
```