or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

audio.mdclassification.mdclustering.mddetection.mdfunctional.mdimage.mdindex.mdmultimodal.mdnominal.mdregression.mdretrieval.mdsegmentation.mdshape.mdtext.mdutilities.mdvideo.md

multimodal.mddocs/

0

# Multimodal Metrics

1

2

Metrics for evaluating multimodal AI systems including video-audio synchronization and cross-modal quality assessment for applications involving multiple data modalities.

3

4

## Capabilities

5

6

### Video-Audio Synchronization

7

8

Metrics for evaluating lip-sync and audio-visual alignment quality.

9

10

```python { .api }

11

class LipVertexError(Metric):

12

def __init__(

13

self,

14

**kwargs

15

): ...

16

```

17

18

### Cross-Modal Quality Assessment

19

20

Deep learning-based metrics for evaluating cross-modal quality (require optional dependencies).

21

22

```python { .api }

23

class CLIPScore(Metric):

24

def __init__(

25

self,

26

model_name_or_path: str = "openai/clip-vit-base-patch16",

27

**kwargs

28

): ...

29

30

class CLIPImageQualityAssessment(Metric):

31

def __init__(

32

self,

33

model_name_or_path: str = "openai/clip-vit-base-patch16",

34

**kwargs

35

): ...

36

```

37

38

## Usage Examples

39

40

```python

41

import torch

42

from torchmetrics.multimodal import LipVertexError

43

44

# Lip vertex error for video analysis

45

lve = LipVertexError()

46

47

# Sample video landmarks (batch, time, landmarks, coords)

48

preds = torch.randn(2, 10, 68, 2) # 2 videos, 10 frames, 68 landmarks, x-y coords

49

target = torch.randn(2, 10, 68, 2)

50

51

# Compute lip synchronization error

52

lve_score = lve(preds, target)

53

print(f"Lip Vertex Error: {lve_score:.4f}")

54

55

# CLIP Score (requires transformers)

56

try:

57

from torchmetrics.multimodal import CLIPScore

58

59

clip_metric = CLIPScore()

60

61

# Sample text and images

62

images = torch.randint(0, 256, (4, 3, 224, 224), dtype=torch.uint8)

63

texts = ["a photo of a cat", "a dog playing", "a beautiful sunset", "a city skyline"]

64

65

# Compute CLIP score

66

clip_score = clip_metric(images, texts)

67

print(f"CLIP Score: {clip_score:.4f}")

68

69

except ImportError:

70

print("CLIP metrics require 'transformers' package")

71

```

72

73

## Types

74

75

```python { .api }

76

VideoLandmarks = Tensor # Shape: (batch, time, landmarks, coordinates)

77

TextPrompts = List[str] # Text descriptions or prompts

78

```