or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

annotators.mdcoordinate-conversion.mdcore-data-structures.mddataset-management.mddetection-tools.mddrawing-colors.mdfile-utilities.mdindex.mdiou-nms.mdkeypoint-annotators.mdmetrics.mdtracking.mdvideo-processing.mdvlm-support.md

vlm-support.mddocs/

0

# Vision-Language Model Integration

1

2

Support for integrating various vision-language models (VLMs) for zero-shot object detection and image analysis tasks. These models can perform object detection, segmentation, and other computer vision tasks using natural language prompts.

3

4

## Capabilities

5

6

### VLM Enums

7

8

Supported vision-language models with standardized interfaces.

9

10

```python { .api }

11

class VLM(Enum):

12

"""

13

Enum specifying supported Vision-Language Models (VLMs).

14

15

Attributes:

16

PALIGEMMA: Google's PaliGemma vision-language model

17

FLORENCE_2: Microsoft's Florence-2 vision-language model

18

QWEN_2_5_VL: Qwen2.5-VL open vision-language model from Alibaba

19

GOOGLE_GEMINI_2_0: Google Gemini 2.0 vision-language model

20

GOOGLE_GEMINI_2_5: Google Gemini 2.5 vision-language model

21

MOONDREAM: The Moondream vision-language model

22

"""

23

PALIGEMMA = "paligemma"

24

FLORENCE_2 = "florence_2"

25

QWEN_2_5_VL = "qwen_2_5_vl"

26

GOOGLE_GEMINI_2_0 = "gemini_2_0"

27

GOOGLE_GEMINI_2_5 = "gemini_2_5"

28

MOONDREAM = "moondream"

29

30

@classmethod

31

def list(cls) -> list[str]:

32

"""Return list of all VLM values."""

33

34

@classmethod

35

def from_value(cls, value: "VLM" | str) -> "VLM":

36

"""Create VLM enum from string value."""

37

38

@deprecated("LMM enum is deprecated, use VLM instead")

39

class LMM(Enum):

40

"""

41

Deprecated. Use VLM instead.

42

Enum specifying supported Large Multimodal Models (LMMs).

43

"""

44

PALIGEMMA = "paligemma"

45

FLORENCE_2 = "florence_2"

46

QWEN_2_5_VL = "qwen_2_5_vl"

47

GOOGLE_GEMINI_2_0 = "gemini_2_0"

48

GOOGLE_GEMINI_2_5 = "gemini_2_5"

49

MOONDREAM = "moondream"

50

```

51

52

### VLM Parameter Validation

53

54

Utility for validating VLM parameters and result types.

55

56

```python { .api }

57

def validate_vlm_parameters(vlm: VLM | str, result: Any, kwargs: dict[str, Any]) -> VLM:

58

"""

59

Validates the parameters and result type for a given Vision-Language Model (VLM).

60

61

Args:

62

vlm: The VLM enum or string specifying the model

63

result: The result object to validate (type depends on VLM)

64

kwargs: Dictionary of arguments to validate against required/allowed lists

65

66

Returns:

67

The validated VLM enum value

68

69

Raises:

70

ValueError: If the VLM, result type, or arguments are invalid

71

"""

72

```

73

74

### Model-Specific Parsers

75

76

Functions to parse results from different vision-language models into standardized formats.

77

78

```python { .api }

79

def from_paligemma(

80

result: str,

81

resolution_wh: tuple[int, int],

82

classes: list[str] | None = None

83

) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:

84

"""

85

Parse bounding boxes from PaliGemma-formatted text and scale to specified resolution.

86

87

Args:

88

result: String containing PaliGemma-formatted locations and labels

89

resolution_wh: Target resolution (width, height) for scaling coordinates

90

classes: Optional list of valid class names for filtering

91

92

Returns:

93

Tuple of (xyxy, class_names, confidence_scores)

94

"""

95

96

def from_qwen_2_5_vl(

97

result: str,

98

input_wh: tuple[int, int],

99

resolution_wh: tuple[int, int],

100

classes: list[str] | None = None

101

) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:

102

"""

103

Parse bounding boxes from Qwen2.5-VL formatted text.

104

105

Args:

106

result: String containing Qwen2.5-VL formatted locations and labels

107

input_wh: Input image resolution (width, height)

108

resolution_wh: Target resolution (width, height) for scaling coordinates

109

classes: Optional list of valid class names for filtering

110

111

Returns:

112

Tuple of (xyxy, class_names, confidence_scores)

113

"""

114

115

def from_florence_2(

116

result: dict,

117

resolution_wh: tuple[int, int]

118

) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:

119

"""

120

Parse bounding boxes from Florence-2 model results.

121

122

Args:

123

result: Dictionary containing Florence-2 model output

124

resolution_wh: Target resolution (width, height) for scaling coordinates

125

126

Returns:

127

Tuple of (xyxy, class_names, confidence_scores)

128

"""

129

130

def from_google_gemini_2_0(

131

result: str,

132

resolution_wh: tuple[int, int],

133

classes: list[str] | None = None

134

) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:

135

"""

136

Parse bounding boxes from Google Gemini 2.0 formatted text.

137

138

Args:

139

result: String containing Gemini 2.0 formatted locations and labels

140

resolution_wh: Target resolution (width, height) for scaling coordinates

141

classes: Optional list of valid class names for filtering

142

143

Returns:

144

Tuple of (xyxy, class_names, confidence_scores)

145

"""

146

147

def from_google_gemini_2_5(

148

result: str,

149

resolution_wh: tuple[int, int],

150

classes: list[str] | None = None

151

) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:

152

"""

153

Parse bounding boxes from Google Gemini 2.5 formatted text.

154

155

Args:

156

result: String containing Gemini 2.5 formatted locations and labels

157

resolution_wh: Target resolution (width, height) for scaling coordinates

158

classes: Optional list of valid class names for filtering

159

160

Returns:

161

Tuple of (xyxy, class_names, confidence_scores)

162

"""

163

164

def from_moondream(

165

result: dict,

166

resolution_wh: tuple[int, int]

167

) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:

168

"""

169

Parse bounding boxes from Moondream model results.

170

171

Args:

172

result: Dictionary containing Moondream model output

173

resolution_wh: Target resolution (width, height) for scaling coordinates

174

175

Returns:

176

Tuple of (xyxy, class_names, confidence_scores)

177

"""

178

```

179

180

## Usage Examples

181

182

### Using PaliGemma for Object Detection

183

184

```python

185

import supervision as sv

186

import numpy as np

187

188

# PaliGemma result string

189

paligemma_result = "person <loc_123><loc_456><loc_789><loc_234> car <loc_345><loc_567><loc_890><loc_123>"

190

191

# Parse the results

192

xyxy, class_names, confidence = sv.from_paligemma(

193

result=paligemma_result,

194

resolution_wh=(1280, 720),

195

classes=["person", "car", "bicycle"] # Optional filtering

196

)

197

198

# Create Detections object

199

detections = sv.Detections(

200

xyxy=xyxy,

201

class_id=np.arange(len(xyxy)),

202

confidence=confidence

203

)

204

205

print(f"Found {len(detections)} objects")

206

```

207

208

### Working with Florence-2 Model

209

210

```python

211

import supervision as sv

212

213

# Florence-2 model result dictionary

214

florence_result = {

215

"<OD>": {

216

"bboxes": [[100, 200, 300, 400], [500, 100, 800, 300]],

217

"labels": ["person", "car"]

218

}

219

}

220

221

# Parse Florence-2 results

222

xyxy, class_names, confidence = sv.from_florence_2(

223

result=florence_result,

224

resolution_wh=(1920, 1080)

225

)

226

227

# Create detections

228

detections = sv.Detections(

229

xyxy=xyxy,

230

class_id=np.arange(len(xyxy)),

231

confidence=confidence

232

)

233

234

# Annotate the image

235

box_annotator = sv.BoxAnnotator()

236

label_annotator = sv.LabelAnnotator()

237

238

annotated_image = box_annotator.annotate(image, detections)

239

annotated_image = label_annotator.annotate(annotated_image, detections)

240

```

241

242

### Validating VLM Parameters

243

244

```python

245

import supervision as sv

246

247

# Validate VLM configuration

248

try:

249

vlm = sv.validate_vlm_parameters(

250

vlm="florence_2",

251

result={"<OD>": {"bboxes": [], "labels": []}},

252

kwargs={"resolution_wh": (640, 480)}

253

)

254

print(f"Valid VLM: {vlm}")

255

except ValueError as e:

256

print(f"Invalid VLM configuration: {e}")

257

```

258

259

### Integration with Detections.from_* Methods

260

261

The VLM parsers can be used with the core Detections class through custom integration:

262

263

```python

264

import supervision as sv

265

266

def create_detections_from_vlm(vlm_type: str, result, **kwargs):

267

"""Helper function to create Detections from VLM results."""

268

269

if vlm_type == "paligemma":

270

xyxy, class_names, confidence = sv.from_paligemma(result, **kwargs)

271

elif vlm_type == "florence_2":

272

xyxy, class_names, confidence = sv.from_florence_2(result, **kwargs)

273

# Add other VLM types...

274

275

return sv.Detections(

276

xyxy=xyxy,

277

confidence=confidence,

278

class_id=np.arange(len(xyxy)) if len(xyxy) > 0 else np.array([])

279

)

280

281

# Usage

282

detections = create_detections_from_vlm(

283

vlm_type="paligemma",

284

result=paligemma_result,

285

resolution_wh=(1280, 720),

286

classes=["person", "car"]

287

)

288

```

289

290

## Supported Tasks

291

292

### Florence-2 Supported Tasks

293

294

Florence-2 supports multiple computer vision tasks through different task prompts:

295

296

- `<OD>`: Object Detection

297

- `<CAPTION_TO_PHRASE_GROUNDING>`: Caption to phrase grounding

298

- `<DENSE_REGION_CAPTION>`: Dense region captioning

299

- `<REGION_PROPOSAL>`: Region proposal generation

300

- `<OCR_WITH_REGION>`: OCR with region detection

301

- `<REFERRING_EXPRESSION_SEGMENTATION>`: Referring expression segmentation

302

- `<REGION_TO_SEGMENTATION>`: Region to segmentation

303

- `<OPEN_VOCABULARY_DETECTION>`: Open vocabulary detection

304

- `<REGION_TO_CATEGORY>`: Region to category classification

305

- `<REGION_TO_DESCRIPTION>`: Region to description