0
# Vision-Language Model Integration
1
2
Support for integrating various vision-language models (VLMs) for zero-shot object detection and image analysis tasks. These models can perform object detection, segmentation, and other computer vision tasks using natural language prompts.
3
4
## Capabilities
5
6
### VLM Enums
7
8
Supported vision-language models with standardized interfaces.
9
10
```python { .api }
11
class VLM(Enum):
12
"""
13
Enum specifying supported Vision-Language Models (VLMs).
14
15
Attributes:
16
PALIGEMMA: Google's PaliGemma vision-language model
17
FLORENCE_2: Microsoft's Florence-2 vision-language model
18
QWEN_2_5_VL: Qwen2.5-VL open vision-language model from Alibaba
19
GOOGLE_GEMINI_2_0: Google Gemini 2.0 vision-language model
20
GOOGLE_GEMINI_2_5: Google Gemini 2.5 vision-language model
21
MOONDREAM: The Moondream vision-language model
22
"""
23
PALIGEMMA = "paligemma"
24
FLORENCE_2 = "florence_2"
25
QWEN_2_5_VL = "qwen_2_5_vl"
26
GOOGLE_GEMINI_2_0 = "gemini_2_0"
27
GOOGLE_GEMINI_2_5 = "gemini_2_5"
28
MOONDREAM = "moondream"
29
30
@classmethod
31
def list(cls) -> list[str]:
32
"""Return list of all VLM values."""
33
34
@classmethod
35
def from_value(cls, value: "VLM" | str) -> "VLM":
36
"""Create VLM enum from string value."""
37
38
@deprecated("LMM enum is deprecated, use VLM instead")
39
class LMM(Enum):
40
"""
41
Deprecated. Use VLM instead.
42
Enum specifying supported Large Multimodal Models (LMMs).
43
"""
44
PALIGEMMA = "paligemma"
45
FLORENCE_2 = "florence_2"
46
QWEN_2_5_VL = "qwen_2_5_vl"
47
GOOGLE_GEMINI_2_0 = "gemini_2_0"
48
GOOGLE_GEMINI_2_5 = "gemini_2_5"
49
MOONDREAM = "moondream"
50
```
51
52
### VLM Parameter Validation
53
54
Utility for validating VLM parameters and result types.
55
56
```python { .api }
57
def validate_vlm_parameters(vlm: VLM | str, result: Any, kwargs: dict[str, Any]) -> VLM:
58
"""
59
Validates the parameters and result type for a given Vision-Language Model (VLM).
60
61
Args:
62
vlm: The VLM enum or string specifying the model
63
result: The result object to validate (type depends on VLM)
64
kwargs: Dictionary of arguments to validate against required/allowed lists
65
66
Returns:
67
The validated VLM enum value
68
69
Raises:
70
ValueError: If the VLM, result type, or arguments are invalid
71
"""
72
```
73
74
### Model-Specific Parsers
75
76
Functions to parse results from different vision-language models into standardized formats.
77
78
```python { .api }
79
def from_paligemma(
80
result: str,
81
resolution_wh: tuple[int, int],
82
classes: list[str] | None = None
83
) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:
84
"""
85
Parse bounding boxes from PaliGemma-formatted text and scale to specified resolution.
86
87
Args:
88
result: String containing PaliGemma-formatted locations and labels
89
resolution_wh: Target resolution (width, height) for scaling coordinates
90
classes: Optional list of valid class names for filtering
91
92
Returns:
93
Tuple of (xyxy, class_names, confidence_scores)
94
"""
95
96
def from_qwen_2_5_vl(
97
result: str,
98
input_wh: tuple[int, int],
99
resolution_wh: tuple[int, int],
100
classes: list[str] | None = None
101
) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:
102
"""
103
Parse bounding boxes from Qwen2.5-VL formatted text.
104
105
Args:
106
result: String containing Qwen2.5-VL formatted locations and labels
107
input_wh: Input image resolution (width, height)
108
resolution_wh: Target resolution (width, height) for scaling coordinates
109
classes: Optional list of valid class names for filtering
110
111
Returns:
112
Tuple of (xyxy, class_names, confidence_scores)
113
"""
114
115
def from_florence_2(
116
result: dict,
117
resolution_wh: tuple[int, int]
118
) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:
119
"""
120
Parse bounding boxes from Florence-2 model results.
121
122
Args:
123
result: Dictionary containing Florence-2 model output
124
resolution_wh: Target resolution (width, height) for scaling coordinates
125
126
Returns:
127
Tuple of (xyxy, class_names, confidence_scores)
128
"""
129
130
def from_google_gemini_2_0(
131
result: str,
132
resolution_wh: tuple[int, int],
133
classes: list[str] | None = None
134
) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:
135
"""
136
Parse bounding boxes from Google Gemini 2.0 formatted text.
137
138
Args:
139
result: String containing Gemini 2.0 formatted locations and labels
140
resolution_wh: Target resolution (width, height) for scaling coordinates
141
classes: Optional list of valid class names for filtering
142
143
Returns:
144
Tuple of (xyxy, class_names, confidence_scores)
145
"""
146
147
def from_google_gemini_2_5(
148
result: str,
149
resolution_wh: tuple[int, int],
150
classes: list[str] | None = None
151
) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:
152
"""
153
Parse bounding boxes from Google Gemini 2.5 formatted text.
154
155
Args:
156
result: String containing Gemini 2.5 formatted locations and labels
157
resolution_wh: Target resolution (width, height) for scaling coordinates
158
classes: Optional list of valid class names for filtering
159
160
Returns:
161
Tuple of (xyxy, class_names, confidence_scores)
162
"""
163
164
def from_moondream(
165
result: dict,
166
resolution_wh: tuple[int, int]
167
) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:
168
"""
169
Parse bounding boxes from Moondream model results.
170
171
Args:
172
result: Dictionary containing Moondream model output
173
resolution_wh: Target resolution (width, height) for scaling coordinates
174
175
Returns:
176
Tuple of (xyxy, class_names, confidence_scores)
177
"""
178
```
179
180
## Usage Examples
181
182
### Using PaliGemma for Object Detection
183
184
```python
185
import supervision as sv
186
import numpy as np
187
188
# PaliGemma result string
189
paligemma_result = "person <loc_123><loc_456><loc_789><loc_234> car <loc_345><loc_567><loc_890><loc_123>"
190
191
# Parse the results
192
xyxy, class_names, confidence = sv.from_paligemma(
193
result=paligemma_result,
194
resolution_wh=(1280, 720),
195
classes=["person", "car", "bicycle"] # Optional filtering
196
)
197
198
# Create Detections object
199
detections = sv.Detections(
200
xyxy=xyxy,
201
class_id=np.arange(len(xyxy)),
202
confidence=confidence
203
)
204
205
print(f"Found {len(detections)} objects")
206
```
207
208
### Working with Florence-2 Model
209
210
```python
211
import supervision as sv
212
213
# Florence-2 model result dictionary
214
florence_result = {
215
"<OD>": {
216
"bboxes": [[100, 200, 300, 400], [500, 100, 800, 300]],
217
"labels": ["person", "car"]
218
}
219
}
220
221
# Parse Florence-2 results
222
xyxy, class_names, confidence = sv.from_florence_2(
223
result=florence_result,
224
resolution_wh=(1920, 1080)
225
)
226
227
# Create detections
228
detections = sv.Detections(
229
xyxy=xyxy,
230
class_id=np.arange(len(xyxy)),
231
confidence=confidence
232
)
233
234
# Annotate the image
235
box_annotator = sv.BoxAnnotator()
236
label_annotator = sv.LabelAnnotator()
237
238
annotated_image = box_annotator.annotate(image, detections)
239
annotated_image = label_annotator.annotate(annotated_image, detections)
240
```
241
242
### Validating VLM Parameters
243
244
```python
245
import supervision as sv
246
247
# Validate VLM configuration
248
try:
249
vlm = sv.validate_vlm_parameters(
250
vlm="florence_2",
251
result={"<OD>": {"bboxes": [], "labels": []}},
252
kwargs={"resolution_wh": (640, 480)}
253
)
254
print(f"Valid VLM: {vlm}")
255
except ValueError as e:
256
print(f"Invalid VLM configuration: {e}")
257
```
258
259
### Integration with Detections.from_* Methods
260
261
The VLM parsers can be used with the core Detections class through custom integration:
262
263
```python
264
import supervision as sv
265
266
def create_detections_from_vlm(vlm_type: str, result, **kwargs):
267
"""Helper function to create Detections from VLM results."""
268
269
if vlm_type == "paligemma":
270
xyxy, class_names, confidence = sv.from_paligemma(result, **kwargs)
271
elif vlm_type == "florence_2":
272
xyxy, class_names, confidence = sv.from_florence_2(result, **kwargs)
273
# Add other VLM types...
274
275
return sv.Detections(
276
xyxy=xyxy,
277
confidence=confidence,
278
class_id=np.arange(len(xyxy)) if len(xyxy) > 0 else np.array([])
279
)
280
281
# Usage
282
detections = create_detections_from_vlm(
283
vlm_type="paligemma",
284
result=paligemma_result,
285
resolution_wh=(1280, 720),
286
classes=["person", "car"]
287
)
288
```
289
290
## Supported Tasks
291
292
### Florence-2 Supported Tasks
293
294
Florence-2 supports multiple computer vision tasks through different task prompts:
295
296
- `<OD>`: Object Detection
297
- `<CAPTION_TO_PHRASE_GROUNDING>`: Caption to phrase grounding
298
- `<DENSE_REGION_CAPTION>`: Dense region captioning
299
- `<REGION_PROPOSAL>`: Region proposal generation
300
- `<OCR_WITH_REGION>`: OCR with region detection
301
- `<REFERRING_EXPRESSION_SEGMENTATION>`: Referring expression segmentation
302
- `<REGION_TO_SEGMENTATION>`: Region to segmentation
303
- `<OPEN_VOCABULARY_DETECTION>`: Open vocabulary detection
304
- `<REGION_TO_CATEGORY>`: Region to category classification
305
- `<REGION_TO_DESCRIPTION>`: Region to description