Tessl Tile for pypi/supervision@0.26.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

annotators.md coordinate-conversion.md core-data-structures.md dataset-management.md detection-tools.md drawing-colors.md file-utilities.md index.md iou-nms.md keypoint-annotators.md metrics.md tracking.md video-processing.md vlm-support.md

vlm-support.mddocs/

0
# Vision-Language Model Integration
1

2
Support for integrating various vision-language models (VLMs) for zero-shot object detection and image analysis tasks. These models can perform object detection, segmentation, and other computer vision tasks using natural language prompts.
3

4
## Capabilities
5

6
### VLM Enums
7

8
Supported vision-language models with standardized interfaces.
9

10
```python { .api }
11
class VLM(Enum):
12
    """
13
    Enum specifying supported Vision-Language Models (VLMs).
14

15
    Attributes:
16
        PALIGEMMA: Google's PaliGemma vision-language model
17
        FLORENCE_2: Microsoft's Florence-2 vision-language model
18
        QWEN_2_5_VL: Qwen2.5-VL open vision-language model from Alibaba
19
        GOOGLE_GEMINI_2_0: Google Gemini 2.0 vision-language model
20
        GOOGLE_GEMINI_2_5: Google Gemini 2.5 vision-language model
21
        MOONDREAM: The Moondream vision-language model
22
    """
23
    PALIGEMMA = "paligemma"
24
    FLORENCE_2 = "florence_2"
25
    QWEN_2_5_VL = "qwen_2_5_vl"
26
    GOOGLE_GEMINI_2_0 = "gemini_2_0"
27
    GOOGLE_GEMINI_2_5 = "gemini_2_5"
28
    MOONDREAM = "moondream"
29

30
    @classmethod
31
    def list(cls) -> list[str]:
32
        """Return list of all VLM values."""
33

34
    @classmethod
35
    def from_value(cls, value: "VLM" | str) -> "VLM":
36
        """Create VLM enum from string value."""
37

38
@deprecated("LMM enum is deprecated, use VLM instead")
39
class LMM(Enum):
40
    """
41
    Deprecated. Use VLM instead.
42
    Enum specifying supported Large Multimodal Models (LMMs).
43
    """
44
    PALIGEMMA = "paligemma"
45
    FLORENCE_2 = "florence_2"
46
    QWEN_2_5_VL = "qwen_2_5_vl"
47
    GOOGLE_GEMINI_2_0 = "gemini_2_0"
48
    GOOGLE_GEMINI_2_5 = "gemini_2_5"
49
    MOONDREAM = "moondream"
50
```
51

52
### VLM Parameter Validation
53

54
Utility for validating VLM parameters and result types.
55

56
```python { .api }
57
def validate_vlm_parameters(vlm: VLM | str, result: Any, kwargs: dict[str, Any]) -> VLM:
58
    """
59
    Validates the parameters and result type for a given Vision-Language Model (VLM).
60

61
    Args:
62
        vlm: The VLM enum or string specifying the model
63
        result: The result object to validate (type depends on VLM)
64
        kwargs: Dictionary of arguments to validate against required/allowed lists
65

66
    Returns:
67
        The validated VLM enum value
68

69
    Raises:
70
        ValueError: If the VLM, result type, or arguments are invalid
71
    """
72
```
73

74
### Model-Specific Parsers
75

76
Functions to parse results from different vision-language models into standardized formats.
77

78
```python { .api }
79
def from_paligemma(
80
    result: str, 
81
    resolution_wh: tuple[int, int], 
82
    classes: list[str] | None = None
83
) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:
84
    """
85
    Parse bounding boxes from PaliGemma-formatted text and scale to specified resolution.
86

87
    Args:
88
        result: String containing PaliGemma-formatted locations and labels
89
        resolution_wh: Target resolution (width, height) for scaling coordinates
90
        classes: Optional list of valid class names for filtering
91

92
    Returns:
93
        Tuple of (xyxy, class_names, confidence_scores)
94
    """
95

96
def from_qwen_2_5_vl(
97
    result: str,
98
    input_wh: tuple[int, int],
99
    resolution_wh: tuple[int, int],
100
    classes: list[str] | None = None
101
) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:
102
    """
103
    Parse bounding boxes from Qwen2.5-VL formatted text.
104

105
    Args:
106
        result: String containing Qwen2.5-VL formatted locations and labels
107
        input_wh: Input image resolution (width, height)
108
        resolution_wh: Target resolution (width, height) for scaling coordinates
109
        classes: Optional list of valid class names for filtering
110

111
    Returns:
112
        Tuple of (xyxy, class_names, confidence_scores)
113
    """
114

115
def from_florence_2(
116
    result: dict, 
117
    resolution_wh: tuple[int, int]
118
) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:
119
    """
120
    Parse bounding boxes from Florence-2 model results.
121

122
    Args:
123
        result: Dictionary containing Florence-2 model output
124
        resolution_wh: Target resolution (width, height) for scaling coordinates
125

126
    Returns:
127
        Tuple of (xyxy, class_names, confidence_scores)
128
    """
129

130
def from_google_gemini_2_0(
131
    result: str,
132
    resolution_wh: tuple[int, int],
133
    classes: list[str] | None = None
134
) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:
135
    """
136
    Parse bounding boxes from Google Gemini 2.0 formatted text.
137

138
    Args:
139
        result: String containing Gemini 2.0 formatted locations and labels
140
        resolution_wh: Target resolution (width, height) for scaling coordinates
141
        classes: Optional list of valid class names for filtering
142

143
    Returns:
144
        Tuple of (xyxy, class_names, confidence_scores)
145
    """
146

147
def from_google_gemini_2_5(
148
    result: str,
149
    resolution_wh: tuple[int, int],
150
    classes: list[str] | None = None
151
) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:
152
    """
153
    Parse bounding boxes from Google Gemini 2.5 formatted text.
154

155
    Args:
156
        result: String containing Gemini 2.5 formatted locations and labels
157
        resolution_wh: Target resolution (width, height) for scaling coordinates
158
        classes: Optional list of valid class names for filtering
159

160
    Returns:
161
        Tuple of (xyxy, class_names, confidence_scores)
162
    """
163

164
def from_moondream(
165
    result: dict,
166
    resolution_wh: tuple[int, int]
167
) -> tuple[np.ndarray, np.ndarray | None, np.ndarray]:
168
    """
169
    Parse bounding boxes from Moondream model results.
170

171
    Args:
172
        result: Dictionary containing Moondream model output
173
        resolution_wh: Target resolution (width, height) for scaling coordinates
174

175
    Returns:
176
        Tuple of (xyxy, class_names, confidence_scores)
177
    """
178
```
179

180
## Usage Examples
181

182
### Using PaliGemma for Object Detection
183

184
```python
185
import supervision as sv
186
import numpy as np
187

188
# PaliGemma result string
189
paligemma_result = "person <loc_123><loc_456><loc_789><loc_234> car <loc_345><loc_567><loc_890><loc_123>"
190

191
# Parse the results
192
xyxy, class_names, confidence = sv.from_paligemma(
193
    result=paligemma_result,
194
    resolution_wh=(1280, 720),
195
    classes=["person", "car", "bicycle"]  # Optional filtering
196
)
197

198
# Create Detections object
199
detections = sv.Detections(
200
    xyxy=xyxy,
201
    class_id=np.arange(len(xyxy)),
202
    confidence=confidence
203
)
204

205
print(f"Found {len(detections)} objects")
206
```
207

208
### Working with Florence-2 Model
209

210
```python
211
import supervision as sv
212

213
# Florence-2 model result dictionary
214
florence_result = {
215
    "<OD>": {
216
        "bboxes": [[100, 200, 300, 400], [500, 100, 800, 300]],
217
        "labels": ["person", "car"]
218
    }
219
}
220

221
# Parse Florence-2 results
222
xyxy, class_names, confidence = sv.from_florence_2(
223
    result=florence_result,
224
    resolution_wh=(1920, 1080)
225
)
226

227
# Create detections
228
detections = sv.Detections(
229
    xyxy=xyxy,
230
    class_id=np.arange(len(xyxy)),
231
    confidence=confidence
232
)
233

234
# Annotate the image
235
box_annotator = sv.BoxAnnotator()
236
label_annotator = sv.LabelAnnotator()
237

238
annotated_image = box_annotator.annotate(image, detections)
239
annotated_image = label_annotator.annotate(annotated_image, detections)
240
```
241

242
### Validating VLM Parameters
243

244
```python
245
import supervision as sv
246

247
# Validate VLM configuration
248
try:
249
    vlm = sv.validate_vlm_parameters(
250
        vlm="florence_2",
251
        result={"<OD>": {"bboxes": [], "labels": []}},
252
        kwargs={"resolution_wh": (640, 480)}
253
    )
254
    print(f"Valid VLM: {vlm}")
255
except ValueError as e:
256
    print(f"Invalid VLM configuration: {e}")
257
```
258

259
### Integration with Detections.from_* Methods
260

261
The VLM parsers can be used with the core Detections class through custom integration:
262

263
```python
264
import supervision as sv
265

266
def create_detections_from_vlm(vlm_type: str, result, **kwargs):
267
    """Helper function to create Detections from VLM results."""
268
    
269
    if vlm_type == "paligemma":
270
        xyxy, class_names, confidence = sv.from_paligemma(result, **kwargs)
271
    elif vlm_type == "florence_2":
272
        xyxy, class_names, confidence = sv.from_florence_2(result, **kwargs)
273
    # Add other VLM types...
274
    
275
    return sv.Detections(
276
        xyxy=xyxy,
277
        confidence=confidence,
278
        class_id=np.arange(len(xyxy)) if len(xyxy) > 0 else np.array([])
279
    )
280

281
# Usage
282
detections = create_detections_from_vlm(
283
    vlm_type="paligemma",
284
    result=paligemma_result,
285
    resolution_wh=(1280, 720),
286
    classes=["person", "car"]
287
)
288
```
289

290
## Supported Tasks
291

292
### Florence-2 Supported Tasks
293

294
Florence-2 supports multiple computer vision tasks through different task prompts:
295

296
- `<OD>`: Object Detection
297
- `<CAPTION_TO_PHRASE_GROUNDING>`: Caption to phrase grounding
298
- `<DENSE_REGION_CAPTION>`: Dense region captioning
299
- `<REGION_PROPOSAL>`: Region proposal generation
300
- `<OCR_WITH_REGION>`: OCR with region detection
301
- `<REFERRING_EXPRESSION_SEGMENTATION>`: Referring expression segmentation
302
- `<REGION_TO_SEGMENTATION>`: Region to segmentation
303
- `<OPEN_VOCABULARY_DETECTION>`: Open vocabulary detection
304
- `<REGION_TO_CATEGORY>`: Region to category classification
305
- `<REGION_TO_DESCRIPTION>`: Region to description

Version

Tile

Files

vlm-support.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

vlm-support.mddocs/