Guide for implementing Google Gemini API image understanding - analyze images with captioning, classification, visual QA, object detection, segmentation, and multi-image comparison. Use when analyzing images, answering visual questions, detecting objects, or processing documents with vision.
Overall
score
18%
Does it follow best practices?
If you maintain this skill, you can automatically optimize it using the tessl CLI to improve its score:
npx tessl skill review --optimize ./path/to/skillValidation for skill structure
This skill enables Claude to use Google's Gemini API for advanced image understanding tasks including captioning, classification, visual question answering, object detection, segmentation, and multi-image analysis.
pip install google-genai (Python 3.9+)The skill checks for GEMINI_API_KEY in this order:
Process environment variable (recommended)
export GEMINI_API_KEY="your-api-key"Skill directory: .claude/skills/gemini-vision/.env
GEMINI_API_KEY=your-api-keyProject directory: .env or .gemini_api_key in project root
Security: Never commit API keys to version control. Add .env to .gitignore.
# Analyze a local image
python scripts/analyze-image.py path/to/image.jpg "What's in this image?"
# Analyze from URL
python scripts/analyze-image.py https://example.com/image.jpg "Describe this"
# Specify model
python scripts/analyze-image.py image.jpg "Caption this" --model gemini-2.5-propython scripts/analyze-image.py image.jpg "Detect all objects" --model gemini-2.0-flashpython scripts/analyze-image.py img1.jpg img2.jpg "What's different between these?"# Upload file
python scripts/upload-file.py path/to/large-image.jpg
# Use uploaded file
python scripts/analyze-image.py file://file-id "Caption this"# List uploaded files
python scripts/manage-files.py list
# Get file info
python scripts/manage-files.py get file-id
# Delete file
python scripts/manage-files.py delete file-idImages consume tokens based on size:
Token Formula:
crop_unit = floor(min(width, height) / 1.5)
tiles = (width / crop_unit) × (height / crop_unit)
total_tokens = tiles × 258Example: 960×540 image = 6 tiles = 1,548 tokens
Limits vary by tier (Free, Tier 1, 2, 3):
Common errors:
See the references/ directory for:
When implementing Gemini vision features:
All scripts support the 3-step API key lookup:
Run any script with --help for detailed usage instructions.
Official Documentation: https://ai.google.dev/gemini-api/docs/image-understanding
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.