Guide for implementing Google Gemini API image understanding - analyze images with captioning, classification, visual QA, object detection, segmentation, and multi-image comparison. Use when analyzing images, answering visual questions, detecting objects, or processing documents with vision.
This skill enables Claude to use Google's Gemini API for advanced image understanding tasks including captioning, classification, visual question answering, object detection, segmentation, and multi-image analysis.
pip install google-genai (Python 3.9+)The skill checks for GEMINI_API_KEY in this order:
Process environment variable (recommended)
export GEMINI_API_KEY="your-api-key"Skill directory: .claude/skills/gemini-vision/.env
GEMINI_API_KEY=your-api-keyProject directory: .env or .gemini_api_key in project root
Security: Never commit API keys to version control. Add .env to .gitignore.
# Analyze a local image
python scripts/analyze-image.py path/to/image.jpg "What's in this image?"
# Analyze from URL
python scripts/analyze-image.py https://example.com/image.jpg "Describe this"
# Specify model
python scripts/analyze-image.py image.jpg "Caption this" --model gemini-2.5-propython scripts/analyze-image.py image.jpg "Detect all objects" --model gemini-2.0-flashpython scripts/analyze-image.py img1.jpg img2.jpg "What's different between these?"# Upload file
python scripts/upload-file.py path/to/large-image.jpg
# Use uploaded file
python scripts/analyze-image.py file://file-id "Caption this"# List uploaded files
python scripts/manage-files.py list
# Get file info
python scripts/manage-files.py get file-id
# Delete file
python scripts/manage-files.py delete file-idImages consume tokens based on size:
Token Formula:
crop_unit = floor(min(width, height) / 1.5)
tiles = (width / crop_unit) × (height / crop_unit)
total_tokens = tiles × 258Example: 960×540 image = 6 tiles = 1,548 tokens
Limits vary by tier (Free, Tier 1, 2, 3):
Common errors:
See the references/ directory for:
When implementing Gemini vision features:
All scripts support the 3-step API key lookup:
Run any script with --help for detailed usage instructions.
Official Documentation: https://ai.google.dev/gemini-api/docs/image-understanding
b1b2fe0
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.