Autonomous agents are AI systems that can independently decompose goals, plan actions, execute tools, and self-correct without constant human guidance. The challenge isn't making them capable - it's making them reliable. Every extra decision multiplies failure probability.
38
24%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/antigravity-autonomous-agents/SKILL.mdAutonomous agents are AI systems that can independently decompose goals, plan actions, execute tools, and self-correct without constant human guidance. The challenge isn't making them capable - it's making them reliable. Every extra decision multiplies failure probability.
This skill covers agent loops (ReAct, Plan-Execute), goal decomposition, reflection patterns, and production reliability. Key insight: compounding error rates kill autonomous agents. A 95% success rate per step drops to 60% by step 10. Build for reliability first, autonomy second.
2025 lesson: The winners are constrained, domain-specific agents with clear boundaries, not "autonomous everything." Treat AI outputs as proposals, not truth.
Alternating reasoning and action steps
When to use: Interactive problem-solving, tool use, exploration
""" The ReAct loop:
Key: Explicit reasoning traces make debugging possible """
""" from langchain.agents import create_react_agent from langchain_openai import ChatOpenAI
react_prompt = ''' Answer the question using the following format:
Question: the input question Thought: reason about what to do Action: tool_name Action Input: input to the tool Observation: result of the action ... (repeat Thought/Action/Observation as needed) Thought: I now know the final answer Final Answer: the answer '''
agent = create_react_agent( llm=ChatOpenAI(model="gpt-4o"), tools=tools, prompt=react_prompt, )
result = agent.invoke( {"input": query}, config={"max_iterations": 10} # Prevent runaway loops ) """
""" from langgraph.prebuilt import create_react_agent from langgraph.checkpoint.postgres import PostgresSaver
checkpointer = PostgresSaver.from_conn_string( os.environ["POSTGRES_URL"] )
agent = create_react_agent( model=llm, tools=tools, checkpointer=checkpointer, # Durable state )
config = {"configurable": {"thread_id": "user-123"}} result = agent.invoke({"messages": [query]}, config) """
Separate planning phase from execution
When to use: Complex multi-step tasks, when full plan visibility matters
""" Two-phase approach:
Advantages:
Disadvantages:
""" from langgraph.prebuilt import create_plan_and_execute_agent
planner_prompt = ''' For the given objective, create a step-by-step plan. Each step should be atomic and actionable. Format: numbered list of steps. '''
executor_prompt = ''' You are executing step {step_number} of the plan. Previous results: {previous_results} Current step: {current_step} Execute this step using available tools. '''
agent = create_plan_and_execute_agent( planner=planner_llm, executor=executor_llm, tools=tools, replan_on_error=True, # Re-plan if step fails )
config = { "configurable": { "thread_id": "task-456", }, "interrupt_before": ["execute"], # Pause before execution }
plan = agent.invoke({"objective": goal}, config)
if human_approves(plan): result = agent.invoke(None, config) # Continue from checkpoint """
"""
def interleaved_execute(goal, max_steps=10): state = {"goal": goal, "completed": [], "remaining": [goal]}
for step in range(max_steps):
# Plan next action based on current state
next_action = planner.plan_next(state)
if next_action == "DONE":
break
# Execute and update state
result = executor.execute(next_action)
state["completed"].append((next_action, result))
# Re-evaluate remaining work
state["remaining"] = planner.reassess(state)
return state"""
Self-evaluation and iterative improvement
When to use: Quality matters, complex outputs, creative tasks
""" Self-correction loop:
Also called: Evaluator-Optimizer, Self-Critique """
""" def reflect_and_improve(task, max_iterations=3): # Initial generation output = generator.generate(task)
for i in range(max_iterations):
# Evaluate output
critique = evaluator.critique(
task=task,
output=output,
criteria=[
"Correctness",
"Completeness",
"Clarity",
]
)
if critique["passes_all"]:
return output
# Refine based on critique
output = generator.refine(
task=task,
previous_output=output,
critique=critique["feedback"],
)
return output # Best effort after max iterations"""
""" from langgraph.graph import StateGraph
def build_reflection_graph(): graph = StateGraph(ReflectionState)
# Nodes
graph.add_node("generate", generate_node)
graph.add_node("reflect", reflect_node)
graph.add_node("output", output_node)
# Edges
graph.add_edge("generate", "reflect")
graph.add_conditional_edges(
"reflect",
should_continue,
{
"continue": "generate", # Loop back
"end": "output",
}
)
return graph.compile()def should_continue(state): if state["iteration"] >= 3: return "end" if state["score"] >= 0.9: return "end" return "continue" """
"""
generator = ChatOpenAI(model="gpt-4o") evaluator = ChatOpenAI(model="gpt-4o-mini") # Different perspective
from langchain.evaluation import load_evaluator evaluator = load_evaluator("criteria", criteria="correctness") """
Constrained agents with safety boundaries
When to use: Production systems, critical operations
""" Production agents need multiple safety layers:
""" class GuardedAgent: def init(self, agent, config): self.agent = agent self.max_cost = config.get("max_cost_usd", 1.0) self.max_steps = config.get("max_steps", 10) self.allowed_actions = config.get("allowed_actions", []) self.require_approval = config.get("require_approval", [])
async def execute(self, goal):
total_cost = 0
steps = 0
while steps < self.max_steps:
# Get next action
action = await self.agent.plan_next(goal)
# Validate action is allowed
if action.name not in self.allowed_actions:
raise ActionNotAllowedError(action.name)
# Check if approval needed
if action.name in self.require_approval:
approved = await self.request_human_approval(action)
if not approved:
return {"status": "rejected", "action": action}
# Estimate cost
estimated_cost = self.estimate_cost(action)
if total_cost + estimated_cost > self.max_cost:
raise CostLimitExceededError(total_cost)
# Execute with rollback capability
checkpoint = await self.save_checkpoint()
try:
result = await self.agent.execute(action)
total_cost += self.actual_cost(action)
steps += 1
except Exception as e:
await self.rollback_to(checkpoint)
raise
if result.is_complete:
break
return {"status": "complete", "total_cost": total_cost}"""
"""
TASK_PERMISSIONS = { "research": ["web_search", "read_file"], "coding": ["read_file", "write_file", "run_tests"], "admin": ["all"], # Rarely grant this }
def create_scoped_agent(task_type): allowed = TASK_PERMISSIONS.get(task_type, []) tools = [t for t in ALL_TOOLS if t.name in allowed] return Agent(tools=tools) """
"""
def trim_context(messages, max_tokens=4000): # Keep system message and recent messages system = messages[0] recent = messages[-10:]
# Summarize middle if needed
if len(messages) > 11:
middle = messages[1:-10]
summary = summarize(middle)
return [system, summary] + recent
return messages"""
Agents that survive failures and resume
When to use: Long-running tasks, production systems, multi-day processes
""" Production agents must:
LangGraph 1.0 provides this natively. """
""" from langgraph.checkpoint.postgres import PostgresSaver from langgraph.graph import StateGraph
checkpointer = PostgresSaver.from_conn_string( os.environ["POSTGRES_URL"] )
graph = StateGraph(AgentState)
agent = graph.compile(checkpointer=checkpointer)
config = {"configurable": {"thread_id": "long-task-789"}}
agent.invoke({"goal": complex_goal}, config)
state = agent.get_state(config) if not state.is_complete: agent.invoke(None, config) # Continues from checkpoint """
"""
agent = graph.compile( checkpointer=checkpointer, interrupt_before=["critical_action"], # Pause before interrupt_after=["validation"], # Pause after )
result = agent.invoke({"goal": goal}, config)
state = agent.get_state(config) if human_approves(state): # Continue from pause point agent.invoke(None, config) else: # Modify state and continue agent.update_state(config, {"approved": False}) agent.invoke(None, config) """
"""
history = list(agent.get_state_history(config))
past_state = history[5] agent.update_state(config, past_state.values)
agent.invoke(None, config) """
Severity: CRITICAL
Situation: Building multi-step autonomous agents
Symptoms: Agent works in demos but fails in production. Simple tasks succeed, complex tasks fail mysteriously. Success rate drops dramatically as task complexity increases. Users lose trust.
Why this breaks: Each step has independent failure probability. A 95% success rate per step sounds great until you realize:
This is the fundamental limit of autonomous agents. Every additional step multiplies failure probability.
Recommended fix:
class RobustAgent: def execute_with_retry(self, step, max_retries=3): for attempt in range(max_retries): try: result = step.execute() if self.validate(result): return result except Exception as e: if attempt == max_retries - 1: raise self.log_retry(step, attempt, e)
Severity: CRITICAL
Situation: Running agents with growing conversation context
Symptoms: $47 to close a single support ticket. Thousands in surprise API bills. Agents getting slower as they run longer. Token counts exceeding model limits.
Why this breaks: Transformer costs scale quadratically with context length. Double the context, quadruple the compute. A long-running agent that re-sends its full conversation each turn can burn money exponentially.
Most agents append to context without trimming. Context grows:
Recommended fix:
class CostLimitedAgent: MAX_COST_PER_TASK = 1.00 # USD
def __init__(self):
self.total_cost = 0
def before_call(self, estimated_tokens):
estimated_cost = self.estimate_cost(estimated_tokens)
if self.total_cost + estimated_cost > self.MAX_COST_PER_TASK:
raise CostLimitExceeded(
f"Would exceed ${self.MAX_COST_PER_TASK} limit"
)
def after_call(self, response):
self.total_cost += self.calculate_actual_cost(response)def trim_context(messages, max_tokens=4000): # Keep: system prompt + last N messages # Summarize: everything in between if count_tokens(messages) <= max_tokens: return messages
system = messages[0]
recent = messages[-5:]
middle = messages[1:-5]
if middle:
summary = summarize(middle) # Compress history
return [system, summary] + recent
return [system] + recentSeverity: CRITICAL
Situation: Moving from prototype to production
Symptoms: Impressive demo to stakeholders. Months of failure in production. Works for the founder's use case, fails for real users. Edge cases overwhelm the system.
Why this breaks: Demos show the happy path with curated inputs. Production means:
The methodology is questionable, but the core problem is real. The gap between a working demo and a reliable production system is where projects die.
Recommended fix:
import structlog logger = structlog.get_logger()
class ObservableAgent: def execute(self, task): with logger.bind(task_id=task.id): logger.info("task_started") try: result = self._execute(task) logger.info("task_completed", result=result) return result except Exception as e: logger.error("task_failed", error=str(e)) raise
Severity: HIGH
Situation: Agent can't complete task with available information
Symptoms: Agent invents plausible-looking data. Fake restaurant names on expense reports. Made-up statistics in reports. Confident answers that are completely wrong.
Why this breaks: LLMs are trained to be helpful and produce plausible outputs. When stuck, they don't say "I can't do this" - they fabricate. Autonomous agents compound this by acting on fabricated data without human review.
The agent that fabricated expense entries was trying to meet its goal (complete the expense report). It "solved" the problem by inventing data.
Recommended fix:
def validate_expense(expense): # Cross-check with external sources if expense.restaurant: if not verify_restaurant_exists(expense.restaurant): raise ValidationError("Restaurant not found")
# Check for suspicious patterns
if expense.amount == round(expense.amount, -1):
flag_for_review("Suspiciously round amount")system_prompt = ''' For every factual claim, cite the specific tool output that supports it. If you cannot find supporting evidence, say "I could not verify this" rather than guessing. '''
from pydantic import BaseModel
class VerifiedClaim(BaseModel): claim: str source: str # Must reference tool output confidence: float
Severity: HIGH
Situation: Connecting agent to external systems
Symptoms: Works with mock APIs, fails with real ones. Rate limits cause crashes. Auth tokens expire mid-task. Data format mismatches. Partial failures leave systems in inconsistent state.
Why this breaks: The companies promising "autonomous agents that integrate with your entire tech stack" haven't built production systems at scale. Real integrations have:
Recommended fix:
from tenacity import retry, stop_after_attempt, wait_exponential
class RobustAPIClient: @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=60) ) async def call(self, endpoint, data): response = await self.client.post(endpoint, json=data) if response.status_code == 429: retry_after = response.headers.get("Retry-After", 60) await asyncio.sleep(int(retry_after)) raise RateLimitError() return response
class TokenManager: def init(self): self.token = None self.expires_at = None
async def get_token(self):
if self.is_expired():
self.token = await self.refresh_token()
return self.token
def is_expired(self):
buffer = timedelta(minutes=5) # Refresh early
return datetime.now() > (self.expires_at - buffer)Severity: HIGH
Situation: Agent with broad permissions
Symptoms: Agent deletes production data. Sends emails to wrong recipients. Makes purchases without approval. Modifies settings it shouldn't. Actions that can't be undone.
Why this breaks: Agents optimize for their goal. Without guardrails, they'll take the shortest path - even if that path is destructive. An agent told to "clean up the database" might interpret that as "delete everything."
Broad permissions + autonomy + goal optimization = danger.
Recommended fix:
PERMISSIONS = { "research_agent": ["read_web", "read_docs"], "code_agent": ["read_file", "write_file", "run_tests"], "email_agent": ["read_email", "draft_email"], # NOT send "admin_agent": ["all"], # Rarely used }
DANGEROUS_ACTIONS = [ "delete_*", "send_email", "transfer_money", "modify_production", "revoke_access", ]
async def execute_action(action): if matches_dangerous_pattern(action): approval = await request_human_approval(action) if not approval: return ActionRejected(action) return await actually_execute(action)
Severity: MEDIUM
Situation: Long-running agent tasks
Symptoms: Agent forgets earlier instructions. Contradicts itself. Loses track of the goal. Starts repeating itself. Model errors about token limits.
Why this breaks: Every message, observation, and thought consumes context. Long tasks exhaust the window. When context is truncated:
Recommended fix:
class ContextManager: def init(self, max_tokens=100000): self.max_tokens = max_tokens self.messages = []
def add(self, message):
self.messages.append(message)
self.maybe_compact()
def maybe_compact(self):
if self.token_count() > self.max_tokens * 0.8:
self.compact()
def compact(self):
# Always keep: system prompt
system = self.messages[0]
# Always keep: last N messages
recent = self.messages[-10:]
# Summarize: everything else
middle = self.messages[1:-10]
if middle:
summary = summarize_messages(middle)
self.messages = [system, summary] + recentSeverity: MEDIUM
Situation: Agent fails mysteriously
Symptoms: "It just didn't work." No idea why agent failed. Can't reproduce issues. Users report problems you can't explain. Debugging is guesswork.
Why this breaks: Agents make dozens of internal decisions. Without visibility into each step, you're blind to failure modes. Production debugging without traces is impossible.
Recommended fix:
import structlog
logger = structlog.get_logger()
class TracedAgent: def think(self, context): with logger.bind(step="think"): thought = self.llm.generate(context) logger.info("thought_generated", thought=thought, tokens=count_tokens(thought) ) return thought
def act(self, action):
with logger.bind(step="act", action=action.name):
logger.info("action_started")
try:
result = action.execute()
logger.info("action_completed", result=result)
return result
except Exception as e:
logger.error("action_failed", error=str(e))
raisefrom langsmith import trace
@trace def agent_step(state): # Automatically traced with inputs/outputs return next_state
Severity: ERROR
Autonomous agents must have maximum step limits
Message: Agent loop without step limit. Add max_steps to prevent infinite loops.
Severity: ERROR
Agents should track and limit API costs
Message: Agent uses LLM without cost tracking. Add cost limits to prevent runaway spending.
Severity: WARNING
Long-running agents need timeouts
Message: Agent invocation without timeout. Add timeout to prevent hung tasks.
Severity: ERROR
MemorySaver is for development only
Message: MemorySaver is not persistent. Use PostgresSaver or SqliteSaver for production.
Severity: WARNING
Agents that run multiple steps need checkpointing
Message: Multi-step agent without checkpointing. Add checkpointer for durability.
Severity: WARNING
Checkpointed agents need unique thread IDs
Message: Agent invocation without thread_id. State won't persist correctly.
Severity: WARNING
Agent outputs should be validated before use
Message: Agent output used without validation. Validate before acting on results.
Severity: INFO
Structured outputs are more reliable
Message: Consider using structured outputs (Pydantic) for more reliable parsing.
Severity: WARNING
Agents should handle and recover from errors
Message: Agent call without error handling. Add try/catch or error handler.
Severity: WARNING
Actions that modify state should be reversible
Message: Destructive action without rollback capability. Save state before modification.
Works well with: agent-tool-builder, agent-memory-systems, multi-agent-orchestration, agent-evaluation
f1697b6
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.