Optimize LangChain application performance: latency, throughput, streaming, caching, batch processing, and connection pooling. Trigger: "langchain performance", "langchain optimization", "langchain latency", "langchain slow", "speed up langchain".
84
82%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize LangChain apps for production: measure baseline latency, implement caching, batch with concurrency control, stream for perceived speed, optimize prompts for fewer tokens, and select the right model for each task.
async function benchmark(
chain: { invoke: (input: any) => Promise<any> },
input: any,
iterations = 5,
) {
const times: number[] = [];
for (let i = 0; i < iterations; i++) {
const start = performance.now();
await chain.invoke(input);
times.push(performance.now() - start);
}
times.sort((a, b) => a - b);
return {
mean: (times.reduce((a, b) => a + b, 0) / times.length).toFixed(0) + "ms",
median: times[Math.floor(times.length / 2)].toFixed(0) + "ms",
p95: times[Math.floor(times.length * 0.95)].toFixed(0) + "ms",
min: times[0].toFixed(0) + "ms",
max: times[times.length - 1].toFixed(0) + "ms",
};
}
// Usage
const results = await benchmark(chain, { input: "test" }, 10);
console.table(results);import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
const chain = ChatPromptTemplate.fromTemplate("{input}")
.pipe(new ChatOpenAI({ model: "gpt-4o-mini", streaming: true }))
.pipe(new StringOutputParser());
// Non-streaming: user waits 2-3s for full response
// Streaming: first token in ~200ms, user sees progress immediately
const stream = await chain.stream({ input: "Explain LCEL" });
for await (const chunk of stream) {
process.stdout.write(chunk);
}
// Express SSE endpoint for web apps
app.post("/api/chat/stream", async (req, res) => {
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
res.setHeader("Connection", "keep-alive");
const stream = await chain.stream({ input: req.body.input });
for await (const chunk of stream) {
res.write(`data: ${JSON.stringify({ text: chunk })}\n\n`);
}
res.write("data: [DONE]\n\n");
res.end();
});import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
const chain = ChatPromptTemplate.fromTemplate("Summarize: {text}")
.pipe(new ChatOpenAI({ model: "gpt-4o-mini" }))
.pipe(new StringOutputParser());
const inputs = articles.map((text) => ({ text }));
// Sequential: ~10s for 10 items (1s each)
// const results = [];
// for (const input of inputs) results.push(await chain.invoke(input));
// Batch: ~2s for 10 items (parallel API calls)
const results = await chain.batch(inputs, {
maxConcurrency: 10,
});
// Benchmark comparison
console.time("sequential");
for (const i of inputs.slice(0, 5)) await chain.invoke(i);
console.timeEnd("sequential");
console.time("batch");
await chain.batch(inputs.slice(0, 5), { maxConcurrency: 5 });
console.timeEnd("batch");// In-memory cache (single process, resets on restart)
const cache = new Map<string, string>();
async function cachedInvoke(
chain: any,
input: Record<string, any>,
): Promise<string> {
const key = JSON.stringify(input);
const cached = cache.get(key);
if (cached) return cached;
const result = await chain.invoke(input);
cache.set(key, result);
return result;
}
// Cache hit: ~0ms (vs ~500-2000ms for API call)# Python — built-in caching
from langchain_core.globals import set_llm_cache
from langchain_community.cache import SQLiteCache, InMemoryCache
# Option 1: In-memory (single process)
set_llm_cache(InMemoryCache())
# Option 2: SQLite (persistent, survives restarts)
set_llm_cache(SQLiteCache(database_path=".langchain_cache.db"))
# Option 3: Redis (distributed, production)
from langchain_community.cache import RedisCache
import redis
set_llm_cache(RedisCache(redis.Redis.from_url("redis://localhost:6379")))import { ChatOpenAI } from "@langchain/openai";
// Fast + cheap: simple tasks, classification, extraction
const fast = new ChatOpenAI({
model: "gpt-4o-mini", // ~200ms TTFT, $0.15/1M input
temperature: 0,
});
// Powerful + slower: complex reasoning, code generation
const powerful = new ChatOpenAI({
model: "gpt-4o", // ~400ms TTFT, $2.50/1M input
temperature: 0,
});
// Route based on task
import { RunnableBranch } from "@langchain/core/runnables";
const router = RunnableBranch.from([
[(input: any) => input.task === "classify", classifyChain],
[(input: any) => input.task === "reason", reasoningChain],
defaultChain,
]);// Shorter prompts = fewer input tokens = lower latency + cost
// BEFORE (150+ tokens):
const verbose = `You are an expert AI assistant specialized in software
engineering. Your task is to carefully analyze the following code and
provide a comprehensive review covering all aspects including...`;
// AFTER (20 tokens, same quality):
const concise = "Review this code. List issues and fixes:\n\n{code}";
// Token counting (Python)
// import tiktoken
// enc = tiktoken.encoding_for_model("gpt-4o-mini")
// print(len(enc.encode(prompt))) # check before deploying| Optimization | Latency Improvement | Cost Impact |
|---|---|---|
| Streaming | First token 80% faster | Neutral |
| Caching | 99% on cache hit | Major savings |
| Batch processing | 50-80% for bulk ops | Neutral |
| gpt-4o-mini vs gpt-4o | ~2x faster TTFT | ~17x cheaper |
| Shorter prompts | 10-30% | 10-50% cheaper |
| maxConcurrency tuning | Linear scaling | Neutral |
| Error | Cause | Fix |
|---|---|---|
| Batch partially fails | Rate limit on some items | Lower maxConcurrency, add maxRetries |
| Stream hangs | Network timeout | Set timeout on model, handle disconnect |
| Cache stale data | Content changed upstream | Add TTL or version key to cache |
| High memory usage | Large cache | Use LRU eviction or Redis |
Use langchain-cost-tuning for cost optimization alongside performance.
70e9fa4
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.