**The Issue**
Generic prompt optimization treats every input the same way. A creative brainstorming prompt gets the same structural changes as a code generation request, which means you're either over-constraining creative work or under-specifying technical tasks. I needed a way to detect what I was actually trying to do with a prompt before deciding how to improve it—without manually tagging every request or building custom routing logic.
**What changed**
I built an intent detection system that reads your prompt once and routes it to the right optimization strategy automatically. When you send a prompt through the Prompt Optimizer, it runs through 6 specialized detection patterns—what I call Precision Locks—that identify whether you're doing creative work, technical implementation, data analysis, research, general tasks, or working with images and video. Each lock looks for different signals: structural markers like code blocks and file references for technical prompts, open-ended language patterns for creative work, citation requests and source requirements for research.
The system doesn't need training data or fine-tuning because it's pattern-based. I tested it against 91.94% overall accuracy on my own prompt history, with image and video detection hitting 96.4%. That accuracy matters because the wrong optimization strategy actively makes your prompt worse—adding creative flexibility to a code generation request introduces ambiguity that breaks the output. The detection happens in milliseconds, returns a semantic confidence score between 0.0 and 1.0, and costs nothing because I route the analysis through a free model by default.
Once the system knows your intent, it applies context-specific optimization goals. Technical prompts get structural precision and explicit constraints. Creative prompts get expanded possibility space and removed limitations. Research prompts get source verification requirements and citation formats. You don't configure any of this—the detection result automatically selects the right optimization approach, and you see exactly which lock triggered and why in the response metadata.
**How it works**
The detection system runs a function called \`detect\_prompt\_context\`. When you call it, the system analyzes your prompt text against 6 concurrent pattern matchers:
`# Example call from Claude Desktop or any MCP client`
`detect_prompt_context(`
`prompt_text="Write a Python function that validates email addresses using regex",`
`analysis_depth="standard"`
`)`
Each Precision Lock returns a confidence score. The technical lock looks for: code fence markers, file path patterns (/src/, .py, .js), function signatures, import statements, and explicit technical verbs like "implement", "debug", "refactor". The creative lock scans for: open-ended questions, exploratory language ("imagine", "brainstorm", "what if"), absence of constraints, and requests for multiple alternatives. The research lock detects: citation requirements, source verification requests, academic terminology, and fact-checking language.
The system aggregates scores across all 6 locks and returns the highest-confidence match. For the example above, the technical lock would score \~0.92 because of "Python function", "regex", and the implementation verb "validates". That score triggers the technical optimization strategy, which adds explicit input/output specifications, error handling requirements, and test case expectations to the optimized version.
I set the confidence threshold at 0.75. Below that, the system returns "general" as the detected context and applies minimal optimization—just clarity improvements without strategic changes. This prevents false positives from forcing the wrong optimization approach. The detection result includes: \`context\_type\` (the winning lock), \`confidence\_score\` (0.0-1.0), \`detected\_patterns\` (which specific markers triggered), and \`alternative\_contexts\` (other locks that scored above 0.5, useful for hybrid prompts).
The image/video lock works differently because visual content requests have distinct structural markers: file format mentions (.jpg, .mp4), visual terminology ("render", "frame", "resolution"), and media-specific constraints (aspect ratio, duration, color space). I measured 96.4% accuracy on this lock specifically because the pattern set is more constrained—there are fewer ways to request visual content compared to the open-ended nature of creative or research prompts.
**Metrics**
\*\*Authentic Metrics from Production:\*\*
\- \*\*evaluation\_cost:\*\* 0 — free model auto-selected
\- \*\*context\_types:\*\* 7
\- \*\*semantic\_score\_range:\*\* 0.0-1.0
**Deeper than just rewrites**
The hardest part was handling hybrid prompts—requests that legitimately span multiple contexts. "Write a creative story about a programmer debugging code" triggers both creative and technical locks with similar confidence scores. I initially tried weighted averaging, but that produced muddled optimization strategies that didn't serve either intent well. I switched to a primary-secondary approach: the system picks the highest-scoring lock as primary and exposes the second-highest as an alternative in the metadata. You can manually override if the auto-detection misses your actual intent.
I found edge cases where the detection was technically correct but strategically wrong. Short, ambiguous prompts like "improve this" or "make it better" score low across all locks because there's no content to analyze. The system returns "general" context, which is accurate but not useful—you need more specificity in the original prompt before optimization helps. I added a minimum token threshold (15 tokens) below which the system suggests prompt expansion before attempting optimization.
The confidence threshold took iteration to get right. I started at 0.85, which produced too many "general" classifications and missed obvious contexts. At 0.65, I got false positives—creative prompts misclassified as research because they mentioned "exploring ideas". 0.75 balanced precision and recall based on my own testing, but I exposed it as a configurable parameter (\`confidence\_threshold\`) because different use cases have different tolerance for false positives versus false negatives.
**What I measured**
I measured 91.94% accuracy on my own prompt history—about 500 prompts spanning 6 months of daily use across code generation, content writing, and research tasks. The system correctly identified technical prompts 94% of the time, creative prompts 89% of the time, and research prompts 87% of the time. Image/video detection hit 96.4%, likely because those requests have more distinctive structural markers.
The accuracy translated into cost reduction because correctly-detected prompts get optimized in ways that reduce token count and retry attempts. I measured a 40% reduction in my own API costs after routing all prompts through context detection. The savings came from two sources: technical prompts became more precise (fewer tokens, fewer clarification rounds), and creative prompts stopped getting over-constrained (fewer regeneration requests because the first output actually matched my intent).
The detection overhead is negligible—analysis completes in under 200ms on average, and I route it through a free model by default so the evaluation cost is zero. The semantic confidence scores proved useful for debugging misclassifications: when I saw a prompt score 0.68 for technical and 0.71 for creative, I knew the prompt itself was ambiguous and needed rewriting before optimization would help. That feedback loop—seeing the confidence scores in real time—improved how I write initial prompts, which compounded the optimization benefits.
**Key Takeaways**
\- Intent detection isn't a nice-to-have—it's what makes optimization actually work. Generic improvements either over-constrain creative work or under-specify technical tasks.
\- Pattern-based detection (looking for structural markers like code blocks, citation requests, visual terminology) works without training data and hits 91.94% accuracy on real use.
\- Confidence scores matter more than binary classification. A 0.68 technical score tells you the prompt is ambiguous and needs rewriting before optimization helps.
\- Hybrid prompts need a primary-secondary approach, not weighted averaging. Pick the highest-scoring context and expose the runner-up in metadata for manual override.
\- Less complex/basic prompts see cost reductions (40% in my testing) which comes from fewer retries and shorter prompts—not from the detection itself, which costs nothing when routed through a free model.
AI systems now depends on how effectively we engineer and evaluate prompts at scale! I've built a platform that removes the technical workload of shifting from manual prompting to strategically automating the process: [https://promptoptimizer.xyz/](https://promptoptimizer.xyz/)
--- TOP COMMENTS ---
The part that clicked for me is that generic optimization is just formatting theater when the underlying intent is wrong. I've had more success changing the output contract than changing the wording. For code prompts, I add an explicit schema and an 'if ambiguous, list assumptions' clause. The schema catches most bad outputs before I see them. The assumptions clause turns vague responses into reviewable ones. What's your failure mode when the wrong optimization path runs?
---
Great idea, but the error rate is pretty high, no? ..
Yesterday