I built an open source LLM agent evaluation tool that works with any framework
Every team building AI agents hits the same wall. You ship a LangChain agent. It works great in demos. Then it goes to production and quietly starts hallucinating, calling the wrong tools, or givin...

Source: DEV Community
Every team building AI agents hits the same wall. You ship a LangChain agent. It works great in demos. Then it goes to production and quietly starts hallucinating, calling the wrong tools, or giving answers that have nothing to do with what it retrieved. You don't find out until a user complains. The root cause is simple: there's no standard way to evaluate agent quality before and after every deploy. Every framework has its own story: LangChain has LangSmith — but it's a paid SaaS and only works with LangChain CrewAI has no eval tooling AutoGen has no eval tooling OpenAI Agents SDK has basic tracing but no scoring If you switch frameworks, you rebuild your eval setup from scratch. If you use multiple frameworks, you have no unified view. This is the problem I set out to solve. Introducing EvalForge EvalForge is a framework-agnostic LLM agent evaluation harness. You give it a trace JSON from any agent framework, it scores it on quality metrics, and returns a pass/fail result your CI pi