A regional third-party logistics provider came to us with a familiar problem: too many exceptions, not enough people.
Their exception management process was consuming 6 FTEs per day across three shifts. Every exception — damaged goods, address issues, carrier failures, inventory discrepancies — landed in a shared queue. Humans triaged, researched, and resolved each one manually.
Volume was growing. The team was burning out. They’d looked at traditional automation but couldn’t make the rule-based approaches handle the variability.
The Challenge
Order exceptions in 3PL operations are fundamentally messy. The same exception type — “address undeliverable” — might require:
- Contacting the consignee for address correction
- Rerouting to a local pickup point
- Returning to shipper
- Escalating to the shipper’s customer service team
The right resolution depends on the shipper’s preferences, the consignee’s history, the value of the shipment, the carrier contract terms, and a dozen other factors. This is exactly the kind of judgment-intensive work that rule-based systems can’t handle but LLMs can.
The Architecture
We built a three-agent pipeline:
Triage Agent: Classifies each exception by type, urgency, and complexity. Routes routine exceptions (those that match known patterns) to automated resolution. Routes complex exceptions to the Decision Agent.
Decision Agent: For each exception, retrieves relevant context — shipper preferences, consignee history, carrier capabilities, contract terms — and determines the appropriate resolution action. Generates a resolution plan with confidence score.
Action Agent: Executes the resolution plan. Updates the TMS, sends notifications, creates carrier claims, or escalates to human review — depending on the plan and confidence threshold.
Human review queue: exceptions where confidence is below threshold, or where the action involves above-threshold financial impact.
Integration Points
The system integrates with:
- Their TMS (Manhattan Associates) for exception data and status updates
- Their carrier APIs for redelivery scheduling and claims
- Their customer portal for shipper preference data
- Email and SMS for notifications
All integrations are read/write. The agent doesn’t just surface recommendations — it acts.
Results at 12 Weeks
- 80% automation rate: 4 in 5 exceptions resolved without human intervention
- Average resolution time: 4 minutes (down from 47 minutes manual)
- Accuracy: 97.3% resolution accuracy vs. 94.1% human baseline
- Escalation quality: When exceptions do escalate to humans, they arrive with full context and a recommended action — reducing human resolution time by 60%
The 6 FTEs were redeployed to higher-value work. The client’s exception management cost decreased by 71%.
What Made It Work
A few things that weren’t obvious at the start:
Confidence calibration mattered more than we expected. Getting the confidence threshold right — when to automate vs. when to escalate — was the most important tuning problem. Too aggressive and you get errors. Too conservative and you don’t get enough automation.
The audit trail was a feature, not overhead. Every resolution decision is logged with the full reasoning trace. This gave the operations team visibility they’d never had before — they could see exactly why each exception was resolved the way it was.
Human review improved the model. The human review queue became a feedback mechanism. Reviewers flagging incorrect agent decisions generated training data that improved the model’s calibration over time.
What We’d Do Differently
Start the confidence calibration work earlier. We spent the first three weeks building before we had a clear definition of what “good” looked like. Starting with the evaluation framework would have accelerated the tuning phase.