Introduction: Monitoring Beyond the Basics
In the high-stakes world of digital payments, uptime alone won’t cut it. True resilience means sensing the pulse of business health—tracking transaction outcomes across thousands of merchants, dozens of payment methods, and countless attribute permutations. Manual monitoring? Often noisy. Traditional SaaS tools? Incomplete and expensive.
That is why we built Overwatch. Originally a payment success-rate tracker, Overwatch has evolved into PayU’s intelligent observability layer—monitoring transaction volumes, refunds, Gross Merchant Volume (GMV), and more. Designed and delivered by a nimble two-person team, this system now supports critical recovery decisions in real time.
Problem Scope: Complexity Meets Urgency
Back in 2022, our incumbent SaaS tool was cracking under pressure—it missed key alerts, covered limited metrics, and scaled costs without scaling value. And with contract renewal approaching, we needed to act fast.
In shadow testing, Overwatch caught incidents the legacy system missed. That proved its worth—instantly.
The challenges we faced:
- 3,000+ transactions/second, 20K+ Change Data Capture (CDC) events per second
- Multi-attribute permutations (merchant, issuer, integration flow, etc.)
- Alert fatigue from static thresholds and low actionability
- Irregular behavior patterns—festive surges, regional traffic waves, time-sensitive load variations
Architecture Overview: From Ingestion to Insight

Streaming & Processing
- Amazon Aurora CDC via Maxwell Daemon: Real-time capture
- Logstash → Kafka → Redis: Streaming events, reordering out-of-sync records, and ensuring consistency
- Redis buffering: Crucial for update-before-insert edge cases
Timescale DB: The Engine Behind It All
After evaluating many options, Timescale DB stood out for:
- Hyper tables that scale with precision
- Continuous aggregates for blazing-fast queries
- Chunked storage and multi-tier retention policies to balance cost and performance
Data organization:
Layer | Retention | Use Case |
Raw transactions | 2 weeks | Immediate insights |
Minute-wise aggregates | 2 months | Trend analysis |
Daily summaries | 1 year | Strategic decisions |
Most operational queries have high-cardinality, and they are time-sensitive—Timescale DB handles them effortlessly.
Smart Thresholding & Health Checks
Overwatch analyzes a rolling 15-day window per entity to build dynamic baselines. No manual configurations. No rigid rules.
- Detects meaningful deviations (not false alarms)
- Updates thresholds automatically as business patterns evolve
- Keeps teams informed—without overwhelming them
Actionable Alerts > Noisy Notifications
Anomalies go through a prioritization pipeline:
- Impact measured via duration, volume, deviation
- Fuzzy correlation joins related events—even without an attribute match
- Automated feedback loops trigger mitigation—real-time rerouting or traffic control
Overwatch does not just alert. It acts.
Dashboards & Routing
Built in Grafana, Overwatch’s UI is lean yet expressive. Clever hacks allow cross-dashboard context passing.
Alerts reach teams via:
- Microsoft Teams
Smart routing ensures attention goes where it is needed—without alert overload.
Meta data Monitoring: Observability of Observability
Every piece of Overwatch—from ingestion latencies to buffer depths—is itself monitored. Because when monitoring breaks, you need monitoring for that too.
Why Reinvent, Not Rent?
The previous SaaS tool:
- Missed key incidents
- Monitored only partial metrics
- Scaled pricing with zero empathy for business complexity
Overwatch’s rollout under contract pressure proved how fast focused engineering can deliver better outcomes.
People & Change: The Human Factor
Adoption was not automatic. Teams were used to legacy workflows. But Overwatch earned trust through results.
What started as a niche tool quickly became central to broader business metric monitoring—driven by user demand, not top-down mandates.
No AI copilots. Just clarity, iteration, and delivery.
Engineering Decisions & Takeaways
Timescale DB’s Edge
- Query speed in milliseconds
- Predictable retention and costs
- Minimal ops handholding
Redis Buffering
- Critical for CDC stream integrity
- Prevents data mis ordering at scale
Zero-Ops Mandate
- No manual tuning
- Intelligent thresholds and alert logic built in
Trustable Alerts
- Grouping, impact scoring, fuzzy linking—all designed to help ops act
- Alert fatigue avoided, actionability prioritized
- Speed > Feature Fatigue
- Deliver what solves the pain, iterate toward completeness
- Smart shortcuts (like Grafana hacks) mattered more than perfection
If We Had a Do-Over…
- Build resilience from day one: Automated fallbacks save real downtime—manual overrides do not scale.
- Start narrow, prove value, then scale: Precision-focused delivery earned trust and unlocked generalization.
- Consider team cost, not just infra cost: Alert clarity, developer time, and flexibility saved far more than just cloud bills.
Summary
Overwatch is PayU’s purpose-built system for scalable business observability. From payment success rates to system-level action, it is zero-ops by design and rapid-recovery by default.
It is a testament to what tight priorities, trust in iteration, and small teams with big ownership can achieve.
Business metrics monitoring was requested rapidly after launch—a sign of Overwatch’s trustworthiness and scalability.