Overwatch: PayU’s Journey to Intelligent Monitoring Without Ops Overhead

Introduction: Monitoring Beyond the Basics

In the high-stakes world of digital payments, uptime alone won’t cut it. True resilience means sensing the pulse of business health—tracking transaction outcomes across thousands of merchants, dozens of payment methods, and countless attribute permutations. Manual monitoring? Often noisy. Traditional SaaS tools? Incomplete and expensive.

That is why we built Overwatch. Originally a payment success-rate tracker, Overwatch has evolved into PayU’s intelligent observability layer—monitoring transaction volumes, refunds, Gross Merchant Volume (GMV), and more. Designed and delivered by a nimble two-person team, this system now supports critical recovery decisions in real time.

Problem Scope: Complexity Meets Urgency

Back in 2022, our incumbent SaaS tool was cracking under pressure—it missed key alerts, covered limited metrics, and scaled costs without scaling value. And with contract renewal approaching, we needed to act fast.

In shadow testing, Overwatch caught incidents the legacy system missed. That proved its worth—instantly.

The challenges we faced:

3,000+ transactions/second, 20K+ Change Data Capture (CDC) events per second

Multi-attribute permutations (merchant, issuer, integration flow, etc.)

Alert fatigue from static thresholds and low actionability

Irregular behavior patterns—festive surges, regional traffic waves, time-sensitive load variations

Architecture Overview: From Ingestion to Insight

Streaming & Processing

Amazon Aurora CDC via Maxwell Daemon: Real-time capture

Logstash → Kafka → Redis: Streaming events, reordering out-of-sync records, and ensuring consistency

Redis buffering: Crucial for update-before-insert edge cases

Timescale DB: The Engine Behind It All

After evaluating many options, Timescale DB stood out for:

Hyper tables that scale with precision

Continuous aggregates for blazing-fast queries

Chunked storage and multi-tier retention policies to balance cost and performance

Data organization:

Layer	Retention	Use Case
Raw transactions	2 weeks	Immediate insights
Minute-wise aggregates	2 months	Trend analysis
Daily summaries	1 year	Strategic decisions

Most operational queries have high-cardinality, and they are time-sensitive—Timescale DB handles them effortlessly.

Smart Thresholding & Health Checks

Overwatch analyzes a rolling 15-day window per entity to build dynamic baselines. No manual configurations. No rigid rules.

Detects meaningful deviations (not false alarms)

Updates thresholds automatically as business patterns evolve

Keeps teams informed—without overwhelming them

Actionable Alerts > Noisy Notifications

Anomalies go through a prioritization pipeline:

Impact measured via duration, volume, deviation

Fuzzy correlation joins related events—even without an attribute match

Automated feedback loops trigger mitigation—real-time rerouting or traffic control

Overwatch does not just alert. It acts.

Dashboards & Routing

Built in Grafana, Overwatch’s UI is lean yet expressive. Clever hacks allow cross-dashboard context passing.

Alerts reach teams via:

Microsoft Teams

Smart routing ensures attention goes where it is needed—without alert overload.

Meta data Monitoring: Observability of Observability

Every piece of Overwatch—from ingestion latencies to buffer depths—is itself monitored. Because when monitoring breaks, you need monitoring for that too.

Why Reinvent, Not Rent?

The previous SaaS tool:

Missed key incidents

Monitored only partial metrics

Scaled pricing with zero empathy for business complexity

Overwatch’s rollout under contract pressure proved how fast focused engineering can deliver better outcomes.

People & Change: The Human Factor

Adoption was not automatic. Teams were used to legacy workflows. But Overwatch earned trust through results.

What started as a niche tool quickly became central to broader business metric monitoring—driven by user demand, not top-down mandates.

No AI copilots. Just clarity, iteration, and delivery.

Engineering Decisions & Takeaways

Timescale DB’s Edge

Query speed in milliseconds

Predictable retention and costs

Minimal ops handholding

Redis Buffering

Critical for CDC stream integrity

Prevents data mis ordering at scale

Zero-Ops Mandate

No manual tuning

Intelligent thresholds and alert logic built in

Trustable Alerts

Grouping, impact scoring, fuzzy linking—all designed to help ops act

Alert fatigue avoided, actionability prioritized

Speed > Feature Fatigue

Deliver what solves the pain, iterate toward completeness

Smart shortcuts (like Grafana hacks) mattered more than perfection

If We Had a Do-Over…

Build resilience from day one: Automated fallbacks save real downtime—manual overrides do not scale.

Start narrow, prove value, then scale: Precision-focused delivery earned trust and unlocked generalization.

Consider team cost, not just infra cost: Alert clarity, developer time, and flexibility saved far more than just cloud bills.

Summary

Overwatch is PayU’s purpose-built system for scalable business observability. From payment success rates to system-level action, it is zero-ops by design and rapid-recovery by default.

It is a testament to what tight priorities, trust in iteration, and small teams with big ownership can achieve.

Business metrics monitoring was requested rapidly after launch—a sign of Overwatch’s trustworthiness and scalability.