Overwatch: PayU’s Journey to Intelligent Monitoring Without Ops Overhead

Introduction: Monitoring Beyond the Basics 

In the high-stakes world of digital payments, uptime alone won’t cut it. True resilience means sensing the pulse of business health—tracking transaction outcomes across thousands of merchants, dozens of payment methods, and countless attribute permutations. Manual monitoring? Often noisy. Traditional SaaS tools? Incomplete and expensive. 

That is why we built Overwatch. Originally a payment success-rate tracker, Overwatch has evolved into PayU’s intelligent observability layer—monitoring transaction volumes, refunds, Gross Merchant Volume (GMV), and more. Designed and delivered by a nimble two-person team, this system now supports critical recovery decisions in real time. 

Problem Scope: Complexity Meets Urgency 

Back in 2022, our incumbent SaaS tool was cracking under pressure—it missed key alerts, covered limited metrics, and scaled costs without scaling value. And with contract renewal approaching, we needed to act fast. 

In shadow testing, Overwatch caught incidents the legacy system missed. That proved its worth—instantly. 

The challenges we faced: 

  • 3,000+ transactions/second, 20K+ Change Data Capture (CDC) events per second 
  • Multi-attribute permutations (merchant, issuer, integration flow, etc.) 
  • Alert fatigue from static thresholds and low actionability 
  • Irregular behavior patterns—festive surges, regional traffic waves, time-sensitive load variations 

Architecture Overview: From Ingestion to Insight 

Streaming & Processing 

  • Amazon Aurora CDC via Maxwell Daemon: Real-time capture 
  • Logstash → Kafka → Redis: Streaming events, reordering out-of-sync records, and ensuring consistency 
  • Redis buffering: Crucial for update-before-insert edge cases 

Timescale DB: The Engine Behind It All 

After evaluating many options, Timescale DB stood out for: 

  • Hyper tables that scale with precision 
  • Continuous aggregates for blazing-fast queries 
  • Chunked storage and multi-tier retention policies to balance cost and performance 

Data organization: 

Layer Retention Use Case 
Raw transactions 2 weeks Immediate insights 
Minute-wise aggregates 2 months Trend analysis 
Daily summaries 1 year Strategic decisions 

Most operational queries have high-cardinality, and they are time-sensitive—Timescale DB handles them effortlessly. 

Smart Thresholding & Health Checks 

Overwatch analyzes a rolling 15-day window per entity to build dynamic baselines. No manual configurations. No rigid rules. 

  • Detects meaningful deviations (not false alarms) 
  • Updates thresholds automatically as business patterns evolve 
  • Keeps teams informed—without overwhelming them 

Actionable Alerts > Noisy Notifications 

Anomalies go through a prioritization pipeline: 

  • Impact measured via duration, volume, deviation 
  • Fuzzy correlation joins related events—even without an attribute match 
  • Automated feedback loops trigger mitigation—real-time rerouting or traffic control 

Overwatch does not just alert. It acts. 

Dashboards & Routing 

Built in Grafana, Overwatch’s UI is lean yet expressive. Clever hacks allow cross-dashboard context passing. 

Alerts reach teams via: 

  • Email 
  • Microsoft Teams 

Smart routing ensures attention goes where it is needed—without alert overload. 

Meta data Monitoring: Observability of Observability 

Every piece of Overwatch—from ingestion latencies to buffer depths—is itself monitored. Because when monitoring breaks, you need monitoring for that too. 

Why Reinvent, Not Rent? 

The previous SaaS tool: 

  • Missed key incidents 
  • Monitored only partial metrics 
  • Scaled pricing with zero empathy for business complexity 

Overwatch’s rollout under contract pressure proved how fast focused engineering can deliver better outcomes. 

People & Change: The Human Factor 

Adoption was not automatic. Teams were used to legacy workflows. But Overwatch earned trust through results. 

What started as a niche tool quickly became central to broader business metric monitoring—driven by user demand, not top-down mandates. 

No AI copilots. Just clarity, iteration, and delivery. 

Engineering Decisions & Takeaways 

Timescale DB’s Edge 

  • Query speed in milliseconds 
  • Predictable retention and costs 
  • Minimal ops handholding 

Redis Buffering 

  • Critical for CDC stream integrity 
  • Prevents data mis ordering at scale 

Zero-Ops Mandate 

  • No manual tuning 
  • Intelligent thresholds and alert logic built in 

Trustable Alerts 

  • Grouping, impact scoring, fuzzy linking—all designed to help ops act 
  • Alert fatigue avoided, actionability prioritized 
  •  Speed > Feature Fatigue 
  • Deliver what solves the pain, iterate toward completeness 
  • Smart shortcuts (like Grafana hacks) mattered more than perfection 

If We Had a Do-Over… 

  • Build resilience from day one: Automated fallbacks save real downtime—manual overrides do not scale. 
  • Start narrow, prove value, then scale: Precision-focused delivery earned trust and unlocked generalization. 
  • Consider team cost, not just infra cost: Alert clarity, developer time, and flexibility saved far more than just cloud bills. 

Summary 

Overwatch is PayU’s purpose-built system for scalable business observability. From payment success rates to system-level action, it is zero-ops by design and rapid-recovery by default. 

It is a testament to what tight priorities, trust in iteration, and small teams with big ownership can achieve. 

Business metrics monitoring was requested rapidly after launch—a sign of Overwatch’s trustworthiness and scalability. 

0