The problem at country scale
Building infrastructure for the operational backbone of the Dominican Republic's largest payment network is not a software problem — it's a systems design problem. Every peso moving through merchant terminals, mobile wallets, bank integrations, and government disbursement channels flows through the same settlement layer. The margin for error at that scale is effectively zero.
The engineering challenge we faced wasn't building just another payment processing API. It was designing the entire operational intelligence layer — the layer that monitors, orchestrates, alerts, reconciles, and recovers — in real time, across a heterogeneous ecosystem of financial actors ranging from large commercial banks to corner-store merchants with mobile POS devices.
Starting with the right abstraction
Most payment infrastructure teams start with throughput. We started with operational surface area: how many distinct failure modes exist, how long do they take to surface, and who needs to act when they do?
In a country-scale network, the answer is uncomfortable. Failure modes are in the hundreds. Surfacing time, without intelligent monitoring, can be hours. And the actors who need to respond — compliance officers, fraud analysts, network operations teams, bank liaisons — are completely siloed from each other.
The first architectural decision was therefore not technical: we had to design a shared operational plane. A single intelligence layer that all actors could read from and write to, in structured, real-time terms.
The operational intelligence architecture
We structured the operational infrastructure around five core capabilities:
1. Real-time event ingestion
Every transaction event — authorization, settlement, chargeback, reversal, dispute flag — is streamed into a unified event bus. The schema was standardized across all participant institutions, which required months of data normalization work upstream. Without this, downstream intelligence is meaningless.
2. Agentic monitoring loops
Rather than static dashboards with manual alert thresholds, we deployed autonomous monitoring agents — each responsible for a specific slice of the network. One agent watches settlement timing SLAs across all acquiring banks. Another monitors authorization decline rates by merchant category and geography. A third tracks chargeback velocity, comparing it against historical cohorts and seasonal patterns.
These agents don't just surface metrics. They reason about them. When a decline spike appears, the agent cross-references it against terminal firmware versions, network connectivity logs, and recent regulatory changes before escalating. This dramatically reduces false positive fatigue for human operators.
3. Multi-tier alert orchestration
Alerts are inherently hierarchical in payment networks. A merchant experiencing elevated declines is a tier-1 alert. A bank experiencing elevated declines across all its merchants simultaneously is a tier-2 event. The same pattern appearing across all banks in a specific region of the country is a tier-3 national alert requiring regulatory notification.
The orchestration layer manages this tiering automatically, routing escalations to the right roles, generating pre-populated incident briefs, and initiating the appropriate playbooks — from automated retry logic at the transaction level to coordination calls with central bank operations at the systemic level.
4. Reconciliation intelligence
Settlement reconciliation in a multi-bank, multi-acquirer, multi-wallet environment is one of the hardest unsolved problems in payments. Traditional approaches rely on batch file comparisons run overnight. We replaced that with a continuous reconciliation engine that compares transaction state across all participant ledgers in near real-time, flags discrepancies as they emerge, and assigns resolution priority based on value, age, and counterparty.
5. Fraud signal aggregation
Country-scale fraud patterns don't live in individual institution data. They live in the correlations across institutions. A card tested at 50 merchants across 5 banks in 4 minutes is invisible to any single bank's fraud system. It's a clear signal at the network level.
We built a federated fraud intelligence layer that aggregates anonymized signals from all participant institutions, runs behavioral pattern matching against known fraud typologies, and generates network-wide risk scores in real time. High-risk signals trigger immediate coordinated responses across all participant institutions simultaneously.
The hardest engineering decisions
Consistency vs. availability under network partition
The Dominican financial ecosystem has meaningful geographic connectivity variance. Rural merchant terminals often operate over degraded mobile connections. The architecture had to handle partial network partitions gracefully — meaning transactions in flight during connectivity loss needed guaranteed eventual consistency without manual intervention.
We chose an event-sourcing model with idempotent message delivery and a distributed state machine per transaction. Every state transition is logged immutably. Recovery after partition is deterministic. The trade-off is increased storage I/O and higher operational complexity — both of which we considered acceptable given the zero-tolerance requirement for settlement errors.
Multi-institution trust boundaries
No bank was willing to expose raw transaction data to an operator of another bank. The privacy engineering challenge was significant. We solved this through a combination of purpose-limited data views (institutions only see aggregated network signals, never peer transaction data), cryptographic audit trails (every data access is logged and verifiable), and contractual data governance enforced at the infrastructure layer.
Real-time vs. near-real-time for fraud signals
Full real-time fraud analysis at the authorization moment adds latency to every transaction. For a high-volume network, even 50 milliseconds at the authorization layer compounds into meaningful throughput degradation.
Our solution was a two-stage model: a lightweight synchronous risk score is computed at authorization time using cached signals (sub-5ms P99 latency). A deeper asynchronous analysis runs post-authorization, updating the fraud intelligence layer and informing future authorization decisions. For the highest-risk segments — international cards, high-value transactions, new merchant categories — the synchronous model is extended with a slightly richer real-time analysis.
What agentic AI changed
The 2024 version of this infrastructure used rule-based alerting with human review queues. The 2025 architecture adds autonomous agents that operate continuously between human interventions.
The most impactful change is in incident classification speed. A payment network experiencing an unusual pattern in one region used to require an analyst to manually correlate four different dashboards, consult two different teams, and produce an incident brief within 20-40 minutes. The agentic architecture now produces a structured, pre-triaged incident brief in under 90 seconds, with supporting evidence, affected entity lists, and recommended escalation path pre-populated.
The second biggest change is in cross-institution coordination. When a systemic event is detected, the agentic layer automatically notifies all relevant institution operations centers simultaneously, with institution-specific context (e.g., Bank A sees their affected merchant list; Bank B sees theirs). This eliminated the phone tree coordination that previously consumed 30-45 minutes at the start of every major incident.
Results that matter at this scale
The infrastructure is going live across the full participant network in mid 2026. The projected operational metrics at full deployment:
• Mean time to detect (MTTD) for network anomalies: from 38 minutes to under 4 minutes
• Mean time to notify (MTTN) relevant stakeholders: from 45 minutes to under 2 minutes
• Reconciliation discrepancy resolution time: from 3-5 business days (batch) to same-day for 94% of cases
• False positive alert rate: reduced by 71% through the contextual reasoning layer
• Fraud detection coverage: 3.4× improvement in cross-institution fraud signal detection
What we learned about country-scale systems
The most important lesson is that operational intelligence at this scale is not a single product — it is a continuous negotiation between technical capability and institutional reality. The best algorithm for fraud detection means nothing if the institution trust framework doesn't allow data to flow. The most elegant reconciliation engine is irrelevant if the merchant terminal ecosystem can't produce clean transaction records.
The engineering work is 30% of the project. The other 70% is the organizational architecture: who owns what signal, who responds to what alert, how decisions are made at speed across institutions with their own governance structures and regulatory obligations.
We're proud of what was built. More importantly, we're proud of the operating model that surrounds it — because that's what makes it run.
