Skip to content

Fail-Open Resilience

AgentWatch is designed with a fundamental principle: our infrastructure never causes your production to go down. This is achieved through fail-open architecture at every layer.

What "Fail-Open" Means

When AgentWatch infrastructure is unreachable or experiencing issues, budget checks silently fail open — allowing API calls to proceed to the upstream provider without enforcement. This ensures your agents always work, even during AgentWatch outages.

Fail-Open Behavior by Mode

When agentwatch_primary_provider is NOT set, only budget checks and telemetry go through AgentWatch:

python
client = WatchedOpenAI(
    agentwatch_api_key="aw_live_xxx",
    agentwatch_session_budget_usd=2.00,
    agentwatch_enforcement_mode=True,
    # No agentwatch_primary_provider — SDK calls go directly to OpenAI
)

If AgentWatch is down:

  • Budget check HTTP call fails (timeout at 2s)
  • SDK catches exception, logs debug message
  • API call proceeds directly to OpenAI
  • Telemetry is silently dropped

Proxy Mode

When agentwatch_primary_provider is set, all traffic routes through AgentWatch:

python
client = WatchedOpenAI(
    agentwatch_api_key="aw_live_xxx",
    agentwatch_primary_provider="openai",
    agentwatch_session_budget_usd=2.00,
    agentwatch_enforcement_mode=True,
)

If AgentWatch is down:

  • The SDK automatically falls back to the direct provider URL
  • No manual intervention required

Configuration

Fail-open is the default behavior. To change it:

python
# Fail-open (default) — AgentWatch outage doesn't affect your app
client = WatchedOpenAI(
    agentwatch_enforcement_fail_open=True,  # default
    ...
)

# Fail-closed — AgentWatch outage blocks API calls
client = WatchedOpenAI(
    agentwatch_enforcement_fail_open=False,
    ...
)

What Happens During an Outage

ComponentBehavior
Budget checkFails open, API call proceeds
Telemetry loggingSilently dropped, no data loss
Anomaly detectionDisabled (requires edge processing)
DashboardMay show stale data
Compliance reportsSkipped for the period

Monitoring AgentWatch Health

Monitor AgentWatch availability via the health endpoint:

bash
curl https://agent-watch.dev/healthz
# Returns: ok

If this endpoint returns non-200, AgentWatch is experiencing issues and fail-open is active.