agent online·up.or.dead / ai-sre
two products / one stack/ai sre + uptime probes

HIRE ANAI SRE.FIRE THE 3 AM PAGE.

A Slack-native agent that investigates incidents against your logs, metrics, traces, errors and web events. Paired with 500ms uptime probes from 38 regions, so you find out before your users do — and know why before you finish your coffee.

slack · ms teams · claude code mcp · github
fig.01agent / autopsy_ready
agentsre-0xDEAD
statusinvestigating
toolslogs · traces · errors
auto-mergenever
median MTTR
−63%
vs. legacy on-call
auto-RCA hit rate
94%
within top-3 hypothesis
probe interval
500ms
every region, every check
regions worldwide
38
bare metal, no shared tenants
AI SRE / SLACK NATIVE+500MS UPTIME PROBES+HUMAN IN THE LOOP+NO AUTO-MERGE+AI SRE / SLACK NATIVE+500MS UPTIME PROBES+HUMAN IN THE LOOP+NO AUTO-MERGE+AI SRE / SLACK NATIVE+500MS UPTIME PROBES+HUMAN IN THE LOOP+NO AUTO-MERGE+AI SRE / SLACK NATIVE+500MS UPTIME PROBES+HUMAN IN THE LOOP+NO AUTO-MERGE+
section / 01/product · ai sre

the ai sre // 3 am shift

One agent. Reads your stack the way a staff engineer would, but at machine speed. Tag it in Slack, it shows up with three ranked hypotheses, the traces to back them up, and a draft PR if you want one.
how the agent thinks
01
step
observe

Watches deploys, error rates, latency, log anomalies and probe failures in real time. No dashboard required — it brings the signal to you.

02
step
correlate

Cross-references signals against the last 50 deploys, recent migrations, and known runbooks. Ranks hypotheses by confidence with citations.

03
step
propose

Drops a Slack thread summary, opens a PR with the suggested change, and waits. You decide what ships. The agent never merges on its own.

live transcript

what a real incident looks like.

#incident-0xDEAD · slack● live
03:14:08alertp99 latency on api.checkout > 4000ms in us-east-1
03:14:42human@up-or-dead what's going on
03:14:44agentCorrelating with last deploy (sha 9f2a1c, 17m ago) + recent error spike in payments-svc. Pulling traces.
03:14:51agentHypothesis 1 (78%): new query in checkout.PaymentRepo missing index on (user_id, created_at). 12.4k slow rows scanned per req.
03:14:52agentHypothesis 2 (14%): Stripe webhook retries backing up redis queue (lag 1.2M).
03:15:09humanopen a PR for the index
03:15:18agentPR #4412 opened → migrations/2026_idx_payment_repo.sql. Awaiting your review. I will NOT merge.
what the agent actually does

capabilities, no demos required.

cap / 01
human-in-the-loop
agentic root cause

Correlates deploys, errors, traces, metric trends and logs into ranked hypotheses with confidence scores and links to the evidence.

cap / 02
human-in-the-loop
plugs into your stack

Datadog, Grafana, Sentry, Linear, Notion, Honeycomb. Or ingest logs, metrics and traces directly via OpenTelemetry.

cap / 03
human-in-the-loop
slack · teams · mcp

Tag @up-or-dead in any incident thread, or call the agent from Claude Code, Cursor and Zed via the MCP server.

cap / 04
human-in-the-loop
pull requests, not panic

Gets you a PR with the suggested fix waiting in GitHub or GitLab. You review. The agent never merges on its own.

cap / 05
human-in-the-loop
drafts the post-mortem

Once resolved, the agent writes the first draft of the timeline, contributing factors and action items. Edit, ship, archive.

cap / 06
human-in-the-loop
runbook aware

Reads your existing runbooks and follows them. If a runbook is missing, it suggests one based on how you actually fixed the incident.

field report
“our 3 am pages dropped from nine per week to two. the agent already had the PR open by the time i was on the laptop.”
staff sre · public fintech · series d
section / 02/product · uptime monitoring

uptime monitoring // every 500ms

Real probes. Bare metal. 38 regions. The agent watches the same signal you do — so when something breaks, it already knows.
live · last 30 seconds
service
region
p99
signal
state
api.checkout
us-east-1
4218ms
DEGRADED
api.checkout
eu-west-2
184ms
OK
auth.sso
global
92ms
OK
webhook.stripe
us-east-1
8800ms
RETRY
edge.cdn
ap-south-1
61ms
OK
db.replica-3
eu-west-2
12ms
OK
queue.kafka-east
us-east-1
LAGGING
llm.gateway
global
1240ms
OK
monitor types
monitor / 01
HTTP / HTTPS

Status codes, headers, JSON-path assertions, response body diffing.

monitor / 02
TCP / UDP / ICMP

Raw socket probes for the ports your app actually listens on.

monitor / 03
SSL / TLS

Cert expiry, full chain validation, OCSP, mixed-content detection.

monitor / 04
DNS

Authoritative lookups, propagation across all 38 regions.

monitor / 05
Synthetic flows

Multi-step browser checks for login, checkout, signup, search.

monitor / 06
Agent traces

Probe LLM agents end-to-end: tools called, tokens, hallucinations.

public status page
status.yourcompany.com

Custom domain, your branding, incident timelines drafted by the AI SRE. No CSS injection. No iframe of shame. No "we are aware of an issue" pasted by a panicking human.

3 of 4 systems up · 1 degraded
on-call & escalation
SMS · voice · slack · siren

Rotations in plain English. Escalation policies you can read out loud. Optional physical desk-siren for the production team that thinks they're hardcore.

section / 03/global probe network

38 regions. real metal.

Every region runs dedicated probe hardware on bare metal — no shared tenancy, no surprise noisy neighbors, no "actually it was AWS" excuses.
us-east-1
us-east-2
us-west-1
us-west-2
ca-central-1
sa-east-1
eu-west-1
eu-west-2
eu-west-3
eu-central-1
eu-north-1
eu-south-1
me-south-1
af-south-1
af-north-1
il-central-1
ap-south-1
ap-south-2
ap-northeast-1
ap-northeast-2
ap-northeast-3
ap-southeast-1
ap-southeast-2
ap-southeast-3
ap-southeast-4
ap-east-1
cn-north-1
cn-northwest-1
in-mum-1
in-del-1
jp-osa-1
au-syd-1
au-mel-1
br-gru-1
ar-eze-1
cl-scl-1
za-jnb-1
ng-lag-1
POPs
38
Probes / day
172M
Avg latency
47ms
Tenant model
isolated
section / 04/integrations & stack

one stack, two weapons.

datadog
grafana
sentry
linear
notion
github
gitlab
slack
ms teams
claude code
cursor
zed
opentelemetry
ebpf
pagerduty
incident.io
jira
honeycomb
section / 05/trust & compliance

serious about your data.

trust / 01
SOC 2 type 2

Attestation available under NDA. We pass audits so you can.

trust / 02
GDPR · ISO 27001

EU residency option. Data centers under ISO 27001 controls.

trust / 03
no auto-actions

The agent suggests. It never merges, deploys, or rolls back without you.

section / 06/frequently autopsied

questions, answered cold.

q.01
Can the AI SRE deploy fixes on its own?
No. Suggestions land as PRs in GitHub or tickets in Linear. A human reviews and merges. There is no auto-merge, no auto-rollback, no surprise prod changes.
q.02
What data does the agent see?
Only what you connect: logs, metrics, traces, error events, deploys, runbooks. Scoped, audited, revocable per integration. No model training on your data.
q.03
How are uptime probes different from the others?
Probes run on bare metal in 38 regions, fire every 500ms, and ship with the AI SRE that explains failures. Most competitors share tenancy and check every 60s.
q.04
Can I self-host?
Yes. The probe layer and the agent runtime ship as a self-hosted bundle for regulated environments. The MCP server is open source.
section / 07/pricing

pick a tier. or don't.

SOLO
$0
forever, on us
  • +100 uptime monitors
  • +50 AI investigations / month
  • +1 region · 1 status page
  • +community support
$ start_tier →
TEAM
most installed
$49
per dev / month
  • +Unlimited monitors
  • +Unlimited investigations
  • +All 38 regions · 500ms
  • +Slack, Teams & MCP
  • +SOC 2, audit logs
$ start_tier →
FLEET
talk
self-host or VPC
  • +Bring your own cloud
  • +Private LLM routing
  • +ISO 27001 · SSO · SCIM
  • +Dedicated SRE for your SRE
book a call →

stop reading. start investigating.

Free for the first 100 monitors and 50 agent investigations / month. No card. No demo call. No LinkedIn DMs from a fake AE.