agent online·up.or.dead / ai-sre

v0.7.0-sreslack · mcp · oncall2026-07-28 23:59:51Z

●two products / one stack/ai sre + uptime probes

HIRE ANAI SRE.

A Slack-native agent that investigates incidents against your logs, metrics, traces, errors and web events. Paired with 500ms uptime probes from 38 regions, so you find out before your users do — and know why before you finish your coffee.

$ install_agent→see uptime probes →

slack · ms teams · claude code mcp · github

fig.01agent / autopsy_ready

agentsre-0xDEAD

statusinvestigating

toolslogs · traces · errors

auto-mergenever

median MTTR

−63%

vs. legacy on-call

auto-RCA hit rate

94%

within top-3 hypothesis

probe interval

500ms

every region, every check

regions worldwide

bare metal, no shared tenants

AI SRE / SLACK NATIVE+500MS UPTIME PROBES+HUMAN IN THE LOOP+NO AUTO-MERGE+AI SRE / SLACK NATIVE+500MS UPTIME PROBES+HUMAN IN THE LOOP+NO AUTO-MERGE+AI SRE / SLACK NATIVE+500MS UPTIME PROBES+HUMAN IN THE LOOP+NO AUTO-MERGE+AI SRE / SLACK NATIVE+500MS UPTIME PROBES+HUMAN IN THE LOOP+NO AUTO-MERGE+AI SRE / SLACK NATIVE+500MS UPTIME PROBES+HUMAN IN THE LOOP+NO AUTO-MERGE+AI SRE / SLACK NATIVE+500MS UPTIME PROBES+HUMAN IN THE LOOP+NO AUTO-MERGE+AI SRE / SLACK NATIVE+500MS UPTIME PROBES+HUMAN IN THE LOOP+NO AUTO-MERGE+AI SRE / SLACK NATIVE+500MS UPTIME PROBES+HUMAN IN THE LOOP+NO AUTO-MERGE+

●section / 01/product · ai sre

the ai sre // 3 am shift

One agent. Reads your stack the way a staff engineer would, but at machine speed. Tag it in Slack, it shows up with three ranked hypotheses, the traces to back them up, and a draft PR if you want one.

how the agent thinks

step

observe

Watches deploys, error rates, latency, log anomalies and probe failures in real time. No dashboard required — it brings the signal to you.

step

correlate

Cross-references signals against the last 50 deploys, recent migrations, and known runbooks. Ranks hypotheses by confidence with citations.

step

propose

Drops a Slack thread summary, opens a PR with the suggested change, and waits. You decide what ships. The agent never merges on its own.

live transcript

what a real incident looks like.

3.2s

median time
to first hypothesis

#incident-0xDEAD · slack● live

03:14:08alertp99 latency on api.checkout > 4000ms in us-east-1

03:14:42human@up-or-dead what's going on

03:14:44agentCorrelating with last deploy (sha 9f2a1c, 17m ago) + recent error spike in payments-svc. Pulling traces.

03:14:51agentHypothesis 1 (78%): new query in checkout.PaymentRepo missing index on (user_id, created_at). 12.4k slow rows scanned per req.

03:14:52agentHypothesis 2 (14%): Stripe webhook retries backing up redis queue (lag 1.2M).

03:15:09humanopen a PR for the index

03:15:18agentPR #4412 opened → migrations/2026_idx_payment_repo.sql. Awaiting your review. I will NOT merge.

what the agent actually does

capabilities, no demos required.

cap / 01

human-in-the-loop

agentic root cause

Correlates deploys, errors, traces, metric trends and logs into ranked hypotheses with confidence scores and links to the evidence.

cap / 02

human-in-the-loop

plugs into your stack

Datadog, Grafana, Sentry, Linear, Notion, Honeycomb. Or ingest logs, metrics and traces directly via OpenTelemetry.

cap / 03

human-in-the-loop

slack · teams · mcp

Tag @up-or-dead in any incident thread, or call the agent from Claude Code, Cursor and Zed via the MCP server.

cap / 04

human-in-the-loop

pull requests, not panic

Gets you a PR with the suggested fix waiting in GitHub or GitLab. You review. The agent never merges on its own.

cap / 05

human-in-the-loop

drafts the post-mortem

Once resolved, the agent writes the first draft of the timeline, contributing factors and action items. Edit, ship, archive.

cap / 06

human-in-the-loop

runbook aware

Reads your existing runbooks and follows them. If a runbook is missing, it suggests one based on how you actually fixed the incident.

field report

“our 3 am pages dropped from nine per week to two. the agent already had the PR open by the time i was on the laptop.”

staff sre · public fintech · series d

●section / 02/product · uptime monitoring

uptime monitoring // every 500ms

Real probes. Bare metal. 38 regions. The agent watches the same signal you do — so when something breaks, it already knows.

live · last 30 seconds

service

region

p99

signal

state

api.checkout

us-east-1

4218ms

DEGRADED

api.checkout

eu-west-2

184ms

auth.sso

global

92ms

webhook.stripe

us-east-1

8800ms

RETRY

edge.cdn

ap-south-1

61ms

db.replica-3

eu-west-2

12ms

queue.kafka-east

us-east-1

—

LAGGING

llm.gateway

global

1240ms

monitor types

monitor / 01

HTTP / HTTPS

Status codes, headers, JSON-path assertions, response body diffing.

monitor / 02

TCP / UDP / ICMP

Raw socket probes for the ports your app actually listens on.

monitor / 03

SSL / TLS

Cert expiry, full chain validation, OCSP, mixed-content detection.

monitor / 04

DNS

Authoritative lookups, propagation across all 38 regions.

monitor / 05

Synthetic flows

Multi-step browser checks for login, checkout, signup, search.

monitor / 06

Agent traces

Probe LLM agents end-to-end: tools called, tokens, hallucinations.

public status page

status.yourcompany.com

Custom domain, your branding, incident timelines drafted by the AI SRE. No CSS injection. No iframe of shame. No "we are aware of an issue" pasted by a panicking human.

3 of 4 systems up · 1 degraded

on-call & escalation

SMS · voice · slack · siren

Rotations in plain English. Escalation policies you can read out loud. Optional physical desk-siren for the production team that thinks they're hardcore.

●section / 03/global probe network

38 regions. real metal.

Every region runs dedicated probe hardware on bare metal — no shared tenancy, no surprise noisy neighbors, no "actually it was AWS" excuses.

us-east-1

us-east-2

us-west-1

us-west-2

ca-central-1

sa-east-1

eu-west-1

eu-west-2

eu-west-3

eu-central-1

eu-north-1

eu-south-1

me-south-1

af-south-1

af-north-1

il-central-1

ap-south-1

ap-south-2

ap-northeast-1

ap-northeast-2

ap-northeast-3

ap-southeast-1

ap-southeast-2

ap-southeast-3

ap-southeast-4

ap-east-1

cn-north-1

cn-northwest-1

in-mum-1

in-del-1

jp-osa-1

au-syd-1

au-mel-1

br-gru-1

ar-eze-1

cl-scl-1

za-jnb-1

ng-lag-1

POPs

Probes / day

172M

Avg latency

47ms

Tenant model

isolated

●section / 04/integrations & stack

one stack, two weapons.

datadog

grafana

sentry

linear

notion

github

gitlab

slack

ms teams

claude code

cursor

zed

opentelemetry

ebpf

pagerduty

incident.io

jira

honeycomb

●section / 05/trust & compliance

serious about your data.

trust / 01

SOC 2 type 2

Attestation available under NDA. We pass audits so you can.

trust / 02

GDPR · ISO 27001

EU residency option. Data centers under ISO 27001 controls.

trust / 03

no auto-actions

The agent suggests. It never merges, deploys, or rolls back without you.

●section / 06/frequently autopsied

questions, answered cold.

q.01

Can the AI SRE deploy fixes on its own?

No. Suggestions land as PRs in GitHub or tickets in Linear. A human reviews and merges. There is no auto-merge, no auto-rollback, no surprise prod changes.

q.02

What data does the agent see?

Only what you connect: logs, metrics, traces, error events, deploys, runbooks. Scoped, audited, revocable per integration. No model training on your data.

q.03

How are uptime probes different from the others?

Probes run on bare metal in 38 regions, fire every 500ms, and ship with the AI SRE that explains failures. Most competitors share tenancy and check every 60s.

q.04

Can I self-host?

Yes. The probe layer and the agent runtime ship as a self-hosted bundle for regulated environments. The MCP server is open source.

●section / 07/pricing