Orbit — SRE Agent Architecture

Architecture

Main Request Flow

From @mention to response — follow the path of a Slack message through the entire serverless pipeline.

Slack / API Gateway

Lambda Functions

Step Functions

AgentCore Runtime

DynamoDB

Click to watch a request flow through the system

click to expand

Slack Workspace

@mention Orbit API Gateway POST /slack/events
Approve / Reject API Gateway POST /slack/actions

Two API Gateway HTTP routes receive all Slack traffic. Every request is verified with HMAC-SHA256 before any processing occurs. The Events route handles @mentions; the Actions route handles interactive button clicks from the HITL approval flow.

click to expand

Verification Lambda

1. HMAC-SHA256 signature check
2. Dedup via DynamoDB (1h TTL)
3. Start Step Functions
4. Return 200 within 3s

Timeout: 5 seconds
Why 3s? Slack retries if it doesn't get 200 within 3 seconds. This Lambda must ACK fast, then start async processing via Step Functions.
Dedup: DynamoDB table with 1-hour TTL prevents duplicate processing from Slack's retry mechanism (up to 3 retries). The event_id is used as the partition key for atomic conditional puts.

click to expand

Handle Interactivity Lambda

1. HMAC-SHA256 signature check
2. Atomic DynamoDB update
(prevents double-click race)
3. Update Slack message with decision

Timeout: 10 seconds
Race prevention: Uses DynamoDB ConditionExpression — only succeeds if status is still PENDING. Second click fails safely.
Two modes: Tool-level approval (agent polls DynamoDB) and workflow-level approval (SendTaskSuccess to Step Functions).

click to expand

Step Functions (callback pattern)

▶ PostThinking — post "Thinking…" to Slack

▶ InvokeAgentWithCallback — waitForTaskToken

▶ PostResult — update thread with response

▶ Error handlers — 4 catch states

Callback pattern: Step Functions generates a unique task token and PAUSES at zero cost. The agent processes asynchronously and calls SendTaskSuccess when done.

Retry config: 6 attempts, 2s initial delay, 2x backoff, FULL jitter (prevents thundering herd).
Timeouts: 8h max execution, 1h heartbeat deadline.
Error states: PostAgentError, PostHeartbeatTimeout, PostTimeout, PostError, PostErrorNoThinking — each posts a specific error message back to the Slack thread.

click to expand

invoke_agent Lambda

Generate session ID from Slack thread — sha256(channel:thread_ts)
Invoke AgentCore with task_token + prompt

Timeout: 30 seconds
Session ID: Deterministic — slack-thread-{sha256(channel:thread_ts)[:40]}. All messages in the same Slack thread share a session, enabling multi-turn conversation context.
Thread history: Fetches full thread via Slack conversations.replies API and passes it to AgentCore for context injection.

click to expand

AgentCore Runtime (Orbit)

Spawns background thread, returns ACK
Claude Opus 4.6 processes the request
Sends SFN heartbeats every 30 min
Calls SendTaskSuccess when done

Tool Permission Guard (tool_guard_hook)

SAFE auto-allow — Read, Grep, CloudWatch, Lumigo, etc.

STRUCTURAL auto-deny — $(...), eval, | bash, exec, netcat

CATASTROPHIC auto-deny — fork bomb, mkfs, dd to device

DANGEROUS HITL approval — rm -rf, kill -9, untrusted URLs

Skills: cloudwatch-guide, datadog-guide, lumigo-guide, jira-guide, confluence-guide, embrace-guide, tacobell-store-api, tacobell-menu-api
MCP servers: CloudWatch, Jira, Confluence, Lumigo, Datadog, Embrace
Session persistence: Claude session ID stored at /tmp/claude_session_id for conversation continuity across invocations.
Thread context: Injects prior Slack messages into prompt (full, missed, or none based on session freshness). Truncated to 2,000 chars/message, 80,000 chars total.

click to expand

DynamoDB

slack-event-dedup
slack-approval-tokens

slack-event-dedup: Key: event_id, TTL: 1 hour. Prevents duplicate Slack event processing.
slack-approval-tokens: Key: approval_id, TTL: 24 hours. Stores HITL approval state, tool context, and Step Functions task tokens.

Safety

Human-in-the-Loop Approval

When the tool guard classifies a command as dangerous, the agent pauses and asks a human reviewer via Slack buttons. Fail-closed on timeout.

Click to watch the HITL approval flow in action

click

Agent detects danger

Tool classified as
DANGEROUS tier

The tool_guard_hook runs before every tool call. When a bash command matches dangerous patterns (rm -rf, kill -9, etc.) or a WebFetch targets an untrusted domain, the agent initiates the approval flow.

click

post_approval_request

Post Slack buttons
Store approval_id in DynamoDB

Generates a unique approval_id, stores the tool call context (command, arguments, reason) in DynamoDB, and posts a Slack message with [Approve] and [Reject] buttons to the thread.

Slack Buttons

Approve Reject
Reviewer clicks to decide

click

handle_interactivity

Atomic DynamoDB update
Prevents double-click

Uses DynamoDB ConditionExpression: only succeeds if status = PENDING. If two reviewers click simultaneously, only the first write wins. Updates the Slack message to show who approved/rejected and when.

DynamoDB

approval-tokens
Stores approval decision

Agent polls

Every 3s, 5 min timeout
Fail-closed on timeout

APPROVED tool executes REJECTED tool denied, agent informed TIMEOUT tool denied (fail-closed)

Interactive

Tool Guard Playground

Try typing a bash command to see how the four-tier permission guard classifies it in real-time. Structural shell bypasses and catastrophic commands are auto-denied, dangerous commands require HITL approval, and safe commands auto-allow.

Enter a command above to see its classification

Try these examples:

ls -la /var/log

cat /etc/hosts

rm -rf /tmp/cache

kill -9 1234

:(){ :|:& };:

mkfs.ext4 /dev/sda1

chmod 777 /etc/passwd

systemctl stop nginx

dd if=/dev/zero of=/dev/sda

python3 -c "import os"

kubectl get pods

sed -i 's/foo/bar/' config

xargs rm *.log

shutdown -h now

rm -rf /

echo test | bash

eval "rm -rf /"

bash -c "whoami"

nc -l 4444

Infrastructure

Lambda Functions

12 Python 3.12 Lambda functions on arm64. Lambdas needing slack_sdk share a Lambda Layer.

Function	Purpose	Timeout
verification	HMAC verify, dedup (DynamoDB), start Step Functions	5s
invoke_agent	Generate session ID, invoke AgentCore with task token	30s
post_to_slack	Post/update Slack messages, rate limit retry, chunking	30s
post_approval_request	Post Slack approval buttons, store token in DynamoDB	30s
handle_interactivity	Handle button clicks, atomic DynamoDB update, SFN callback	10s
resume_agent	Send approval decision to agent (workflow-level HITL)	30s
scheduled_trigger	Start proactive health check workflows on EventBridge schedule	10s
jira	Jira REST API integration (search, CRUD, transitions)	30s
confluence	Confluence REST API integration (search, CRUD, comments)	30s
lumigo	Lumigo Log API integration (search, aggregate, investigate)	420s
datadog	Datadog REST API (monitors, metrics, logs, incidents)	30s
embrace	Embrace Metrics API integration (crash data, session analytics)	30s

Capabilities

Agent Tools

65 tools across 9 categories. Click a category to expand. Every tool is classified by the permission guard.

Auto-Allow

HITL Required

MCP Servers

Skills

🛠

Built-in Claude Tools 10 ▶

Read

Read file contents from the filesystem

AUTO

Write

Write or create files on disk

AUTO

Edit

Edit existing file contents in-place

AUTO

Glob

Find files by pattern matching

AUTO

Bash

Execute shell commands (4-tier classification)

SMART

WebSearch

Search the web for information

AUTO

WebFetch

Fetch URL content (trusted domains auto-allow, others HITL)

SMART

Skill

Load skill reference docs for guided tool usage

AUTO

Task / TaskList / TaskGet

Task management and progress tracking

AUTO

Notebook / NotebookEdit

Create and edit Jupyter-style notebooks

AUTO

📊

CloudWatch MCP 9 tools · all auto-allow ▶

get_metric_data

Query raw metric datapoints and timeseries

AUTO

analyze_metric

Statistical analysis: avg, p50, p90, p99

AUTO

get_active_alarms

List currently firing CloudWatch alarms

AUTO

get_alarm_history

Retrieve alarm state-change history

AUTO

describe_log_groups

List and search CloudWatch log groups

AUTO

analyze_log_group

Summarize recent activity in a log group

AUTO

execute_log_insights_query

Run CloudWatch Logs Insights queries

AUTO

get_logs_insight_query_results

Retrieve Logs Insights query results

AUTO

get_recommended_metric_alarms

Get alarm recommendations for resources

AUTO

🐝

Datadog MCP 14 tools · 3 HITL ▶

search_monitors

Search monitors by status, tags, or name

AUTO

get_monitor

Get full details for a specific monitor

AUTO

query_metrics

Query AWS API Gateway metric timeseries

AUTO

search_metrics

Search available metric names by prefix

AUTO

search_logs

Search and retrieve Datadog log entries

AUTO

search_events

Search Datadog events by time and tags

AUTO

list_incidents

List Datadog incidents with filters

AUTO

get_incident

Get incident details by ID

AUTO

list_dashboards

List dashboards, optionally filtered by title

AUTO

get_dashboard

Get dashboard details and widget summary

AUTO

list_downtimes

List currently scheduled downtimes

AUTO

mute_monitor

Mute a Datadog monitor

HITL

unmute_monitor

Unmute a Datadog monitor

HITL

schedule_downtime

Schedule a Datadog downtime window

HITL

🎯

Jira MCP 8 tools · 4 HITL ▶

jira_search

Search Jira issues using JQL queries

AUTO

jira_get_issue

Get issue details by key (includes recent comments)

AUTO

jira_get_transitions

Get available status transitions for an issue

AUTO

jira_get_issue_sla

Get SLA information for JSM request issues

AUTO

jira_create_issue

Create a new Jira issue

HITL

jira_update_issue

Update fields on an existing issue

HITL

jira_transition_issue

Transition issue to a new status

HITL

jira_add_comment

Add a comment to a Jira issue

HITL

📖

Confluence MCP 8 tools · 3 HITL ▶

confluence_search

Search Confluence pages using CQL queries

AUTO

confluence_get_page

Get page content by ID

AUTO

confluence_get_page_views

Get page view analytics

AUTO

confluence_get_comments

Get comments on a Confluence page

AUTO

confluence_get_page_children

Get child pages of a parent page

AUTO

confluence_create_page

Create a new Confluence page

HITL

confluence_update_page

Update an existing Confluence page

HITL

confluence_add_comment

Add a comment to a page

HITL

🔎

Lumigo MCP 3 tools · all auto-allow ▶

lumigo_search_logs

Search Lambda logs by severity, resource, or free text

AUTO

lumigo_aggregate_logs

Aggregate log data: count, avg, p95, p99, timeseries

AUTO

lumigo_get_issue_details

Investigate issues with root cause analysis and stack traces

AUTO

📱

Embrace MCP 3 tools · all auto-allow ▶

embrace_list_metrics

List available metric names from Embrace, optionally filtered by substring

AUTO

embrace_query_instant

Execute a PromQL instant query for current metric values

AUTO

embrace_query_range

Execute a PromQL range query for time-series metric data

AUTO

🌮

Taco Bell API Tools 2 scripts via Bash ▶

store_lookup.py

Taco Bell store locator — search by lat/lng, ZIP, or address. Returns store details, hours, and capabilities.

AUTO

menu_lookup.py

Taco Bell menu catalog API — search menu items by name, get item details, nutrition info, and pricing by store.

AUTO

📚

Skills (Reference Guides) 8 skills ▶

cloudwatch-guide

CloudWatch metric queries, Log Insights syntax, alarm investigation playbooks

GUIDE

datadog-guide

Monitor search, metric query format, dashboard lookup, troubleshooting

GUIDE

lumigo-guide

Log search syntax, aggregation patterns, issue investigation workflows

GUIDE

jira-guide

JQL search, issue CRUD, project board conventions (ECRS, RDS)

GUIDE

confluence-guide

CQL search, page CRUD, space conventions (TR, ECOM)

GUIDE

embrace-guide

Embrace Metrics API, crash analytics, session investigation

GUIDE

tacobell-store-api

Store locator API endpoints, response schema, search examples

GUIDE

tacobell-menu-api

Menu catalog API endpoints, item search, nutrition data schema

GUIDE

Technology

Tech Stack

The building blocks behind Orbit.

🧠

Claude Opus 4.6

Frontier reasoning model powering all agent decisions

☁️

Bedrock AgentCore

AWS-managed agent runtime with session persistence

⚡

AWS Lambda

12 Python 3.12 functions on arm64 with shared layers

🔄

Step Functions

Callback pattern orchestration with zero-cost waits

🗀

DynamoDB

Event dedup + HITL approval state with TTL cleanup

💬

Slack API

Events + Interactivity with HMAC-SHA256 verification

📊

CloudWatch MCP

Metrics, logs, alarms via MCP server integration

🐝

Datadog

Monitors, metrics, logs, incidents via REST API Lambda

🔎

Lumigo

Log search, aggregation, and trace investigation

🎯

Jira

Issue search, CRUD, transitions, board management

📖

Confluence

Page search, CRUD, comments, space management

📱

Embrace

Mobile crash analytics, session metrics via PromQL

🌎

Terragrunt

Infrastructure as code for all AWS resources

Meet Orbit

Three steps. Zero overhead.

Slack Trigger

Agent Processing

Safe Response

Main Request Flow

Human-in-the-Loop Approval

Tool Guard Playground

Lambda Functions

Agent Tools

Tech Stack

Claude Opus 4.6

Bedrock AgentCore

AWS Lambda

Step Functions

DynamoDB

Slack API

CloudWatch MCP

Datadog

Lumigo

Jira

Confluence

Embrace

Terragrunt