Meet Orbit

An SRE agent built on AWS Bedrock AgentCore. It lives in Slack, thinks with Claude Opus 4.6, and keeps your infrastructure in check — with human-in-the-loop safety for every dangerous action.

Slack Integration
Step Functions
Bedrock AgentCore
DynamoDB
Lambda (Python 3.12)
Explore Architecture
12
Lambdas
8
Skills
66
Trusted Domains
8hr
Max Runtime
17
Auto-Deny Rules
How It Works

Three steps. Zero overhead.

@mention Orbit in any Slack channel. It processes your request through a serverless pipeline with built-in safety rails.

1

Slack Trigger

User @mentions Orbit in a Slack thread. The message hits API Gateway, gets signature-verified, deduplicated, and kicks off a Step Functions workflow.

2

Agent Processing

Step Functions invokes the Orbit agent on AgentCore via callback pattern. Claude Opus 4.6 processes the request with access to CloudWatch, Datadog, Jira, Confluence, and more.

3

Safe Response

Every tool call passes through a four-tier permission guard. Structural shell bypasses and catastrophic commands are auto-denied, dangerous actions require Slack approval, and safe commands auto-allow. Responses are chunked and posted back to the thread.

Architecture

Main Request Flow

From @mention to response — follow the path of a Slack message through the entire serverless pipeline.

Slack / API Gateway
Lambda Functions
Step Functions
AgentCore Runtime
DynamoDB
Click to watch a request flow through the system
click to expand
Slack Workspace
@mention Orbit API Gateway POST /slack/events
Approve / Reject API Gateway POST /slack/actions
Two API Gateway HTTP routes receive all Slack traffic. Every request is verified with HMAC-SHA256 before any processing occurs. The Events route handles @mentions; the Actions route handles interactive button clicks from the HITL approval flow.
click to expand
Verification Lambda
1. HMAC-SHA256 signature check
2. Dedup via DynamoDB (1h TTL)
3. Start Step Functions
4. Return 200 within 3s
Timeout: 5 seconds
Why 3s? Slack retries if it doesn't get 200 within 3 seconds. This Lambda must ACK fast, then start async processing via Step Functions.
Dedup: DynamoDB table with 1-hour TTL prevents duplicate processing from Slack's retry mechanism (up to 3 retries). The event_id is used as the partition key for atomic conditional puts.
click to expand
Handle Interactivity Lambda
1. HMAC-SHA256 signature check
2. Atomic DynamoDB update
    (prevents double-click race)
3. Update Slack message with decision
Timeout: 10 seconds
Race prevention: Uses DynamoDB ConditionExpression — only succeeds if status is still PENDING. Second click fails safely.
Two modes: Tool-level approval (agent polls DynamoDB) and workflow-level approval (SendTaskSuccess to Step Functions).
click to expand
Step Functions (callback pattern)
PostThinking — post "Thinking…" to Slack
InvokeAgentWithCallbackwaitForTaskToken
PostResult — update thread with response
Error handlers — 4 catch states
Callback pattern: Step Functions generates a unique task token and PAUSES at zero cost. The agent processes asynchronously and calls SendTaskSuccess when done.

Retry config: 6 attempts, 2s initial delay, 2x backoff, FULL jitter (prevents thundering herd).
Timeouts: 8h max execution, 1h heartbeat deadline.
Error states: PostAgentError, PostHeartbeatTimeout, PostTimeout, PostError, PostErrorNoThinking — each posts a specific error message back to the Slack thread.
click to expand
invoke_agent Lambda
Generate session ID from Slack thread — sha256(channel:thread_ts)
Invoke AgentCore with task_token + prompt
Timeout: 30 seconds
Session ID: Deterministic — slack-thread-{sha256(channel:thread_ts)[:40]}. All messages in the same Slack thread share a session, enabling multi-turn conversation context.
Thread history: Fetches full thread via Slack conversations.replies API and passes it to AgentCore for context injection.
click to expand
AgentCore Runtime (Orbit)
Spawns background thread, returns ACK
Claude Opus 4.6 processes the request
Sends SFN heartbeats every 30 min
Calls SendTaskSuccess when done
Tool Permission Guard (tool_guard_hook)
SAFE auto-allow — Read, Grep, CloudWatch, Lumigo, etc.
STRUCTURAL auto-deny — $(...), eval, | bash, exec, netcat
CATASTROPHIC auto-deny — fork bomb, mkfs, dd to device
DANGEROUS HITL approval — rm -rf, kill -9, untrusted URLs
Skills: cloudwatch-guide, datadog-guide, lumigo-guide, jira-guide, confluence-guide, embrace-guide, tacobell-store-api, tacobell-menu-api
MCP servers: CloudWatch, Jira, Confluence, Lumigo, Datadog, Embrace
Session persistence: Claude session ID stored at /tmp/claude_session_id for conversation continuity across invocations.
Thread context: Injects prior Slack messages into prompt (full, missed, or none based on session freshness). Truncated to 2,000 chars/message, 80,000 chars total.
click to expand
DynamoDB
slack-event-dedup
slack-approval-tokens
slack-event-dedup: Key: event_id, TTL: 1 hour. Prevents duplicate Slack event processing.
slack-approval-tokens: Key: approval_id, TTL: 24 hours. Stores HITL approval state, tool context, and Step Functions task tokens.
Safety

Human-in-the-Loop Approval

When the tool guard classifies a command as dangerous, the agent pauses and asks a human reviewer via Slack buttons. Fail-closed on timeout.

Click to watch the HITL approval flow in action
click
Agent detects danger
Tool classified as
DANGEROUS tier
The tool_guard_hook runs before every tool call. When a bash command matches dangerous patterns (rm -rf, kill -9, etc.) or a WebFetch targets an untrusted domain, the agent initiates the approval flow.
click
post_approval_request
Post Slack buttons
Store approval_id in DynamoDB
Generates a unique approval_id, stores the tool call context (command, arguments, reason) in DynamoDB, and posts a Slack message with [Approve] and [Reject] buttons to the thread.
Slack Buttons
Approve Reject
Reviewer clicks to decide
click
handle_interactivity
Atomic DynamoDB update
Prevents double-click
Uses DynamoDB ConditionExpression: only succeeds if status = PENDING. If two reviewers click simultaneously, only the first write wins. Updates the Slack message to show who approved/rejected and when.
DynamoDB
approval-tokens
Stores approval decision
Agent polls
Every 3s, 5 min timeout
Fail-closed on timeout
APPROVED tool executes REJECTED tool denied, agent informed TIMEOUT tool denied (fail-closed)
Interactive

Tool Guard Playground

Try typing a bash command to see how the four-tier permission guard classifies it in real-time. Structural shell bypasses and catastrophic commands are auto-denied, dangerous commands require HITL approval, and safe commands auto-allow.

Enter a command above to see its classification
Try these examples:
ls -la /var/log
cat /etc/hosts
rm -rf /tmp/cache
kill -9 1234
:(){ :|:& };:
mkfs.ext4 /dev/sda1
chmod 777 /etc/passwd
systemctl stop nginx
dd if=/dev/zero of=/dev/sda
python3 -c "import os"
kubectl get pods
sed -i 's/foo/bar/' config
xargs rm *.log
shutdown -h now
rm -rf /
echo test | bash
eval "rm -rf /"
bash -c "whoami"
nc -l 4444
Infrastructure

Lambda Functions

12 Python 3.12 Lambda functions on arm64. Lambdas needing slack_sdk share a Lambda Layer.

Function Purpose Timeout
verificationHMAC verify, dedup (DynamoDB), start Step Functions5s
invoke_agentGenerate session ID, invoke AgentCore with task token30s
post_to_slackPost/update Slack messages, rate limit retry, chunking30s
post_approval_requestPost Slack approval buttons, store token in DynamoDB30s
handle_interactivityHandle button clicks, atomic DynamoDB update, SFN callback10s
resume_agentSend approval decision to agent (workflow-level HITL)30s
scheduled_triggerStart proactive health check workflows on EventBridge schedule10s
jiraJira REST API integration (search, CRUD, transitions)30s
confluenceConfluence REST API integration (search, CRUD, comments)30s
lumigoLumigo Log API integration (search, aggregate, investigate)420s
datadogDatadog REST API (monitors, metrics, logs, incidents)30s
embraceEmbrace Metrics API integration (crash data, session analytics)30s
Capabilities

Agent Tools

65 tools across 9 categories. Click a category to expand. Every tool is classified by the permission guard.

55
Auto-Allow
10
HITL Required
6
MCP Servers
8
Skills
🛠
Built-in Claude Tools 10
Read
Read file contents from the filesystem
AUTO
Write
Write or create files on disk
AUTO
Edit
Edit existing file contents in-place
AUTO
Glob
Find files by pattern matching
AUTO
Bash
Execute shell commands (4-tier classification)
SMART
WebSearch
Search the web for information
AUTO
WebFetch
Fetch URL content (trusted domains auto-allow, others HITL)
SMART
Skill
Load skill reference docs for guided tool usage
AUTO
Task / TaskList / TaskGet
Task management and progress tracking
AUTO
Notebook / NotebookEdit
Create and edit Jupyter-style notebooks
AUTO
📊
CloudWatch MCP 9 tools · all auto-allow
get_metric_data
Query raw metric datapoints and timeseries
AUTO
analyze_metric
Statistical analysis: avg, p50, p90, p99
AUTO
get_active_alarms
List currently firing CloudWatch alarms
AUTO
get_alarm_history
Retrieve alarm state-change history
AUTO
describe_log_groups
List and search CloudWatch log groups
AUTO
analyze_log_group
Summarize recent activity in a log group
AUTO
execute_log_insights_query
Run CloudWatch Logs Insights queries
AUTO
get_logs_insight_query_results
Retrieve Logs Insights query results
AUTO
get_recommended_metric_alarms
Get alarm recommendations for resources
AUTO
🐝
Datadog MCP 14 tools · 3 HITL
search_monitors
Search monitors by status, tags, or name
AUTO
get_monitor
Get full details for a specific monitor
AUTO
query_metrics
Query AWS API Gateway metric timeseries
AUTO
search_metrics
Search available metric names by prefix
AUTO
search_logs
Search and retrieve Datadog log entries
AUTO
search_events
Search Datadog events by time and tags
AUTO
list_incidents
List Datadog incidents with filters
AUTO
get_incident
Get incident details by ID
AUTO
list_dashboards
List dashboards, optionally filtered by title
AUTO
get_dashboard
Get dashboard details and widget summary
AUTO
list_downtimes
List currently scheduled downtimes
AUTO
mute_monitor
Mute a Datadog monitor
HITL
unmute_monitor
Unmute a Datadog monitor
HITL
schedule_downtime
Schedule a Datadog downtime window
HITL
🎯
Jira MCP 8 tools · 4 HITL
jira_search
Search Jira issues using JQL queries
AUTO
jira_get_issue
Get issue details by key (includes recent comments)
AUTO
jira_get_transitions
Get available status transitions for an issue
AUTO
jira_get_issue_sla
Get SLA information for JSM request issues
AUTO
jira_create_issue
Create a new Jira issue
HITL
jira_update_issue
Update fields on an existing issue
HITL
jira_transition_issue
Transition issue to a new status
HITL
jira_add_comment
Add a comment to a Jira issue
HITL
📖
Confluence MCP 8 tools · 3 HITL
confluence_search
Search Confluence pages using CQL queries
AUTO
confluence_get_page
Get page content by ID
AUTO
confluence_get_page_views
Get page view analytics
AUTO
confluence_get_comments
Get comments on a Confluence page
AUTO
confluence_get_page_children
Get child pages of a parent page
AUTO
confluence_create_page
Create a new Confluence page
HITL
confluence_update_page
Update an existing Confluence page
HITL
confluence_add_comment
Add a comment to a page
HITL
🔎
Lumigo MCP 3 tools · all auto-allow
lumigo_search_logs
Search Lambda logs by severity, resource, or free text
AUTO
lumigo_aggregate_logs
Aggregate log data: count, avg, p95, p99, timeseries
AUTO
lumigo_get_issue_details
Investigate issues with root cause analysis and stack traces
AUTO
📱
Embrace MCP 3 tools · all auto-allow
embrace_list_metrics
List available metric names from Embrace, optionally filtered by substring
AUTO
embrace_query_instant
Execute a PromQL instant query for current metric values
AUTO
embrace_query_range
Execute a PromQL range query for time-series metric data
AUTO
🌮
Taco Bell API Tools 2 scripts via Bash
store_lookup.py
Taco Bell store locator — search by lat/lng, ZIP, or address. Returns store details, hours, and capabilities.
AUTO
menu_lookup.py
Taco Bell menu catalog API — search menu items by name, get item details, nutrition info, and pricing by store.
AUTO
📚
Skills (Reference Guides) 8 skills
cloudwatch-guide
CloudWatch metric queries, Log Insights syntax, alarm investigation playbooks
GUIDE
datadog-guide
Monitor search, metric query format, dashboard lookup, troubleshooting
GUIDE
lumigo-guide
Log search syntax, aggregation patterns, issue investigation workflows
GUIDE
jira-guide
JQL search, issue CRUD, project board conventions (ECRS, RDS)
GUIDE
confluence-guide
CQL search, page CRUD, space conventions (TR, ECOM)
GUIDE
embrace-guide
Embrace Metrics API, crash analytics, session investigation
GUIDE
tacobell-store-api
Store locator API endpoints, response schema, search examples
GUIDE
tacobell-menu-api
Menu catalog API endpoints, item search, nutrition data schema
GUIDE
Technology

Tech Stack

The building blocks behind Orbit.

🧠

Claude Opus 4.6

Frontier reasoning model powering all agent decisions

☁️

Bedrock AgentCore

AWS-managed agent runtime with session persistence

AWS Lambda

12 Python 3.12 functions on arm64 with shared layers

🔄

Step Functions

Callback pattern orchestration with zero-cost waits

🗀

DynamoDB

Event dedup + HITL approval state with TTL cleanup

💬

Slack API

Events + Interactivity with HMAC-SHA256 verification

📊

CloudWatch MCP

Metrics, logs, alarms via MCP server integration

🐝

Datadog

Monitors, metrics, logs, incidents via REST API Lambda

🔎

Lumigo

Log search, aggregation, and trace investigation

🎯

Jira

Issue search, CRUD, transitions, board management

📖

Confluence

Page search, CRUD, comments, space management

📱

Embrace

Mobile crash analytics, session metrics via PromQL

🌎

Terragrunt

Infrastructure as code for all AWS resources