Bedrock Budgeteer System Architecture
Overview
Bedrock Budgeteer is a comprehensive serverless budget monitoring and control system for AWS Bedrock API usage. It provides real-time cost tracking, automated threshold monitoring, and progressive access control to prevent budget overruns.
System Goals
- Real-time cost visibility: Compute token-based costs using the AWS Pricing API
- Proactive controls: Multi-level thresholds with alerts and graduated suspension system
- Operational safety: DLQs, audit trails, and idempotent workflows
- Separation of concerns: Clear boundaries across ingestion, evaluation, enforcement, and audit
High-Level Architecture
The system follows an event-driven serverless architecture with clear separation between data ingestion, processing, orchestration, and monitoring layers.
graph TB
subgraph "Event Sources"
BR[Bedrock API Calls]
IAM[IAM Changes]
CT[CloudTrail Events]
AC[AgentCore Lifecycle Events]
end
subgraph "Ingestion Layer"
EB[EventBridge Rules]
CF[CloudTrail]
KF[Kinesis Firehose]
CW[CloudWatch Logs]
end
subgraph "Processing Layer"
US[User Setup Lambda]
UC[Usage Calculator Lambda]
BM[Budget Monitor Lambda]
AL[Audit Logger Lambda]
PM[Pricing Manager Lambda]
end
subgraph "Storage Layer"
DDB[(DynamoDB Tables)]
S3[(S3 Bucket)]
SSM[SSM Parameters]
end
subgraph "AgentCore Layer"
ACS[AgentCore Setup Lambda]
ACBM[AgentCore Budget Monitor Lambda]
ACMGR[AgentCore Budget Manager Lambda]
ACIU[AgentCore IAM Utilities Lambda]
end
subgraph "Orchestration Layer"
SW[Suspension Workflow]
RW[Restoration Workflow]
ACSW[AgentCore Suspension Workflow]
ACRW[AgentCore Restoration Workflow]
SF[Step Functions]
end
subgraph "Monitoring Layer"
SNS[SNS Topics]
CD[CloudWatch Dashboards]
CA[CloudWatch Alarms]
end
BR --> CW
IAM --> CT
CT --> EB
AC --> EB
EB --> US
EB --> PM
EB --> ACS
CW --> KF
KF --> UC
US --> DDB
UC --> DDB
BM --> DDB
AL --> DDB
PM --> DDB
ACS --> DDB
ACBM --> DDB
ACMGR --> DDB
UC --> DDB
BM --> SF
ACBM --> SF
SF --> SW
SF --> RW
SF --> ACSW
SF --> ACRW
SW --> SNS
RW --> SNS
BM --> SNS
SNS --> CA
CA --> CD
Core Components
1. Data Storage Layer
DynamoDB Tables:
UserBudgets- Principal budget tracking and statusUsageTracking- Historical usage and cost dataAuditLogs- System audit trailPricing- Cached Bedrock model pricing data
Key Attributes:
// UserBudgets table
{
"principal_id": "string",
"budget_limit_usd": "number",
"spent_usd": "number",
"status": "active|suspended|grace_period",
"account_type": "bedrock_api_key|user|service",
"budget_refresh_date": "ISO8601",
"grace_deadline_epoch": "number"
}
// Pricing table
{
"model_id": "string",
"region": "string",
"input_tokens_per_1000": "decimal",
"output_tokens_per_1000": "decimal",
"ttl": "number"
}
2. Event Ingestion Layer
CloudTrail Configuration:
- Captures all Bedrock API calls (
InvokeModel,InvokeModelWithResponseStream) - Monitors IAM changes affecting Bedrock access
- Integrates with EventBridge for real-time processing
EventBridge Rules:
bedrock-usage: Routes Bedrock API events to processingiam-key-creation: Captures new API key creation for user setupiam-bedrock-permissions: Tracks permission changes
Kinesis Data Firehose:
- Streams Bedrock invocation logs to S3
- Invokes Usage Calculator Lambda for real-time processing
- Provides data transformation and enrichment
3. Core Processing Layer
Lambda Functions:
- User Setup Lambda
- Triggered by IAM API key creation events
- Initializes budget entries for new users
- Sets default budget limits and configuration
- Usage Calculator Lambda
- Processes Bedrock invocation logs from Firehose
- Calculates token costs using AWS Pricing API
- Updates budget consumption in real-time
- Budget Monitor Lambda
- Runs every 5 minutes via EventBridge schedule
- Evaluates budget thresholds (70% warning, 90% critical, 100% exceeded)
- Initiates grace period for budget violations
- Triggers suspension workflows for grace period expiry
- Pricing Manager Lambda
- Maintains current Bedrock model pricing data
- Refreshes pricing cache daily
- Handles fallback pricing when API unavailable
- Budget Refresh Lambda
- Daily schedule to reset expired budgets
- Triggers automatic restoration for suspended users
- Manages budget refresh cycles
- Audit Logger Lambda
- Processes all system events for audit trail
- Ensures compliance and traceability
4. Workflow Orchestration Layer
Step Functions State Machines:
- Suspension Workflow
graph TD A[Start] --> B[Send Grace Notification] B --> C[Grace Period Wait] C --> D[Send Final Warning] D --> E[Apply Full Suspension] E --> F[Update User Status] F --> G[Send Audit Event] G --> H[Success] - Restoration Workflow
graph TD A[Start] --> B[Validate Restoration Request] B --> C{Validation Passed?} C -->|Yes| D[Restore Access] C -->|No| E[Skip Restoration] D --> F[Validate Access Restoration] F --> G[Reset Budget Status] G --> H[Send Audit Event] H --> I[Success] E --> J[End]
Workflow Lambda Functions:
IAM Utilities: Manages IAM policy operationsGrace Period Notifications: Handles user notificationsPolicy Backup: Backs up original policies before suspensionRestoration Validation: Validates restoration eligibility
5. AgentCore Budgeting Layer
Gated by feature flag: enable_agentcore_budgeting in cdk.json.
The AgentCoreConstruct extends budget management to AWS Bedrock AgentCore runtimes. Unlike the API key flow (which detaches a single managed policy), AgentCore suspension snapshots all role policies, strips them, attaches a deny-all inline policy, and tags the role with a management marker.
DynamoDB Table:
agentcore-budgets— tracks per-runtime budgets and aGLOBAL_POOLsingleton row. Uses arole_arn-indexGSI so the usage_calculator can match caller role ARNs to registered runtimes.
Budget Model:
- A
GLOBAL_POOLsingleton row defines the total AgentCore spend pool. - Per-runtime rows track individual agent carve-outs from the global pool.
Lambda Functions:
- AgentCore Setup Lambda
- Triggered by lifecycle events from
bedrock-agentcore.amazonaws.comvia EventBridge - Registers new AgentCore runtimes in the agentcore-budgets table
- Initializes per-agent budget carve-outs from the global pool
- Triggered by lifecycle events from
- AgentCore Budget Monitor Lambda
- Runs every 5 minutes via EventBridge schedule
- Evaluates budget thresholds against both per-agent and global pool limits
- Initiates grace periods and triggers suspension workflows
- AgentCore Budget Manager Lambda
- Exposed via Function URL with IAM authentication
- Provides an API for managing agent budgets (CRUD operations on budget allocations)
- AgentCore IAM Utilities Lambda
- Manages role policy operations for suspension/restoration
- Suspension: snapshot all role policies, strip them, attach deny-all inline policy, tag role with
BedrockBudgeteerManaged - Restoration: delete deny-all policy, reattach managed policies, recreate inline policies, untag role, reset budget
Step Functions State Machines:
- AgentCore Suspension Workflow
- Grace notification → wait → strip all role policies → attach deny-all → tag role → update status
- AgentCore Restoration Workflow
- Validate restoration → delete deny-all → restore managed policies → recreate inline policies → untag role → reset budget
Integration with Existing Constructs:
- The usage_calculator Lambda performs an early routing check: if the caller role ARN matches a registered AgentCore runtime (via the
role_arn-indexGSI), usage is tracked in the agentcore-budgets table instead of user-budgets. - IAM permissions for role management (list/attach/detach/put/delete policies, tag/untag roles) are scoped to roles with the
BedrockBudgeteerManagedtag.
SSM Parameters (under /bedrock-budgeteer/global/agentcore/):
global_budget_limit_usd— total AgentCore spend poolgrace_period_seconds— seconds before suspension after budget exceededwarning_threshold_percent— warning alert thresholdcritical_threshold_percent— critical alert thresholddefault_per_agent_budget_usd— default budget for new agent runtimes
6. Monitoring and Alerting Layer
SNS Topics:
operational-alerts: System operational issuesbudget-alerts: Budget threshold violationshigh-severity: Critical system alerts
CloudWatch Dashboards:
- System Overview: Lambda metrics, DynamoDB performance
- Ingestion Pipeline: CloudTrail, EventBridge, Firehose metrics
- Workflow Orchestration: Step Functions execution metrics
- Business Metrics: Budget violations, user activity, costs
Multi-Channel Notifications:
- Email subscriptions
- Slack webhook integration
- SMS for high-severity alerts
- Generic webhook for external integrations
Data Flow Patterns
1. Real-time Usage Processing
sequenceDiagram
participant U as User
participant BR as Bedrock
participant CW as CloudWatch Logs
participant KF as Kinesis Firehose
participant UC as Usage Calculator
participant DDB as DynamoDB
U->>BR: Invoke Model
BR->>CW: Write Invocation Log
CW->>KF: Stream Log Event
KF->>UC: Transform & Process
UC->>DDB: Update Usage & Cost
Note over UC,DDB: Real-time budget tracking
2. Budget Violation Response
sequenceDiagram
participant BM as Budget Monitor
participant DDB as DynamoDB
participant EB as EventBridge
participant SF as Step Functions
participant IAM as IAM
participant SNS as SNS
BM->>DDB: Check Budget Status
Note over BM,DDB: Budget exceeded
BM->>DDB: Start Grace Period
BM->>EB: Publish Suspension Event
EB->>SF: Trigger Workflow
SF->>IAM: Apply Restrictions
SF->>SNS: Send Notifications
SF->>DDB: Update Status
3. Automatic Restoration
sequenceDiagram
participant BR as Budget Refresh
participant DDB as DynamoDB
participant EB as EventBridge
participant SF as Step Functions
participant IAM as IAM
BR->>DDB: Check Refresh Dates
Note over BR,DDB: Refresh period reached
BR->>EB: Publish Restoration Event
EB->>SF: Trigger Restoration
SF->>IAM: Restore Access
SF->>DDB: Reset Budget & Status
Security Architecture
IAM Roles and Permissions
Lambda Execution Role:
- DynamoDB: Full access to application tables
- S3: Read access to logs bucket
- SSM: Read access to configuration parameters
- Bedrock: API access for model information
- Pricing: API access for cost calculation
- CloudWatch: Custom metrics publishing
Step Functions Role:
- Lambda: Invoke workflow functions
- DynamoDB: Read/write access to tables
- EventBridge: Publish audit events
- SQS: Send messages to DLQ
Firehose Delivery Role:
- S3: Write access to logs bucket
- KMS: Encrypt/decrypt with customer keys
- Lambda: Invoke usage calculator
Encryption Strategy
At Rest:
- DynamoDB tables encrypted with customer-managed KMS key (optional)
- S3 buckets encrypted with KMS
- CloudWatch logs encrypted with KMS (optional)
In Transit:
- All API calls use HTTPS/TLS
- Inter-service communication encrypted by default
Configuration Management
SSM Parameter Hierarchy
/bedrock-budgeteer/
├── global/
│ ├── thresholds_percent_warn (70)
│ ├── thresholds_percent_critical (90)
│ ├── default_user_budget_usd (1)
│ ├── grace_period_seconds (300)
│ └── agentcore/
│ ├── global_budget_limit_usd
│ ├── grace_period_seconds
│ ├── warning_threshold_percent
│ ├── critical_threshold_percent
│ └── default_per_agent_budget_usd
└── production/
└── cost/
└── budget_refresh_period_days (30)
Environment-Specific Configuration
- Production: Strict thresholds, longer grace periods, comprehensive monitoring
- Development: Relaxed thresholds, shorter grace periods, basic monitoring
Operational Considerations
Fault Tolerance
Dead Letter Queues (DLQ):
- All Lambda functions have DLQ configured
- Failed executions captured for manual retry
- 14-day retention for investigation
Idempotent Operations:
- All workflows support safe retry
- State reconciliation prevents drift
- Unique event IDs prevent duplicate processing
Scalability
Auto-scaling Configuration:
- DynamoDB tables with read/write auto-scaling
- Lambda concurrency managed automatically
- Firehose buffering for batch processing
Cost Optimization:
- S3 lifecycle policies for log retention
- CloudWatch log retention policies
- DynamoDB on-demand billing for variable workloads
Monitoring and Observability
Metrics:
- System operational metrics
- Business KPI metrics
- Error rates and latencies
- Cost tracking metrics
Alerting:
- Threshold-based alarms
- Anomaly detection
- Multi-channel notifications
- Escalation procedures
Dashboards:
- Real-time system health
- Business metrics overview
- Operational troubleshooting views
- Cost analysis dashboards
Compliance and Auditing
Audit Trail
Comprehensive Logging:
- All budget changes logged
- IAM policy modifications tracked
- User suspension/restoration events
- System configuration changes
Retention Policy:
- Audit logs retained for compliance requirements
- S3 lifecycle policies for long-term storage
- DynamoDB TTL for automated cleanup
Data Privacy
Minimal Data Collection:
- Only necessary user identifiers stored
- No sensitive business data captured
- Encryption for data at rest and in transit
Disaster Recovery
Backup Strategy
DynamoDB:
- Point-in-time recovery disabled for cost optimization
- Automated backups via DynamoDB native features
- Cross-region replication not implemented (single-region design)
Configuration:
- SSM parameters backed up via infrastructure as code
- CDK deployment enables environment recreation
Recovery Procedures
Service Restoration:
- Redeploy CDK stack in target region
- Restore DynamoDB tables from backup
- Update SSM parameters as needed
- Validate service functionality
Data Recovery:
- DynamoDB table restoration from backup
- S3 data available across availability zones
- CloudWatch logs retention for investigation