I recently found a neat way to monitor all Step Function in an AWS account with minimal overhead.
Usually every AWS Step Function is deployed with a matching Cloudwatch Alarm, which is triggered when the Step Function fails.
ResourceComplianceMachineExecutionsFailed:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: !Sub |
Checks whether the ${AWS::StackName} state machine fails.
AlarmActions:
- !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmTopicName}"
ComparisonOperator: GreaterThanOrEqualToThreshold
EvaluationPeriods: 1
MetricName: ExecutionsFailed
Namespace: AWS/States
Period: 300
Statistic: Sum
Threshold: 1
TreatMissingData: missing
Unit: Count
Dimensions:
- Name: StateMachineArn
Value: !Ref YourStateMachine
This is a good start, but it has a few drawbacks:
There is a better way to monitor all Step Functions in an AWS account.
We can use an AWS EventBridge Rule to monitor specific State Changes of AWS Step Functions.
This EventBridge Rule triggers a Lambda function, which can e.g. parse the error message and include a link to the Step Function execution. This uses the AWS SAM framework.
StepFunctionEventParserFunction:
Condition: IsFrankfurt
Type: AWS::Serverless::Function
Properties:
Description: Event Parsing Lambda for Step Function Errors, sends parsed error message to SNS topic when a Step Function fails or times out.
CodeUri: src/notification/
Handler: step_function_event_parser.handler
Runtime: python3.8
Environment:
Variables:
SNS_ARN: !Ref AlarmTopic
Events:
Trigger:
Type: EventBridgeRule
Properties:
Pattern:
source:
- aws.states
detail-type:
- Step Functions Execution Status Change
detail:
status:
- FAILED
- TIMED_OUT