Tutorial / Cram Notes
The first step in remediation is to identify the issue accurately. AWS provides various monitoring tools to track and alert on system anomalies:
- Amazon CloudWatch is the go-to service for monitoring AWS resources and applications. It can monitor CPU usage, network traffic, and even custom metrics. When a metric goes beyond a defined threshold, CloudWatch can trigger alarms.
- AWS Config provides a detailed inventory of your AWS resources and their configurations, enabling you to audit changes and detect configurations that diverge from established baseline.
Remediation Strategies
Once an issue is identified, the next step is to execute a remediation strategy. There are several common strategies:
- Manual Remediation: Having an engineer log in and address the issue directly. This is often used for complex problems requiring human judgment.
- Automated Remediation: Utilizing tools like AWS Systems Manager or custom scripts triggered by CloudWatch alarms to address well-understood and routine issues.
- Infrastructure as Code (IaC): Using tools like AWS CloudFormation or Terraform to manage infrastructure, allowing for automatic rollback to a previous state or re-provisioning of resources in response to configuration drift or other undesired states.
Using AWS Systems Manager for Remediation
AWS Systems Manager is a powerful service that can help with automatic remediation. Below is a high-level view of how to use Systems Manager to automatically remedy an issue:
- Set Up an Alarm: Create a CloudWatch alarm that monitors the desired metric.
- Create a Systems Manager Automation Document: Define the actions that need to be taken when an issue is detected. Actions could be restarting a service, replacing instances, or running a script.
- Link the Alarm to the Automation Document: Within the alarm’s actions, select the Systems Manager document as the target.
For example, if an EC2 instance becomes unresponsive, you could create an Automation Document that checks the instance status and reboots it if necessary.
{
“schemaVersion”: “0.3”,
“description”: “Check and remediate EC2 instance health.”,
“parameters”: {
“InstanceId”: {
“type”: “String”,
“description”: “The instance ID to check.”
}
},
“mainSteps”: [
{
“action”: “aws:assertAwsResourceProperty”,
“name”: “assertInstanceHealth”,
“onFailure”: “Continue”,
“inputs”: {
“Service”: “EC2”,
“Selector”: “$.InstanceStatus.Status”,
“DesiredValues”: [“ok”],
“ResourceType”: “Instance”,
“ResourceId”: “{{ InstanceId }}”
}
},
{
“action”: “aws:changeInstanceState”,
“name”: “remediateInstanceState”,
“onFailure”: “Abort”,
“inputs”: {
“InstanceIds”: [“{{ InstanceId }}”],
“DesiredState”: “reboot”
}
}
]
}
Continuous Compliance with AWS Config
To avoid deviations from desired system state, you can use AWS Config rules to continuously evaluate your setup against pre-defined configurations. AWS Config can automate remediation actions for certain rules using SSM Automation Documents or AWS Lambda functions.
For instance, you could create a rule to check whether your Amazon S3 buckets have versioning enabled, and an Automation Document that enables versioning on any S3 bucket found noncompliant.
Conclusion
Remediating non-desired states in AWS involves identifying the issue, formulating an appropriate strategy, and then applying manual or automated fixes using AWS’s breadth of tools. By leveraging services like Amazon CloudWatch, AWS Systems Manager, and AWS Config, DevOps engineers can ensure that their systems return to the desired state with minimal disruption. Such practices are integral to the AWS Certified DevOps Engineer – Professional exam and are part of the foundational knowledge required to architect, operate, and troubleshoot complex cloud systems.
Practice Test with Explanation
True or False: AWS Config can be used to assess, audit, and evaluate the configurations of your AWS resources.
- Answer: True
Explanation: AWS Config provides a detailed view of the configuration of AWS resources in your account, and it can monitor and record configurations and evaluate them against desired configurations.
True or False: Amazon CloudWatch can execute automated actions based strictly on memory utilization metrics without any custom metrics.
- Answer: False
Explanation: Amazon CloudWatch does not provide memory utilization metrics by default; these are custom metrics that need to be published to CloudWatch from within the instance.
Which AWS service allows you to automate remediation actions without human intervention when a CloudWatch alarm is triggered?
- A) AWS Lambda
- B) Amazon EC2
- C) Amazon S3
- D) AWS CloudTrail
Answer: A) AWS Lambda
Explanation: AWS Lambda can be configured to trigger from Amazon CloudWatch alarms to automate remediation actions without human intervention.
True or False: Using AWS Systems Manager, you can automate the process of patching your EC2 instances and on-premises servers.
- Answer: True
Explanation: AWS Systems Manager provides a Patch Manager that automates the process of patching managed instances with both security-related and other types of updates.
Which of the following is NOT a capability of AWS Systems Manager?
- A) Automated compliance checks
- B) Configuration management
- C) Auto-scaling EC2 instances
- D) Patch management
Answer: C) Auto-scaling EC2 instances
Explanation: Auto-scaling is a feature of Amazon EC2 and not directly of AWS Systems Manager.
True or False: Amazon Inspector can only assess applications for exposure, vulnerabilities, and deviations from best practices after an instance has been launched.
- Answer: False
Explanation: Amazon Inspector can assess the application for security vulnerabilities and deviations from best practices both during runtime and during the development and build stages.
In which situation would you use an AWS Lambda function in conjunction with Amazon CloudWatch Events to remediate a system state?
- A) To monitor S3 bucket access policies periodically
- B) To reroute traffic based on network latency
- C) To take EBS snapshots on a regular schedule
- D) To revert unauthorized security group changes automatically
Answer: D) To revert unauthorized security group changes automatically
Explanation: An AWS Lambda function can be triggered by Amazon CloudWatch Events to automatically revert changes when unauthorized modifications to security groups are detected.
True or False: AWS Systems Manager State Manager can be used to enforce a desired system state on a scheduled basis, such as ensuring that monitoring agents are always running.
- Answer: True
Explanation: AWS Systems Manager State Manager helps you maintain consistent configuration of your instances and ensure that your instances are in a desired state at specified intervals.
Which of the following actions can be performed by AWS Systems Manager Automation?
- (Select TWO)
- A) Patching operating systems
- B) Rolling back a resource configuration
- C) Horizontal scaling of Amazon RDS instances
- D) Modifying IAM roles and policies
- E) Horizontal scaling of Amazon EC2 Auto Scaling groups
Answer: A) Patching operating systems, B) Rolling back a resource configuration
Explanation: AWS Systems Manager Automation can automate patching operating systems and roll back a resource configuration, but does not manage horizontal scaling of RDS or Auto Scaling groups, nor does it modify IAM roles and policies directly.
True or False: You can use an AWS Step Functions state machine to orchestrate a series of remediation actions that involve multiple AWS services.
- Answer: True
Explanation: AWS Step Functions coordinates multiple AWS services into serverless workflows so you can build and update apps quickly, and it can orchestrate complex remediation actions.
What feature of Amazon CloudWatch can dynamically respond to system state changes and execute remediation actions defined in AWS Systems Manager documents?
- A) CloudWatch Alarms
- B) CloudWatch Logs
- C) CloudWatch Events/EventBridge
- D) CloudWatch Metrics
Answer: C) CloudWatch Events/EventBridge
Explanation: CloudWatch Events/EventBridge can respond to state changes in AWS services and resources and can trigger automated actions like executing Systems Manager documents for remediation.
True or False: AWS Trusted Advisor can automatically resolve issues it identifies within your AWS environment.
- Answer: False
Explanation: AWS Trusted Advisor provides recommendations to help you follow AWS best practices, but it does not automatically resolve issues; manual or automated actions must be taken by the user or via other AWS services.
Interview Questions
Can you explain the principle of ‘infrastructure as code’ and how it helps in remediating a non-desired system state?
Infrastructure as Code (IaC) refers to the practice of managing and provisioning computing infrastructure through machine-readable definition files rather than physical hardware configuration or interactive configuration tools. IaC helps in remediating non-desired system states by allowing DevOps engineers to easily deploy consistent environments. If a system drifts from its desired state, the infrastructure can be quickly redeployed or corrected with the code defining the correct state.
How would you use AWS CloudFormation to revert a non-desired state of an AWS resource?
AWS CloudFormation allows engineers to manage AWS resources by describing them in templates which are YAML or JSON files. If a resource enters a non-desired state, you can update the stack using a previous or corrected version of the CloudFormation template, which will then change the current state back to the defined desired state.
What are AWS Config rules and how can they be used for remediation?
AWS Config rules are used to evaluate the configuration settings of AWS resources for compliance with preferred settings. When a resource is non-compliant, AWS Config can trigger Lambda functions for automatic remediation or send notifications for manual intervention.
Discuss the role of AWS Systems Manager in remediating non-desired system states.
AWS Systems Manager provides visibility and control over your AWS resources. It can be used to automate operational tasks to help ensure your systems remain in a desired state. For example, you can define State Manager configurations to automatically enforce desired system configuration and state on a scheduled basis.
Describe how Amazon CloudWatch can be used for system state remediation.
Amazon CloudWatch can monitor AWS resources and applications, providing alerts (CloudWatch Alarms) when it detects deviations from predefined metrics thresholds. These alerts can be configured to trigger automated actions using AWS Lambda or SNS topics for notifications or further workflow automation for remediation.
In the context of AWS, what is a rollback strategy and when would you implement one?
A rollback strategy in AWS is a process to revert your resources back to a previous known good state, typically after a failed deployment or when undesired behavior is detected. Rollback can be automated by services like AWS CodeDeploy, which can monitor deployments and automatically roll back if deployment fails or critical issues are detected.
How can Elastic Beanstalk be used for maintaining a desired application state?
AWS Elastic Beanstalk provides environment configurations that define how an application should run. If an application state diverges, you can redeploy the application or update the environment configuration to restore the desired state. Elastic Beanstalk also supports versioning and can automatically handle application rollbacks to previous stable versions if needed.
Can you identify a tool within the AWS ecosystem to automate the remediation of non-compliant resources?
AWS Systems Manager Automation is a tool within the AWS ecosystem designed to automate operational tasks, making it easier to manage and remediate non-compliant resources. It uses pre-defined automation documents or your own that specify the actions AWS Systems Manager should take on your AWS resources.
How does the concept of blue/green deployments help ensure system state is always as desired?
Blue/green deployments involve running two identical production environments, only one of which (the Blue) receives live traffic at any given time. When an update is needed, the new version is deployed to the inactive environment (the Green). After testing, the traffic is switched over. If the new environment (the Green) has issues, the traffic can be easily and quickly routed back to the previous version (the Blue), thus ensuring system state remains stable.
What is the purpose of Amazon Inspector in maintaining system state, and can it assist in remediation efforts?
Amazon Inspector is an automated security assessment service that helps improve the security and compliance of applications deployed on AWS. It can assess applications for vulnerabilities or deviations from best practices, and while it does not directly remediate issues, it integrates with other AWS services like Lambda for automated remediation based on findings.
In an AWS environment, what role does the AWS Personal Health Dashboard play in maintaining a desired system state?
The AWS Personal Health Dashboard provides alerts and guidance for AWS events that could impact your resources. It helps maintain the desired system state by notifying you of planned maintenance, unplanned issues, or changes that could affect your services, allowing you to proactively manage your environment.
How do you define a ‘non-desired system state’ within AWS and what are the first steps to investigate and remediate such a state?
A ‘non-desired system state’ in AWS is any configuration or behavior of AWS resources that deviates from the predefined settings, performance benchmarks, security posture, or application function that is considered ‘normal’ or ‘intended.’ The first steps to investigate such a state would include checking CloudWatch for alarms, reviewing AWS Config for non-compliant resources, analyzing logs (like CloudTrail and VPC flow logs), and assessing recent changes or deployments for possible causes. Once identified, the remediation steps would depend on the cause and might involve redeployments, correcting resource configurations, scaling resources, or applying security patches.
Great blog post! Very informative on remediating non-desired state in AWS.
Thanks for this overview, it was super helpful for my study preparation.
Could anyone clarify the use of AWS Config rules in these situations?
Appreciate the examples given, they made the concept easier to grasp.
I had trouble understanding the automation part with Lambda.
This guide was a lifesaver, thanks a ton!
When would it be more appropriate to use AWS Systems Manager instead of AWS Config?
Thanks for sharing. This will definitely help in my exam prep.