The Only SLO Template You'll Ever Need | John's Tips 2024W37
Use this template to help define and create your Service Level Objectives (SLOs) before you begin the work, and build a solid foundation for ongoing system reliability.
The purpose of the below is to act as template ticket that you can take into your own project, to help with the construction of your own SLOs.
Copy all of the below into your own Epic or Intiative, and then work as a team to remove the parts that you don’t need.
Context
Service Level Objectives (SLOs) track current platform performance against a baseline. For tracking SLOs we need to first be able to reliably calculate some indicators around different dimensions, such as availability of data served by the platform.
Service Level Objectives purpose is two-fold:
Enable transparency towards the organization about the guarantees that our platform provides around dimensions that are important to our adopters for the service we provide.
Serve as one important enabler of internal decision making, to highlight issues and product improvements we need to prioritize, to either keep compliance with SLOs, or improve SLOs guarantees.
In a distributed architecture like many companies use these days, it is important to think about both your SLOs and the SLOs on any potential downstream dependencies.
For example, if you identify that availability/reliability is critical to your project, you can use the SLOs as a method of deciding if downstream dependencies fit into your overall architecture.
Description
The goal of this initiative is to establish a set of SLOs to measure and track the performance and reliability of our system. By implementing these SLOs, either through automation or documentation, we aim to establish a reliable baseline that allows us to monitor and assess our progress over time.
For the first iteration it is important to not have too many metrics, but rather start with something that is already available, e.g. metrics in Datadog/Splunk etc. You should also start with an assumption of what the consumers of the service care about (what metric/s depicts their happiness when using the service).
To achieve this goal, we will focus on implementing SLOs in relevant dimensions that are crucial for evaluating the system’s performance. The metrics will be based on the following and will need to be accepted by other stakeholders (EM/PM) before implementation can begin.
(please remove/add as best suits your team/project)
System Logs
Performance Monitoring Tools
User Feedback
Error Tracking Systems
Load Testing
Service Level Agreements (SLAs)
Historical Performance Data
Synthetic Monitoring
These SLO dimensions may include, but are not limited to the below
(please remove/alter as best suits your team/project)
Availability: Measure the system’s uptime and ensure it meets the defined availability targets.
Performance: Track key performance indicators (KPIs) to assess the system’s responsiveness and efficiency.
Scalability: Monitor the system’s ability to handle increasing workloads and user demand without performance degradation.
Error Rates: Capture and analyze error rates to identify areas of improvement and reduce the occurrence of errors.
Latency: Measure the time it takes for the system to respond to user requests and ensure it falls within acceptable limits.
Throughput: Evaluate the number of transactions or requests the system can handle effectively within a given timeframe.
Security: Implement metrics to evaluate the system’s security posture and detect any vulnerabilities or breaches.
Compliance: Monitor the system’s adherence to regulatory and industry-specific compliance standards.
User Experience: Measure the system’s performance from the user’s perspective, considering factors like responsiveness, usability, and customer satisfaction.
By implementing metrics in these relevant dimensions, we will gain valuable insights into the system’s performance, identify areas for optimization, and make data-driven decisions to enhance the reliability and user experience of our solution.
For examples of SLOs from other projects around this company, you can simply look at the SLO dashboard <Add your own link here>
Job Stories
The following ACs are split into two job stories intentionally, as the first section is related to the implementation of the metrics, which might mainly be done by the engineers on the team, while the second Job Story regards the specifications and documentation of the SLO that could potentially be picked up by anyone on the team, including managers.
Job Story #1
As the team responsible for maintaining infrastructure reliability of the new INFRASTRUCTURE NAME infrastructure, I want to ensure compliance with SLOs, create baseline metrics, define an error budget policy, and effectively respond to incidents so that we stay within the error budget and provide a reliable experience for our users.
Acceptance Criteria
When creating baseline metrics for the infrastructure, I want to define and measure service level indicators (SLIs) that will in turn facilitate SLOs to establish a reliable service operation baseline.
When defining the error budget policy and knowing the limitations to Datadog SLO functionality, I want to set clear guidelines and thresholds for the allowable error rate or downtime, ensuring that it aligns with business objectives and user expectations.
When an incident occurs that breaches the defined SLO threshold (External incident), and the error budget is impacted, I want to be alerted promptly to take appropriate action.
When an incident occurs that does not breach the defined SLO threshold (Internal incident), but the error budget is impacted, I want to be alerted promptly to take appropriate action.
Job Story #2
As the team responsible for maintaining infrastructure reliability of the new INFRASTRUCTURE NAME infrastructure, I want to establish a streamlined process for incident response and ensure effective reactions to SLO breaches or incidents.
Acceptance Criteria
When investigating and reporting incidents, I want to utilize predefined templates where possible to ensure consistency, efficiency, and thoroughness in documenting incident details, root cause analysis, impact assessment, and remedial actions. If suitable templates do not exist, I want to create them to establish a standardized incident reporting process
When investigating the incident, I want to perform root cause analysis to identify the underlying issue and understand its impact on the error budget and SLIs.
When addressing the incident, I want to take necessary measures to stabilize the infrastructure within the defined response time and restore compliance with SLOs, considering the impact on the error budget.
When defining the stakeholder contact list, I should decide which stakeholders get notified in the case of internal and external incidents, so that I am not adding unnecessary noisy communications to external stakeholders
When communicating about the incident, I want to provide timely updates to the relevant stakeholders, including information about the incident’s impact on the error budget and the steps taken to resolve it, ensuring transparency and alignment with the defined error budget policy.
When documenting lessons learned from the incident, I want to capture insights and actions that will contribute to continuous improvement, including potential adjustments to the error budget policy or baseline metrics.
When planning long-term solutions, I want to prioritize initiatives such as automated load testing and enhanced monitoring capabilities to proactively manage the error budget, improve infrastructure reliability, and maintain alignment with the defined error budget policy.
I have a lot of other content on this website you might find useful. Simply check out the website headers for the subsections of interest.
For my past tips check out my past posts on Substack or check out the hashtag #JohnsTipOfTheWeek on LinkedIn.
I’d love if you subscribed! I’m trying to build a bit of a following to try and help folks in the industry and make their jobs a little bit easier.