Site Reliability Engineering: Measuring and Managing Reliability

Duration: 2 Days (16 Hours)

Site Reliability Engineering: Measuring and Managing Reliability Course Overview:

Service level indicators (SLIs) and service level objectives (SLOs) are essential tools for measuring and maintaining reliability in service-oriented environments. In this course, participants will gain knowledge on creating effective SLIs and SLOs to ensure reliable performance. They will also learn how to utilize an error budget to effectively manage reliability by setting thresholds and making informed decisions. By the end of the course, participants will have the skills necessary to measure and manage reliability using SLIs, SLOs, and error budgets.

Intended Audience:

• DevOps specialists.
• Software developers.
• Product managers and application owners.
• IT business decision makers.

• Learn the best practices of Google SRE.
• Know the definition of SLOs, SLIs and SLAs and how they impact reliability.
• Understand how to set SLOs and SLIs.
• Understand error budgets.
• Analyze Risks associated with SLOs and the consequences of missing SLOs.

Site Reliability Engineering: Measuring and Managing Reliability

The course includes presentations, demonstrations, and hands-on labs.

Module 1: Introduction to Site Reliability Engineering
• Introduction to Site Reliability Engineering.
• Understand the course objectives and overall structure.
• Understand the principles that underlie Site Reliability Engineering.

Module 2: Targeting Reliability
• Definition of SLAs and SLOs.
• Defining ‘good enough’ reliability.
• What to consider when setting SLOs for your application in your organization,

Module 3: Operating for Reliability
• Trading Reliability v. Features.
• Understanding Error budgets.
• Understand the trade-offs in having multiple SLIs for a given application.

Module 4: Choosing a Good SLI
• Defining types of SLIs.
• How to formulate SLI specifications.
• Setting targets for those SLIs.

Module 5: Developing SLOs and SLIs
• Setting SLO, SLI and Error Budgets in a sample application.
• How to go from a user journey to an SLI implementation and an SLO target using a four step process.

Module 6: Quantifying Risks to SLOs
• Characterizing and Analyzing Risk to SLOs.
• Resources for learning more about data analysis, machine learning, business process analysis, and optimization.
• Model risks in terms of time-between-failures, time-to recovery and impact percent.
• Estimate the error budget cost of each risk using our Risk Analysis spreadsheet.
• Meet a desired SLO target by trading off engineering work to mitigate risks.

Module 7: Consequences of SLOs Misses
• Documenting SLOs and Developing Error Budget Policies.
• Understand how to record and present SLI vs SLO data in a useful structure and format.
• Be able to list the essential components of an error budget policy.
• Enumerate possible options for reaction to an error budget overspend.

To get the most out of this course, participants should have:
• Familiarity with the development cycle of cloud applications.
• Familiarity with managing the response to outages.

