Note : Some of the content may seem repetitive as its same subject presented by different people.
Source : Youtube : ⭐ very good one
SRE is still operationally driven but its more of a engineering role but you do also,
CNN as a service used by healthcare.gov for monitoring . They watch news to know if their service is down.
Four golden signals
Should be able to see every other systems dashboard they could your dependency.
Google logs everything with reasonable retention period. Need to have a log analysis tool with SQL like interface is advantageous than grep(google uses Dremel).
SLI are metrics, golden signals are place to start.
SLOs are targets values or ranges for SLIs. (100% is not a good range).
SRE is all about planning about failure. Things are going to break.
SLA is a legal agreement. So customers are going to come with breach of contract.
Error budgets are permissive. What good are the nines if you don't take them out every now and then. Error budget let's you balance reality vs reliability.
No budget = No pushes(change freeze)
If you have the error budgets, think of when is the good time to take extra risks. When you find time and if you have a feature which is well tested but it carries a risk you have a budget to spend.
This is how google works and launch features or products like initially we say we are going to be launching a product and dates change because, we don't have any budget.
Canary release take small fraction of your typical workload. Set golden rules for Canary separately and monitor it. It's all about catching it early.
Source : Youtube
What is SRE ?
my own thoughts : Development teams don't manage applications once it goes to production, they just move on to make other new things. Whereas the operations team manages the application till it gets decommissioned, that's actually lot of time for few months of development.
Mostly they're are lot of emphasis on development teams than operations team
Business to development = Agile solve this Development to operations = Devops solves this
Interface Devops
SRE approach to operations
Error Budgets : Key principles of SRE
SLI/SLO/SLA
SLO = Target of SLIs aggregate over time
Error budgets
Practices of SRE
Metrics and monitoring
Capacity planning : Planning is vital. Should be able to predict and forecast. See something bad rollback immediately and investigate later. Having capacity is one thing but obtaining it swiftly and integrating it is equally important.
Change management
Emergency response
Culture
How to get started
Ways to get help
Source : Youtube
Devops movement was all about breaking down the wall between developers & operators
Simplest way to breaking down the barriers
Devops manifesto
Reduce organizational silos
Accept failure as normal
Implement Gradual Changes
Leverage Tooling & Automation
Measure Everything
SRE evolved independently from Devops. Google believed SRE is the way run, build and maintain production systems at scale.
Devops was built by a community and SRE was built by google
Class SRE implements Devops
Devops | SRE |
---|---|
Reduce organization silos | Share ownership with developers use same set of tooling with developers. So everybody is contributing to get the job done. |
Accept failures as Normal | SLO & Error budgets. Forcing collaboration & conversation between product teams, developers and SREs and even sales and post-sales. We also have to admit, how reliable our system can be. Blameless portmortems |
Implement gradual change | Move fast by making small iterative deployments. |
Leverage Tooling and automation | Spend time that bring long time value to your system. Think of everything that can be possibly automated. (TOIL) |
Measure everything | Measure even toil |
SLIs, SLOs and SLA
Who & Where they are involved
Do remember, SLO should break first before SLA.
Error budgets
Even if you spend & create a 100% reliable system, for the end user it will be available for only 99% of time as its based other least reliable components of the system.
How risky my service can be, it depends on many things.
Fault tolerance
Availability
Competition
How fast you are trying to deliver
Acceptable risk should dictate the SLO
To be highly reliable you need to increase the nines
If your focus on new features and getting it out fast, you may need to decrease some nines
For customer to trust, you system needs to be reliable.
What happens when error budget is depleted ?
You can continue deploying, your developer can continue building features but everything has to focus on reliability. They cannot ship new features until we improve reliability. Sort of everything should shift from new features to building improving reliability of the system until the budget is replenished.
When developers ask/say this is a important feature why cant I deploy it ? Well they can deploy, but they will lose SRE support meaning developers will be taking over the support.
Toil
Suppose you are creating a report like it takes 15mins to make and to automate it will take around 20hrs. This report is made once a year in that case its not Toil. Just document, how to do it, as it will help others.
Source : Youtube
Each time when an outage occurs, the response time to resolve, it should be smaller.
Source : Youtube : ⭐ Good things start @ 30:30
SRE Practices
SLO : Balancing stability & agility
Error budgets
Blameless postmortem
Capping and Eliminating toil
There are two options,
Toil typically involves continuous investment, but absolutely is the right path.
You don't need new people to do SRE
There is no such thing as SRE Vs. Devops. SRE is a more specific implementation of general class of DevOps. Both are complimentary ideas.
Things you can do today to implement SRE
This is communicated to everyone in the organization so we are in same page from individual contributors to vice-presidents.
We do that defining SLO in collaboration with product owners, by agreeing in advance we are making sure there are any confusions.
Every application has unique set of requirement that dictate how reliable does it has to be before customers no longer notice the difference. That means we can make enough room for error and enough room for features reliably.
SLI : Metrics over time such as request latency, batch throughput per second, failures per request to total number of request. Aggregated over time and apply a percentile 99th by which we can get a concrete threshold that a single number good or bad. Eg:- 95th percentile latency of homepage requests over past 5minutes < 300ms.
SLO : Add SLI or Integrate SLI over period of time like year 99.99% to see if total amount of downtime is more or less than 9minutes. SLOs are ranges. Eg:- 95th percentile homepage SLI will succeed 99.9% over trailing year.
SLA : Business agreement between customer and service provider typically based on SLOs. Eg:- Service credits if 95th percentile homepage SLI succeeds less than 99.5% over trailing year.
SLIs drive SLOs which inform SLAs = SLA should be more lenient than SLO so you get early warnings.
Depending on the scale of incidents, you need to bring in specialists from other parts of the company
For long running incidents, IC may delegate successor in different zones/regions. Sometimes due to complexity IC can give up role and can become operations lead.
SLO dictates the Error budget ( SLO(99.9%) 🡪 Error Budget(43.2 Min / Month)
After exceeding error budget, product team can ask for exception from vice-president which can be given only few times a year.
Error Budget is must for everything from Top-Bottom in the stack. This way you can determine how much error budget you have allocated for your dependencies and how much error budget is allocated for your developers.
System cannot be 100% reliable because all your dependencies cannot be available 100%.
Toil | Overhead |
---|---|
Running scripts, commands, restarting services | Email, Expense reports, Meetings, Travelling |
Toil activity must be related to production service.
Characteristics of Toil
If a operator writes down all commands to a script and executes the script instead of running the commands manually, the operator has reduced the amount of toil, since its not automated toil remains.
Manually carrying out task in production is toil but writing code to replace that manual action is not toil, its project work.
Measuring Toil
In SRE, 'E' stands for engineering work, that’s what lets our organization scale and meet the demands of all application and services we support.
Risk Analysis List of items that may cause an SLO Violation
Scenario 1
There is a primary database which needs to be backed up every month which will have 120mins and during that time, it will be offline.
Bad minutes = 120mins 12 months 100% of users = 1440 bad minutes / year
Error Budget = 99.5%
This backup is consuming half of error budget
Scenario 2
Every two weeks there is a slowness on Friday and it takes 30mins for alerts to come and 30mins to resolve.
Calculated Expected Cost
TTD : Total Time to Detect; TTR : Total Time to Repair
Metrics aggregate type data about performance of services such as (Number of queries : counter, Latency : distribution, CPU load : gauge)
Postmortem should be machine readable format(metadata), so that you can track improvements in your Incident response management process overtime and identify meta patterns in your outages and build process or technology to prevent or mitigate incidents in the future.
Postmortem
Use collaborative tools like Google Docs
Records things as you go as this will help you rollback things which you had done to temporarily repair ?
Make sure postmortem is blameless
In distributed system, there is no single root cause to the problems, there would be more than one contributing factor. So need to write down every abnormal behavior
File issues in issue tracker for each action item to make sure prioritize for future
Capture overall themes like,
Incident response management dashboard reporting
Source : Youtube
Two nice features of Error Budgets
Fix 1 : Common Staffing Pool
Fix 2 : SRE team will have only software engineers ( people who know coding )
Fix 3 : 50% caps on ops work
Fix 4 : Keep DEV team in rotation
Fix 5 : Speaking of DEV & Ops work
Fix 6 : SRE Portability
Two goals for each outage :
Minimize Damage
What is SLO ?
Typical SLOs
Error budget policy (some examples)
No new feature launches allowed
Sprint planning may only pull postmortem action items from the backlog
DEV team must meet with SRE team daily to outline their improvements
SRE Principle #1
SRE Principle #2
Shared Responsibility
Regulating workload
Leadership buy-in
Automation
Site Reliability Engineering at Dropbox
Companies
Other SRE resources