Service scoring is a methodology for measuring service on a scale that quantifies the organization's standards. It can create a massive organizational drive for continuous improvements and positive competitiveness. On the other hand, it can cause miss-alignment and poor quality to be delivered on a constant basis. In this blog piece, we’re going to explore how to score your service to drive motivation with practical examples.
How Scoring Works
The score should reflect the service quality based on the standard defined in the organizations.
For each application, there is usually, one of two options:
Overall score - combined score across different categories.
2-3 Category Scores - Instead of one score, you will have 2-3 scores, each representing a category like security, reliability, configuration, cost, etc.
The strategy you pick should be clear and reflect the organization's standards. A good score strategy will define the language inside the teams to quantify quality so it’s very important to pick a strategy.
It doesn’t have to be mutually exclusive, you can have an overall score which is a formula calculated by category scores.
Each score will be defined by check with priority for each one. The score of the category will usually be a formula of all the checks in the category.
Example:
Overall score:
⅓ Security
½ 0 Critical CVEs
½ Redacted Logging
⅓ reliability
⅓ Best Practices Implemented
⅓ Multiple Replicas
⅓ 0 downtimes over the last 3 months
⅓ Cost
1 Consumption vs usage
How to Define a Good Score
Reflective - It should reflect the actual state.
Bad example: Define the check as “using secured port only” but actually the application doesn’t listen to any port
Good Example: Defined the check as “not using secured port”
Actionable - Users can understand how to fix it.
Bad Example: Define the check as “Kubernetes probes undefined”
Good Example: Define the check as “Readiness Probes defined”
Achievable - The users must be able to achieve it. Let’s be honest no application can reach 100% so we need to make sure the users will be able to achieve 80% as easiest it’s possible. If our platform doesn’t allow us to get into 80% easily we may need to add services to help the users achieve it.
Should be a consensus - People need to feel related to your score and that they believe the score and checks are well defined. We’re not looking for 100% agreement across the organizations, but more than 70% should be enough. The teams should trust that the score is going to protect them. The score is going to be the insurance for them.
Open - Everyone can watch their score and even others' score. It’s important to share this kind of information and make it part of the identification of the process. If someone wants to consume an API it’s better to check the service reliability score before adding API to the critical path.
Aligned with the management - The management of the company needs to prioritize this score. The motion of adoption should be top->bottom which means managers should push their teams to achieve good scores but also bottom->top so individual contributors will push toward improving it.
The Platform Responsibility
There are two main roles for the platform:
Define and show the scores - be the platform to manage the scores and report based on them. All of the common portals do have a way to define scores and some of the premium ones even offer campaign features that help you to improve the score across the organization.
Enable Improvements - the easier it will be for users to achieve high scores, the scores will be. If the scorecards are top priority in the organization you should create a way to improve them easily and fast.
Score Groups
There are many score categories that can be defined, here are a few popular examples:
Security
Reliability
Performance
Scalability
Quality
Technology/Dependencies Deprecation
Resource Optimization
Test Coverage
Cost
Some of them can be grouped or be a sub-category of the other. It’s all about picking what’s important for your own organization.
Score Practical Examples
Security:
Check 1: The max number of critical & high CVEs
Check 2: The application doesn’t use unnecessary privileges
Reliability:
Check 1: The Service has Kubernetes Pod Disruption Budgets to limit the number of simultaneous pod disruptions during maintenance or failures to ensure service availability.
Check 2: The service got at least 5 replicas defined.
Performance:
Check 1: The application p99 response time is below 40ms.
Check 2: DB main queries take x ms.
Scalability:
Check 1: The service got autoscaling defined using HPA/KEDA.
Check 2: The service start time is below 30 seconds to allow fast startup.
Technology/Dependencies Deprecation:
Check 1: There are no deprecated packages in the application.
Check 2: The service doesn’t use deprecated APIs.
Resource Optimization:
Check 1: The service uses more than 80% of it’s defined resources.
Check 2: The service uses HPA.
Cost:
Check: Based on the cost optimization score in your cost tools define what’s a good score for it.
Comments