Monday, September 2, 2013

Development, SE and SRE teams - why all of them are critical?

Have you ever wondered what happens when you have a very motivated team that is very wrong for a given job? The team works extremely hard, slogging hours and clocking tons of time, killing themselves over weekends and finally comes up dissatisfied with what they have (non)achieved. In other words, square peg in round hole. Sounds familiar? Read on.

This case study is for service engineering team in any company of 1000 people or more. In the industry, service engineering team typically sets up all the framework to take the developer’s code to production. This involves setting up robust CI/CD (Continuous Integration/Continuous Delivery), monitoring, automating some parts of Dev team provided engineering and QA tests, post-deployment smoke tests and full blown monitoring framework for all alerts around the code.  

SE and SRE Teams
Most organizations have SE and SRE teams in a combined single team. They may chose to call it SE or SRE. But make no mistake, this team has two different and very conspicuous flavors – engineering and operations. Like I said earlier, most teams have both the flavors built into one single team. That implies that the same team will have folks with great engineering competence and those with operations bias as well. However, in some companies, especially the larger ones, the two flavors may be two very different teams working separately under different leadership for common goals as stated above. 

In some places, the engineering focused team is called DevOps and the operations team is called SRE. Yet other companies name the engineering focused team as Service Engineering team and operations team as just operations team.

In Yahoo, we have engineering biased team as Service Engineering team (SE team) and operations focused team as Service Reliability Engineering team (SRE team).

When the teams are created ab initio, the work is segregated and defined for each team. The two teams are also seeded differently – engineering focused team will have more people who can code, understand the code, get into innards of the code base and file bugs when we hit issues due to buggy code. They almost tell the developers “here is where your code is throwing an exception, please fix this part”. So they “can read” and “understand” the code but since they do not “own” the code, they hand off the bug resolution to developers.

Then we have the SRE team which is our first and second line of defense and mostly attends to all the alerts, provides first level investigation and triaging and either resolves them or escalates them to service engineering team. In a very matured SRE team, we expect 85-90% alerts being handled and resolved by SRE team. The 15-10% alerts that are escalated to SE team are mostly resolved by SEs. You can expect 1-2% of those alerts being escalated to development team.

If you look at the work, SRE work is totally interrupt driven, SE work is partially interrupt driven and large part is planned work. The developer’s work, on the other hand, should be largely planned so he or she can totally focus on the new features, new products and enhancements etc.

It is possible that over a period, the teams may morph into something better – at that point, we say “oh, this team has really matured into a fantastic SRE or SE or development team”. That is mostly possible if the teams are doing the type of work that they have been designed to do and in the manner (interrupt or plan driven) they have been conceptualized to do. So this is good scenario, eh?

When things start going awry….
What happens when the scenario doesn’t turn out to be as good as we wanted it to be or as favorable to each team as we would have wished for? Well, then we have a challenge….

If we are not continuously monitoring our teams for type of skill sets that seed these teams, the type of work that falls in their laps, it is very possible that the fiber of the team(s) may undergo a mutation – typically, for worse.

Imagine a scenario, we have attrition due to any number of reasons in SRE team. It could be leadership or lack of it, lack of good management, gaps in people’s expectation, company not doing well, work load being pure killer…any number of reasons. And as a organization, we fail to see this coming, even after it happens, we do not backfill attrition immediately or fast enough, the workload on remaining people will continue to increase since attrition of people doesn’t necessarily translate into reduction of workload. So, the remaining workforce comes under resources crunch and work overload. This then starts a downward spiral that if not arrested well and fast enough can pretty much cause annihilation of SRE team. Once SRE team reduces without proportional reduction in the workload, we start spilling workload to SE team. SE team now suddenly discovers, much to its angst and disappointment that it is the de-facto team doing SRE work while expectation around SE work have not diminished at all. So SE team starts focusing on totally operational work and bends backward to make the Site stay up.

During this time, SE team has also changed their work routine as follows so they can accommodate the operations workload that has been thrust on them:-
  •           Stops going to development team’s daily scrum meetings since SE was up fighting operations and incident late last night, during weekends and long weekends
  •           Doesn’t have time to do code review with dev team
  •           Doesn’t have time to build monitoring new feature that got pushed last evening through CI/CD pipeline
  •           Backlog of the SE related work starts building up
Not many people realize this but during this time, the SE team also moved away from being largely driven by planned work through Kanban/Scrum to interrupt driven work.

Slowly but steadily the SE team becomes the new SRE team. Management and leadership don’t really mind it – they cut the cost down in operating expenses by dismantling a full team that was called SRE team and in their minds and words, they have made the SE team very “efficient”.

The management is so focused on dollars that it misses the deep, dense forest for few “shinning” trees.

The whole change takes place over at least a year, it can’t happen over few months.

The management is applauded and rewarded for cost cutting and the “success stories” are told and exchanged with other teams. No one yet understands the deeper damage this cost cutting and change has done to the SE team. By the time people and management will realize this, the current leadership of management would have long moved to different role, different company, and different team to continue the good work there.

Now let us bring our focus back to poor SE team that has been forced to morph into an SRE team. There are some very brilliant engineers in the SE team who are not happy with the current state of affairs and are waiting for management to put them out of their misery by recreating the SRE team. When they realize that management is not even thinking on those lines – of recreating SRE team, reinvesting in the SRE people - they make their mind and jump ship. Jump ship to different team, different company, anywhere but their current team.

Now the next phase kicks in, the SE team starts seeing the same attrition that SRE team saw – reasons may be the same or different – but good and brilliant team members are first to leave. The immediate management of SE team is suddenly running around like headless chicken, trying to do firefighting by getting enough resources so Site can be kept up. In their rush to get any and every resource they can find, they naturally lower the hiring bar.

Once they lower the hiring bar, the once brilliant and industry recognized SE team starts hiring sub-standard material for that team. Please be aware that the new hires are not bad or incompetent people. When I say "sub-standard" it doesn't mean that they are incompetent. All I am saying is that they are not the right fit for the SE team. They may be pretty hard working but are ill-suited for the SE team which was until then seeded by brilliant engineers who fully understood how to take very "developer" code and change it into a very "production ready" code.

Now, this is where it gets interesting. In the first phase, we lost SRE team. In the second phase, we forced the SE team to become SRE and SE team combined into one. In the third phase, we made the SE team to change totally into SRE team. In the fourth phase, we started losing SE team. In the fifth and final phase, we hired and seeded SE team with different level of people.

…here is the kicker, the SE team that started operating like an operations team is at its lowest morale since the workload is totally interrupt driven, they have by now lost the respect of their development team. At this point, development team pretty much agrees that SE team doesn't do any level of investigation on any issue before chucking it over the fence to them.

…And the disease spreads to Development team as well….
So now, this is perhaps that last stage, the development team changes its workload from being totally plan driven workload to substantially interrupt driven workload.  At this stage, a development team that was entirely focused on new products, new features is now forced to change its focus to part new features/product and part sustenance of existing products or site up. A development team that was able to bring to market at least two big products a year is now struggling to bring even one big product in beta phase.

The slow development team frustrates their management since their management is trying to catch up or stay ahead of competition. As a result, they are continuously changing roadmap and plan of record.

In the past, the development team used to complete a big product in 4-6 months, which used to allow their management to do course corrections rapidly. Now, the course corrections have to happen at the same pace as earlier but imagine that a development which is moving a very slow pace, same team is now forced to absorbed course corrections while their beta version is also not put out. This results in what development team largely sees as "scope creep" on the given project. This frustrates the developers and they understand this as “directionless” development team management.

Now the maladies and issues of SRE and SE team have become contagious and started hitting development team as well. Developers, frustrated by their management’s constant change of direction, start leaving causing a drain even bigger than the SRE and SE team attrition.

At this point the whole organization is paralyzed by these issues and starts slowing down to the extent, that at some point, it comes to a grinding halt. At this juncture, all the teams are focused on keeping the site up.

How do we rebuild from here?
First thing we need to do is to baseline our team to estimate the "damage" - to ascertain how much our team's bias has changed.

To measure where an SE or SRE or Development team stands, we can plot all the dimensions required for a good team - SRE, SE (with engineering bias) or Dev team on X axis and then measure them on a scale of 0 through 10. Zero on the scale shows total absence of the specific dimension and 10 shows reasonably high proficiency in that dimension.

For the purpose of this article, I will focus only on Service Engineering team.

Let us also assume that hypothetically a very mature Service Engineering team in the industry will perhaps be at 7.5 and above.

There is a distinct difference between Service Engineering team as well as Service Reliability Engineering (SRE) team. Both are equally important for a company. At the risk of being shouted down by many, I would say that for a company, SRE team is more important than SE team. I draw the metaphor of a hospital. Every hospital has an Emergency Room or ER. Some places also call it Trauma center. This place receives people who are in dire need of immediate first aid else they would die. They receive all the accidents, heart attack, gun violence related people… They are America’s life line and the first line of response. Without these wonderful folks we would have hundreds of thousands of additional causalities in US every year.

Then there is a medical system that does more of diagnostic and preventive medication. These are also the people for whom every day is a “Monday” – they don’t have holidays,

These are the guys who save us daily.

Similarly, SRE team is our first line of defense. These are the guys who receive the alerts and respond to them immediately. Depending on how egregious an alert is, and how critical a property is, their response may vary. For example, in case of a DoS attack, they may start and IRC, collaborate with other companies having the same challenge and do everything to repel the attack. Actually, discussion on DoS will need a separate book by itself. LOL!

SE team is the medical system that has much more time than the ER or SRE team and therefore, can focus on addressing the causes as opposed to just the symptoms.
Engineering-Operations Graphs or EO-factor

I call these graph Engineering-Operations graph or “EO factor” for short. This is like pH factor that determines if a solution is acidic or basic. The pH factor of pure water is 7 and is considered neutral. pH factor above 7 is considered alkaline or basic and as pH factor increases beyond 7, the basicity or alkalinity of the solution increases. Similarly, as it goes below 7, the acidity of solution increases.

EO factor should be read in the same way, as we move away from the “Operations Line” the team becomes more biased and focused on a given track – operations or engineering.  As you move north of this line, the team tends to become very engineering focused. In the same vein, if you move south of this line, the team has more of operations competence.

This graph can be used by any team with different dimensions. For example, an SE or SRE team working on Big Data systems will have somewhat different dimensions than a team working on Mail or a team working on company landing home page. It is entirely up to the managers to figure out correct dimensions to measure their teams and use those dimensions to then chart out the future course. 

EO Factor
Build the new team...
Once you have ascertained the baseline, take remedial steps to rebuild it - slowly and steadily.

Seed it with right skill set, keep the right workload type, see to it that it comes in right manner - interrupt or planned, keep irrigating and feeding it with right talent. And above all, watch it carefully as shown below.

Watch the team's focus...

Let us take a scenario where I have EO factor for a team and the team exists for a long period. You are interested in seeing how that team either stays in the fiber that you created it for or if, over a period, it has changed its bias. A time series trending will be great graph to have for this type of trending. 

EO Factor Trending
This time series helps us to understand how our team is trending over a period. Remember, no side is good or bad in this graph. You want to keep a good watch on your team’s tendency to shift its bias from Operations to Engineering or vice versa. Depending on what bias the team was intended to be seeded with, you have very compelling motivation to keep the bias in the same side. If you are not watchful and very mindful of this, the teams always have tendencies to move from one focus to the other.

We should ensure that we seed an SE team so that it has an EO factor of about 6 or above. We have also got to ensure that this EO factor always stays above that for that SE team. Similarly, we need to ensure that our SRE team has EO factor below 5 or 5.5 so it keeps operations as their bias.

An SRE team can be seeded with very brilliant engineering focused people as well. The challenge is that most likely the brilliant people do not like operations work and may again start rolling the same juggernaut which led to the attrition of SRE in the first place.

Conclusion
We need to be always aware and cognizant of type of talent pool in every team and make sure that right team is staffed for doing the right and intended job. This has to be consistently and continually monitored else we get to a point where we need to make radical changes. Making changes and having those changes make impact is multi-year project and therefore, painfully slow.  Hence, a stitch in time may act as a preventer for nine few years later.

No comments:

Post a Comment