Deploying Site Reliability Engineering (SRE) throughout an organization has its challenges, both cultural and technical. We previously discussed how bringing an SRE practice inside Dell Digital, Dell’s IT organization, has improved the reliability and scalability of our eCommerce platforms. Today, we’re sharing more on how we’ve created a centralized SRE Enablement program and set out on a broader mission to help organizations across IT deploy an SRE approach to improve their site operations.
Eighteen months into our enablement effort, we are currently working to help five IT organizations create SRE teams and implement SRE solutions and automation capabilities. And we have several more organizations wanting to start or mature SRE capabilities.
Scaling SRE From the Center Out
In the wake of our successful eCommerce pilot, our SRE Enablement team created a Center of Excellence (COE) to provide the basis for an IT-wide outreach effort via a series of roadshows.
The COE organization details SRE tools and best practices that will help teams improve site reliability by setting up a real-time, end-to-end monitoring ecosystem on desktop and mobile devices, delivering intelligent proactive notifications, automating solutions for recurring issues, reducing operational efforts and reducing mean time to find and mean time to repair performance incidents.
If your organization plans to pursue SRE enablement, I recommend creating a centralized place where you build products and formalize a practice that can be scaled with consistency. This will bring down the cost, as well as time and effort to bring reliability to reality.
As we added IT organizations to our SRE enablement effort, we have expanded our core team of SRE engineers in the COE and now have 35 team members overseeing our SRE products and processes.
Sizing Up SRE Maturity
In each case, the first step to helping a participating organization adopt an SRE strategy is to assess their SRE maturity. We first ask teams to assess themselves in terms of SRE work they may be doing. We then evaluate them based on a maturity assessment model that measures SRE fundamentals, including current operation monitoring capabilities, their track record in addressing issues, service level objectives and current roles and responsibilities.
The maturity assessment generates a score that helps teams set priorities and define the cultural shift they need to move away from the traditional ticketing approach to site reliability and to an engineering mindset.
Once we have a maturity score, we then help participating organizations create their own SRE team and develop needed skills. In some cases, we work with the organization’s own engineers, possibly from various backgrounds including software, architecture and networking. One organization we worked with, for example, had five engineers from differing backgrounds to start their SRE team. Another team only had one and we helped the leader create…