SRE SERVICE CATALOGUE

home logo

Site Reliability Engineers (SREs) sits at the intersection of software engineering and systems engineering. While there are potentially infinite permutations and combinations of how infrastructure and software components can be put together to achieve an objective, focusing on foundational skills allows SREs to work with complex systems and software, regardless of whether these systems are proprietary, 3rd party, open systems, run on cloud/on-prem infrastructure, etc. Particularly important is to gain a deep understanding of how these areas of systems and infrastructure relate to each other and interact with each other. The combination of software and systems engineering skills is rare and is generally built over time with exposure to a wide variety of infrastructure, systems, and software.

SREs bring in engineering practices to keep the site up. Each distributed system is an agglomeration of many components. SREs validate business requirements, convert them to SLAs for each of the components that constitute the distributed system, monitor and measure adherence to SLAs, re-architect or scale out to mitigate or avoid SLA breaches, add these learnings as feedback to new systems or projects and thereby reduce operational toil. Hence SREs play a vital role right from the day 0 design of the system.

Site reliability engineering (SRE) is a software engineering approach to IT operations. SRE teams use software as a tool to manage systems, solve problems, and automate operations tasks.

A site reliability engineer (SRE) creates a bridge between development and IT operations by taking on the tasks typically done by operations. Instead, such tasks are given to these types of engineers who use automation tools to solve problems by creating scalable and reliable software systems.

Standardization and automation are at the heart of what an SRE does, especially as systems migrate to the cloud. Thus, they often have a background in software or system engineering or system administration with IT operations experience.

Love DevOps? Wait until you meet SRE

You may have heard of a little company called Google. They invent cool stuff like driverless cars and elevators into outer space. Oh: and they develop massively successful applications like Gmail, Google Docs, and Google Maps. It’s safe to say they know a thing or two about successful application development, right?

They’re also the pioneers behind a growing movement called Site Reliability Engineering (SRE). SRE effectively ends the age-old battles between Development and Operations. It encourages product reliability, accountability, and innovation – minus the hallway drama you’ve come to expect in what can feel like Software Development High School.

How? Let’s look at the basics.

What Does a Site Reliability Engineer Do?

Site reliability engineering involves splitting time between operations and development. For example, a site reliability engineer may be involved with help desk tickets, on-call incidents, manual tasks, etc. In addition to that, a site reliability engineer may also spend their time on proactive projects, such as automation, improving system reliability, etc., trying to reduce the amount of manual work and ensuring all the components (infrastructure/hardware, middleware, software, etc.) that are required to keep the software deployments live are running efficiently.

What Tools do SREs Use?

The tools and software solutions that site reliability engineers can vary greatly from organization to organization. One of the main reasons being that in larger organizations, there would typically be more personnel within an SRE team, therefore, the responsibilities and scope for each SRE would be divided amongst the team, resulting in a more focused role. In turn, this would also reduce the range of tools and platforms they would use. So, for example, in a larger enterprise organization, an SRE may just work in Jenkins all day, every day.

Where Can I Learn More about Site Reliability Engineering?

The term “Site Reliability Engineer” is attributed to Ben Treynor Sloss, now a Vice President of Engineering at Google. He was asked in 2003 to create and manage a team of seven engineers which eventually led him to create the new role/title. There are a few great online resources written by Ben and several other Google engineering team members that cover everything from the principles and tenets of SREs, SRE roles and responsibilities, to the evolution of the Site Reliability Engineering role and where it stands in today’s DevOps environments. No better way to learn more about site reliability engineering than from the individual and organization that created the role in the first place, right?

Should you consider this career path?

You can become an SRE regardless of your background in software or systems engineering, as long as you have solid foundations in both and a strong incentive for improving and automating. If you are a systems engineer and want to improve your programming skills, or if you are a software engineer and want to learn how to manage large-scale systems, this role is for you. Deepening your knowledge in both areas will give you a competitive edge and more flexibility for the future.

Site Reliability Engineering

SRE Road Map

1. Foundation: Understand Basic Concepts

Before diving into SRE-specific topics, you should have a solid understanding of:

Operating Systems (Linux/Unix)

  • Basic commands, file system, process management, etc.
  • Shell scripting (Bash, Python, etc.)

Networking Basics

  • IP addressing, DNS, HTTP/S, TCP/IP, SSL/TLS
  • Load balancing, firewalls, proxies, VPNs

Programming/Scripting Languages

  • Scripting with Python, Bash, or Go
  • Basics of object-oriented programming
  • Understanding of API and microservices

Version Control (Git)

  • Basic Git operations (clone, commit, merge)
  • Working with remote repositories (GitHub, GitLab, Bitbucket)

2. System Administration and Automation

Once you’ve got the foundation, focus on automation and systems management, which are core aspects of SRE.

Infrastructure as Code (IaC)

  • Tools like Terraform, Ansible, Puppet, or Chef for automating infrastructure deployment
  • YAML/JSON configurations, templating

Configuration Management

  • Managing and automating server configurations using tools like Ansible, Chef, or Puppet

Containerization and Orchestration

  • Understanding Docker for creating containers
  • Kubernetes for container orchestration (pods, services, deployments, scaling)
  • Helm for Kubernetes package management

Cloud Platforms

  • Familiarity with AWS, GCP, Azure (Compute, Storage, Networking)
  • Services like EC2, S3, Lambda, Cloud Functions, Kubernetes Engine, Cloud Load Balancer, etc.

3. Reliability Engineering Core Skills

At this stage, you’ll start focusing more specifically on the SRE field.

Monitoring and Observability

  • Prometheus, Grafana, Nagios, or Zabbix for monitoring system health
  • Distributed tracing (e.g., Jaeger, Zipkin)
  • Log management tools (e.g., ELK stack or Fluentd)

Service Level Objectives (SLOs), Service Level Indicators (SLIs), Service Level Agreements (SLAs)

  • Understanding the key SRE principles (SLOs, SLIs, SLAs)
  • Setting up metrics that define reliability (availability, latency, error rate, etc.)

Incident Management

  • On-call rotation best practices
  • Effective response to incidents, postmortem analysis, and learning from failures
  • Tools like PagerDuty, OpsGenie, or VictorOps for incident management

Automation and Self-Healing

  • Automating common tasks to reduce manual work (e.g., patching, scaling)
  • Building self-healing systems to respond automatically to failures (e.g., autoscaling, automated failover)

4. Advanced Concepts in SRE

As you advance, focus on more sophisticated topics and tools used at scale.

Distributed Systems

  • Deep understanding of distributed systems principles: CAP theorem, consensus algorithms (Paxos, Raft)
  • How to build resilient, scalable systems

Chaos Engineering

  • Testing system resilience under failure conditions (e.g., Chaos Monkey, Gremlin)
  • How to perform controlled experiments to improve fault tolerance

Performance Tuning

  • Techniques for improving performance at scale: database optimization, caching strategies, reducing latency
  • Tools like Redis, Varnish, CDNs

CI/CD Pipelines

  • Setting up continuous integration and delivery pipelines
  • Tools like Jenkins, GitLab CI, CircleCI, ArgoCD

Security Practices

  • Basic security practices, such as encryption, securing APIs, and identity management
  • Security tools like Vault, Kubernetes RBAC, or OAuth

5. Real-World Experience

Nothing beats hands-on experience in building and maintaining real-world systems.

  • Participate in open-source projects or contribute to internal tools.
  • Engage in on-call rotations and incident response simulations.
  • Work with teams to improve system reliability and performance.

6. Soft Skills and Communication

SREs work closely with development, operations, and support teams, so communication is key.

  • Writing clear and concise documentation.
  • Analyzing incidents and writing postmortem reports.
  • Collaborating effectively with various teams (DevOps, developers, QA).

7. Continuous Learning

The world of SRE is constantly evolving, so continuous learning is vital.

  • Stay updated on the latest trends and tools.
  • Participate in SRE-focused communities and forums (e.g., SREcon).
  • Read books like Site Reliability Engineering by Google and The Site Reliability Workbook.

Certifications

While certifications are not mandatory, they can help in certain areas:

  • Google Professional Cloud DevOps Engineer
  • AWS Certified DevOps Engineer
  • Certified Kubernetes Administrator (CKA)

Operating Systems

Distributed Systems

Networking

Programming Languages

Python

Go

Web Servers

Nginx

Cluster Management

Cloud

Amazon AWS

Kubernetes

Continuous Integration | Continuous Delivery

Containers