SRE SERVICE CATALOGUE
Site Reliability Engineers (SREs) sits at the intersection of software engineering and systems engineering. While there are potentially infinite permutations and combinations of how infrastructure and software components can be put together to achieve an objective, focusing on foundational skills allows SREs to work with complex systems and software, regardless of whether these systems are proprietary, 3rd party, open systems, run on cloud/on-prem infrastructure, etc. Particularly important is to gain a deep understanding of how these areas of systems and infrastructure relate to each other and interact with each other. The combination of software and systems engineering skills is rare and is generally built over time with exposure to a wide variety of infrastructure, systems, and software.
SREs bring in engineering practices to keep the site up. Each distributed system is an agglomeration of many components. SREs validate business requirements, convert them to SLAs for each of the components that constitute the distributed system, monitor and measure adherence to SLAs, re-architect or scale out to mitigate or avoid SLA breaches, add these learnings as feedback to new systems or projects and thereby reduce operational toil. Hence SREs play a vital role right from the day 0 design of the system.
Site reliability engineering (SRE) is a software engineering approach to IT operations. SRE teams use software as a tool to manage systems, solve problems, and automate operations tasks.
A site reliability engineer (SRE) creates a bridge between development and IT operations by taking on the tasks typically done by operations. Instead, such tasks are given to these types of engineers who use automation tools to solve problems by creating scalable and reliable software systems.
Standardization and automation are at the heart of what an SRE does, especially as systems migrate to the cloud. Thus, they often have a background in software or system engineering or system administration with IT operations experience.
Love DevOps? Wait until you meet SRE
You may have heard of a little company called Google. They invent cool stuff like driverless cars and elevators into outer space. Oh: and they develop massively successful applications like Gmail, Google Docs, and Google Maps. It’s safe to say they know a thing or two about successful application development, right?
They’re also the pioneers behind a growing movement called Site Reliability Engineering (SRE). SRE effectively ends the age-old battles between Development and Operations. It encourages product reliability, accountability, and innovation – minus the hallway drama you’ve come to expect in what can feel like Software Development High School.
How? Let’s look at the basics.
What Does a Site Reliability Engineer Do?
Site reliability engineering involves splitting time between operations and development. For example, a site reliability engineer may be involved with help desk tickets, on-call incidents, manual tasks, etc. In addition to that, a site reliability engineer may also spend their time on proactive projects, such as automation, improving system reliability, etc., trying to reduce the amount of manual work and ensuring all the components (infrastructure/hardware, middleware, software, etc.) that are required to keep the software deployments live are running efficiently.
What Tools do SREs Use?
The tools and software solutions that site reliability engineers can vary greatly from organization to organization. One of the main reasons being that in larger organizations, there would typically be more personnel within an SRE team, therefore, the responsibilities and scope for each SRE would be divided amongst the team, resulting in a more focused role. In turn, this would also reduce the range of tools and platforms they would use. So, for example, in a larger enterprise organization, an SRE may just work in Jenkins all day, every day.
Where Can I Learn More about Site Reliability Engineering?
The term “Site Reliability Engineer” is attributed to Ben Treynor Sloss, now a Vice President of Engineering at Google. He was asked in 2003 to create and manage a team of seven engineers which eventually led him to create the new role/title. There are a few great online resources written by Ben and several other Google engineering team members that cover everything from the principles and tenets of SREs, SRE roles and responsibilities, to the evolution of the Site Reliability Engineering role and where it stands in today’s DevOps environments. No better way to learn more about site reliability engineering than from the individual and organization that created the role in the first place, right?
Should you consider this career path?
You can become an SRE regardless of your background in software or systems engineering, as long as you have solid foundations in both and a strong incentive for improving and automating. If you are a systems engineer and want to improve your programming skills, or if you are a software engineer and want to learn how to manage large-scale systems, this role is for you. Deepening your knowledge in both areas will give you a competitive edge and more flexibility for the future.
Site Reliability Engineering
- (Book) Site Reliability Engineering – https://landing.google.com/sre/book/index.html
- (Book) Site Reliability Workbook – https://landing.google.com/sre/workbook/toc/
- (Book) Building Secure and Reliable Systems – https://landing.google.com/sre/resources/foundationsandprinciples/srs-book/
- (Course) Intro to DevOps – https://br.udacity.com/course/intro-to-devops–ud611/
- (Course) Google Cloud Platform for Systems Operations – https://www.coursera.org/specializations/gcp-sysops
- (Course) Measuring and Managing Reliability – https://www.coursera.org/learn/site-reliability-engineering-slos
Operating Systems
-
(Course) Introduction to Operating Systems – https://br.udacity.com/course/introduction-to-operating-systems–ud923/
-
(Course) Advanced Operating Systems – https://br.udacity.com/course/advanced-operating-systems–ud189/
Automation
-
- (Tutorial) Ansible – https://www.digitalocean.com/community/tutorials/configuration-management-101-writing-ansible-playbooks
- (Course) Terraform – https://www.udemy.com/learn-devops-infrastructure-automation-with-terraform/learn
- A complete Ansible handbook filled with real life IT automation use cases – https://ansiblehandbook.com/
- Other Books – https://www.techbeatly.com/category/books/
Distributed Systems
- (Tutorial) Introduction to Distributed Systems Design – http://www.hpcs.cs.tsukuba.ac.jp/~tatebe/lecture/h23/dsys/dsd-tutorial.html
Networking
- (Book) Understanding Linux Network Internals – http://shop.oreilly.com/product/9780596002558.do
Programming Languages
Python
- (Book) Learn Python 3 The Hard Way – https://learnpythonthehardway.org/python3/
- (Course) Developing Scalable Apps in Python – https://br.udacity.com/course/developing-scalable-apps-in-python–ud858/
Go
-
(Book) The Go Programming Language – https://www.amazon.com/Programming-Language-Addison-Wesley-Professional-Computing/dp/0134190440
-
(Webinar) https://www.youtube.com/watch?v=Q_H4hrUez80 (Go Language for Ops and Site Reliability Engineering)
-
(Hands On) https://gopherlabs.kubedaily.com/
Production Web App
-
(Tutorial) https://www.digitalocean.com/community/tutorial_series/building-for-production-web-applications
-
(Book) Production Ready Microservices – https://www.amazon.com/gp/product/1491965975/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1491965975&linkCode=as2&tag=susanfowler-20&linkId=8e434210b002d00be8507454a75c11ff
Web Servers
Nginx
- (Course) Nginx Fundamentals – https://www.udemy.com/nginx-fundamentals/
Cluster Management
Kubernetes
- (Tutorial) Kubernetes Bootcamp – https://kubernetes.io/docs/tutorials/kubernetes-basics/
- Principles for Designing and Deploying Scalable Applications on Kubernetes – https://elastisys.com/designing-and-deploying-scalable-applications-on-kubernetes/
- Architecting Applications for Kubernetes – https://www.digitalocean.com/community/tutorials/architecting-applications-for-kubernetes
- Building Globally Distributed Services using Kubernetes Cluster Federation – https://kubernetes.io/blog/2016/10/globally-distributed-services-kubernetes-cluster-federation/
- Kubernetes best practices in production – https://learnk8s.io/production-best-practices
Continuous Integration | Continuous Delivery
- (Course) Continuous Deliver Better Software – https://www.udemy.com/learn-devops-continuously-deliver-better-software
Containers
- (Course) Docker for Devops – https://www.udemy.com/docker-tutorial-for-devops-run-docker-containers