Site Reliability Engineering

How Google Runs Production Systems?

Table of Contents
Foreword
Preface
Part I – Introduction
Chapter 1 – Introduction
Chapter 2 – The Production Environment at Google, from the Viewpoint of an SRE
Part II – Principles
Chapter 3 – Embracing Risk
Chapter 4 – Service Level Objectives
Chapter 5 – Eliminating Toil
Chapter 6 – Monitoring Distributed Systems
Chapter 7 – The Evolution of Automation at Google
Chapter 8 – Release Engineering
Chapter 9 – Simplicity
Part III – Practices
Chapter 10 – Practical Alerting
Chapter 11 – Being On-Call
Chapter 12 – Effective Troubleshooting
Chapter 13 – Emergency Response
Chapter 14 – Managing Incidents
Chapter 15 – Postmortem Culture: Learning from Failure
Chapter 16 – Tracking Outages
Chapter 17 – Testing for Reliability
Chapter 18 – Software Engineering in SRE
Chapter 19 – Load Balancing at the Frontend
Chapter 20 – Load Balancing in the Datacenter
Chapter 21 – Handling Overload
Chapter 22 – Addressing Cascading Failures
Chapter 23 – Managing Critical State: Distributed Consensus for Reliability
Chapter 24 – Distributed Periodic Scheduling with Cron
Chapter 25 – Data Processing Pipelines
Chapter 26 – Data Integrity: What You Read Is What You Wrote
Chapter 27 – Reliable Product Launches at Scale
Part IV – Management
Chapter 28 – Accelerating SREs to On-Call and Beyond
Chapter 29 – Dealing with Interrupts
Chapter 30 – Embedding an SRE to Recover from Operational Overload
Chapter 31 – Communication and Collaboration in SRE
Chapter 32 – The Evolving SRE Engagement Model
Part V – Conclusions
Chapter 33 – Lessons Learned from Other Industries
Chapter 34 – Conclusion
Appendix A – Availability Table
Appendix B – A Collection of Best Practices for Production Services
Appendix C – Example Incident State Document
Appendix D – Example Postmortem
Appendix E – Launch Coordination Checklist
Appendix F – Example Production Meeting Minutes
Bibliography

	Levon Ritter on AWS DataSync vs S3 Sync
	Joe on AWS Bedrock AgentCore: Enterpr…
	ABDUL YASEEN BABA MO… on TSM
	Heather W on Puppet push Nagios
	Umesh Kumar on Yum gets ‘HTTPS Error 40…
	Pavel on Check Confluence team calendar…
	withanHdammit on Renew AWS credential for a lon…
	Unleashing the Power… on Image-Reader: A project to exp…
	Bob on Build docker image with kaniko…
	Voces De La Tierra on Puppet for Windows: Remote…

Site Reliability Engineering

Table of Contents

Published by Jackie Chen

Leave a comment Cancel reply

Table of Contents

Share this:

Related

Published by Jackie Chen

Leave a comment Cancel reply