Observability Engineering - Charity Majors

Observability Engineering

Charity Majors

出版时间

2021-12-17

ISBN

9781492076421

评分

★★★★★
书籍介绍

Observability is critical for engineering, managing, and improving complex business-critical systems. Through this process, any software engineering team can gain a deeper understanding of system performance, so you can perform ongoing maintenance and ship the features your customers need. This practical book explains the value of observable systems and shows you how to build an observability-driven development practice.

Authors Charity Majors and Liz Fong-Jones from Honeycomb explain what constitutes good observability, show you how to make improvements from what you’re doing today, and provide practical dos and don’ts for migrating from legacy tooling, such as metrics monitoring and log management. You’ll also learn the impact observability has on organization culture.

You’ll explore:

The value of practicing observability when delivering and managing complex cloud native applications and systems

The impact observability has across the entire software engineering cycle

Software ownership: how different functional teams help achieve system SLOs

How software developers contribute to customer experience and business impact

How to produce quality code for context-aware system debugging and maintenance

How data-rich analytics can help you find answers quickly when maintaining site reliability

Charity Majors is the cofounder and CTO at Honeycomb, and the coauthor of Database Reliability Engineering. Before that, she worked as a systems engineer and engineering manager for companies like Parse, Facebook, and Linden Lab.

Liz Fong-Jones is a developer advocate and site reliability engineer (SRE) with more than 17 years of experience. She is an advocate at Honeycomb for ...

(展开全部)

目录
Foreword
Preface
I. The Path to Observability
1. What Is Observability?
2. How Debugging Practices Differ Between Observability and Monitoring

显示全部
用户评论
算是第一本总结出监控体系需要infra和内核协作去做些什么的书,写的大多比较抽象没有涉及到代码或者prom CRD之类的配置,但对监控体系的理解有帮助,如果对指标部分比较好奇可以单独研究下NIST所写的kubernetes hardening guidance(里面也讲了些不常见的IdP的内容)
很有启发,内容稍有重复,核心概念在前面几章已完全覆盖。主要作者都来自同一家公司,他们的产品也按照这套方法实施,所以有种看产品宣传的感觉。概括讲,就是基于每个用户事件,获取全量的运行信息,再把他们导入分析平台(比如作者们的公司),再按照SRE方法论来管理,用error budget burn rate作为预警信号。 这比传统基于时序数据库,监测提前定义好的预警阈值的优势在于,能更好的处理 emergent novel 问题。
收藏