Insomnia support group: Let's read IT Books

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Insomnia support group: Let's read IT Books

«‹›7 »

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Slow going. Game of Thrones premiere is this weekend, so this is a 100+ hour week for me. Let's do another chapter while I reboot servers!

Chapter 2: The Production Environment at Google, from the Viewpoint of an SRE

This chapter gives a high-level and extremely general overview of what the Google infrastructure looks like. Because terminology can be confusing between hardware and software servers, it begins by defining a machine as a piece of hardware or a VM, and a server as a piece of software that implements a service. I'll be using the same terminology as we talk about this and any other chapters going forward.

Google does not have specific machines allocated to specific applications. Within each datacenter, they operate one or more clusters, running a DCOS called Borg which is somewhat similar to Apache Mesos or Google's open-source Kubernetes container scheduler project. (Presentations from the last several years suggest Google is developing or has already developed a successor to Borg called Omega. But this book refers to Borg, and I'll take it at its word for now.) The goal of Borg is to distribute applications across this cluster in as lightweight and efficient a way as possible. This removes frictions for developers and operators, while providing maximum elasticity and the ability to scale services up and down to meet demand. Because tasks can run anywhere, Google uses custom service discovery to locate all of its services, rather than DNS naming.

Borg runs on top of a custom distributed filesystem called Colossus. Colossus is a successor to GFS. If you're familiar with other distributed systems like HDFS or Gluster or Ceph, you know how this works -- the technical details are out of scope for this summary. Google runs other services on top of Colossus, including BigTable, Google's petascale NoSQL store, Spanner, which provides a SQL-like query interface on top (like Hive, continuing the Hadoop metaphor), and others that this chapter doesn't name.

Because any application can run anywhere, the cluster is joined in a Clos crossbar topology to ensure lots of bandwidth is available between any two endpoints. The infrastructure is software-defined with the shortest routes between hops being centrally computed and stored. This is similar to how basically any large-scale cloud datacenter, like AWS, works. QoS is heavily used to ensure that important services get the bandwidth they need. Google runs several kinds of Global Software Load Balancers, for DNS, for user services, and for remote procedure call servers. We'll get into this further in chapter 20.

A few other key services: Chubby, a shared-nothing lock service based on Paxos, similar to Apache ZooKeeper; Borgmon, a monitoring program that we'll dive into in Chapter 10.

Most services communicate using Remote Procedure Calls (gRPC) over an infrastructure called Stubby. Even some local operations are performed over RPC to make them easier to refactor to other services later.

All Google projects share the same monolithic repository aside from a handful of open-source projects (Android and Chrome), which use independent repos and standard tooling for interoperability reasons. This makes it easier for engineers to spot and fix bugs in components that they rely on. All code at Google goes through code review. Some projects use automated continuous deployment, where the service is automatically deployed to production if all of the test cases pass.

The chapter closes with a description of a sample service called Shakespeare that I'm not going to summarize here, because it's redundant.

# ? Apr 14, 2016 16:14

Adbot: ADBOT LOVES YOU

# ? Apr 23, 2024 16:20

DeaconBlues: Nov 9, 2011

Anyone know if the RHEL 7 study guide by Michael Jang is due out any time soon? I've had an alert set up with Amazon for months and getting tired of waiting.

What's the second best study guide for training for EX200 and EX300? Intermediate Linux user here, willing to learn more and gain a cert or two.

# ? Apr 14, 2016 21:20

Gucci Loafers: May 20, 2006; Ask yourself, do you really want to talk to pair of really nice gaudy shoes?

It's out and sitting on my desk.

# ? Apr 14, 2016 22:09

DeaconBlues: Nov 9, 2011

Whoop whoop! I checked Amazon a few hours ago and it was still pending a release. I shall check again when I get time, cheers!

EDIT: Just checked and it's been out since 8th April on Amazon US. I have to wait another 4 days.

DeaconBlues fucked around with this message at 10:07 on Apr 15, 2016

# ? Apr 14, 2016 22:43

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Chapter 3: Embracing Risk

Especially in an always-on mobile-first world, users are communicating with online services using unreliable devices and unreliable connections. If a user's phone is only 99% reliable, it doesn't make sense to obsess over the difference between 99.9% and 99.99% service availability (the authors of this chapter later use the term background error rate, which I really like). SRE tries to balance availability with development velocity.

The cost to add reliability to a system is not linear. It costs significantly more to get from 99.9% to 99.99% than from 95% to 99%, and so on. These costs are both material (hardware, software) and opportunity cost (time spent baking in more reliability is time that features aren't being written or code isn't being deployed). Realistic risk management allows Google to develop features and get them into the hands of users faster.

The elastic and geographically distributed nature of Google's workloads makes it hard to measure uptime using traditional measures, so they opt for the simple approach availability = (successful requests / total requests). This also maps better to things like batch processing systems which don't benefit whatsoever from traditional definitions of uptime.

Google's risk management approach is framed in terms of risk tolerance, which is arrived through the SREs collaborating with the product owners. Risk tolerance is a continuum rather than a spectrum, because services can fail in different ways, with different impacts to different users. Availability targets are constantly shaped by looking at who is impacted, how, what their expectation is for that service's availability, and whether (and how much) they pay for it. Engineering problems are also considered, like whether it's worse for a service to have frequent small errors or failures, or occasional complete outages. Not all risk tradeoffs are necessarily failure-related, either: other metrics, like latency, are considered, especially for infrastructure services with multiple consumers. These consumers may have different priorities (one preferring low latency, the other preferring high throughput).

The chapter closes with a description of error budgets. This same description is repeated in basically every section of the book. I'm not going to waste more time.

# ? May 20, 2016 06:15

Dr. Arbitrary: Mar 15, 2006; Bleak Gremlin

This thread isn't very active, but I do read your chapter summaries. You're not typing into a void here!

# ? May 20, 2016 06:18

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Chapter 4: Service Level Objectives

Only four chapters in, and our material is already getting frustratingly redundant. The authors talk about the difference between Service Level Indicators, Objectives and Agreements -- you know the difference already. An interesting side point here is that Google creates synthetic outages of some infrastructure services like Chubby (the lock service) if availability is too high -- Chaos Monkey, basically.

SLIs tend to follow one of the following templates:

User-facing serving systems: availability, latency, throughpout
Storage systems: latency, availability, durability
Big data: throughput, end-to-end latency

And all these things should produce correct results, of course.

These indicators must be collected, of course. And we should be careful about how we aggregate them, because aggregation can sometimes produce really misleading results. For example, averages of any samples generally don't provide you with very actionable information. However, breaking the data into distributions, and being able to view your latency in terms of percentiles (50th, 85th, 85th, 99th) provides a much better view of your application's edge cases and long tail behaviors. (Simply: if your 99th percentile latency is good, you definitely have nothing to worry about.)

Teams should try to make the SLIs simple to understand and standardized so that everyone on the team knows exactly what they mean.

Indicators should be chosen by working backwards from objectives towards specific targets, rather than trying to define objectives based on the indicators you have lying aroumd/

Keep things simple to understand. Have as few SLOs as you can get away with. SLOs aren't set in stone; set them to get an understanding of the system (and its users), but don't be afraid to change it if the goals are unreasonable in either direction.

There's a "Control Measures" section that's a restatement of Deming's Plan-Do-Check-Act cycle.

# ? May 20, 2016 06:29

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Chapter 5: Eliminating Toil

There's some fluff words here about why manual work is bad and automation is good. Skip this chapter.

# ? May 20, 2016 06:30

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Dr. Arbitrary posted:

This thread isn't very active, but I do read your chapter summaries. You're not typing into a void here!

Thanks! I learn better when I keep notes, but I might as well vomit my words into a place where other people might find them useful.

# ? May 20, 2016 06:31

Adbot: ADBOT LOVES YOU

# ? Apr 23, 2024 16:20

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Chapter 6: Monitoring Distributed Systems

The chapter starts off with a bunch of definitions of terms. Then it, in all seriousness, follows it up with a section called "Why Monitor?" because the authors of this book love wasting your time reading it. Moving on.

Setting reasonable expectations for monitoring is really important. Even with standard monitoring tools and infrastructure throughout the company, within SRE teams of 10-12 engineers, 1-2 are typically "monitoring people" in charge of making the service operable. Google's monitoring systems tend towards fast and simple, with more investment being put into tools to perform analysis after a problem is discovered. Despite Google's new machine learning culture, they generally avoid making systems automatically determine thresholds or detect anomalies. Non-real-time systems doing things like capacity planning typically have more complexity baked in. Dependencies are typically not tracked within the monitoring system, because an application's dependencies can change constantly due to refactoring.

Monitoring should deal with both symptoms and causes. There's words written here about black-box and white-box monitoring, but the writing about the distinction is sloppy and, in my opinion, wrong (actual quote: "Therefore, white- box monitoring is sometimes symptom-oriented, and sometimes cause-oriented, depending on just how informative your white-box is"). A prescient nugget is that telemetry should cover all the layers of the request-response cycle, because if you don't know how fast the database is responding, you can't determine if a performance problem is due to the database, or the network between the app and the database.

The four golden signals for monitoring:

Latency
Traffic
Errors
Saturation

(recall Brendan Gregg's USE method that he's described in a number of performance talks.)

The author spends a few pages talking about long tails and how bad aggregations like averages aren't a good way to measure your performance. Sound familiar? The book suffers from a lack of cohesive editing or structure, so it does this every chapter.

Pick the right resolution for your measurements. Make things simple so people can understand them and so alerts don't trigger by mistake. Don't build monitoring systems that try to do too much, because they'll do so badly and they'll be a maintenance burden on your application.

Here's some questions to ask about your alerts that I'm going to rewrite verbatim, because they're good questions:

Will I ever be able to ignore this alert, knowing it's benign? When and why will I be able to ignore this alert, and how can I avoid this scenario?
Does this alert definitely indicate that users are being negatively affected? Are there detectable cases in which users aren't being negatively impacted, such as drained traffic or test deployments, that should be filtered out?
Can I take action in response to this alert? Is that action urgent, or could it wait until morning? Could the action be safely automated? Will that action be a long-term fix, or just a short-term workaround?
Are other people getting paged for this issue, therefore rendering at least one of the pages unnecessary?

The chapter closes with a pair examples of times that SRE teams got alerted too much. One key takeaway: setting your SLOs low can improve your product availability by allowing engineers to actually focus on quality and underlying issues instead of fighting fires that maaaaaaaaybe really aren't that bad.

Overall, this chapter basically just talked about monitoring, and didn't give any particularly insightful approaches for monitoring distributed systems. Maybe later chapters on specific distributed systems will address this better. I was disappointed.

# ? May 20, 2016 06:57

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Insomnia support group: Let's read IT Books

«‹›7 »