Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
George Wright
Nov 20, 2005

The NPC posted:

If you are storing team info in metadata, what are your namespace naming conventions? If there isn't team1-redis and team2-redis how do you prevent collisions?

For the record we are using team-app-env so we have webdev-homepage-dev, webdev-homepage-uat, finance-batch-dev etc. with each of these tied to an AD group for permissions. We include the environment in the name because we have 1 nonprod cluster.

Typically where I work a service will own it’s own cache or data stores in K8s, so it doesn’t make sense to name a namespace after the software powering that cache, let alone to have that cache have a separate namespace from the application consuming it.

If the service name is bombadier the namespace would be bombadier or bombadier-<env>. Within that namespace you would have your app deployment and your cache store(s) defined.

A database team managing a data store would either offer a shared data store in their own namespace, or they would have an operator that is allowed to create a data store in a service’s namespace.

Exceptions always exist. We try to do our best to steer teams towards best practices but we still allow people enough rope to learn lessons the hard way as we just silently judge while tapping on our best practices docs that they choose to ignore. At the end of the day it’s their decisions that typically cause downtime, not ours.

Adbot
ADBOT LOVES YOU

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Hadlock posted:

Deployments are plenty enough organizational division in 85% of cases
I agree with the rest of your post, but could you clarify this? K8s RBAC is a problem that leaks sewage all over use cases that rely on partial match.

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

Hadlock posted:

If the zookeeper app needs redis, there's a redis deployment in zookeeper-dev, zookeeper-staging, and zookeeper-prod namespaces (prod should be on a different cluster). If the platform team or the backend team owns zookeeper, that's fine, just update rbac for that user group

It would be to be a company wide ultra high performance HA redis cluster to need it's own namespace. Deployments are plenty enough organizational division in 85% of cases

In my namespaces you have, front end, back end, redis, memcachd, some kind of queue server all together. Most services are pretty low demand (Max 100mb memory) in the lower environments so you just get your own dedicated redis and your dev environment closely mimics prod down to the config level

Cluster wide stuff like Prometheus and Loki live in a shared metrics namespace

Edit: teams don't get their own namespace playgrounds to build weird poo poo that sucks up resources and causes problems. Only services! If team B wants a slack bot/service it gets it's own CI/CD and namespace and grafana dashboard just like prod you can have any color car you want so long as it's black you can deploy any service you want as long as it follows the deploy and monitor pattern of prod

So I read something like this and it seems utterly insane. How many nodes do you have in your cluster? What kind of SLAs do you have? How reliable are your infra services (redis, memcache, etc. )? What kind of resources are you throwing at your shared stuff like coredns? How do you not run into noisy neighbor issues all the time?

LochNessMonster
Feb 3, 2005

I need about three fitty


Blinkz0rz posted:

So I read something like this and it seems utterly insane. How many nodes do you have in your cluster? What kind of SLAs do you have? How reliable are your infra services (redis, memcache, etc. )? What kind of resources are you throwing at your shared stuff like coredns? How do you not run into noisy neighbor issues all the time?

I thought it was me, but a setup like that looks like a recipe for disaster for the companies I’ve worked for.

So I’m really curious what kind of scale you’re running this at.

We usually had teams running 1 or more services. Each team ran their entire setup, so no shared resources. By default we didn’t want teams to access eachothers services, unless explicitly exposed. Most services per team did need to access eachother.

For us it made sense to give each team their own namespace and keep things seperated from other teams while giving them more or less free reign within their own namespace (guardrails were of course in place)

Docjowles
Apr 9, 2009

Annoyed at the terraform AWS provider devs today. They released a new minor version that "fixes" an issue where you could add the same route to a VPC route table multiple times. Which, yeah, that probably shouldn't be allowed. But in practice it didn't hurt anything, it's not like you ended up with multiple routes in reality. AWS just silently ignored the subsequent attempts to create dupes. Now your terraform apply hard fails on the same code.

A module we wrote had a bug, and was creating some harmless dupe routes. I tried to upgrade the provider today and it broke the module. If I remove one of the duplicate declarations, terraform wants to delete the routes. A second plan/apply will restore them since the other declaration is still present. But this still means eating a 30 second network outage. I tried some fuckery with their moved{} syntax but it didn't help in this case, TF still insists on deleting the routes. The best workaround I came up with is manually doing a "terraform state rm" on the resources I am deleting from the code first so it doesn't want to delete them from AWS too. I can pin the provider version to the old version for a while but that's obviously not a long term solution. All of this sucks.

The change they've made is ~technically correct~ but it was not causing any issues whatsoever in practice. Why the hell would you stick this in a 0.01 point release and not sit on it until the next major version with all your other breaking changes :argh:

vanity slug
Jul 20, 2010

Docjowles posted:

Annoyed at the terraform AWS provider devs today. They released a new minor version that "fixes" an issue where you could add the same route to a VPC route table multiple times. Which, yeah, that probably shouldn't be allowed. But in practice it didn't hurt anything, it's not like you ended up with multiple routes in reality. AWS just silently ignored the subsequent attempts to create dupes. Now your terraform apply hard fails on the same code.

A module we wrote had a bug, and was creating some harmless dupe routes. I tried to upgrade the provider today and it broke the module. If I remove one of the duplicate declarations, terraform wants to delete the routes. A second plan/apply will restore them since the other declaration is still present. But this still means eating a 30 second network outage. I tried some fuckery with their moved{} syntax but it didn't help in this case, TF still insists on deleting the routes. The best workaround I came up with is manually doing a "terraform state rm" on the resources I am deleting from the code first so it doesn't want to delete them from AWS too. I can pin the provider version to the old version for a while but that's obviously not a long term solution. All of this sucks.

The change they've made is ~technically correct~ but it was not causing any issues whatsoever in practice. Why the hell would you stick this in a 0.01 point release and not sit on it until the next major version with all your other breaking changes :argh:

OpenTofu lets you use a removed block for this use case, no idea why Terraform hasn't added this

Docjowles
Apr 9, 2009

vanity slug posted:

OpenTofu lets you use a removed block for this use case, no idea why Terraform hasn't added this

drat that is really nice, thanks for the tip. I'm aware of opentofu but had not been following it closely. Didn't realize they had progressed from being a simple fork for license reasons to actually adding sweet new features.

LochNessMonster
Feb 3, 2005

I need about three fitty


vanity slug posted:

OpenTofu lets you use a removed block for this use case, no idea why Terraform hasn't added this

How is OpenTofu maturing? Last time I checked they were adding features like crazy.

Hadlock
Nov 9, 2004

Docjowles posted:

drat that is really nice, thanks for the tip. I'm aware of opentofu but had not been following it closely. Didn't realize they had progressed from being a simple fork for license reasons to actually adding sweet new features.

Just in time

https://www.techtarget.com/searchitoperations/news/366574475/HashiCorp-stock-rises-users-hearts-fall-on-sale-report

Hashicorp is apparently shopping for a buyer to go private. The article doesn't say much more but one potential buyer, the article speculates, without citing evidence, is Broadcom

Hadlock
Nov 9, 2004

Vulture Culture posted:

I agree with the rest of your post, but could you clarify this? K8s RBAC is a problem that leaks sewage all over use cases that rely on partial match.

Trying to convey that inside a namespace, the largest divisional classification you need is a deployment for accessory services like redis etc. I wouldn't put all the services together in a flat hierarchy inside a single namespace, but also wouldn't split service-specific accessories out into their own namespace

Blinkz0rz posted:

So I read something like this and it seems utterly insane. How many nodes do you have in your cluster? What kind of SLAs do you have? How reliable are your infra services (redis, memcache, etc. )? What kind of resources are you throwing at your shared stuff like coredns? How do you not run into noisy neighbor issues all the time?

My current job everything runs on like 15 pods across four nodes so everything works out of the box batteries included which is why I wanted to find a reference architecture because it's small enough not much needs to be modified to work

Previous job was e-commerce and we were running about 140-180 pods 60% of the time and would burst to ~500 based on whatever the marketing people were doing that day, spread across 12-25 nodes. We guaranteed 2 9s uptime during daytime hours and 98% off peak but I don't think we ever went below 2 9s except for the memecache issue

We ran single node(!) redis and a ha triplet of memcache; we only had a single outage on memecache issue in two years, someone uploaded new code that was heavily reliant on memcache, and during a burst period we exceeded the 10gb/s network of the node long enough for Amazon to shut it off

No other issues

Also has a couple other services we inherited from a siloed team after that VP rage quit, some kind of custom zendesk plugin for the call center to do customer lookup and order status; the way it was designed it was a pair of services with a bunch of unnecessary loopback calls that needed some assistance from our group to get working, but otherwise nothing exotic

Everything else either lived in a very sedate management/tooling cluster of ~8 nodes, the dev cluster which did get noisy from time to time, and then we had a "bombing range

Bombing range had no SLA and you could do whatever you wanted at any time; my solution to fixing that cluster there was deleting it in terraform, then adding it back, and if there were complaints tapping the "no SLA" sign

Docjowles
Apr 9, 2009

Apparently base Terraform also has that "removed" feature since January. I have no idea why the hell it didn't come up in my google searching but it's in tf 1.7. Did they crib it from opentofu or the other way around?

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

Hadlock posted:

Trying to convey that inside a namespace, the largest divisional classification you need is a deployment for accessory services like redis etc. I wouldn't put all the services together in a flat hierarchy inside a single namespace, but also wouldn't split service-specific accessories out into their own namespace

My current job everything runs on like 15 pods across four nodes so everything works out of the box batteries included which is why I wanted to find a reference architecture because it's small enough not much needs to be modified to work

Previous job was e-commerce and we were running about 140-180 pods 60% of the time and would burst to ~500 based on whatever the marketing people were doing that day, spread across 12-25 nodes. We guaranteed 2 9s uptime during daytime hours and 98% off peak but I don't think we ever went below 2 9s except for the memecache issue

We ran single node(!) redis and a ha triplet of memcache; we only had a single outage on memecache issue in two years, someone uploaded new code that was heavily reliant on memcache, and during a burst period we exceeded the 10gb/s network of the node long enough for Amazon to shut it off

No other issues

Also has a couple other services we inherited from a siloed team after that VP rage quit, some kind of custom zendesk plugin for the call center to do customer lookup and order status; the way it was designed it was a pair of services with a bunch of unnecessary loopback calls that needed some assistance from our group to get working, but otherwise nothing exotic

Everything else either lived in a very sedate management/tooling cluster of ~8 nodes, the dev cluster which did get noisy from time to time, and then we had a "bombing range

Bombing range had no SLA and you could do whatever you wanted at any time; my solution to fixing that cluster there was deleting it in terraform, then adding it back, and if there were complaints tapping the "no SLA" sign

Oh ok yeah then based on that, definitely not insane. For one small area of our product we have something like 4000-ish application pods alone running on 175 nodes which I'm aware definitely warps my perspective on what's reasonable in terms of thinking about cluster topography and shared dependency resourcing.

LochNessMonster
Feb 3, 2005

I need about three fitty


Ah yeah, for a dozen node cluster that sounds perfectly fine.

Scaling to dozens/hundreds of nodes adds a few other dimensions to problems.

FutuerBear
Feb 22, 2006
Slippery Tilde

vanity slug posted:

OpenTofu lets you use a removed block for this use case, no idea why Terraform hasn't added this

fwiw, this is available in old school Terraform as well (since v1.7): https://developer.hashicorp.com/terraform/language/resources/syntax#removing-resources

Hadlock
Nov 9, 2004

Has anyone come up with a use for the new S3 directory buckets? I guess they use S3 Express one zone storage class

Seems like you could use it sort of like a medium latency (0-9ms) value store, or caching file objects for applications

Gucci Loafers
May 20, 2006

Ask yourself, do you really want to talk to pair of really nice gaudy shoes?


Has anyone here worked with Azure Functions and Cosmos DB? I don't know if I finally losing it but I find the concept or at least the ability to implement a binding freaking impossible. All I am trying to do is simply have my function app with a HTTP Trigger query a single row (or document or whatever Cosmos DB calls it) and increase it's value. For whatever reason, the current tutorials no longer work and I don't get how am I supposed to decipher their documentation. Where does the code go exactly? How do I interrupt the below article? Or is it because I am not a dev and don't know enough C#? :smith:

Azure Cosmos DB trigger and bindings

Zephirus
May 18, 2004

BRRRR......CHK

Gucci Loafers posted:

Has anyone here worked with Azure Functions and Cosmos DB? I don't know if I finally losing it but I find the concept or at least the ability to implement a binding freaking impossible. All I am trying to do is simply have my function app with a HTTP Trigger query a single row (or document or whatever Cosmos DB calls it) and increase it's value. For whatever reason, the current tutorials no longer work and I don't get how am I supposed to decipher their documentation. Where does the code go exactly? How do I interrupt the below article? Or is it because I am not a dev and don't know enough C#? :smith:

Azure Cosmos DB trigger and bindings

Every time i've anything more than 'put object' to cosmosdb in functions I've created a cosmosclient using the sdk rather than using bindings. I'm not sure how much overhead this adds if you're doing something like durable functions but it's easier to me than messing with extra inbound and outbound bindings.

https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/quickstart-dotnet?pivots=devcontainer-codespace#authenticate-the-client

Hadlock
Nov 9, 2004

Me: debugs anything involving publicly hosted anything

It: unreliable bizarre behavior

Me: I'm not sure why, but 60% of the time "empty cache and hard reload" fixes it

20% of the time walking away, doing the dishes, empty cache and reload fixes it

The other 20% of the time: actually a configuration problem

For whatever reason the old existing cloudfront waf/ip whitelist works perfectly, the apparently identical one built using terraform, absolutely does not work :suicide:

5+ minute cycle time is murder for testing, too

Junkiebev
Jan 18, 2002


Feel the progress.

Gucci Loafers posted:

Has anyone here worked with Azure Functions and Cosmos DB? I don't know if I finally losing it but I find the concept or at least the ability to implement a binding freaking impossible. All I am trying to do is simply have my function app with a HTTP Trigger query a single row (or document or whatever Cosmos DB calls it) and increase it's value. For whatever reason, the current tutorials no longer work and I don't get how am I supposed to decipher their documentation. Where does the code go exactly? How do I interrupt the below article? Or is it because I am not a dev and don't know enough C#? :smith:

Azure Cosmos DB trigger and bindings

Should you not use a bus of some sort for this? Distributed writes make me nervous in any “eventually-consistent” datastore.

Hadlock
Nov 9, 2004

Thread opinions on terminating TLS at the load balancer, or at the pod? We don't have any need to do packet inspection between the LB and the pod

Presumably terminating at the pod is best practice

Hadlock fucked around with this message at 07:56 on Mar 30, 2024

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
if you have encryption on your pod network, then either is okay, but otherwise terminating at the lb you will be sending unencrypted traffic on your local network

ime everyone does it though :ssh:

xzzy
Mar 5, 2009

I make my users terminate at the pod.

Nothing to do with security, the policy is 100% laziness because I don't wanna manage certs for them.

xzzy fucked around with this message at 15:17 on Mar 30, 2024

madmatt112
Jul 11, 2016

Is that a cat in your pants, or are you just a lonely excuse for an adult?

We terminate at the LB because it’s better for us to manage the cert infrastructure and expose simple bindings to our devs than to expect them to roll their own bullshit across how ever many hundreds of microservices we run. Everything inside the environment is plain http and nobody has to gently caress around with SSL connections and all that headache when talking to other internal services and troubleshooting them.

Probably more secure to keep it centrally managed, standardized, and observable than to keep tabs on every dev team’s cert implementations.

George Wright
Nov 20, 2005
If you’re handling PII or you’ve got a reliable, well used, integrated, and supported PKI, then you should terminate at the pod. Otherwise it’s easier to terminate at the LB and let your cloud provider deal with certs.

The Fool
Oct 16, 2003


We use self signed certs for internal traffic

Our root is automatically installed and teams can self service their certificates with terraform/venafi

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS
We use istio and for the tls stuff it works great. Scaling istio to our capacity on the other hand...

kaaj
Jun 23, 2013

don't stop, carry on.

Blinkz0rz posted:

We use istio and for the tls stuff it works great. Scaling istio to our capacity on the other hand...

Just curious, what scale you’re running Istio at? We have individual clusters with <6 -8k pods (~tens - low hundreds of clusters total) and even then scaling Istio is fun. Biggest offenders were big global namespaces with high rate of churn, where updates in the mesh need to be propagated to a lot of other peers.

A lot of effort is being put into that overall and we still need to do more to feel that we’re ahead of potential issues.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
Some of the developers we newly acquired are trying to force us to use istio because configuring a golang web server to terminate tls with a cert-manager mounted certificate in their pod is too hard.

I want to rip out their cowardly hearts and serve them - securely! - over the internet.

madmatt112
Jul 11, 2016

Is that a cat in your pants, or are you just a lonely excuse for an adult?

Revenge is a dish best served confidential, integral, and available :hai:

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

kaaj posted:

Just curious, what scale you’re running Istio at? We have individual clusters with <6 -8k pods (~tens - low hundreds of clusters total) and even then scaling Istio is fun. Biggest offenders were big global namespaces with high rate of churn, where updates in the mesh need to be propagated to a lot of other peers.

A lot of effort is being put into that overall and we still need to do more to feel that we’re ahead of potential issues.

Almost exactly that. Supposedly the move away from sidecar proxies will improve memory use and a lot of the startup race conditions but I'm not wholly convinced.

George Wright
Nov 20, 2005
We’ve resisted all service meshes and so far no one has had a compelling enough use case to consider one. We’re open to them if someone actually has a valid need for them, but it’s mostly been attempted cargo culting or resume driven development.

We don’t have the team size to support it and quite frankly we’ve still got larger problems to solve so we don’t want the distraction.

kalel
Jun 19, 2012

what are service meshes and what problems do istio sidecars solve? (address me as you would a five-year-old)

George Wright posted:

resume driven development.

lol, gotta remember that one

Warbird
May 23, 2012

America's Favorite Dumbass

The Iron Rose posted:

Some of the developers we newly acquired are trying to force us to use istio because configuring a golang web server to terminate tls with a cert-manager mounted certificate in their pod is too hard.

I want to rip out their cowardly hearts and serve them - securely! - over the internet.

Is it bad? The team I’m embedded with is using it for their K8s routing and it seems fine but everything about K8s is kinda awful so it may not stand out from the background suck.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Warbird posted:

Is it bad? The team I’m embedded with is using it for their K8s routing and it seems fine but everything about K8s is kinda awful so it may not stand out from the background suck.

I have no idea, I’ve never used a service mesh before. But I’m pretty sure it’s more work than:


code:
// https-server.go
package main
import (
	"crypto/tls"
	"log"
	"net/http"
)
var (
	CertFilePath = "/mnt/certs/server-cert.pem"
	KeyFilePath  = "/mnt/certs/server-key.pem"
)
func httpRequestHandler(w http.ResponseWriter, req *http.Request) {
	w.Write([]byte("Hello,World!\n"))
}
func main() {
	// load tls certificates
	serverTLSCert, err := tls.LoadX509KeyPair(CertFilePath, KeyFilePath)
	if err != nil {
		log.Fatalf("Error loading certificate and key file: %v", err)
	}
	
	tlsConfig := &tls.Config{
		Certificates: []tls.Certificate{serverTLSCert},
	}
	server := http.Server{
		Addr:      ":4443",
		Handler:   http.HandlerFunc(httpRequestHandler),
		TLSConfig: tlsConfig,
	}
	defer server.Close()
	log.Fatal(server.ListenAndServeTLS("", ""))
}

kaaj
Jun 23, 2013

don't stop, carry on.

Blinkz0rz posted:

Almost exactly that. Supposedly the move away from sidecar proxies will improve memory use and a lot of the startup race conditions but I'm not wholly convinced.

Big win which brought noticeable improvements for us was usage of Sidecars (the CRD, not proxies) to explicitly define the endpoints each workload needs to talk to.

We weirdly have a bunch of reasons to mesh ( tens of teams, hundreds of microservices and monoliths in the mesh, FedRAMP, sensitive data, all that) so like a mesh has its place. But we have a dedicated team for owning Istio and I can’t imagine supporting it without few engineers fully dedicated for that effort.

Really hope ambient will make a difference on resource consumption.

neosloth
Sep 5, 2013

Professional Procrastinator
We tried to run istio and it caused a bunch of outages and upgrade headaches with no tangible benefit. It sounds cool tho

Hadlock
Nov 9, 2004

Every company I've been at, someone wanted to do istio. Nobody was able to justify the engineering time as we didn't have problems, or, big enough problems to necessitate it. Seems neat, though

And yeah resume driven development is a very real thing. We had one guy just go completely off the rails trying to get promoted it made my old boss literally rage quit. He was inventing all kinds of insane poo poo like his own DSL templating system for ECS when we already had Kubernetes in place. He ended up at a bitcoin dump which makes total sense

Hadlock fucked around with this message at 10:12 on Apr 1, 2024

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

George Wright posted:

If you’re handling PII or you’ve got a reliable, well used, integrated, and supported PKI, then you should terminate at the pod. Otherwise it’s easier to terminate at the LB and let your cloud provider deal with certs.
Fun fact: the HIPAA Security Rule and many similar compliance regimes don't actually require encryption in transit for your private network. Terminating on-host is something you do to pass third-party audits.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

neosloth posted:

We tried to run istio and it caused a bunch of outages and upgrade headaches with no tangible benefit. It sounds cool tho
Most people shouldn't operate their own data planes period.

I'm again going to say that if your host offers VPC Lattice or something like it, and you aren't either using it or trying to get it adopted, it's out of stubbornness and not because you're looking out for your users

Adbot
ADBOT LOVES YOU

LochNessMonster
Feb 3, 2005

I need about three fitty


At a past company we ran istio. Not sure why, we didn’t get any benefit out of it and it added a layer complexity to troubleshooting. Probably resume driven development by the previous lead.

On a different note, I’ve been trying to get back into terraform after a few years of not using it and was looking into variable precedence. I haven’t got the faintest idea what *.auto.tfvars files do differently then regular .tfvars do, besides taking precendence. A quick google didn’t turn up anything. The hashicorp docs also show them in the precedence list but no other eplanation on when/why you’d use them.

Is it purely for setting global vars that should be the same in each tf deployment?

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply