what the fuck is prometheus anyway? a thread about monitoring - The Something Awful Forums

Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring

r u ready to WALK: Sep 29, 2001

the reason nagios runs for years without anyone touching it is that nobody wants to actively maintain it even with a gun to their head

it can sort of work if you write your own custom scripts that autogenerate all the config files for it,good luck maintaining that poo poo by hand though

# ? Mar 2, 2019 09:25

Adbot: ADBOT LOVES YOU

# ? Apr 26, 2024 00:41

pram: Jun 10, 2001

just lol that you still have to restart the whole loving thing to update anything

# ? Mar 2, 2019 09:26

pram: Jun 10, 2001

also the latest version of opsview is shiiiiiitttttt

# ? Mar 2, 2019 09:27

in a well actually: Jan 26, 2011; dude, you gotta end it on the rhyme

my stepdads beer posted:

my prometheus keeps corrupting its data because it's on an nfs share. that's fine because i only use it for some pretty graphs sometimes. prom really suffers from not having good examples for what I think are common scenarios.

anyway we use cacti for all our network stuff because of inertia. prom+snmp-exporter+grafana was tedious as hell. nagios for our alerting and it's OK but kind of a pain re: config files.

for "tracing" we use senty for catching our dumb php apps various issues and fatals

don�t use nfs for databases

# ? Mar 2, 2019 10:30

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

pram posted:

also the latest version of opsview is shiiiiiitttttt

we use observium, op

# ? Mar 2, 2019 16:48

Powerful Two-Hander: Mar 10, 2004; Mods please change my name to "Tooter Skeleton" TIA.

lancemantis those were some good posts that I can relate to a lot

i spent 2+ hours this week on calls while various auditors asked to "take a screenshot of the code that would log an error" so that "we can check that data flows from system a to system b"

they got some poor fucker to sit there grepping logs while the screenshot that too. they asked me if I could 'show them a log of a message' and I said "we don't retain them because that would be pointless"

they also implied that we should archive all logs indefinitely so they could screenshot them from 6 months ago which is the last time we had an actual error what the actual gently caress

# ? Mar 2, 2019 17:33

Jonny 290: May 5, 2005; [ASK] me about OS/2 Warp

dump nagios and get nargleOS, the best cat monitoring system out there

intense alerts at 4am when the system thinks it it deserves a can of food
over 17 hours a day of efficient Sleep Mode

# ? Mar 2, 2019 17:53

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

we use a combination of elk and the logging software we sell (dogfooding is good) for logging and datadog for monitoring. i think a small part still has some sensu + grafana for monitoring physical assets or something idk

# ? Mar 2, 2019 18:48

cowboy beepboop: Feb 24, 2001

r u ready to WALK posted:

the reason nagios runs for years without anyone touching it is that nobody wants to actively maintain it even with a gun to their head

it can sort of work if you write your own custom scripts that autogenerate all the config files for it,good luck maintaining that poo poo by hand though

ya you use ansible to template the config files it's easy. it's not good. but it works.

# ? Mar 2, 2019 21:47

cowboy beepboop: Feb 24, 2001

Blinkz0rz posted:

we use a combination of elk and the logging software we sell (dogfooding is good) for logging and datadog for monitoring. i think a small part still has some sensu + grafana for monitoring physical assets or something idk

elk or graylog are cool up until the point you have to learn about maintaining an elasticsearch cluster

# ? Mar 2, 2019 21:49

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

my stepdads beer posted:

elk or graylog are cool up until the point you have to learn about maintaining an elasticsearch cluster

yeah ama about maintaining an elk stack that processes a few tb of logs a day

it loving sucks

# ? Mar 3, 2019 00:36

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

Blinkz0rz posted:

yeah ama about maintaining an elk stack that processes a few tb of logs a day

it loving sucks

having done this with a much smaller stack, all i can say is :gonk:

:gonk:

# ? Mar 3, 2019 03:09

suffix: Jul 27, 2013; Wheeee!

prometheus works well enough in practice but the data model feels annoyingly sloppy

little things like interpolating just in case at every opportunity so you your integer counters aren't, range vector weirdness, deliberate special casing of the __name__ label

like a room where all the corners are the wrong angle but you just shrug and shove your furniture in anyway

# ? Mar 5, 2019 00:59

cowboy beepboop: Feb 24, 2001

Blinkz0rz posted:

yeah ama about maintaining an elk stack that processes a few tb of logs a day

it loving sucks

tb?! no thank you

# ? Mar 5, 2019 09:45

post hole digger: Mar 21, 2011

pram posted:

just lol that you still have to restart the whole loving thing to update anything

ive spent all morning loving with this crap. whats a better free alternative

# ? Mar 5, 2019 20:41

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

my bitter bi rival posted:

ive spent all morning loving with this crap. whats a better free alternative

better and free dont really go together in this world

# ? Mar 6, 2019 13:33

post hole digger: Mar 21, 2011

uncurable mlady posted:

better and free dont really go together in this world

well nagios is free so it looks like im owned then.

# ? Mar 6, 2019 16:29

cowboy beepboop: Feb 24, 2001

my bitter bi rival posted:

well nagios is free so it looks like im owned then.

prom's alerts and alertmanager seem good but i have never gotten around to migrating

# ? Mar 6, 2019 22:08

Arcsech: Aug 5, 2008

Blinkz0rz posted:

yeah ama about maintaining an elk stack that processes a few tb of logs a day

it loving sucks

what are the biggest problems you hit?

# ? Mar 6, 2019 23:48

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

Arcsech posted:

what are the biggest problems you hit?

tbh it was a combination of the logstash indexers using up too much memory and the process getting killed by the oom killer, extremely chatty applications generating a gently caress-ton of logs and overwhelming the cluster, along with the way that elasticsearch handles clustering.

the indexers dieing was an easier problem to solve: at first we would just scale the autoscaling group down to 1 and let it scale itself back up (scaling was based on cpu usage.) eventually (after i left the team) they did something to the indexer configuration involving a master node which made things quite a bit more stable. i can ask someone about the fix if you're curious.

for chatty applications we would aggressively tear down autoscaling groups if we determined that they were overwhelming the logging cluster. this didn't happen much but the few times it did i'll be honest and say that it was super satisfying to tell a dev i was killing their app until they reduced the number of logs it generated.

in terms of elasticsearch clustering, we run the cluster in ec2 so we when an instance is terminated or loses connectivity or for any other reason the cluster might lose a node, the entire cluster dedicates itself to moving the replicas of the shards that were on the terminated instances elsewhere so that the replication strategy can be maintained. this causes the cluster to go red which means that it won't process new logs until the pending tasks (shards being moved and replicated) complete. i'm sure there's a setting to tune or something but we were never able to figure out a way to tell es to only attempt to execute a small set of tasks while still ingesting data.

the good news is that none of that actually caused long term ingestion issues or data loss; instead the logstash indexer queues would keep backing up until the es cluster went green and then logs would eventually catch up. it wasn't great when logs went 15-30 minutes behind while teams were deploying new services and relied on logs being available to ensure service health but we got through it

tbh we did a pretty good job of building out the automation and monitoring around how we deployed elasticsearch. during my time with that team i remember a few really lovely on-call rotations where most of my time was spent trying to figure out how to get a red cluster to go green quicker but in terms of stability things were pretty good. i don't think we ever ended up actually losing data

# ? Mar 7, 2019 01:35

pram: Jun 10, 2001

kafka is about 1000x times shittier so count your blessings

# ? Mar 7, 2019 01:43

Arcsech: Aug 5, 2008

Blinkz0rz posted:

tbh it was a combination of the logstash indexers using up too much memory and the process getting killed by the oom killer, extremely chatty applications generating a gently caress-ton of logs and overwhelming the cluster, along with the way that elasticsearch handles clustering.

the indexers dieing was an easier problem to solve: at first we would just scale the autoscaling group down to 1 and let it scale itself back up (scaling was based on cpu usage.) eventually (after i left the team) they did something to the indexer configuration involving a master node which made things quite a bit more stable. i can ask someone about the fix if you're curious.

for chatty applications we would aggressively tear down autoscaling groups if we determined that they were overwhelming the logging cluster. this didn't happen much but the few times it did i'll be honest and say that it was super satisfying to tell a dev i was killing their app until they reduced the number of logs it generated.

in terms of elasticsearch clustering, we run the cluster in ec2 so we when an instance is terminated or loses connectivity or for any other reason the cluster might lose a node, the entire cluster dedicates itself to moving the replicas of the shards that were on the terminated instances elsewhere so that the replication strategy can be maintained. this causes the cluster to go red which means that it won't process new logs until the pending tasks (shards being moved and replicated) complete. i'm sure there's a setting to tune or something but we were never able to figure out a way to tell es to only attempt to execute a small set of tasks while still ingesting data.

the good news is that none of that actually caused long term ingestion issues or data loss; instead the logstash indexer queues would keep backing up until the es cluster went green and then logs would eventually catch up. it wasn't great when logs went 15-30 minutes behind while teams were deploying new services and relied on logs being available to ensure service health but we got through it

tbh we did a pretty good job of building out the automation and monitoring around how we deployed elasticsearch. during my time with that team i remember a few really lovely on-call rotations where most of my time was spent trying to figure out how to get a red cluster to go green quicker but in terms of stability things were pretty good. i don't think we ever ended up actually losing data

thanks, this is very interesting

it surprises me that the loss of a single node would cause a cluster to go red and stop ingesting, unless you had an index with no replicas (don't do that) and actually had legit data loss. or if your discovery/minimum_master_nodes setup was off, i guess.

e: what major version, if you remember?

# ? Mar 7, 2019 18:44

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

iirc it was a very old version somewhere in the 1.7 area

# ? Mar 7, 2019 23:58

Share Bear: Apr 27, 2004

oh no! promotheus.

# ? Mar 8, 2019 00:37

ADINSX: Sep 9, 2003; Wanna run with my crew huh? Rule cyberspace and crunch numbers like I do?

thanks op didn't read

# ? Mar 8, 2019 01:28

Progressive JPEG: Feb 19, 2003

pram posted:

kafka is about 1000x times shittier so count your blessings

kafka is extremely good, maybe your just holding it wrong

# ? Mar 8, 2019 09:41

Silver Alicorn: Mar 30, 2008; 𝓪 𝓻𝓮𝓭 𝓹𝓪𝓷𝓭𝓪 𝓲𝓼 𝓪 𝓬𝓾𝓻𝓲𝓸𝓾𝓼 𝓼𝓸𝓻𝓽 𝓸𝓯 𝓬𝓻𝓮𝓪𝓽𝓾𝓻𝓮

sure it�s great until you find you�ve transformed into a horrible bug-monster

# ? Mar 8, 2019 18:50

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

i didn't know adam posted in yospos

# ? Mar 8, 2019 20:33

Arcsech: Aug 5, 2008

Blinkz0rz posted:

iirc it was a very old version somewhere in the 1.7 area

gotcha

it is really hard to overstate how much better elasticsearch has gotten over the past couple years, and even better soon when the new consensus algo ships (no more minimum_master_nodes, among other things)

edit: full disclosure i guess, i have a vested interest in elasticsearch

Arcsech fucked around with this message at 21:21 on Mar 8, 2019

# ? Mar 8, 2019 21:17

pram: Jun 10, 2001

Share Bear posted:

oh no! promotheus.

# ? Mar 9, 2019 05:36

pram: Jun 10, 2001

Progressive JPEG posted:

kafka is extremely good, maybe your just holding it wrong

lol no. it isnt. youve never used it for anything serious stfu. for example

1) kafka doesnt rebalance topics, ever. if a node is down thats it. the replica is just gone. it doesnt 'migrate' because this is 1998
2) kafka doesnt rebalance storage, ever. if you use JBOD it will just randomly put segments wherever it feels like. if a disk is full it just breaks
3) topic compaction impacts the entire cluster performance if its big enough. nothing you can do about it
4) will randomly break and require a full restart if it lags on the zookeeper state
https://issues.apache.org/jira/browse/KAFKA-2729
5) will effortlessly end up with two cluster controllers if one has degraded performance
6) will spend literal hours 'recovering' on a hard restart (kill) if you have compacted segments
7) replicating data to a replaced node will impact the entire cluster performance, hammering the socket server. and this cant be prevented BECAUSE
8) if you throttle performance it impacts the replica manager AND producers
9) leader rebalancing can still temporarily break producers

and more!

pram fucked around with this message at 05:54 on Mar 9, 2019

# ? Mar 9, 2019 05:50

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

zookeeper is so loving cursed

# ? Mar 9, 2019 05:58

pram: Jun 10, 2001

people dont believe that kafka doesnt migrate anything or rebalance anything. because elasticsearch does so people assume something like kafka (which is pure magic ftw) does

but it literally doesnt. its all manual. if you want to reassign a partition replica, you have to do it yourself with the cli tools or some 3rd party thing. and the operation itself isnt transparent, it actually impacts all the consumers and producers while its doing it (tbf es does this too) its loving garbage

# ? Mar 9, 2019 06:02

Progressive JPEG: Feb 19, 2003

works fine on my machine(s)

# ? Mar 9, 2019 06:35

ADINSX: Sep 9, 2003; Wanna run with my crew huh? Rule cyberspace and crunch numbers like I do?

pram posted:

lol no. it isnt. youve never used it for anything serious stfu. for example

1) kafka doesnt rebalance topics, ever. if a node is down thats it. the replica is just gone. it doesnt 'migrate' because this is 1998
2) kafka doesnt rebalance storage, ever. if you use JBOD it will just randomly put segments wherever it feels like. if a disk is full it just breaks
3) topic compaction impacts the entire cluster performance if its big enough. nothing you can do about it
4) will randomly break and require a full restart if it lags on the zookeeper state
https://issues.apache.org/jira/browse/KAFKA-2729
5) will effortlessly end up with two cluster controllers if one has degraded performance
6) will spend literal hours 'recovering' on a hard restart (kill) if you have compacted segments
7) replicating data to a replaced node will impact the entire cluster performance, hammering the socket server. and this cant be prevented BECAUSE
8) if you throttle performance it impacts the replica manager AND producers
9) leader rebalancing can still temporarily break producers

and more!

Hun, this is interesting. We were playing with kafka at old job because so many things support it and it has per-partition ordering. I knew it was a pain in the rear end to run one, but never knew the reasons why... so this is a lot of reasons.

We were working with Confluent to provide us with a managed instance... I guess they just do all this poo poo behind the scenes? I wonder how they'll do poo poo that actually effects cluster performance? Send the team a notification that its gonna happen? Just never do it?

# ? Mar 9, 2019 06:54

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

tbh a lot of software people consider magical scaling wizardry is a nightmare and I�m convinced the people bringing it in flee before the consequences hit or hav never used it beyond toy projects

# ? Mar 9, 2019 07:14

Arcteryx Anarchist: Sep 15, 2007; Fun Shoe

like spark had a super broken memory model for quite a while, lots of the Hadoop stack is brittle and needs a lot of babysitting

like the noteworthy parts of this stuff is it helps make some stuff feasible but it isn�t �good�

# ? Mar 9, 2019 07:17

ADINSX: Sep 9, 2003; Wanna run with my crew huh? Rule cyberspace and crunch numbers like I do?

lancemantis posted:

like spark had a super broken memory model for quite a while, lots of the Hadoop stack is brittle and needs a lot of babysitting

like the noteworthy parts of this stuff is it helps make some stuff feasible but it isn�t �good�

When you refer to a broken memory model is that for spark streaming stuff where the application might leak memory over time? Or does the problem come up in batch execution? I haven't done much spark stuff so I'm curious.

We were able to come up with a pretty solid BIG DATA pipeline using a lot of managed google stuff... but... it was managed by someone else, for all the reasons listed in the thread.

During my interview with the Kinesis team I got the distinct impression that a lot of their job is fighting fires; I realized its probably a lot more fun to USE these managed systems than it is to work on them

# ? Mar 9, 2019 08:47

pram: Jun 10, 2001

ADINSX posted:

Hun, this is interesting. We were playing with kafka at old job because so many things support it and it has per-partition ordering. I knew it was a pain in the rear end to run one, but never knew the reasons why... so this is a lot of reasons.

We were working with Confluent to provide us with a managed instance... I guess they just do all this poo poo behind the scenes? I wonder how they'll do poo poo that actually effects cluster performance? Send the team a notification that its gonna happen? Just never do it?

yes we use confluent (the platform, not their cloud) and they said they basically made a bunch of proprietary additions for their managed service. in that sense its like redislabs cloud vs 'redis' in that you cant replicate it with off the shelf stuff (or even their own provided tools like replicator lol)

amazon msk is straight up vanilla kafka and i think its a big joke right now. same with the azure one, i think its literally just the hortonworks ambari kafka

if you have a single topic you wont have many issues. if you run multi-tenant clusters where people are doing compaction and exactly-once and theres 10000 different consumer groups then its a total shitshow

# ? Mar 9, 2019 10:20

Adbot: ADBOT LOVES YOU

# ? Apr 26, 2024 00:41

Progressive JPEG: Feb 19, 2003

pram posted:

if you run multi-tenant clusters where people are doing compaction and exactly-once and theres 10000 different consumer groups then its a total shitshow

Progressive JPEG posted:

maybe your just holding it wrong

# ? Mar 9, 2019 11:10

1
2
3
4
5

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring

Powered by: vBulletin Version 2.2.9 (SABB-v2.24.04)
Copyright ©2000, 2001, Jelsoft Enterprises Limited.
Copyright ©2024, Jeffrey of YOSPOS