Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
MononcQc
May 29, 2007

uncurable mlady posted:

i wouldn't say it's just in your web app - our entire application is instrumented from webapp all the way down to the db. that said, it's kind of a massive pain in the rear end right now to get trace data from resources you don't directly manage because not everyone uses opentracing and even if they did, wire formats are very tracer dependent.

I have big opinions about what to trace and logs, and where to put the probes -- about 15 pages worth of opinions -- that I put in one place at https://ferd.ca/operable-software.html

Mostly if I have to TL:DR; my views it is that:
  • you have to be aware that usage of the system forces and creates diverging mental models for all users and operators
  • just tracking what was problematic in the past is a losing proposition that will result in unmaintainable messes that don't offer any insights
  • debugging is often not done by just flat out understanding the internals of the whole system, but through trying to understand your interactions with a given set of underlying components and abstractions. Digging a layer below requires coming up with a whole new mental model.
  • making "everything visible" is a stupid idea because it asks of people interacting with the system to know everything that may or may not be relevant no matter their expertise level
  • you therefore want to locate probes for debugging/tracing/logging/metrics at a layer below the one you're planning to interact with. For example, probes in your app should be so users or support figure out if they're configuring/using the app right. Probe in the framework (say middleware in a server) are for developers to figure out if the app they wrote is behaving right, and so on.
  • Building on a stack of abstractions that lack observability features forces you to cope by reinventing them at your own layer and is generally a nightmare

And so the idea is to think in terms of "operator experience" the same way we would do "user experience", and figure out patterns in which to lay information that talks to the different types and levels of expertise of users and operators. If you don't have that, you have a lot of data, but it's not necessarily going to be useful at all.

Adbot
ADBOT LOVES YOU

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

uncurable mlady posted:

i wouldn't say it's just in your web app - our entire application is instrumented from webapp all the way down to the db. that said, it's kind of a massive pain in the rear end right now to get trace data from resources you don't directly manage because not everyone uses opentracing and even if they did, wire formats are very tracer dependent. that said, w3c is working on a tracecontext/tracedata specification that's intended to address this problem by standardizing headers and wire formats for context so you could have a situation where you're using some sort of managed ingress proxy or w/e and it'd be able to create spans as part of a trace that started on a client, etc. could also see the same thing at a managed db where the database service on the provider side is able to pick up traces incoming from the application and emit spans that you'd collect.

are you using tracing now? something home-brewed, or opentracing/opencensus?

not really tracing today. I have event data in ES that looks like spans, I think? (request A started routine P on node N, and another event when it completes), and time series data from from node N and resource R, S, T that are slightly to tightly correlated (R can tag all requests from P, S can only show traffic from N, and T can only show high-level perf indicators.)

What I want is something to take this structured Elastic data, look at what resources are directly or indirectly used by that request, and show relevant TS data from Prometheus and log data from ES. If T crashes I want to be able to look at what requests are active in the system. Given 5 crashes, I want to bisect that down to see that requests like A were the only common requests in all five crashes; I'd also like to see that A are taking longer than normal because resource T is reporting high utilization, etc.

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

abigserve posted:

my friend you never want to look into doing network or security monitoring

big lol that you decided to @mononcqc with this

abigserve
Sep 13, 2009

this is a better avatar than what I had before

Captain Foo posted:

big lol that you decided to @mononcqc with this

if he has any idea how to do it i'm all ears, i'll send it straight to our infosec team who are currently building a hadoop stack to try and deal with it

MononcQc
May 29, 2007

abigserve posted:

if he has any idea how to do it i'm all ears, i'll send it straight to our infosec team who are currently building a hadoop stack to try and deal with it

Captain Foo mentioned this because I used to work for the routing team at Heroku and maintained part of their logging stack for a while, while I now work in a security company and helping them set up some IoT stuff for data acquisition, so it does make for a funny overlap. I have however not worked in infosec directly.

I don't know what exactly your team's doing, but going for hadoop for infosec and networking makes me think that they're trying to straight up do analytics on network traces or at least network metadata (connection endpoints, protocols/certs, payload sizes, etc.) -- so it'd be interesting to figure out what they're actually trying to accomplish. If it's a dragnet thing it's different from actual logging since you would probably have the ability to control some stuff there?

Most network software logs at least tend to have a semblance of structure, so they're not as bad of a cause as <Timestamp> could not load user <id> because an exception happened which essentially requires a full-text search to do anything with.

abigserve
Sep 13, 2009

this is a better avatar than what I had before

MononcQc posted:

Captain Foo mentioned this because I used to work for the routing team at Heroku and maintained part of their logging stack for a while, while I now work in a security company and helping them set up some IoT stuff for data acquisition, so it does make for a funny overlap. I have however not worked in infosec directly.

I don't know what exactly your team's doing, but going for hadoop for infosec and networking makes me think that they're trying to straight up do analytics on network traces or at least network metadata (connection endpoints, protocols/certs, payload sizes, etc.) -- so it'd be interesting to figure out what they're actually trying to accomplish. If it's a dragnet thing it's different from actual logging since you would probably have the ability to control some stuff there?

Most network software logs at least tend to have a semblance of structure, so they're not as bad of a cause as <Timestamp> could not load user <id> because an exception happened which essentially requires a full-text search to do anything with.

You pretty much nailed it, essentially they are trying to build an analytics stack that brings in a bunch of different data sources to be able to assist in root cause analysis and hunting sessions.

The issue is, none of the data set or very little of it, is structured and often the structures that are in place are inconsistent. To illustrate an example, say you want to be able to have a consistent view of source IP addresses accessing any web page in the enterprise. You need to be parse the following different log formats:

- apache
- nginx
- iis
- netscaler
- firewall connections
- netflow records
- EDR

it's pretty hard. And that's after you ingest all of those data sets into a sane bucket that actually allows you to parse it in the first place.

MononcQc
May 29, 2007

Right. So the two patterns there are essentially what they're doing and just hadooping the hell out of it, or otherwise treating individual logs as a stream that you have to process into a more standard format. Currently they're likely forwarding logs from specific servers to remote instances by reading them from disk first and then shoving them over a socket (or some syslog-like agent); the cheapest way to go would be to do that stream processing at the source as the agent reads up the logs from the disk and before it forwards them.

This requires deploying the agent to all instances (or as a sidecar), but from that point on, all the data is under a more standardized format, and you can shove it in hadoop or splunk or whatever.

E: do forward unknown/unformatteable logs but annotated as such within a structure so that they can iteratively clean poo poo up

abigserve
Sep 13, 2009

this is a better avatar than what I had before

MononcQc posted:

Right. So the two patterns there are essentially what they're doing and just hadooping the hell out of it, or otherwise treating individual logs as a stream that you have to process into a more standard format. Currently they're likely forwarding logs from specific servers to remote instances by reading them from disk first and then shoving them over a socket (or some syslog-like agent); the cheapest way to go would be to do that stream processing at the source as the agent reads up the logs from the disk and before it forwards them.

This requires deploying the agent to all instances (or as a sidecar), but from that point on, all the data is under a more standardized format, and you can shove it in hadoop or splunk or whatever.

E: do forward unknown/unformatteable logs but annotated as such within a structure so that they can iteratively clean poo poo up

You're spot on, their initial shot was to do stream processing on all the logs using Splunk but licensing on that side and the administrative overhead of maintaining correct splunk forwarder/stream configurations drove them to a dump-in-a-lake model.

Real talk, it's a loving mess and I don't envy whoever has to untangle it all into something that resembles a sane data structure.

CRIP EATIN BREAD
Jun 24, 2002

Hey stop worrying bout my acting bitch, and worry about your WACK ass music. In the mean time... Eat a hot bowl of Dicks! Ice T



Soiled Meat

Progressive JPEG posted:

avoid counters imo

counters are fine if they reset on a boundary like 30 seconds. then you can get a rate

CRIP EATIN BREAD
Jun 24, 2002

Hey stop worrying bout my acting bitch, and worry about your WACK ass music. In the mean time... Eat a hot bowl of Dicks! Ice T



Soiled Meat
also we write all our metrics to a kinesis stream and consume them elsewhere to put them in an influxdb database, which is queried using grafana.

doesn't have any perceptible overhead.

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison
influxdb more like refluxdb because it gives you heartburn trying to run it

pram
Jun 10, 2001
my company uses splunk for logs and datadog for tracing/monitoring/alerts. these both cost millions of dollars a year. thanks for listening!

CRIP EATIN BREAD
Jun 24, 2002

Hey stop worrying bout my acting bitch, and worry about your WACK ass music. In the mean time... Eat a hot bowl of Dicks! Ice T



Soiled Meat

uncurable mlady posted:

influxdb more like refluxdb because it gives you heartburn trying to run it

i fired it up and its been running since, seems braindead simple?

carry on then
Jul 10, 2010

by VideoGames

(and can't post for 10 years!)

i was pulled aside to help work on our metrics library for java apps, but it turns out it was just a repackaging of dropwizard metrics so i went back to doing what i was doing before

maybe someday i'll work on something that uses it :shrug:

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

CRIP EATIN BREAD posted:

i fired it up and its been running since, seems braindead simple?

i actually dont know, i've never hosed with it. buddy of mine at work was griping about it being hard to use, maybe he's just bad

DONT THREAD ON ME
Oct 1, 2002

by Nyc_Tattoo
Floss Finder

uncurable mlady posted:

i actually dont know, i've never hosed with it. buddy of mine at work was griping about it being hard to use, maybe he's just bad

i've worked at 2 places with influx and at both places i heard zookeeper levels of ops moaning about it.

r u ready to WALK
Sep 29, 2001

influx is easy until you start trying to customize it with rollup policies, multiple retention schedules and other things that aren't part of the default settings

out of the box it just keeps everything forever with full detail instead of doing coarse averaging for historic data, which works pretty well for home setups with a couple hundred metrics on a minute interval but not so great with thousands of servers sending tens of thousands of metrics every few seconds.

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe
one of my previous jobs was basically working on the monitoring project for a large non-tech company and it gave me like PTSD from the horrendous politics of it

i'm strangely getting involved in it again at my current workplace but this is a much more well-scoped case

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe
i could probably do some kind of experience based long-post on it because it sucked rear end

CRIP EATIN BREAD
Jun 24, 2002

Hey stop worrying bout my acting bitch, and worry about your WACK ass music. In the mean time... Eat a hot bowl of Dicks! Ice T



Soiled Meat

r u ready to WALK posted:

influx is easy until you start trying to customize it with rollup policies, multiple retention schedules and other things that aren't part of the default settings

out of the box it just keeps everything forever with full detail instead of doing coarse averaging for historic data, which works pretty well for home setups with a couple hundred metrics on a minute interval but not so great with thousands of servers sending tens of thousands of metrics every few seconds.

yeah we don't use the real-time metrics for anything than just monitoring. so we set the retention policy to 3 months and call it a day.

we have a few metrics that are condensed from a continuous query but they never have a problem?

only issue we ever had was when someone decided to use a random GUID as a tag and polluted a DB but it wasn't a big deal, i just deleted them.

Powerful Two-Hander
Mar 10, 2004

Mods please change my name to "Tooter Skeleton" TIA.


here's my log story:

we couldn't figure out what the gently caress was going on because the genius that wrote these service components logged every single work step, including the ones that just said "sleeping for 1 second", so our logger was completely jammed with useless stuff. also stuff that was actually a problem was sometimes logged as info

solution: lets stop logging so much trace poo poo! why not use the log level configuration per-environment to set this! it'll be great!

what actually happened: all logging got disabled completely and now we can't tell what the gently caress is going on, but in a new and exciting way!

also our logger is some home rolled poo poo that nobody knows how to install anymore so we have to hope it never breaks

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

lancemantis posted:

i could probably do some kind of experience based long-post on it because it sucked rear end

you should!

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe
so basically this was an early job for me so its a good example of being young and naive and letting a company screw you over pretty hard and how bad projects can be in these big organizations

this is like mid-late 2000s, when I came on they had a couple applications they used for monitoring -- one that was a licensed solution they used to keep an eye on web applications by hitting pages, which was becoming increasingly irrelevant thanks to being an old product and we were entering the age of javascript hell that it had no idea how to deal with (much less some of the Apache Wicket hell that floated around for a while)

the other was a home-grown java based application that had been written some years ago by some members of an "architecture group" that was later broken apart. It was a somewhat more "flexible" solution, that had a lot of different pieces of functionality to poll things from server metrics, linux daemon statuses, various types of message queues, some custom log scraping, accessing our in-house service API, etc etc and shove all this time-series oriented information in a standard database. it wasnt organized in such a horrible fashion that it was too terrible to comprehend and maintain, though it did take advantage of some weird libraries for unknown reasons, and it had some design decisions that were obviously intended to get ahead of any scalability issues that were probably flawed and wouldn't really work. it also had a cold-fusion based front end for some reason, likely because there were a number of cold-fusion apps in the company and the person that developed that portion was just going with what they knew. we actually rarely used it because in the end it didn't work very well as the number of things being monitored grew -- instead we used some other swing-based application that could better deal with things

i should also point out that operations folks and developers didn't really have much access to either one of these systems; in fact, trying to remember it now, im not sure if they could view the websites that sat in front of them at all -- the operational people might have just been able to see information from alerts that appeared in a legacy console they used, as well as some of those lovely operations status displays sprinkled around the building that CIOs love, and some daily reporting summary pages. the reasoning for any of this is management didn't "trust" any of them to handle these things

as you might have picked up, this place was kind of a technological mess -- there was a legacy mainframe system that had served as the core platform for a lot of the core operations of the company, as well as some of the other business operations, for many decades; there were also a large number of linux servers running various distros (which would eventually become a large number of virtualized linux servers), windows servers, hell there were even some tandems sitting around iirc though I think I never had to deal with those, and along with all these different systems were a multitude of application stacks. Some applications we be developed in-house, others would be contracted out, and in many cases a lot of work was being done by different organizations within the company outside of what would be considered the main IT department -- like marketing or accounting or something might create their own IT group or contract something out thanks to the terrible power dynamics of corporate america

batch database jobs, mainframe terminals, web applications, native applications, etc

the scope of these applications might just be some boring back office accounting and reporting stuff, or they might be something more critical or even safety-impacting to the field operations of the company -- so also all over the place

either way, the (small) group i was in had somehow become responsible for covering it all, if it had been covered up to that point, or even trying to expand that coverage; we served as both the developers and support people -- this was a huge PITA because it meant trying to deal with an external facing ticketing system where we would have to perform tons of work for teams since they weren't allowed to do any of it themselves as well as our development work

so that kind of lays out the situation i was in

Arcteryx Anarchist fucked around with this message at 21:25 on Feb 26, 2019

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe
so year one for me was basically spent learning how to maintain the in-house monitoring system. i was the only real employee on this team outside of the manager, there were 4 other people, 3 of whom were contractors and 1 intern. i've kind of mentioned it before, but this companies use of on and offshore contractors was pretty widespread, and i feel anymore that the way they were engaging in this was likely against labor regs but what do I know im not a lawyer

2 of the contractors had been on the project for a little while so they knew a lot more about these systems than I did (none of the people on the project were around when the systems were implemented) -- later both of them would not have their contracts renewed as the company went to "exclusive" contracting with a big firm, and they also wouldn't exactly be replaced; they were kind of backfilled by 2 offshore contractors that were more or less worthless since they worked completely different hours so they couldn't really handle doing any of the support ticket work without terrible turnaround times and they weren't that great at doing any of the development work either. im pretty sure the entire reason they were brought on was for the manager's career development, allowing them to claim to have managed offshore resources on a project.

there was also a newer version of the in-house monitoring application that was being developed by a person that had left the company and I basically back-filled the position of (though probably at a lower title and pay rate). in true corporate fashion it was basically the same application but completely re-written with a few changes to libraries and other bits mostly to address something that developer didn't like about the existing application, but also a few new features as well, including a more modern UI that could actually be used to manage the application. i also spent a good portion of my time trying to complete this new version, though in the end this would take a few years to finally release, after a change in management and even a complete re-location within the org chart and pushing back on a fair amount of feature creep.

there would of course be a third iteration of application re-write because all of the scaling flaws were still in it, but i didn't stick around long enough for the completion of that one

thanks to this also being around the time of the financial collapse, my compensation for handling this bumpy first year and keeping my head above the water and everything from falling apart was a 1% or 2% raise and I think a $1000 pre-tax bonus, and this was basically the story for a couple years until my management changed and I got a few title bumps and better comp bumps and bonuses

i think i had more or less been the victim of the career building of the previous management in these early days, as they had gone on to get an MBA and moved from being the head of my small group to maybe a director (basically one step below AVP) by the time I had left; they were on a surface level a very nice person which just kind of underlines how systemically evil this stuff can be

Arcteryx Anarchist fucked around with this message at 21:46 on Feb 26, 2019

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe
now for some of the horrendous politics of this stuff

since my group had to do most of the work of managing all this monitoring in addition to the development of the systems to perform it, i got tangled up in the office politics of it all, and monitoring is easily one of those things that goes from being the most important thing in the world all the way up to the CIO level one day, to being derided as a meaningless cost center the next

problem 1: SLAs

so the organization had picked up on using the concept of SLAs for various key applications to help keep the people running those applications accountable and relate what the importance of those applications might be, and the operations folks were in charge of keeping track of if those SLAs were being met and my team basically supplied data to provide insight for that

of course, this also meant that various people likely had meeting those SLAs as part of their performance evaluation, which really just creates an incentive for trying to weasel your way out of anything that makes it look like you might not be meeting your objectives, mostly through blame shifting, and if another application or some part of the application stack couldn't be blamed, then the monitoring itself would become the target because we had obviously not assigned correct blame or inaccurately reported something in their opinion. now, im not going to say they were entirely wrong in some cases with some of this blame shifting, outside of their terrible motivations for it. like many corporate application developers, all they really cared about was hitting shiny feature requirements while the underlying application might be a tech-debted, rotting, unstable mess -- but a mess they hoped they could shift responsibility to others for

sometimes that responsibility shifting might hit the level where they almost wanted our monitoring application to more or less become a reliability wizard for their own rotten application, somehow keeping it propped up through all failure modes, but thankfully this is one bit of scope creep we were able to successfully repeatedly push back on

but there were plenty of meetings that were basically some manager or other application person trying to bully operations people or us into "correcting" their reported SLA metrics (which they sadly often more or less won)

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe
problem 2: dumb-ops

its not that the people in the operations center were actually dumb, its more that the company was kind of cheap and pretty much tied their hands; since the IT operations center is a 24-hour, 3-shift outfit, you can probably reduce some of your costs by reducing their responsibilities to argue for compensating them less

they pretty much had all their visibility into systems in the form of a legacy console that had likely started to be used in the late mainframe days, and some of those reporting dashboards developed by our group, and they were basically instructed to look into things when something was "bad" according to them

they had to troubleshoot things based on documentation (in some old lotus document dabase no less) that was of varying quality, and if they couldn't seem the resolve the issue, they would move on to calling developers; thats right, as a developer in this organization you had the perk of basically being expected to do 24/7/365 on-call support; you might be able to push off getting called until business hours if you could effectively argue that your app wasn't that important, but generally this didn't seem to happen because most managers can easily reason themselves into thinking any application is actually totally business critical. most teams were at least pretty good at rotating this responsibility among themselves though, unless you were on my tiny team where at one point it was basically rotated between me and the only other developer that was actually located in-office on a bi-weekly or maybe at one point monthly basis

of course, as the monitoring person, i would get called a lot

i had to be on the deployment calls, because of course applications don't want that downtime to count towards their SLAs and operations doesn't want to see alerts for applications that are "obviously" down, and since management didn't trust this stuff to be managed by anyone but us, we had to manage doing the blackouts for those windows

and of course a developer getting called at 2AM for a supposed problem, with no compensation for this, and when they look into things feels like something is not a "real" issue, wants to blame shift the issue to us so I get called for that to either turn something off or tweak something else

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe
problem 3: resources

of course, as an application that would bounce between being the most important thing to meaningless, the resources we were given were often third-rate

i was some green developer that ended up being in charge of an application that had performance and scalability needs that were quickly growing and I had little to no idea what I was doing there, and the other persons that might work on it were basically offshore or on-shore contractors with a background purely in the exact kind of unstable rotten business applications that we were trying to monitor, and the team was always pretty small, especially considering the size of the rest of the organization

we were eventually shoved onto the same virtualized machines they were trying to shove most of the other applications onto, and we had to share a databse instance -- basically the monitoring was living in the same infrastructure space as the applications and infrastructure we were expected to monitor

one amusing anecdote from this was when there was a mysterious database throughput issue in the organization that we were even impacted by and were brought in to discuss because we were also expected to have some metrics on what might be happening even while being impacted as well (i think it ended up being some massive network and/or other backplane congestion from backups)

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe
another fun anecdote: a kind of career recognition theft

so one particularly high-value operations-related large organizational effort decided to develop their own kind of monitoring front end; a super fancy set of dashboards and navigation tools to help provided visibility to all kinds of things about their sprawling mess of applications; the operations center was going to be expected to use this as their kind of source of record for application status in this particular area

they developed this whole thing with a team of their own that was probably 10 times or more larger than mine easily

and in the end it had no system to provided the data to back it -- it was basically just a fancy UI and I think a graph-database to help organize the displayed information

they then of course expected us to magically now integrate and provided the exact kind of data they wanted to back it, and used their much heavier organizational hitting power to bully us into it; it was a nightmare and I don't think it ever ended up getting finished and was probably a total waste of money

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe
this was all in a place where the CIO regularly told the entire IT organization it was a Cost Center to curtail any demands that might come from it

i swear if i could go back today i would love to tell him to his face if its such a cost center I have a great idea to save the company tons of money by walking down to the datacenter right now and pulling the switch, then we can work on making back some money surplussing all that equipment -- hell we might even make a little money on the mainframe parts, plenty of other places still using that stuff and its probably hard to come by spares

this man made millions of dollars

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe
anyways I think thats it for now

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

lancemantis posted:

another fun anecdote: a kind of career recognition theft

so one particularly high-value operations-related large organizational effort decided to develop their own kind of monitoring front end; a super fancy set of dashboards and navigation tools to help provided visibility to all kinds of things about their sprawling mess of applications; the operations center was going to be expected to use this as their kind of source of record for application status in this particular area

they developed this whole thing with a team of their own that was probably 10 times or more larger than mine easily

and in the end it had no system to provided the data to back it -- it was basically just a fancy UI and I think a graph-database to help organize the displayed information

they then of course expected us to magically now integrate and provided the exact kind of data they wanted to back it, and used their much heavier organizational hitting power to bully us into it; it was a nightmare and I don't think it ever ended up getting finished and was probably a total waste of money

lol this owns in a really lovely way

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

lancemantis posted:

anyways I think thats it for now

i appreciate you are posts

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe

uncurable mlady posted:

lol this owns in a really lovely way

what's also fun about this is they planned to sell/license this as part of the larger system they were building in this kind of plan to kind of be like the SAP of that particular space

i kind of wonder if they still think thats happening or its more of what I kind of always suspected -- a "good idea" that has toxic effects on how the applications are written and in the end it's not even possible to sell it and there's likely no market for it anyway

it was the kind of place that liked to promote from within, which might be seen as nice because it makes you feel like you can have a career there if you can play the politics and everything, but at the same time i feel like they had a serious blind spot in how competent they really were as an organization since they rarely had a lot of true outside perspective

like some of the "principle engineers/architects/whatever" might have been completely incompetent but given a huge amount of power, and they had no real idea of their incompetence because their entire career had been within an organization without any real external check of competency and now they had a big title and power and probably a fat salary so they felt on top of things

abigserve
Sep 13, 2009

this is a better avatar than what I had before
I was tasked recently with the job of migrating several of our ancient monitoring servers to a newer distro along with a bunch of one off scripterinos that hook into it.

So I spent the week migrating all of the stuff into version control, automating the quite complex builds and documenting as much as possible. I spent a couple of days trying to discover all the dependencies by going through the servers but I had a fairly good picture leading in of the service and the dependencies it had. After all, we were responsible for the service and I knew what the team did and didn't use.

loving wrongo, turns out the entire thing forms the foundation of a web of other scripts, webpages, and alarms for an entirely different team, none of which is documented or in source control and I literally had no idea existed.

The worst part is it's no (technical) person's fault either, the person they have looking after with is an exceptionally good programmer and an all around nice guy, but he's only seconded for a day per week and the second he walks in the door they are piling him with work that needs to be out the door TODAY ASAP

it's a cold splash of water, I've been dealing almost exclusively in new builds or drop-in replacements it feels like I've been living in a zen garden and now I'm back into the weeds.

tldr: gently caress monitoring

Perplx
Jun 26, 2004


Best viewed on Orgasma Plasma
Lipstick Apathy
my monitoring involves people calling me with their personal cell phone because the internet is down

DONT THREAD ON ME
Oct 1, 2002

by Nyc_Tattoo
Floss Finder

Perplx posted:

my monitoring involves people calling me with their personal cell phone because the internet is down

turn your monitoring on

cowboy beepboop
Feb 24, 2001

my prometheus keeps corrupting its data because it's on an nfs share. that's fine because i only use it for some pretty graphs sometimes. prom really suffers from not having good examples for what I think are common scenarios.

anyway we use cacti for all our network stuff because of inertia. prom+snmp-exporter+grafana was tedious as hell. nagios for our alerting and it's OK but kind of a pain re: config files.

for "tracing" we use senty for catching our dumb php apps various issues and fatals

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

my stepdads beer posted:

my prometheus keeps corrupting its data because it's on an nfs share. that's fine because i only use it for some pretty graphs sometimes. prom really suffers from not having good examples for what I think are common scenarios.

anyway we use cacti for all our network stuff because of inertia. prom+snmp-exporter+grafana was tedious as hell. nagios for our alerting and it's OK but kind of a pain re: config files.

for "tracing" we use senty for catching our dumb php apps various issues and fatals

nagios sucks cocks in hell

cowboy beepboop
Feb 24, 2001

it's fine. thing break -> email and sms sent. it also runs happily for years without anyone touching it

Adbot
ADBOT LOVES YOU

pram
Jun 10, 2001
nagios is extremely poo poo

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply