|
floatman posted:There was recently a jobs queue system implemented in SQL. lol please tell me this was in sql server or oracle where there are proper queues built into the loving database engine that are miles better than trying to write your own from scratch
|
# ? Dec 21, 2018 12:21 |
|
|
# ? Oct 4, 2024 12:42 |
|
for you sql gurus how good is it at handling billions/trillions of very small rows (like <100 byte average length) I am writing an application that does tens of thousands of rows per second and a standard interface is much nicer than writing it raw to disk and also writing some handlers for retrieving it. The code is modular so I can write to anything, I'm going to try mongodb as well because it is web scale, maybe I'll do hadoop as well but I don't even know where to start with something like that
|
# ? Dec 21, 2018 12:21 |
|
wait "Since jobs were either waiting status or completed status, whenever a job was picked, the job record needed a lock to lock the job so if the SQL did run again while this job was running it wouldn't pick up the same job." they left the jobs in the "queue" ?!?!?!
|
# ? Dec 21, 2018 12:22 |
abigserve posted:for you sql gurus how good is it at handling billions/trillions of very small rows (like <100 byte average length) heavily depends on data structure and intended uses
|
|
# ? Dec 21, 2018 13:07 |
|
it's network flow data, consists of a single solitary table and looks like this (sorry for the formatting)code:
|
# ? Dec 21, 2018 13:22 |
abigserve posted:it's network flow data, consists of a single solitary table and looks like this (sorry for the formatting) this is going to be a really lovely table to query like that, assuming you'll use last_switched, a nullable field, as the key
|
|
# ? Dec 21, 2018 13:33 |
|
cinci zoo sniper posted:this is going to be a really lovely table to query like that, assuming you'll use last_switched, a nullable field, as the key thanks for noticing that, that was an error in my schema as it wasn't supposed to be a nullable field
|
# ? Dec 21, 2018 13:38 |
abigserve posted:thanks for noticing that, that was an error in my schema as it wasn't supposed to be a nullable field if it's a non-nullable, monotonically increasing or decreasing field that will be used in every single query, then in postgres you can use last_switched as partitioning key or brin index key. that should make this table decently well manageable at least until low-mid billions of records
|
|
# ? Dec 21, 2018 14:08 |
|
cinci zoo sniper posted:if it's a non-nullable, monotonically increasing or decreasing field that will be used in every single query, then in postgres you can use last_switched as partitioning key or brin index key. that should make this table decently well manageable at least until low-mid billions of records it'd be pretty easy to write psql as another supported datastore. I guess I should probably just bite the bullet, pay the :tenbux: for a db.t2.xlarge instance and a couple terabytes of storage then just see what happens
|
# ? Dec 21, 2018 14:23 |
abigserve posted:it'd be pretty easy to write psql as another supported datastore. I guess I should probably just bite the bullet, pay the :tenbux: for a db.t2.xlarge instance and a couple terabytes of storage then just see what happens eh, for testing maybe, for production im not a fan db.t2.xlarge or db.t2.2xlarge because they are postgresql 9.x. for you 10 is more or less needed, 11 is desired (which is only on rds preview right now i think) if you want to do partitions, rather than brin indexing.
|
|
# ? Dec 21, 2018 15:10 |
|
floatman posted:There was recently a jobs queue system implemented in SQL. lmao what the gently caress there was some process somewhere another team had that ran on a "last in first out" queue and apparently started back logging itself around 2008 but nobody noticed, then when they moved it to a non-potato server a few years later it started blasting out 5 year old confirmation documents to clients who were pissed for my part I set off a massive load of alerts because I refactored a big old mess of sql data views/procs and made it about 10x faster which caused our exporter to blast out messages so fast the receiver poo poo itself, that was pretty funny
|
# ? Dec 21, 2018 15:27 |
|
for the sake of discussion, let's assume that we actually needed a complicated priority system (otherwise the whole thing is trivial as you noted) seems like the right approach would have been to separate the priority scoring from the actual job queue for the former, you can run that overly-complicated query at the READ UNCOMMITTED transaction isolation level so it doesn't acquire any locks. it's not displaying data to the customer or anything, so dirty reads aren't a problem then, with the job_id in hand from the query, you can set the isolation level back to highest make full use of every locking feature your database provides to actually update that job to "processing" status and ensure that it gets run one and only one time. you're only touching the job table at this point so it doesn't affect the rest of the program
|
# ? Dec 21, 2018 16:26 |
|
abigserve posted:it's network flow data, consists of a single solitary table and looks like this (sorry for the formatting) Make the IPv6 entries only 12 bytes long and merge with the IPv4 columns to create the final result.
|
# ? Dec 21, 2018 16:33 |
|
that would leave open the case of "job got set to processing and never finished because it halted half way" though which yeah i guess you could solve via a time out or whatever or through a db transaction or something buutttt why not just have you queue read by a single process dispatcher that passes the jobs to workers that do the actual thing and write success when they're done using the job id directly ? any time the dispatcher fails it'd just start up and pick up the next x jobs and carry on and you don't have to worry about messing with db locking at all and the workers don't need to know or care about the priority.
|
# ? Dec 21, 2018 16:34 |
|
NihilCredo posted:for the sake of discussion, let's assume that we actually needed a complicated priority system (otherwise the whole thing is trivial as you noted) read uncommitted (aka nolock) wont work for that since multiple processes performing the same query will get the same dirty data and then all attempt to pull the same row for processing. you need to use readpast and updlock which will read past locked rows and then lock only your row you intend to process. Shaggar fucked around with this message at 16:42 on Dec 21, 2018 |
# ? Dec 21, 2018 16:37 |
|
Powerful Two-Hander posted:that would leave open the case of "job got set to processing and never finished because it halted half way" though which yeah i guess you could solve via a time out or whatever or through a db transaction or something buutttt why not just have you queue read by a single process dispatcher that passes the jobs to workers that do the actual thing and write success when they're done using the job id directly ? why would the single process dispatcher be any less likely to fail in the middle of processing than independent dispatchers? what if it assigns a process to a worker and the worker crashes? you have to handle retries some how and a single dispatcher doesn't really help.
|
# ? Dec 21, 2018 16:40 |
|
Shaggar posted:why would the single process dispatcher be any less likely to fail in the middle of processing than independent dispatchers? what if it assigns a process to a worker and the worker crashes? you have to handle retries some how and a single dispatcher doesn't really help. the dispatcher does have to worry about "other" processes touching the queue so you don't have to mess with row locks to ensure that a job only gets processed once, you take away the issue of "workers A and B both try to process job 1" because you've assigned job 1 to a specific worker first. if the workers are distributed in different components that can fail independently then yeah you could restart the dispatch and reassign a job but you can solve that by tagging a job to the worker or just run the workers as separate threads in a single service,depends on volume I guess. i mean you could still get the dreaded "worker finished the job but failed before it could write completion" edit: ofc at some point someone will complain it's slow and install a second dispatcher and you're back to square one edit 2: why the gently caress am I posting all this I'm supposed to be on holiday
|
# ? Dec 21, 2018 16:46 |
|
the row locking problem is actually pretty easy to handle and the failure handling is far and away the hardest issue, especially when you need to avoid reprocessing effects. also if you design for multiple processes in the queue then you can easily handle scaling up by adding more processes.
|
# ? Dec 21, 2018 16:55 |
|
abigserve posted:for you sql gurus how good is it at handling billions/trillions of very small rows (like <100 byte average length) better than handling trillions of large rows? mongo will poo poo itself. don't use mongo. it is possible but you will regret it (actually this is true anytime mongo is under consideration for anything) what i would do is write to elasticsearch, export completed indexes to parquet, and delete indexes from elasticsearch after a day or a week. kibana for NRT queries and hive/presto serves historical queries. just like log aggregation
|
# ? Dec 21, 2018 16:57 |
|
Powerful Two-Hander posted:that would leave open the case of "job got set to processing and never finished because it halted half way" though which yeah i guess you could solve via a time out or whatever or through a db transaction or something buutttt why not just have you queue read by a single process dispatcher that passes the jobs to workers that do the actual thing and write success when they're done using the job id directly ? when you set it to processing, also set a column with the unique ID of the dispatcher. if that ID is the same between restarts, then a dispatcher that crashes can pick up its own jobs when it restarts. failing that, have a way to get the IDs of all active dispatchers (can be just a time with last_seen time updated every minute) and treat as free any job marked as 'processing' by an absentee dispatcher Shaggar posted:read uncommitted (aka nolock) wont work for that since multiple processes performing the same query will get the same dirty data and then all attempt to pull the same row for processing. you need to use readpast and updlock which will read past locked rows and then lock only your row you intend to process. you use read uncommitted when selecting the job_id to process you use for update skip locked (or the mssql equivalent) when you actually update the job_id you found and set it to 'processing' if the update fails, you assume someone who read your same data beat you to the job and goto step 1. you may have to run the heavy query multiple times but this way you never actually locks the non-job queue tables
|
# ? Dec 21, 2018 17:00 |
|
cinci zoo sniper posted:eh, for testing maybe, for production im not a fan db.t2.xlarge or db.t2.2xlarge because they are postgresql 9.x. for you 10 is more or less needed, 11 is desired (which is only on rds preview right now i think) if you want to do partitions, rather than brin indexing. i think he's a perfect fit for https://www.timescale.com/how-it-works
|
# ? Dec 21, 2018 17:05 |
|
NihilCredo posted:when you set it to processing, also set a column with the unique ID of the dispatcher. if that ID is the same between restarts, then a dispatcher that crashes can pick up its own jobs when it restarts. failing that, have a way to get the IDs of all active dispatchers (can be just a time with last_seen time updated every minute) and treat as free any job marked as 'processing' by an absentee dispatcher yeah that's kind of what i was thinking. pretty sure we actually had a system like this years ago but the workers were set up so that if one died, they all refused to work until it was restarted. must have been unionised I guess. also I apologise for derailing the terrible programmer stories and taking us all to the bikeshed
|
# ? Dec 21, 2018 17:08 |
NihilCredo posted:i think he's a perfect fit for https://www.timescale.com/how-it-works never seen this before, curious. off the bat seems like a decent fit yeah, since the target workload is rather pristine - not sure if there can be any much partitioning in space done however
|
|
# ? Dec 21, 2018 17:18 |
|
NihilCredo posted:i think he's a perfect fit for https://www.timescale.com/how-it-works oh that's a good one. closest thing i've ever used is influxdb which is not suitable
|
# ? Dec 21, 2018 17:22 |
|
NihilCredo posted:i think he's a perfect fit for https://www.timescale.com/how-it-works timecube has grown up
|
# ? Dec 21, 2018 18:26 |
|
NihilCredo posted:when you set it to processing, also set a column with the unique ID of the dispatcher. if that ID is the same between restarts, then a dispatcher that crashes can pick up its own jobs when it restarts. failing that, have a way to get the IDs of all active dispatchers (can be just a time with last_seen time updated every minute) and treat as free any job marked as 'processing' by an absentee dispatcher wouldn't this cause a race condition wrt the update? like if 2 processes have selected the same job id and then theres a delay between when one attempts to update and its selection of job id, you run the risk of the faster process successfully updating and committing before the later tries its update. so when the second does its update the first process's lock will be gone so the update will work. or are you locking the row for the entire processing of the job? In which case theres no real reason to set it to a temporary status of "processing" because that status would be removed by the final job status or in the event of a failure the transaction will rollback to before it was selected.
|
# ? Dec 21, 2018 19:27 |
|
i seriously don't understand why you wouldn't just use a service broker queue if you're using mssql
|
# ? Dec 21, 2018 19:33 |
|
at some point im gonna use service broker to do pub sub instead of polling, but i'll still have the same table structure for reporting and app needs. service broker would basically be running the same query to get the list if items to queue tho.
|
# ? Dec 21, 2018 19:40 |
|
Shaggar posted:wouldn't this cause a race condition wrt the update? like if 2 processes have selected the same job id and then theres a delay between when one attempts to update and its selection of job id, you run the risk of the faster process successfully updating and committing before the later tries its update. so when the second does its update the first process's lock will be gone so the update will work. each dispatcherruns something like "update jobs set processing = true, owner = @my_id where job_id = @chosen_id and processing = false" I'm on phone so pretend i added the various id columns and skip locked clauses first dispatcher will see 1 row affected and do the job. second one will see 0 rows and go look for another chosen_id
|
# ? Dec 21, 2018 20:19 |
|
ok i got you. that works. i wouldn't have thought to do it by rows affected.
|
# ? Dec 21, 2018 20:24 |
|
abigserve posted:it's network flow data, consists of a single solitary table and looks like this (sorry for the formatting) what would be your search parameters for this?
|
# ? Dec 21, 2018 20:42 |
|
drop table imo
|
# ? Dec 21, 2018 20:48 |
|
|
# ? Dec 21, 2018 22:34 |
|
Shaggar posted:at some point im gonna use service broker to do pub sub instead of polling, but i'll still have the same table structure for reporting and app needs. service broker would basically be running the same query to get the list if items to queue tho. i was more responding to the op, but in your case why not just push messages to your queue when you update the operational table instead of updating an operational table to query later to queue messages? seems like an extra pointless step to me. if the problem is they can't be processed until a later time, use a conversation timer. if you're dealing with multiple dependencies then check and only queue the message when all dependencies are satisfied. if the messages need to be "cancellable" things get hairy but you can still update the operational table to say "don't process that message we nulled the process timestamp" and then double check when processing that message you queued up for 1 am, and since you're always looking poo poo up via the pk of the table the check is fast. basically i don't see the point of turning operational tables into queues and i really have to stretch to come up with scenarios where it would be the only choice
|
# ? Dec 22, 2018 00:42 |
|
all of that said, queues are way overused too. i mostly use them in async scenarios such as file drops and api requests coming in waves to smooth out processing. ideally your system can just make whatever api calls it needs to and everything just works, i.e. read this i guess: https://ferd.ca/queues-don-t-fix-overload.html
|
# ? Dec 22, 2018 00:44 |
|
Cold on a Cob posted:i was more responding to the op, but in your case why not just push messages to your queue when you update the operational table instead of updating an operational table to query later to queue messages? seems like an extra pointless step to me. its easier to just handle control and statusing in a table vs an opaque message queue and ive never seen a purpose built message queue that isn't fragile as hell and relies on the ability to dump everything in the queue and restart when things go wrong.
|
# ? Dec 22, 2018 00:57 |
|
well guys we better separate the queue from the database because what if the database server goes down? then how will the workers fetch jobs from the queue to process?? what do you mean the work of those jobs requires further queries to the database, i don't see your point???
|
# ? Dec 22, 2018 02:01 |
|
CRIP EATIN BREAD posted:what would be your search parameters for this? It's at least last_switched but can be any combination of columns. Im writing the timescaledb backend now, I'll let you guys know how that goes!
|
# ? Dec 22, 2018 02:07 |
|
Shaggar posted:its easier to just handle control and statusing in a table vs an opaque message queue and ive never seen a purpose built message queue that isn't fragile as hell and relies on the ability to dump everything in the queue and restart when things go wrong. SQS is super robust. job has been relying on SQS for about five years now and had 0 problems. billions of messages per month, 1000x traffic spikes, once 300 misconfigured workers short polling in a loop, never had a problem probably helps that the features you get are put message, receive message, and delete received message
|
# ? Dec 22, 2018 02:22 |
|
|
# ? Oct 4, 2024 12:42 |
|
Cold on a Cob posted:lol Yes. Cold on a Cob posted:wait Yes
|
# ? Dec 22, 2018 02:25 |