Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!
On a related AWS note, apparently Provisioned IOPS is the way to go these days on EC2 for DB workloads.

You get very unpredictable 100 IOPS (16kB block reads) with regular EBS, but with PIOPS you can give yourself up to 4k *guaranteed* IOPS per drive, no outrageous fluctuation I was mentioning above. Meaning that if you were to stripe a few EBSes together, you'd have a lot of throughput. At that point, if the data flowing in and out is considerable, you might start hitting the network throughput limit, since EBS and actual incoming DB calls would be through the same pipe. Once you hit that, you can check out EBS-optimized instances, which have a separate pipe for EBS only, and you don't need to share the network.

The things you learn..

This actually raises a question: in your experience, how much of a difference does artisan babysitting of a DB server (software+hardware) make in practice compared to using pre-packaged solutions like RDS, Heroku's hosted PG etc? I can't figure out if those services would ever bother giving you the VERY BEST hardware, possibly outclassing the best hand-picked machines, or if they'd give you a "meh" performance in exchange for the convenience of no maintenance?

DreadCthulhu fucked around with this message at 20:35 on Oct 13, 2013

Adbot
ADBOT LOVES YOU

roomforthetuna
Mar 22, 2005

I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!
Thinking more on my aforementioned "incredibly slow InnoDB update" issue, I notice my InnoDB configuration has the "one file per table" item set to false, so duplicating the table contents into a new table wouldn't necessarily have sorted out whatever was making it so slow (as indeed it did not).

So I wonder, is the problem it was having perhaps to do with some sort of horrible [de]fragmentation treatment of allocation of large text fields? Because the data's behavior has essentially been repeated "append about 10 characters" to each of 30 strings, up to a length around 1500 characters on each one. If it's trying to allocate that space efficiently in a single pool, and has a bit of a naive system for it, that might explain the horrifying slowness of updates.

Because it wasn't slow at all when the database was started, and I can't imagine 32 rows is enough to make it have a meltdown, but 4500 reusable spaces that are each too small to fit the new data-lump, being examined sequentially and scattered through a file that's always being randomly disc accessed, might do it maybe? Is that what InnoDB would do?

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

roomforthetuna posted:

Thinking more on my aforementioned "incredibly slow InnoDB update" issue,..

Because it wasn't slow at all when the database was started, and I can't imagine 32 rows is enough to make it have a meltdown, but 4500 reusable spaces that are each too small to fit the new data-lump, being examined sequentially and scattered through a file that's always being randomly disc accessed, might do it maybe? Is that what InnoDB would do?
Nope, the standard size "issue" for MySQL is 8000 bytes, so you shouldn't even have to worry about your row being stored "off page", except perhaps for the rules field (depending on its size). After reading your initial post a few times, I came to realize that your update is of a single row. Updating a text field in a single row of a 100kB data set should never take anything close to a second, particularly since it's referenced by a single, primary key field.

Is this the only statement within the transaction, or are you hiding thousands of other things in there? What else is happening on the system? Do you have a bunch of waiting metadata locks contending for update? Check your logging settings and SHOW PROCESSLIST to see what is causing the delay. Assuming you can even use ed/vim/nano/cat/more on this system, and that these query times are so highly variable, it seems most likely that you are locking the row in some fashion and not releasing it.

There are many other choices, "Using an 8MHz processor and an IBM WD-12 5.25" 10MB Internal Hard Drive" not the least among them, but you have some very serious issues if you have to blink more than once to append a nominal string to a text field.

roomforthetuna
Mar 22, 2005

I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

PhantomOfTheCopier posted:

Nope, the standard size "issue" for MySQL is 8000 bytes, so you shouldn't even have to worry about your row being stored "off page", except perhaps for the rules field (depending on its size). After reading your initial post a few times, I came to realize that your update is of a single row. Updating a text field in a single row of a 100kB data set should never take anything close to a second, particularly since it's referenced by a single, primary key field.

Is this the only statement within the transaction, or are you hiding thousands of other things in there? What else is happening on the system? Do you have a bunch of waiting metadata locks contending for update? Check your logging settings and SHOW PROCESSLIST to see what is causing the delay. Assuming you can even use ed/vim/nano/cat/more on this system, and that these query times are so highly variable, it seems most likely that you are locking the row in some fashion and not releasing it.

There are many other choices, "Using an 8MHz processor and an IBM WD-12 5.25" 10MB Internal Hard Drive" not the least among them, but you have some very serious issues if you have to blink more than once to append a nominal string to a text field.
Yeah, that's why I'm asking, because it's clearly hosed up. I did a show processlist already, the only process was the mysql command line I was testing with, and while I was testing it there was nothing else in the transaction, it was just that single update command being run on its own. (In the original code, the transaction was just "start transaction, get the single row's data, modify the data, put the modified data back, commit", which is why I wanted the InnoDB row lock for it, so that two simultaneous updates wouldn't ever both be operating from the same 'get', thus overwriting the first change with the second.)

There may be something insanely slow about the hard drive of the server though - saving a file from vi often takes a second. If MyISAM is caching things in memory and writing out in the background or something then that might explain the ridiculous speed difference. It's a virtual private server so it's hard to say what the hardware is like.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

roomforthetuna posted:

... There may be something insanely slow about the hard drive of the server though - saving a file from vi often takes a second. If MyISAM is caching things in memory and writing out in the background or something then that might explain the ridiculous speed difference. It's a virtual private server so it's hard to say what the hardware is like.
It's likely that the VPS provider will tell you that you are using shared resources and should expect to see periodic and random performance degradation, so you'll want to run some more basic tests before sending them a complaint. In particular, if they have a database or MySQL person, they're likely to check all your performance settings which is a very nice idea but should honestly not matter for a single-row transaction in a 100kB database (unless you've done something silly like lowering the buffer pool size to 10 bytes). If they hand the issue to a systems administrator, they might recognize the SQL, but they'll be more interested in disk performance as a whole.

Find yourself any basic intro to using `dd` as a benchmarking tool (such as this) to get some basic timing information. Test reads and writes. See if you can narrow this misbehavior down to a fixed period of the day or not. Try to get some base CPU values as well (time to calculate the first 100000 primes, etc.). You'll want to demonstrate something specific (unless they really just have lots of spare time to help you), or that the disks are seemingly fine but MySQL isn't (which is a bit non-sensical, though it pains me to say it). If you get to that point, a full dump and restore may help (maybe without your configuration, since there could still be something hiding there).

There are a few other MySQLish possibilities, but I should think you'd have to be hiding lots of triggers and bloated indexes from us for them to be likely. Being on a virtual system, there's more potential trouble there. You could have bad neighbors, a RAID drive that's rebuilding, and a host of other things going on that you can't see.

roomforthetuna
Mar 22, 2005

I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

PhantomOfTheCopier posted:

Find yourself any basic intro to using `dd` as a benchmarking tool (such as this) to get some basic timing information. Test reads and writes. See if you can narrow this misbehavior down to a fixed period of the day or not. Try to get some base CPU values as well (time to calculate the first 100000 primes, etc.). You'll want to demonstrate something specific (unless they really just have lots of spare time to help you), or that the disks are seemingly fine but MySQL isn't (which is a bit non-sensical, though it pains me to say it). If you get to that point, a full dump and restore may help (maybe without your configuration, since there could still be something hiding there).
Thanks for that, that was good for confirming it's almost certainly something disc related.
(In case you're interested, the tests I did and results:
code:
$ dd bs=1M count=256 if=/dev/zero of=test conv=fdatasync
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 8.09512 s, 33.2 MB/s
$ rm test
$ dd bs=1K count=256 if=/dev/zero of=test conv=fdatasync
256+0 records in
256+0 records out
262144 bytes (262 kB) copied, 6.18214 s, 42.4 kB/s
$ dd bs=1K count=2 if=/dev/zero of=test conv=fdatasync
2+0 records in
2+0 records out
2048 bytes (2.0 kB) copied, 6.19403 s, 0.3 kB/s
$ rm test
$ dd bs=1K count=1 if=/dev/zero of=test conv=fdatasync
1+0 records in
1+0 records out
1024 bytes (1.0 kB) copied, 2.89831 s, 0.4 kB/s
$ rm test
$ dd bs=1K count=1 if=/dev/zero of=test conv=fdatasync
1+0 records in
1+0 records out
1024 bytes (1.0 kB) copied, 0.086591 s, 11.8 kB/s
$ rm test
$ dd bs=1K count=2 if=/dev/zero of=test conv=fdatasync
2+0 records in
2+0 records out
2048 bytes (2.0 kB) copied, 2.99339 s, 0.7 kB/s
Reading seemed relatively okay - 24MB in 1MB blocks in 8 seconds, 256K in 1K blocks in 0.75 seconds and 1K in a single block in 0.029 seconds, each reading from a different file that hasn't been used in months to make sure of no cache hits.

Clearly 3 seconds to write 1K of data is not expected behavior, and likely explains the lovely InnoDB performance. I'll give the hosts a prod about it. Thanks for your help with what turned out to not really be an SQL question at all!

Edit: and yeah, InnoDB does fsync every write; a similar dd test without fdatasync returns immediately, which would be a fair simulation of MyISAM's behaviour of not actively syncing at all.
Edit2: and InnoDB then fsyncs a log too, so that's already like 6 seconds on this system even if both files were infinitesimal and it's only one write each, with no rearranging data or anything.
Edit3: just wrote a quick C program to do a series of small writes and fsyncs and time them - it tends to go 12 seconds, 8 seconds, 3 seconds, then a bunch of instant (writing only 4 bytes at a time), so my guess is someone else on the same meta-server is spewing data to disc at several megabytes per second, and the result is my few bytes get queued behind their mass of data, hence the inconsistent long times to sync. (The host says there was also a raid rebuild at some point today, but that started after the slowness and was finished before some of the slowness, so is not to blame - more likely the crazy data spew is to blame for the need for a raid rebuild!)

roomforthetuna fucked around with this message at 22:05 on Oct 14, 2013

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

roomforthetuna posted:

$ dd bs=1K count=2 if=/dev/zero of=test conv=fdatasync
2+0 records in
2+0 records out
2048 bytes (2.0 kB) copied, 6.19403 s, 0.3 kB/s
Ye freaking :effort:s, it's worse than even I imagined. The forking 1541 drive on a Commodore 64 got 300 bytes per second!

roomforthetuna
Mar 22, 2005

I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

PhantomOfTheCopier posted:

Ye freaking :effort:s, it's worse than even I imagined. The forking 1541 drive on a Commodore 64 got 300 bytes per second!
Ha ha, yeah, imagine how slow it would look if I'd had it write and sync just one byte!

Here's the script I used to examine what was going on a bit closer - http://pastebin.com/nKFpJpXe
From the numbers I got out of that it would appear to be writing about half a byte per second, in that it would take 8 seconds to write and sync four bytes.
But each iterations would get faster, with the last 12 iterations generally being near-instant, so my theory is some other fucker is writing over 200MB per second, constantly, to the drive, so my few bytes are just spending a long time waiting in a write queue. I wonder if there's any way VPS software could send big writers to the back of the queue to prevent loving up ordinary usage for other users. Probably not, since the write queue is significantly in hardware.

Xae
Jan 19, 2005

Anyone have experience dealing with "PAGEIOLATCH_EX" issues in SQLServer 2008?

We have a Warehouse sitting on a stupidly powerful box that just spikes up to 5-10 mb/s IO and <10% CPU usage then the loads get suspended due to PageIOLatch_EX. This happens in cycles of ~10 seconds or so.

The disks are on a fiber channel to the SAN so 5-10 mb/s seems absurdly low. I have seen IO spikes up to 500mb/s at times on activity monitor.

The warehouse is loaded by using SSIS to load to a staging area, then the data is processed and inserted (in one giant query) into the final table. All the tables reside on the same datafile, so is the read colliding with the write?

Erik Victory
Jan 26, 2013

Xae posted:

Anyone have experience dealing with "PAGEIOLATCH_EX" issues in SQLServer 2008?

We have a Warehouse sitting on a stupidly powerful box that just spikes up to 5-10 mb/s IO and <10% CPU usage then the loads get suspended due to PageIOLatch_EX. This happens in cycles of ~10 seconds or so.

The disks are on a fiber channel to the SAN so 5-10 mb/s seems absurdly low. I have seen IO spikes up to 500mb/s at times on activity monitor.

The warehouse is loaded by using SSIS to load to a staging area, then the data is processed and inserted (in one giant query) into the final table. All the tables reside on the same datafile, so is the read colliding with the write?

An actual (not estimated) query plan as well as io statistics will basically answer this; there's no way for someone to give you an informed opinion otherwise. It's very likely due to the large inserts causing the table's indices to rearrange on insert - SQL Server can do any number of things to achieve this, including bulk loading all the indices into table spools and rewinding/rebinding and making a whole complex plan out of it. Even though more info is needed, you might try the lazy dba method of dropping indices before insert and rebuilding afterward. It probably has nothing to do with the disk io subsystem.

Xae
Jan 19, 2005

Erik Victory posted:

An actual (not estimated) query plan as well as io statistics will basically answer this; there's no way for someone to give you an informed opinion otherwise. It's very likely due to the large inserts causing the table's indices to rearrange on insert - SQL Server can do any number of things to achieve this, including bulk loading all the indices into table spools and rewinding/rebinding and making a whole complex plan out of it. Even though more info is needed, you might try the lazy dba method of dropping indices before insert and rebuilding afterward. It probably has nothing to do with the disk io subsystem.

We drop the indexes before the load and recreate them afterwords.

Erik Victory
Jan 26, 2013

Xae posted:

We drop the indexes before the load and recreate them afterwords.

Have you run a trace/captured the query plan? If it's not the indices it's possibly the query itself (which you described as "gigantic").

ManoliIsFat
Oct 4, 2002

Xae posted:

We drop the indexes before the load and recreate them afterwords.

Could be the nature of the data/indexes. IIRC, we had a lot of those EX page locks when we used to cluster key around a GUID, no padding, in a fairly big table. This would cause huge re-balancing of pages every so many inserts, because the page would split and since the GUID data was all random, the indexes were fragrmented as hell, moving a grip of data around at seemingly random times.

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!
How do most pros handle validation checks such as uniqueness from web applications? Right now I enforce uniqueness at the DB level (of course..), and also make a DB call before insertion to make sure the row doesn't exist. Obviously this fails in the case where two of these calls are made, but essentially no damage is done, as the second call will result in failure once the DB constraint is hit.

I was chatting with some experienced SQL folks though, and they told me that the above is pretty ghetto and that, in the simple uniqueness scenario above, I should just attempt inserts, and then catch and interpret the exception correctly in the web app. This way I get to make only 1 round-trip and don't have to have account for the possibility of uniqueness violation in 2 different chunks of code.

Is the latter approach what most sane people do, or what's the deal? I know some frameworks frown upon that (like Rails), but that might be just their unique opinion.

roomforthetuna
Mar 22, 2005

I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

DreadCthulhu posted:

How do most pros handle validation checks such as uniqueness from web applications? Right now I enforce uniqueness at the DB level (of course..), and also make a DB call before insertion to make sure the row doesn't exist. Obviously this fails in the case where two of these calls are made, but essentially no damage is done, as the second call will result in failure once the DB constraint is hit.
No damage is done, but you're sending two queries when you only need one, slowing your thing down. I assume that's why people say it's not good, in addition to unnecessarily having two lots of code checking for the same error.

Personally, in that situation I just do INSERT, then if (bla->rows==0) as the only check. For a huge professional thing I might examine what kind of error caused no insert, but for things I make for myself I'm happy to just assume (or at least report to the client) that if nothing was inserted then it was a key-already-exists problem.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

DreadCthulhu posted:

...make a DB call before insertion to make sure the row doesn't exist. Obviously this fails in the case where two of these calls are made...(like Rails)...
Yeah, this is called a race condition, and it plagues Rails types because MySQL seems to keep changing how they handle things like auto-incrementing fields, return values, upserts, replication key failures, blardy blar.

Let your real database do its job.

DreadCthulhu posted:

I should just attempt inserts, and then catch and interpret the exception correctly in the web app. This way I get to make only 1 round-trip and don't have to have account for the possibility of uniqueness violation in 2 different chunks of code.
For even the simplest of applications, insert failures tend to result in finite-looped retries, possibly with exponential back-off timing. Applications that are more intelligent would be more concerned with the type of error returned. Note specifically that an INSERT can fail for reasons other than a uniqueness violation. In any case, your application should not be causing uniqueness violations; i.e., it should not be making choices about things like primary, randomly-or-sequentially assigned keys, because the database can handle the uniqueness and check constraints and return the values ultimately inserted to the application.

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!

PhantomOfTheCopier posted:

For even the simplest of applications, insert failures tend to result in finite-looped retries, possibly with exponential back-off timing. Applications that are more intelligent would be more concerned with the type of error returned. Note specifically that an INSERT can fail for reasons other than a uniqueness violation. In any case, your application should not be causing uniqueness violations; i.e., it should not be making choices about things like primary, randomly-or-sequentially assigned keys, because the database can handle the uniqueness and check constraints and return the values ultimately inserted to the application.

The case I'm thinking of is, for simplicity sake, a user trying to create an account with email X. Email X is in the system already, you get an exception from the driver, you check to see if it's the uniqueness violation code, then you report the specific error to the caller of the API. No amount of retrying is going to help here and I'm not doing any sequential ID generation. I can also ignore most of the other insertion failures here and assume that if they happen, I just return a 500 to the caller since I have no idea what just happened and I don't care too much. Handling edge cases here doesn't buy me anything: the user will just retry again later or there's an issue and I'll be fixing it right away. Am I missing anything?

roomforthetuna
Mar 22, 2005

I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

PhantomOfTheCopier posted:

Note specifically that an INSERT can fail for reasons other than a uniqueness violation.
Are there any other reasons that are likely to be short term temporary such that retrying is worthwhile? I can think of "drive is full", which isn't likely to get fixed without someone intervening, and maybe "database just got restarted after your connection was made," what else is there?

Pardot
Jul 25, 2001




Well, there are serialization failures which are temporary and should be retried.

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!
I'm trying to make a healthy design decision here and could use your help. Still on the input validation topic.

In a web app situation, when a user attempts to update or insert something, I'm seeing three phases of validation that have to be done to ensure sanity. First you check input validity for basics such as type, length, range, presence etc. Then you start making checks against the DB for things like "does the user have the right to make this call with these parameters?" Maybe the FK he's pointing to is actually not owned by this user and would lead do undesired badness. The third check is for the results of the insertion/update itself, and consequent interpretation of a possible exception code. Each phase allows you to short-circuit the rest of the operations and immediately return the error specific to the user.

Now, it's clear that one must decide how much logic needs to reside at the DB level and try to minimize business rule duplication. Uniqueness constraints are obvious as per previous discussion we were having here. Other checks such as "I want there at most x rows with this content in this table" seem to make most sense as part of check constraints again to avoid concurrency issues.

I'm suspecting that phase 2 and 3 could be merged together into a sproc, but I'm not a huge fan of writing a ton of sql in whatever clunky postgres language (yes yes there's v8 support) or having to now have a second test suite for this stuff. Making the db be dumb makes my testing existence very comfortable, but I wonder if I'm seriously missing out here.

Anybody got thoughts on the whole validation shebang?

Cocoa Crispies
Jul 20, 2001

Vehicular Manslaughter!

Pillbug

DreadCthulhu posted:

Making the db be dumb makes my testing existence very comfortable, but I wonder if I'm seriously missing out here.

Anybody got thoughts on the whole validation shebang?

The last big Rails project I did just used Postgres constraints, and not Rails validations. I handled the raised errors in the controller because I only had one action that modified records anyways. You can't have the DB be dumb, you will have race conditions that lead to stupid poo poo, and at that point, you might as well be using a broken database for idiots to defend like MySQL or Mongo.

DreadCthulhu posted:

The case I'm thinking of is, for simplicity sake, a user trying to create an account with email X. Email X is in the system already, you get an exception from the driver, you check to see if it's the uniqueness violation code, then you report the specific error to the caller of the API.
If you do this pattern a lot, figure out a better way to put it on the model instead of in the controllers.
Ruby code:
# UsersController
  def create
    @user = User.new email: params[:email]
    success = false
    begin
      success = @user.save
    rescue ActiveRecord::RecordNotUnique => e
      # do the @user.errors dance to put the error on whichever column failed, read the docs
    end
    
    return redirect_to root_path if success
    
    render action: 'new'
  end

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!

Cocoa Crispies posted:

The last big Rails project I did just used Postgres constraints, and not Rails validations. I handled the raised errors in the controller because I only had one action that modified records anyways. You can't have the DB be dumb, you will have race conditions that lead to stupid poo poo, and at that point, you might as well be using a broken database for idiots to defend like MySQL or Mongo.

Am I correctly understanding that a serializable isolation level transaction should be able to take care of most of the scenarios where I need a bunch of tables to be in sync and follow a certain set of constraints? That's a pretty cheap and relatively inexpensive solution I'm already using for most insertions, but I'm curious if I'd ever need to consider table locks etc for some edge cases.

Cocoa Crispies posted:

If you do this pattern a lot, figure out a better way to put it on the model instead of in the controllers.

The way I have it currently implemented is that I have a macro that wraps every db call in a try catch block that examines the result and packages it into an error-monad-style vector of results or errors that the controller is then able to translate to the right format expected by the caller. API callers get their error back, and HTML form submitters get their error explanations on the page.

Cocoa Crispies
Jul 20, 2001

Vehicular Manslaughter!

Pillbug

DreadCthulhu posted:

Am I correctly understanding that a serializable isolation level transaction should be able to take care of most of the scenarios where I need a bunch of tables to be in sync and follow a certain set of constraints? That's a pretty cheap and relatively inexpensive solution I'm already using for most insertions, but I'm curious if I'd ever need to consider table locks etc for some edge cases.

Dunno; the last couple apps I made with constraints could simply treat a uniqueness violation as a success and move on.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

DreadCthulhu posted:

Am I correctly understanding that a serializable isolation level transaction should be able to take care of most of the scenarios where I need a bunch of tables to be in sync and follow a certain set of constraints?
While that should be able to "take care" of those scenarios, you have not yet given a use case for it here. You should be able to keep a bunch of tables "in sync", follow constraints, and handle all typical insertions and updates, at the default, read committed isolation level. If your application can't facilitate retries and proper error handling with a normal rate of transactions at that level, there's no way it will be ready to handle serializable.

One of the benefits of a layered design is that you have multiple chances to prevent a mistake, reducing the likelihood of data corruption over time. None of the layers should particularly "trust" what it's being told, nor should it waste much time on things that belong elsewhere. Remember that there are different techniques for different purposes, so while you can easily cache a "group permissions array" in an application, for example, you need to program as if larger things like tables of user data always require a database call.

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!
Well one situation is, to steal the scenario from SO, I have a table of photographers and a table of photos. I don't want a photographers to have more than 10 photos each.

My options are, I think:

  • run two queries, one check and one insert in a serializable transaction (this way two transactions can't both see the same count). On that note, do I have to manually retry things here if there are any sort of collisions, or is a failure a 100% guarantee retrying won't help?
  • implement a before-save trigger with a table lock so that two counts can't happen at once, and raise an exception if one attempts to insert past the cap.

The "Rails way" was so much easier, too bad it didn't work :P

Now the question is, am I missing anything up there, and if not, which approach is less of a perf hog?

McGlockenshire
Dec 16, 2005

GOLLOCKS!
Which database are you using? If you're using something that supports the CHECK constraint, then that and a UDF might be an alternative to the save trigger.

Another alternative might be doing the insert through a stored proc that also does the check.

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!

McGlockenshire posted:

Which database are you using?

This is specific to Postgres 9.3. Check constraints form what I can tell are unable to do anything that goes beyond that specific row. I was actually hoping to do a subquery in there (my inserts in that table are relatively infrequent), but PG won't let you do that.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

DreadCthulhu posted:

Well one situation is, to steal the scenario from SO, I have a table of photographers and a table of photos. I don't want a photographers to have more than 10 photos each...
Your proposals all make me very sad. While it doesn't address some issues in the generic case, on user creation (probably) create ten photos for the user (row=userid,NULL photoid). When the application wants to add a photo, take an update lock on the first userid/NULL row and add the image there. If a second transaction attempts to update the same null row, it will block, then fail, and your retry can handle the scenario of no free buckets. Likewise, if the user wishes to replace a photo, it's in a specific slot/bucket, so you select the row for update.

In other words, the "upload a photo" portion of the application has no functionality nor permissions to add rows to the photo table, only to issue updates to existing rows. This locks a single row, namely a single row per user, for uploads, instead of an entire table (which could mean lots of things, but yeah) or, :ohdear: the entire transaction engine.

(The generic case isn't fixed at "ten", but you could create a "pre-update" routine in the "add pictures" portion of the application that would add a chunk of free buckets up to a limit so there were always a certain number showing as free. You can also do forceful things like always dropping off the last/oldest, but users tend to prefer errors to those types of behaviors.)

IT BEGINS
Jan 15, 2009

I don't know how to make analogies
Are there any best practices for loading files into a database that don't have a simple structure?

I've got a CSV file that has a set of records that represent invoices and charges. In the CSV file, these records look like the following:

pre:
invoice 1, Charge 1
Invoice 1, Charge 2
Invoice 3, Charge 3
Invoice 2, Charge 1
Basically, a bunch of fields pertaining to an invoice, then a bunch of fields pertaining to a single charge. The invoice fields are repeated for every charge in that invoice in this file. In the past, I've loaded these files into an intermediary table, then converted the full records into records in the separate invoice and charge tables. However, this seems fairly inefficient to me - I'm doing a bunch of inserts twice. Is there a better way?

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!

The problem here is that this is an area of SQL I'm very unfamiliar with. I have done little to no real concurrency-conscious work with performance and medium sized data sets in mind, so a lot of the stuff I propose has the likelihood to not scale in any sane way. Is there anywhere I can learn more about this specific topic? Even the postgres perf book barely grazes this, although it does mention count being a tricky beast.

ManoliIsFat
Oct 4, 2002

Vasja posted:

Are there any best practices for loading files into a database that don't have a simple structure?

I've got a CSV file that has a set of records that represent invoices and charges. In the CSV file, these records look like the following:

pre:
invoice 1, Charge 1
Invoice 1, Charge 2
Invoice 3, Charge 3
Invoice 2, Charge 1
Basically, a bunch of fields pertaining to an invoice, then a bunch of fields pertaining to a single charge. The invoice fields are repeated for every charge in that invoice in this file. In the past, I've loaded these files into an intermediary table, then converted the full records into records in the separate invoice and charge tables. However, this seems fairly inefficient to me - I'm doing a bunch of inserts twice. Is there a better way?

If it were me I'd slice and dice it in a c# program, group them in memory, then make the SQL inserts. If you're just bulk inserting the file in to a working table, then making your charges and invoices records, I don't think there's much better way to do it. What do you mean about making the inserts twice?

If I had to do it all in SQL:

1) Bulk Insert in to a working table or temp table.
2) Do a GROUP BY and get unique invoices, insert in to invoices table
3) Do a JOIN with my working table on to the invoices table to get the PKs of the invoices (this assumes you're not just PKing on the invoice ID whatever jackass w/ excel is sending you, abd have two columns: one called pk that's your autoinc clustering key, and a varchar for invoice_id)
code:
INSERT INTO charges (invoice_pk ,...) 
SELECT invoices.pk, ... FROM working_table JOIN invoices ON working_table.invoice_id = invoices.invoice_id
(ps if you do this all in a big transaction, look to your specific DB on what type of isolation scope you'd want. I think you'd want to do SERIALIZABLE in MSSQL http://technet.microsoft.com/en-us/library/ms173763.aspx)

lord funk
Feb 16, 2004

What is a fast alternative to powf()? I'll be taking other steps to optimize like not calling the method so often, but even then I'm looking at a sizable number of calls to it. Unfortunately this isn't something that can be vectorized. On the plus side accuracy is not important.

edit: wow, holy nevermind. I started piecing apart what the bottleneck was and powf() had absolutely nothing to do with it.

lord funk fucked around with this message at 02:50 on Oct 26, 2013

IT BEGINS
Jan 15, 2009

I don't know how to make analogies
Thanks for the help. For the moment, I'll stick to doing the bulk insert option. I'd like to slice and dice it in PHP, but I feel this way will be faster and more clear.

raminasi
Jan 25, 2005

a last drink with no ice
I need some design advice. Say I've got some Buildings that I need to store in my database (this is SQLite for the moment, but if there's some better option out there I'd love to hear it). Each Building has a name, and an ID, and some location information, and whatever. That's all fine. What each Building also has is some time-series performance data, but the particular data isn't consistent across each Building. For example, Building A might have twelve monthly heating load values, and Building B might have 8760 hourly heating load values, and Building C might have both the monthly and hourly values, and they don't match. I don't have a list beforehand of what the potential data might be. How do I store this? Googling this just gives freshman-level "this is how to set up a many-to-many relationship" answers, which is not what I need.

Bognar
Aug 4, 2011

I am the queen of France
Hot Rope Guy
Easiest way* is to just convert it to a blob (binary data, text, etc.) and store it in a single column.

*up front, anyway

raminasi
Jan 25, 2005

a last drink with no ice
Are there any good practices for blobbing un-schemad data like this or do I just #yolo up some ad-hoc encoding?

Bognar
Aug 4, 2011

I am the queen of France
Hot Rope Guy
Really depends on what you're doing with it. However, you could just lean on your language of choice's serialization tools and convert it to JSON/XML/whatever.

Xae
Jan 19, 2005

GrumpyDoctor posted:

I need some design advice. Say I've got some Buildings that I need to store in my database (this is SQLite for the moment, but if there's some better option out there I'd love to hear it). Each Building has a name, and an ID, and some location information, and whatever. That's all fine. What each Building also has is some time-series performance data, but the particular data isn't consistent across each Building. For example, Building A might have twelve monthly heating load values, and Building B might have 8760 hourly heating load values, and Building C might have both the monthly and hourly values, and they don't match. I don't have a list beforehand of what the potential data might be. How do I store this? Googling this just gives freshman-level "this is how to set up a many-to-many relationship" answers, which is not what I need.

You can structure it.

BuildingID, MetricCD, FrequencyCD, MetricValue, TimeStamp

MetricID would be heating load, Frequency would be a daily/monthly code. The value is in the MetricValue. If there is separate sensors you can add a SourceID.

Xae fucked around with this message at 20:05 on Oct 27, 2013

Cocoa Crispies
Jul 20, 2001

Vehicular Manslaughter!

Pillbug

GrumpyDoctor posted:

Are there any good practices for blobbing un-schemad data like this or do I just #yolo up some ad-hoc encoding?

Is it truly untyped, or is it that each Building has 0…∞ Series, a Series is a measure name and measurement unit, a Series has 0…∞ Measures, which are a timestamp and a value?

Adbot
ADBOT LOVES YOU

raminasi
Jan 25, 2005

a last drink with no ice

Bognar posted:

Really depends on what you're doing with it. However, you could just lean on your language of choice's serialization tools and convert it to JSON/XML/whatever.

Do you mean just serialize for the blob column? That's a good idea.

Xae posted:

You can structure it.

BuildingID, MetricCD, FrequencyCD, MetricValue, TimeStamp

MetricID would be heating load, Frequency would be a daily/monthly code. The value is in the MetricValue. If there is separate sensors you can add a SourceID.

Cocoa Crispies posted:

Is it truly untyped, or is it that each Building has 0…∞ Series, a Series is a measure name and measurement unit, a Series has 0…∞ Measures, which are a timestamp and a value?

These are good analyses. I think the reason that I didn't think of them is that the data isn't really timestamped, because this is simulated data, not real data. "Faking" timestamps for each hour of the year is plausible, but as there become fewer data points it gets weirder, and by the time the series has a single measure because it's just the total annual heating load I think a timestamp (on when, January 1st?) is downright silly. Of course, I could replace "timestamp" with "series index." Is that weird? That was kind of my original question.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply