|
DreadCthulhu posted:Awful question of the day, would love to pick your brains on this: the product I work on allows people to answer questions. I need to record the # of times each of these uniquely identifiable questions were answered wrong/right and when, and then do some analytics on it. I'm on Postgres 9.2. 200 million rows in a year is really not huge. Assuming you don't have absurdly wide rows (with lots of TEXT columns) it'll take up a couple tens of gigabytes on disk. Table scans won't happen in less than a second but with appropriate indexing you could answer most of the queries you'll be putting into the app pretty rapidly. If you want more in-depth advice on this you should probably outline a bit more of the use case and the data you'll need to be storing.
|
# ? Jul 12, 2013 05:07 |
|
|
# ? May 30, 2024 14:15 |
|
DreadCthulhu posted:I can think of a bunch of ways of collapsing some of this data into fewer rows, like creating daily counters per question rather than having pure raw data. However, I also don't want to do any sort of overcomplicated premature optimization, falling into YAGNI. Your problem here is that each datapoint you're looking at is a single bit. As far as I'm aware Postgres does not have an efficient way to deal with that and you'll end up with something like code:
Otherwise you may need to look at a system of batching. So record the rows as above and have a nightly job calculate daily totals and update an aggregate table. This way your analysis query can be fast but you can change the reporting at a later date if you need to. Having 200 million million rows sitting on your disk isn't going to slow anything down unless you scan the whole table and flush your disk cache. quote:Part 2: on an unrelated note, is it ever ok NOT to have a primary id in a table? In the analytics example, I might not really ever need to identify a datapoint uniquely and might want to save 4-8 bytes. Is it cool to never create that column? If it's some sort of output table that nothing else in the db is ever going to need to refer to then sure.
|
# ? Jul 12, 2013 10:10 |
|
DreadCthulhu posted:Awful question of the day... I need to record the # of times each of these uniquely identifiable questions were answered wrong/right and when... YAGNI... raw datapoints for now until I see the performance degrade to justify optimization? Also, is there an accepted way of "compacting" old data that might have little to no value at its original degree of granularity? For your part two, if you follow the table design you gave yourself at the beginning, the table structure given by Super Delegate does have a primary key, it just doesn't happen to be a single column constructed for that purpose. (This is a valid use of the term 'primary key' in PG, though it might be distinguished from an abstract 'compound key'.) Indeed, (SurveyResponseID,PersonID,QuestionID,AnswerDate,Correct) is your key, and there's nothing particularly wrong with that. While most tables benefit from a separate PK column, it is of little use here. As others have mentioned, your data set is really not that large. You have used a word that throws crowds into fits of hash stores and NoSQL chaos, but with a proper index or three, you should be able to get the data you need very quickly. Make sure your columns are all plain storage so you don't have to deal with extended column retrieval. That you should never UPDATE this data is very useful, so dinking with table and index storage parameters and considering CLUSTER may prove useful. For now, stick with what appears to be the most straightforward storage approach. The trouble with YAGNI here is that it is appealing from a runtime point of view, but you cannot 'unsummarize/deaggregate' data if you haven't stored it. There is a great deal of potential algorithm design and optimization hiding here, and until you have a notion of those needs, you'll be wasting time building useless procedures. Any optimizations that work for even a month of data will most likely still help you in future redesigns.
|
# ? Jul 12, 2013 12:36 |
|
Anaxandrides posted:What do you suggest instead, out of curiosity? Records are inserted using LINQ/EF half the time, so logic inside procedures/functions to audit is not possible. Without getting the devs to do it application-side, any ideas? Digest the statement log files that the database is already writing out to the file system using something that is not the database itself.
|
# ? Jul 12, 2013 13:47 |
|
Beelzebozo posted:Digest the statement log files that the database is already writing out to the file system using something that is not the database itself. Is this an honest answer? That it would be better to deconstruct transaction logs and then reinsert audit records to an audit table using some homebrew application than it would be to process the transactions as they come in? And how, then, is one to exclude the audit inserts from being processed the next time one analyzes the log? Create a custom user just to do the auditing? It seems to me that this is the very definition of making up work for oneself. I don't see any advantage at all in doing this. Can you please explain the benefits here?
|
# ? Jul 12, 2013 17:02 |
|
PhantomOfTheCopier posted:For your part two, if you follow the table design you gave yourself at the beginning, the table structure given by Super Delegate does have a primary key, it just doesn't happen to be a single column constructed for that purpose. (This is a valid use of the term 'primary key' in PG, though it might be distinguished from an abstract 'compound key'.) Indeed, (SurveyResponseID,PersonID,QuestionID,AnswerDate,Correct) is your key, and there's nothing particularly wrong with that. While most tables benefit from a separate PK column, it is of little use here. I do still highly recommend that a proper primary key and clustered index be created, otherwise you end up with a heap table. This table would be slow indeed for data aggregation if it existed as a heap.
|
# ? Jul 12, 2013 17:48 |
|
Anaxandrides posted:Is this an honest answer? That it would be better to deconstruct transaction logs and then reinsert audit records to an audit table using some homebrew application than it would be to process the transactions as they come in? And how, then, is one to exclude the audit inserts from being processed the next time one analyzes the log? Create a custom user just to do the auditing? Who said anything about reinserting audit records to an audit table in that same database?! The whole point is not to introduce a bunch of additional overhead to your OLTP system. Why do you need the audit trail there? No, take the logs off and process them separately, independent of your transaction system in either a separate logging shadow environment or in your data warehouse / OLAP system. If you're set on reinserting the audit data into the database it describes, then yes, you may as well just hang triggers everywhere. At that point it's just pay now vs. pay later. I'm merely suggesting you needn't pay at all, in that system.
|
# ? Jul 12, 2013 18:32 |
|
Beelzebozo posted:Who said anything about reinserting audit records to an audit table in that same database?! The whole point is not to introduce a bunch of additional overhead to your OLTP system. Why do you need the audit trail there? No, take the logs off and process them separately, independent of your transaction system in either a separate logging shadow environment or in your data warehouse / OLAP system. If you're set on reinserting the audit data into the database it describes, then yes, you may as well just hang triggers everywhere. At that point it's just pay now vs. pay later. I'm merely suggesting you needn't pay at all, in that system. Ah, I understand what you're saying here. I should have specified -- triggers are good for audit history that is visible to users of the selfsame application that is making the changes to be tracked. We do not use triggers for OLAP auditing or change tracking, in that sense. The audit tables I'm referring to here are used to display "history" to users on changes made, and are generally applied after-the-fact to snapshot-style tables that don't track history, rather than transactional-style tables where a new record is inserted for every change, and which self-audit. We do, however, all hate making triggers of this sort, as it usually signifies that a design failure has been made somewhere along the line, or that the users have requested some visibility on changes made -- it's not a real surprise that users ask for this sort of thing, but developers are often surprised. Generally speaking, transactional-style tables are preferred. But DBAs are not included in initial design discussions and meetings on new applications at the company I work for, so oftentimes these tables are presented as a fait accompli after major development work has already done. So, with that in mind, if there's some idea out there to present this sort of functionality to users without the use of triggers, I am all for it.
|
# ? Jul 12, 2013 19:07 |
|
Anaxandrides posted:I do still highly recommend that a proper primary key and clustered index be created, otherwise you end up with a heap table. This table would be slow indeed for data aggregation if it existed as a heap. But why create a separate ID field when DreadCthulhu is likely to want to enforce unique+notnull on (s,p,q,d) in any case, and the index on ID doesn't apply to any queries planned for the data. Based on the earlier description, we can surmise needing at least an index on QuestionID, but would an additional ID PRIMARY KEY make those scans any faster? As this StackExchange discussion suggests, PostgreSQL is more likely to make use of leading columns in a multi-column index, so I still feel like a primary key of (q,p,s,d?) is ample for this case. (It seems an index on person may well be required in both cases.) Of course, what we haven't heard from DreadCthulhu is whether or not a good supply of one million faked data points have been inserted into the relevant database and a few of the approaches profiled for performance based on the needs of their actual algorithm. Assuming no data changes, nor really even any deletes except routine maintenance, it seems like CLUSTER plus proper partial indexes might make the existence or lack of a singular ID field imperceptible or superfluous. edit: vvvv I misread. Thanks for clarifying, Anaxandrides. We can now together. PhantomOfTheCopier fucked around with this message at 04:57 on Jul 13, 2013 |
# ? Jul 12, 2013 22:22 |
|
Thanks for the tips everybody, very interesting read. Apologies if it came off as having you guys do the work, I should have posted my first take on this as a starting point. As PhantomOfTheCopier suggested, it'd make the most sense to do some testing on a larger dataset and see how it performs. And yes, 200 million rows isn't awful, and that's basically the worst case scenario. In reality I'm sure we'll get a lot fewer hits, at first at least. I like the aggregate table suggestion, it seems like it'd be a very common scenario for a lot of analytics products. As in, at some point you might not care again about fine-grained data from 6 months ago, and you can remove a ton of rows from that table. As you guys pointed, you'd be mostly saving space, not necessarily performance if the table is sorted correctly by date. Just to add more detail, the app adaptively throws random questions at the student and she just answers them up a few hundred times a day. No such concept as batch, or session or survey, which should make things a bit simpler. I'm thinking of going with something like: code:
I'm going to have to brush up on "use the index luke" to get the multicolumn indexes right here for the different scenarios. I also haven't worked with CLUSTER before, so I'm going to have to read up on that.
|
# ? Jul 12, 2013 23:17 |
|
PhantomOfTheCopier posted:I guess I don't see the benefit of having a dedicated primary key using a sequence in this case. Adding that gives (ID,SurveyResponseID,PersonID,QuestionID,AnswerDate,Correct), a NOT NULL constraint to check, and a UNIQUE index on (ID) to check, update, and for queries, plus the sequence to nextval() on insertions. In contrast, I propose creating the table with PRIMARY KEY (question,person,survey) (and maybe date?), requiring a multicolumn NOT NULL constraint and forcing a UNIQUEness on (q,p,s,d?). I'm not suggesting that a separate ID field being created -- there's no need to overthink this. A primary key of (q,p,s,d) would be fine, so long as it's actually created as a clustered index. When I said that a proper primary key should be created, I meant "Don't treat this logic as an implicit primary key -- make sure you actually declare one programatically also and create one on a clustered index". Otherwise you end up with problems.
|
# ? Jul 12, 2013 23:18 |
|
DreadCthulhu posted:
Is category an attribute off the answer, or the question? If the latter, then it should be on the question table, not this one. I would still add an identity type column, and supporting indexes for the foreign key columns.
|
# ? Jul 13, 2013 00:37 |
|
Anaxandrides posted:Honestly, this sounds like a right tool for the right job question. Is there some reason why you can't rip through this data using Python or whatnot in order to extract it, given that you're working with a limited toolset? You know, I thought about this some more and I don't know why it didn't occur to me, but the easiest solution was just to take one of the nightly database dumps and restore it in a 9.1 server.
|
# ? Jul 13, 2013 04:07 |
|
RICHUNCLEPENNYBAGS posted:You know, I thought about this some more and I don't know why it didn't occur to me, but the easiest solution was just to take one of the nightly database dumps and restore it in a 9.1 server. poo poo, I was going to suggest that but assumed you had some reason you couldn't.
|
# ? Jul 13, 2013 12:45 |
|
Nth Doctor posted:Is category an attribute off the answer, or the question? If the latter, then it should be on the question table, not this one. That's a good observation, it's an attribute of question as you suggested. I'll move categories into their own relation and see if the two joins are making much of a difference. And yes, good reminder with the fk indexes!
|
# ? Jul 13, 2013 19:45 |
|
House has many Rooms. Room has many Chairs. Then, I want a HouseStuffHolder. A HouseStuffHolder can have one House, one Room, or one Chair. But: House has many HouseStuffHolders. Room has many HouseStuffHolders. Chair has many HouseStuffHolders. House: id Room: id, houseId Chair: id, roomId Does the following table sound stupid? HouseStuffHolder: id, houseId, roomId, chairId I don't like that there are three rows when really only one row should be needed, because a HouseStuffHolder should belong to only one other "thing" (i.e., House, Room Chair). Is there a better way?
|
# ? Jul 15, 2013 08:08 |
|
Why would you want a HouseStuffHolder table at all? The table setup you gave seems ok.pre:House: ID Room: ID, houseID Chair: ID, roomID pre:House: ID Room: ID, houseID Chair: ID Chair_MTM_Room: chairID, roomID RichardA fucked around with this message at 09:40 on Jul 15, 2013 |
# ? Jul 15, 2013 09:32 |
|
I have a question which probably has a stupid simple answer but I can't figure it out. (SQL Server) Say I have these fields and values code:
So the result for the above example would be 5 and 10. Thanks
|
# ? Jul 15, 2013 13:29 |
|
Use a self-anti-join like so:SQL code:
Beelzebozo fucked around with this message at 13:48 on Jul 15, 2013 |
# ? Jul 15, 2013 13:41 |
|
Haha awesome thanks a lot. Gave me the 7 problematic rows instantly.
|
# ? Jul 15, 2013 13:51 |
|
Any reason to prefer a anti join over using something like this?SQL code:
|
# ? Jul 15, 2013 14:05 |
|
RichardA posted:Any reason to prefer a anti join over using something like this? Nope, that works too. I honestly couldn't even guess which the MS SQL optimizer would prefer. I just tend to think in set logic and don't really like the look of a MAX aggregate over something which is ostensibly a BOOLEAN declared as an INTEGER. (I would've declared the column a BOOLEAN in postgres, but I'm willing to accept that there could be some valid raisin not to do that in MS SQL, about which I know very close to nothing.)
|
# ? Jul 15, 2013 14:20 |
|
I wasn't sure if this was the best place to post this or not. I was curious if anyone has been to one of these SQL in the city events sponsored by Red Gate (http://sqlinthecity.red-gate.com/). I am just trying to get an idea of what these are like. They seem like a marketing talk for their tools so I'm trying to see if it's worth going to even if you don't use their tools. If anyone has been to one I would love to hear your experience from it. I've been able to find a little bit of information on them but nothing really substantial. I apologize if this was the wrong place to ask this but I wasn't sure which thread would be the best fit.
|
# ? Jul 15, 2013 18:30 |
|
Beelzebozo posted:Nope, that works too. I honestly couldn't even guess which the MS SQL optimizer would prefer. I just tend to think in set logic and don't really like the look of a MAX aggregate over something which is ostensibly a BOOLEAN declared as an INTEGER. (I would've declared the column a BOOLEAN in postgres, but I'm willing to accept that there could be some valid raisin not to do that in MS SQL, about which I know very close to nothing.) I'm pretty sure the left anti-semi join is better performance in MSSQL, and it's certainly more readable and intuitive. Just as important.
|
# ? Jul 15, 2013 19:14 |
|
hi, i have a database which basically takes these "lots" that the company receives, and links them to a table describing all of the items in each lot. my item table contains a field called "Type", which can be several different values. It also contains a field called "LotNo", which acts as a foreign key to link it to the correct lot How can I make a check constraint to only allow one item of type "Miscellaneous" per lot? (the company basically throws whatever's left into a bin and weighs the whole thing as misc) There can be any number of items of different types, and there can be multiple miscellaneous items in the whole table, but only one miscellaneous item per lot.
|
# ? Jul 15, 2013 19:52 |
|
jryand posted:How can I make a check constraint to only allow one item of type "Miscellaneous" per lot? This comes up a lot. You can't write a CHECK constraint that consider tuples outside the scope of the one being checked. You need to use triggers to enforce this, which will burden your table with additional overhead. I've frankly never seen a solution to this problem that is clean enough for my liking. Welcome to the land of suboptimal decisions.
|
# ? Jul 15, 2013 19:57 |
|
jryand posted:hi, i have a database which basically takes these "lots" that the company receives, and links them to a table describing all of the items in each lot. If this is SQL Server you could do something crazy like using an indexed view with a PK based on LotNo and item types, with a WHERE clause restricting to the Miscellaneous type.
|
# ? Jul 15, 2013 20:28 |
|
Safe and Secure! posted:House, Rooms, Chairs, Stuffs pre:Container: ID, Name, TypeID, ParentID Stuff: ID, ContainerID, ItemID Items: ID, etc. ContainerTypes: ID, Name pre:House: ID, Name Room: ID, Name, HouseID Chair: ID, Name, RoomID Stuff: ID, House, Room, Chair, etc
|
# ? Jul 15, 2013 20:32 |
|
Beelzebozo posted:This comes up a lot. You can't write a CHECK constraint that consider tuples outside the scope of the one being checked. You need to use triggers to enforce this, which will burden your table with additional overhead. I've frankly never seen a solution to this problem that is clean enough for my liking. Welcome to the land of suboptimal decisions. Roll your own database In all seriousness though, the function space is just too large for a DBMS to offer arbitrary table-wide constraints natively.
|
# ? Jul 15, 2013 20:45 |
|
I have a stupid question that I'm not sure how to even Google for an answer because I'm not sure what it's called. Essentially what I need to do is this: code:
If someone can just give me a term to lookup on Google, I can probably go from there, I just don't know what to look for.
|
# ? Jul 16, 2013 20:51 |
|
Sudden Infant Def Syndrome posted:…and so on and so on. the term "transitive closure" and find out just how bad SQL intrinsically is at it.
|
# ? Jul 16, 2013 20:59 |
|
Sudden Infant Def Syndrome posted:I have a stupid question that I'm not sure how to even Google for an answer because I'm not sure what it's called. What RDBMS? In SQL Server I'd suggest a recursive CTE, other RDBMS may have similar constructs so let us know?
|
# ? Jul 16, 2013 21:11 |
|
Beelzebozo posted:the term "transitive closure" and find out just how bad SQL intrinsically is at it. Oh man Molotov Cock Tale posted:What RDBMS? In SQL Server I'd suggest a recursive CTE, other RDBMS may have similar constructs so let us know? SQL Server 2012
|
# ? Jul 16, 2013 21:19 |
|
So I've got a question finally: My company has several instances of production servers with identical schema but different data on the different servers. We use replication (both merge and transactional) to keep some tables the same across the different shards. We've gotten to the point where replication is more of a hindrance than a benefit and are looking into replacing it with something else. Has anyone used Change Tracking in SQL Server 2008 to do something like this? Any other tools? Nth Doctor fucked around with this message at 21:25 on Jul 16, 2013 |
# ? Jul 16, 2013 21:22 |
|
Sudden Infant Def Syndrome posted:SQL Server 2012 Ok on SQL2012 you should be able to do this with a recursive CTE. The anchor is what you posted, union all to the same, on recipe.id = ingredient.recipeid or however you've done that.
|
# ? Jul 16, 2013 22:46 |
|
Nth Doctor posted:So I've got a question finally: We've used SQL2008 Change Tracking, yes, but I really recommend Change Data Capture if what you're looking for is selective replication. You can process CDC logs with SSIS, which makes for a much better and more atomic solution. This is my experience, at the least.
|
# ? Jul 16, 2013 22:56 |
|
Anaxandrides posted:We've used SQL2008 Change Tracking, yes, but I really recommend Change Data Capture if what you're looking for is selective replication. You can process CDC logs with SSIS, which makes for a much better and more atomic solution. This is my experience, at the least. We're going to be doing selective tables, but in most cases everything in the particular table will be shuffled around to the other instances.
|
# ? Jul 17, 2013 01:54 |
|
Nth Doctor posted:We're going to be doing selective tables, but in most cases everything in the particular table will be shuffled around to the other instances. Given this, what about replication is causing troubles for you?
|
# ? Jul 17, 2013 06:34 |
|
Anaxandrides posted:Given this, what about replication is causing troubles for you? Recovering from server crashes, replication takes so long to spin up that we blow our SLAs.
|
# ? Jul 17, 2013 12:28 |
|
|
# ? May 30, 2024 14:15 |
|
Nth Doctor posted:Recovering from server crashes, replication takes so long to spin up that we blow our SLAs. Well, that's not the expected answer. What in the world is causing server crashes frequently enough that you're looking at moving away from replication in order to ease recovery?
|
# ? Jul 17, 2013 16:43 |