Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
spitefulcrow
Jun 16, 2013

DreadCthulhu posted:

Awful question of the day, would love to pick your brains on this: the product I work on allows people to answer questions. I need to record the # of times each of these uniquely identifiable questions were answered wrong/right and when, and then do some analytics on it. I'm on Postgres 9.2.

The brute force approach would be to just create a new row for each time a question is answered, that'd give one the most flexibility, but also bloat the table to massive size in no time given that we have over 10k questions and a user can answer 100+ questions in a day. Say we have 5k daily users, assuming they all answer different questions, that'd be 100 * 5000 * 365, or about ~200 mil rows a year. Say I need to do some real time crunching on this (under 1s) to get something like "how much better are you doing than others within timeframe x", not sure I could have it perform fast enough.

I can think of a bunch of ways of collapsing some of this data into fewer rows, like creating daily counters per question rather than having pure raw data. However, I also don't want to do any sort of overcomplicated premature optimization, falling into YAGNI.

Doing analytics on large tables seems like a very old solved problem, I just haven't dealt with it before, so I could use advice. Should I just work with raw datapoints for now until I see the performance degrade to justify optimization? Also, is there an accepted way of "compacting" old data that might have little to no value at its original degree of granularity?

Part 2: on an unrelated note, is it ever ok NOT to have a primary id in a table? In the analytics example, I might not really ever need to identify a datapoint uniquely and might want to save 4-8 bytes. Is it cool to never create that column?

200 million rows in a year is really not huge. Assuming you don't have absurdly wide rows (with lots of TEXT columns) it'll take up a couple tens of gigabytes on disk. Table scans won't happen in less than a second but with appropriate indexing you could answer most of the queries you'll be putting into the app pretty rapidly.

If you want more in-depth advice on this you should probably outline a bit more of the use case and the data you'll need to be storing.

Adbot
ADBOT LOVES YOU

Zombywuf
Mar 29, 2008

DreadCthulhu posted:

I can think of a bunch of ways of collapsing some of this data into fewer rows, like creating daily counters per question rather than having pure raw data. However, I also don't want to do any sort of overcomplicated premature optimization, falling into YAGNI.

Your problem here is that each datapoint you're looking at is a single bit. As far as I'm aware Postgres does not have an efficient way to deal with that and you'll end up with something like
code:
  fk_question_id integer not null,
  datetime timestamptz not null,
  correct boolean not null
which is 13 bytes wide giving a ratio of 103:1 overhead to data which is excessive. If you are sure, and I mean absolutely sure, that the analysis you are going to be performing in 18 months time is only going to need total counts the simply maintaining running totals (using stored procedures and transactions) will be fine.

Otherwise you may need to look at a system of batching. So record the rows as above and have a nightly job calculate daily totals and update an aggregate table. This way your analysis query can be fast but you can change the reporting at a later date if you need to. Having 200 million million rows sitting on your disk isn't going to slow anything down unless you scan the whole table and flush your disk cache.

quote:

Part 2: on an unrelated note, is it ever ok NOT to have a primary id in a table? In the analytics example, I might not really ever need to identify a datapoint uniquely and might want to save 4-8 bytes. Is it cool to never create that column?

If it's some sort of output table that nothing else in the db is ever going to need to refer to then sure.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

DreadCthulhu posted:

Awful question of the day... I need to record the # of times each of these uniquely identifiable questions were answered wrong/right and when... YAGNI... raw datapoints for now until I see the performance degrade to justify optimization? Also, is there an accepted way of "compacting" old data that might have little to no value at its original degree of granularity?

Part 2: on an unrelated note, is it ever ok NOT to have a primary id in a table? In the analytics example, I might not really ever need to identify a datapoint uniquely and might want to save 4-8 bytes. Is it cool to never create that column?
I find this question awful only because I somehow equate it with "build this for me", but you've given enough info to warrant a few answers that are of general interest.

For your part two, if you follow the table design you gave yourself at the beginning, the table structure given by Super Delegate does have a primary key, it just doesn't happen to be a single column constructed for that purpose. (This is a valid use of the term 'primary key' in PG, though it might be distinguished from an abstract 'compound key'.) Indeed, (SurveyResponseID,PersonID,QuestionID,AnswerDate,Correct) is your key, and there's nothing particularly wrong with that. While most tables benefit from a separate PK column, it is of little use here.

As others have mentioned, your data set is really not that large. You have used a word that throws crowds into fits of hash stores and NoSQL chaos, but with a proper index or three, you should be able to get the data you need very quickly. Make sure your columns are all plain storage so you don't have to deal with extended column retrieval. That you should never UPDATE this data is very useful, so dinking with table and index storage parameters and considering CLUSTER may prove useful.

For now, stick with what appears to be the most straightforward storage approach. The trouble with YAGNI here is that it is appealing from a runtime point of view, but you cannot 'unsummarize/deaggregate' data if you haven't stored it. There is a great deal of potential algorithm design and optimization hiding here, and until you have a notion of those needs, you'll be wasting time building useless procedures. Any optimizations that work for even a month of data will most likely still help you in future redesigns.

Beelzebozo
Nov 6, 2002

I don’t mean to sound bitter, cold, or cruel. But I am, so that’s how it comes out.

Anaxandrides posted:

What do you suggest instead, out of curiosity? Records are inserted using LINQ/EF half the time, so logic inside procedures/functions to audit is not possible. Without getting the devs to do it application-side, any ideas?

Digest the statement log files that the database is already writing out to the file system using something that is not the database itself.

Anaxandrides
Jun 7, 2012

Every summer day in the desert is like a vacation in hell.

Beelzebozo posted:

Digest the statement log files that the database is already writing out to the file system using something that is not the database itself.

Is this an honest answer? That it would be better to deconstruct transaction logs and then reinsert audit records to an audit table using some homebrew application than it would be to process the transactions as they come in? And how, then, is one to exclude the audit inserts from being processed the next time one analyzes the log? Create a custom user just to do the auditing?

It seems to me that this is the very definition of making up work for oneself. I don't see any advantage at all in doing this. Can you please explain the benefits here?

Anaxandrides
Jun 7, 2012

Every summer day in the desert is like a vacation in hell.

PhantomOfTheCopier posted:

For your part two, if you follow the table design you gave yourself at the beginning, the table structure given by Super Delegate does have a primary key, it just doesn't happen to be a single column constructed for that purpose. (This is a valid use of the term 'primary key' in PG, though it might be distinguished from an abstract 'compound key'.) Indeed, (SurveyResponseID,PersonID,QuestionID,AnswerDate,Correct) is your key, and there's nothing particularly wrong with that. While most tables benefit from a separate PK column, it is of little use here.

I do still highly recommend that a proper primary key and clustered index be created, otherwise you end up with a heap table. This table would be slow indeed for data aggregation if it existed as a heap.

Beelzebozo
Nov 6, 2002

I don’t mean to sound bitter, cold, or cruel. But I am, so that’s how it comes out.

Anaxandrides posted:

Is this an honest answer? That it would be better to deconstruct transaction logs and then reinsert audit records to an audit table using some homebrew application than it would be to process the transactions as they come in? And how, then, is one to exclude the audit inserts from being processed the next time one analyzes the log? Create a custom user just to do the auditing?

It seems to me that this is the very definition of making up work for oneself. I don't see any advantage at all in doing this. Can you please explain the benefits here?

Who said anything about reinserting audit records to an audit table in that same database?! The whole point is not to introduce a bunch of additional overhead to your OLTP system. Why do you need the audit trail there? No, take the logs off and process them separately, independent of your transaction system in either a separate logging shadow environment or in your data warehouse / OLAP system. If you're set on reinserting the audit data into the database it describes, then yes, you may as well just hang triggers everywhere. At that point it's just pay now vs. pay later. I'm merely suggesting you needn't pay at all, in that system.

Anaxandrides
Jun 7, 2012

Every summer day in the desert is like a vacation in hell.

Beelzebozo posted:

Who said anything about reinserting audit records to an audit table in that same database?! The whole point is not to introduce a bunch of additional overhead to your OLTP system. Why do you need the audit trail there? No, take the logs off and process them separately, independent of your transaction system in either a separate logging shadow environment or in your data warehouse / OLAP system. If you're set on reinserting the audit data into the database it describes, then yes, you may as well just hang triggers everywhere. At that point it's just pay now vs. pay later. I'm merely suggesting you needn't pay at all, in that system.

Ah, I understand what you're saying here. I should have specified -- triggers are good for audit history that is visible to users of the selfsame application that is making the changes to be tracked. We do not use triggers for OLAP auditing or change tracking, in that sense. The audit tables I'm referring to here are used to display "history" to users on changes made, and are generally applied after-the-fact to snapshot-style tables that don't track history, rather than transactional-style tables where a new record is inserted for every change, and which self-audit.

We do, however, all hate making triggers of this sort, as it usually signifies that a design failure has been made somewhere along the line, or that the users have requested some visibility on changes made -- it's not a real surprise that users ask for this sort of thing, but developers are often surprised. Generally speaking, transactional-style tables are preferred. But DBAs are not included in initial design discussions and meetings on new applications at the company I work for, so oftentimes these tables are presented as a fait accompli after major development work has already done.

So, with that in mind, if there's some idea out there to present this sort of functionality to users without the use of triggers, I am all for it.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Anaxandrides posted:

I do still highly recommend that a proper primary key and clustered index be created, otherwise you end up with a heap table. This table would be slow indeed for data aggregation if it existed as a heap.
I guess I don't see the benefit of having a dedicated primary key using a sequence in this case. Adding that gives (ID,SurveyResponseID,PersonID,QuestionID,AnswerDate,Correct), a NOT NULL constraint to check, and a UNIQUE index on (ID) to check, update, and for queries, plus the sequence to nextval() on insertions. In contrast, I propose creating the table with PRIMARY KEY (question,person,survey) (and maybe date?), requiring a multicolumn NOT NULL constraint and forcing a UNIQUEness on (q,p,s,d?).

But why create a separate ID field when DreadCthulhu is likely to want to enforce unique+notnull on (s,p,q,d) in any case, and the index on ID doesn't apply to any queries planned for the data. Based on the earlier description, we can surmise needing at least an index on QuestionID, but would an additional ID PRIMARY KEY make those scans any faster? As this StackExchange discussion suggests, PostgreSQL is more likely to make use of leading columns in a multi-column index, so I still feel like a primary key of (q,p,s,d?) is ample for this case. (It seems an index on person may well be required in both cases.)

Of course, what we haven't heard from DreadCthulhu is whether or not a good supply of one million faked data points have been inserted into the relevant database and a few of the approaches profiled for performance based on the needs of their actual algorithm. Assuming no data changes, nor really even any deletes except routine maintenance, it seems like CLUSTER plus proper partial indexes might make the existence or lack of a singular ID field imperceptible or superfluous.

edit: vvvv I misread. Thanks for clarifying, Anaxandrides. We can now :cheers: together.

PhantomOfTheCopier fucked around with this message at 04:57 on Jul 13, 2013

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!
Thanks for the tips everybody, very interesting read. Apologies if it came off as having you guys do the work, I should have posted my first take on this as a starting point. As PhantomOfTheCopier suggested, it'd make the most sense to do some testing on a larger dataset and see how it performs. And yes, 200 million rows isn't awful, and that's basically the worst case scenario. In reality I'm sure we'll get a lot fewer hits, at first at least.

I like the aggregate table suggestion, it seems like it'd be a very common scenario for a lot of analytics products. As in, at some point you might not care again about fine-grained data from 6 months ago, and you can remove a ton of rows from that table. As you guys pointed, you'd be mostly saving space, not necessarily performance if the table is sorted correctly by date.

Just to add more detail, the app adaptively throws random questions at the student and she just answers them up a few hundred times a day. No such concept as batch, or session or survey, which should make things a bit simpler. I'm thinking of going with something like:

code:
question_id integer
student_id integer
category_id integer -- could be math, english, etc
correct boolean
created_at timestamptz
All not nullable, no uniqueness constraints, since per design you could be answering the same question many times per day if it is given to you. Also it doesn't look like I'd need a serial ID of any kind, but I might be wrong. The core queries would be:

  • for category x, what questions did a student answer between dates a and b
  • for category x, how many questions were answered, answered correctly and answered incorrectly between dates a and b
  • for category x, what are the questions that are giving students the most trouble

I'm going to have to brush up on "use the index luke" to get the multicolumn indexes right here for the different scenarios. I also haven't worked with CLUSTER before, so I'm going to have to read up on that.

Anaxandrides
Jun 7, 2012

Every summer day in the desert is like a vacation in hell.

PhantomOfTheCopier posted:

I guess I don't see the benefit of having a dedicated primary key using a sequence in this case. Adding that gives (ID,SurveyResponseID,PersonID,QuestionID,AnswerDate,Correct), a NOT NULL constraint to check, and a UNIQUE index on (ID) to check, update, and for queries, plus the sequence to nextval() on insertions. In contrast, I propose creating the table with PRIMARY KEY (question,person,survey) (and maybe date?), requiring a multicolumn NOT NULL constraint and forcing a UNIQUEness on (q,p,s,d?).

But why create a separate ID field when DreadCthulhu is likely to want to enforce unique+notnull on (s,p,q,d) in any case, and the index on ID doesn't apply to any queries planned for the data. Based on the earlier description, we can surmise needing at least an index on QuestionID, but would an additional ID PRIMARY KEY make those scans any faster? As this StackExchange discussion suggests, PostgreSQL is more likely to make use of leading columns in a multi-column index, so I still feel like a primary key of (q,p,s,d?) is ample for this case. (It seems an index on person may well be required in both cases.)

Of course, what we haven't heard from DreadCthulhu is whether or not a good supply of one million faked data points have been inserted into the relevant database and a few of the approaches profiled for performance based on the needs of their actual algorithm. Assuming no data changes, nor really even any deletes except routine maintenance, it seems like CLUSTER plus proper partial indexes might make the existence or lack of a singular ID field imperceptible or superfluous.

I'm not suggesting that a separate ID field being created -- there's no need to overthink this. A primary key of (q,p,s,d) would be fine, so long as it's actually created as a clustered index. When I said that a proper primary key should be created, I meant "Don't treat this logic as an implicit primary key -- make sure you actually declare one programatically also and create one on a clustered index". Otherwise you end up with problems.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


DreadCthulhu posted:

code:
question_id integer
student_id integer
category_id integer -- could be math, english, etc
correct boolean
created_at timestamptz

Is category an attribute off the answer, or the question? If the latter, then it should be on the question table, not this one.
I would still add an identity type column, and supporting indexes for the foreign key columns.

RICHUNCLEPENNYBAGS
Dec 21, 2010

Anaxandrides posted:

Honestly, this sounds like a right tool for the right job question. Is there some reason why you can't rip through this data using Python or whatnot in order to extract it, given that you're working with a limited toolset?

You know, I thought about this some more and I don't know why it didn't occur to me, but the easiest solution was just to take one of the nightly database dumps and restore it in a 9.1 server.

Zombywuf
Mar 29, 2008

RICHUNCLEPENNYBAGS posted:

You know, I thought about this some more and I don't know why it didn't occur to me, but the easiest solution was just to take one of the nightly database dumps and restore it in a 9.1 server.

poo poo, I was going to suggest that but assumed you had some reason you couldn't.

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!

Nth Doctor posted:

Is category an attribute off the answer, or the question? If the latter, then it should be on the question table, not this one.
I would still add an identity type column, and supporting indexes for the foreign key columns.

That's a good observation, it's an attribute of question as you suggested. I'll move categories into their own relation and see if the two joins are making much of a difference. And yes, good reminder with the fk indexes!

Safe and Secure!
Jun 14, 2008

OFFICIAL SA THREAD RUINER
SPRING 2013
House has many Rooms.
Room has many Chairs.

Then, I want a HouseStuffHolder.

A HouseStuffHolder can have one House, one Room, or one Chair.

But:

House has many HouseStuffHolders.
Room has many HouseStuffHolders.
Chair has many HouseStuffHolders.

House: id
Room: id, houseId
Chair: id, roomId

Does the following table sound stupid?

HouseStuffHolder:
id, houseId, roomId, chairId

I don't like that there are three rows when really only one row should be needed, because a HouseStuffHolder should belong to only one other "thing" (i.e., House, Room Chair). Is there a better way?

RichardA
Sep 1, 2006
.
Dinosaur Gum
Why would you want a HouseStuffHolder table at all? The table setup you gave seems ok.
pre:
House: ID
Room:  ID, houseID
Chair: ID, roomID
Allows each house to have many rooms and each room to have many chairs. This also limits each room to a single house and likewise for chairs to rooms. If for some reason you need a many to many relationship instead you can do something like this:
pre:
House: ID
Room:  ID, houseID
Chair: ID
Chair_MTM_Room: chairID, roomID

RichardA fucked around with this message at 09:40 on Jul 15, 2013

MightyShadow
Oct 27, 2005
I have a question which probably has a stupid simple answer but I can't figure it out. (SQL Server)

Say I have these fields and values

code:
ID   StaticID   Active
 1      5          0
 2     10          0
 3     10          0
 4     11          1
 5     11          0
 1     20          1
I need a query that will show me which StaticID don't have at least 1 of their active entries in the table as 1
So the result for the above example would be 5 and 10.
Thanks

Beelzebozo
Nov 6, 2002

I don’t mean to sound bitter, cold, or cruel. But I am, so that’s how it comes out.
Use a self-anti-join like so:
SQL code:
SELECT DISTINCT T.StaticID
FROM MyTable T
LEFT OUTER JOIN MyTable S
  ON S.StaticID = T.StaticID
  AND S.Active = 1
WHERE S.StaticID IS NULL
edit: Fiddle'd it for you – http://sqlfiddle.com/#!6/36ca7/3

Beelzebozo fucked around with this message at 13:48 on Jul 15, 2013

MightyShadow
Oct 27, 2005
Haha awesome thanks a lot. Gave me the 7 problematic rows instantly.

RichardA
Sep 1, 2006
.
Dinosaur Gum
Any reason to prefer a anti join over using something like this?
SQL code:
SELECT static_id 
FROM   tbl 
GROUP  BY static_id 
HAVING Max(active) = 0;

Beelzebozo
Nov 6, 2002

I don’t mean to sound bitter, cold, or cruel. But I am, so that’s how it comes out.

RichardA posted:

Any reason to prefer a anti join over using something like this?
SQL code:
SELECT static_id 
FROM   tbl 
GROUP  BY static_id 
HAVING Max(active) = 0;

Nope, that works too. I honestly couldn't even guess which the MS SQL optimizer would prefer. I just tend to think in set logic and don't really like the look of a MAX aggregate over something which is ostensibly a BOOLEAN declared as an INTEGER. (I would've declared the column a BOOLEAN in postgres, but I'm willing to accept that there could be some valid raisin not to do that in MS SQL, about which I know very close to nothing.)

RobotRob
Aug 7, 2007

Let's get weird, but not end of BSG weird.
I wasn't sure if this was the best place to post this or not. I was curious if anyone has been to one of these SQL in the city events sponsored by Red Gate (http://sqlinthecity.red-gate.com/). I am just trying to get an idea of what these are like. They seem like a marketing talk for their tools so I'm trying to see if it's worth going to even if you don't use their tools. If anyone has been to one I would love to hear your experience from it. I've been able to find a little bit of information on them but nothing really substantial. I apologize if this was the wrong place to ask this but I wasn't sure which thread would be the best fit.

Anaxandrides
Jun 7, 2012

Every summer day in the desert is like a vacation in hell.

Beelzebozo posted:

Nope, that works too. I honestly couldn't even guess which the MS SQL optimizer would prefer. I just tend to think in set logic and don't really like the look of a MAX aggregate over something which is ostensibly a BOOLEAN declared as an INTEGER. (I would've declared the column a BOOLEAN in postgres, but I'm willing to accept that there could be some valid raisin not to do that in MS SQL, about which I know very close to nothing.)

I'm pretty sure the left anti-semi join is better performance in MSSQL, and it's certainly more readable and intuitive. Just as important.

jryand
Jun 18, 2013
hi, i have a database which basically takes these "lots" that the company receives, and links them to a table describing all of the items in each lot.

my item table contains a field called "Type", which can be several different values. It also contains a field called "LotNo", which acts as a foreign key to link it to the correct lot

How can I make a check constraint to only allow one item of type "Miscellaneous" per lot?
(the company basically throws whatever's left into a bin and weighs the whole thing as misc)

There can be any number of items of different types, and there can be multiple miscellaneous items in the whole table, but only one miscellaneous item per lot.

Beelzebozo
Nov 6, 2002

I don’t mean to sound bitter, cold, or cruel. But I am, so that’s how it comes out.

jryand posted:

How can I make a check constraint to only allow one item of type "Miscellaneous" per lot?
(the company basically throws whatever's left into a bin and weighs the whole thing as misc)

There can be any number of items of different types, and there can be multiple miscellaneous items in the whole table, but only one miscellaneous item per lot.

This comes up a lot. You can't write a CHECK constraint that consider tuples outside the scope of the one being checked. You need to use triggers to enforce this, which will burden your table with additional overhead. I've frankly never seen a solution to this problem that is clean enough for my liking. Welcome to the land of suboptimal decisions.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


jryand posted:

hi, i have a database which basically takes these "lots" that the company receives, and links them to a table describing all of the items in each lot.

my item table contains a field called "Type", which can be several different values. It also contains a field called "LotNo", which acts as a foreign key to link it to the correct lot

How can I make a check constraint to only allow one item of type "Miscellaneous" per lot?
(the company basically throws whatever's left into a bin and weighs the whole thing as misc)

There can be any number of items of different types, and there can be multiple miscellaneous items in the whole table, but only one miscellaneous item per lot.

If this is SQL Server you could do something crazy like using an indexed view with a PK based on LotNo and item types, with a WHERE clause restricting to the Miscellaneous type.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Safe and Secure! posted:

House, Rooms, Chairs, Stuffs :words:
Wait, isn't this equivalent to the species question, page 160 of this thread, depending on what you want to emphasize? In particular, if you're trying to focus on an easy solution to limit Stuff to a single container:
pre:
Container:  ID, Name, TypeID, ParentID
Stuff:  ID, ContainerID, ItemID
Items:  ID, etc.
ContainerTypes:  ID, Name
Or if you're trying to focus on the relationship between the containers:
pre:
House:  ID, Name
Room:  ID, Name, HouseID
Chair:  ID, Name, RoomID
Stuff:  ID, House, Room, Chair, etc
In the first case, you get to quickly deal with Stuff at the expense of doing more SQL work in support of your Container requirements. In the second case, you get your container structure for free at the expense of spending time dealing with needing a multicolumn uniqueness constraint that considers nulls as equal (for example).

Amarkov
Jun 21, 2010

Beelzebozo posted:

This comes up a lot. You can't write a CHECK constraint that consider tuples outside the scope of the one being checked. You need to use triggers to enforce this, which will burden your table with additional overhead. I've frankly never seen a solution to this problem that is clean enough for my liking. Welcome to the land of suboptimal decisions.

Roll your own database :colbert:

In all seriousness though, the function space is just too large for a DBMS to offer arbitrary table-wide constraints natively.

Sudden Infant Def Syndrome
Oct 2, 2004

I have a stupid question that I'm not sure how to even Google for an answer because I'm not sure what it's called.

Essentially what I need to do is this:

code:
SELECT *
FROM Recipes R
	JOIN RecipeIngredients RI ON R.pk = RI.rec_fk
		JOIN Ingredients ON I.pk = RI.ing_fk
Where the ingredient at the end can be another recipe which then needs to drill down and so on and so on.

If someone can just give me a term to lookup on Google, I can probably go from there, I just don't know what to look for.

Beelzebozo
Nov 6, 2002

I don’t mean to sound bitter, cold, or cruel. But I am, so that’s how it comes out.

Sudden Infant Def Syndrome posted:

…and so on and so on.

:google: the term "transitive closure" and find out just how bad SQL intrinsically is at it.

Molotov Cock Tale
Jun 30, 2010

Sudden Infant Def Syndrome posted:

I have a stupid question that I'm not sure how to even Google for an answer because I'm not sure what it's called.

Essentially what I need to do is this:

code:

SELECT *
FROM Recipes R
	JOIN RecipeIngredients RI ON R.pk = RI.rec_fk
		JOIN Ingredients ON I.pk = RI.ing_fk

Where the ingredient at the end can be another recipe which then needs to drill down and so on and so on.

If someone can just give me a term to lookup on Google, I can probably go from there, I just don't know what to look for.

What RDBMS? In SQL Server I'd suggest a recursive CTE, other RDBMS may have similar constructs so let us know?

Sudden Infant Def Syndrome
Oct 2, 2004

Beelzebozo posted:

:google: the term "transitive closure" and find out just how bad SQL intrinsically is at it.

Oh man :suicide:

Molotov Cock Tale posted:

What RDBMS? In SQL Server I'd suggest a recursive CTE, other RDBMS may have similar constructs so let us know?

SQL Server 2012

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


So I've got a question finally:
My company has several instances of production servers with identical schema but different data on the different servers. We use replication (both merge and transactional) to keep some tables the same across the different shards. We've gotten to the point where replication is more of a hindrance than a benefit and are looking into replacing it with something else.
Has anyone used Change Tracking in SQL Server 2008 to do something like this? Any other tools?

Nth Doctor fucked around with this message at 21:25 on Jul 16, 2013

Molotov Cock Tale
Jun 30, 2010

Ok on SQL2012 you should be able to do this with a recursive CTE. The anchor is what you posted, union all to the same, on recipe.id = ingredient.recipeid or however you've done that.

Anaxandrides
Jun 7, 2012

Every summer day in the desert is like a vacation in hell.

Nth Doctor posted:

So I've got a question finally:
My company has several instances of production servers with identical schema but different data on the different servers. We use replication (both merge and transactional) to keep some tables the same across the different shards. We've gotten to the point where replication is more of a hindrance than a benefit and are looking into replacing it with something else.
Has anyone used Change Tracking in SQL Server 2008 to do something like this? Any other tools?

We've used SQL2008 Change Tracking, yes, but I really recommend Change Data Capture if what you're looking for is selective replication. You can process CDC logs with SSIS, which makes for a much better and more atomic solution. This is my experience, at the least.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Anaxandrides posted:

We've used SQL2008 Change Tracking, yes, but I really recommend Change Data Capture if what you're looking for is selective replication. You can process CDC logs with SSIS, which makes for a much better and more atomic solution. This is my experience, at the least.

We're going to be doing selective tables, but in most cases everything in the particular table will be shuffled around to the other instances.

Anaxandrides
Jun 7, 2012

Every summer day in the desert is like a vacation in hell.

Nth Doctor posted:

We're going to be doing selective tables, but in most cases everything in the particular table will be shuffled around to the other instances.

Given this, what about replication is causing troubles for you?

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Anaxandrides posted:

Given this, what about replication is causing troubles for you?

Recovering from server crashes, replication takes so long to spin up that we blow our SLAs.

Adbot
ADBOT LOVES YOU

Anaxandrides
Jun 7, 2012

Every summer day in the desert is like a vacation in hell.

Nth Doctor posted:

Recovering from server crashes, replication takes so long to spin up that we blow our SLAs.

Well, that's not the expected answer. What in the world is causing server crashes frequently enough that you're looking at moving away from replication in order to ease recovery?

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply