Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

I don't think there's an easy answer to this question, but figured I'd ask here anyway.

I'm categorizing items, in this case based on length. The problem being, the items in question are from several different vendors with no standardized way of communicating length (e.g. a field called Length with a number in it). Instead I've got things like '18 inch', '08.00 inch', '7.25 inch', etc etc etc. This has mostly worked because I'm generally only looking for >one< length and I can just put in the edge cases like so:
code:
INSERT INTO ProductCategory (ProductID,CategoryID)
SELECT p.ProductID,218 <-- 218 being the categoryid for something that's say, 10 inches long
FROM Product p
WHERE p.Name like '%10 inch%' or p.Name like '%10.00 inch%'
or p.name like '%010 inch%'
Messy, but mostly works. To give an idea of scale there's only about 300,000 rows in products so I can get away with messy crap like the above.

The problem is, I've now been asked to classify by 'long' and 'short'. Put simply 'long' is > 24/24.00/024 inches and 'short' is < 10/10.00/010 inches. Is there a relatively straightforward way I can extract just the numeric and do a mathematic greater than/less than comparison? Only thing I can think of off-hand is to make a temporary table and then do a bunch of SUBSTRING fuckery based on if 'inch' is found in the p.Name text field but am hoping you geniuses have something better.

EDIT-Sorry, MSSQL 2008R2, p.Name is nvarchar(250)

Scaramouche fucked around with this message at 22:00 on Jul 29, 2013

Adbot
ADBOT LOVES YOU

Anaxandrides
Jun 7, 2012

Every summer day in the desert is like a vacation in hell.

Scaramouche posted:

Messy, but mostly works. To give an idea of scale there's only about 300,000 rows in products so I can get away with messy crap like the above.

The problem is, I've now been asked to classify by 'long' and 'short'. Put simply 'long' is > 24/24.00/024 inches and 'short' is < 10/10.00/010 inches. Is there a relatively straightforward way I can extract just the numeric and do a mathematic greater than/less than comparison? Only thing I can think of off-hand is to make a temporary table and then do a bunch of SUBSTRING fuckery based on if 'inch' is found in the p.Name text field but am hoping you geniuses have something better.

EDIT-Sorry, MSSQL 2008R2, p.Name is nvarchar(250)


This doesn't look too bad. Assuming that your examples are reflective of the actual data you have, I'd use PATINDEX to find where the space is and just LEFT(p.Name) the number out. Or is your data more heterogeneous than '18 inch'm '08.00 inch', '7.25 inch'?

Salt n Reba McEntire
Nov 14, 2000

Kuparp.
Perhaps something like...

code:
SELECT TOP 20 IDENTITY(INT,1,1) AS Tally
INTO #Tally
FROM master.dbo.syscolumns

DECLARE @TestInches AS TABLE (
	ID INT NOT NULL PRIMARY KEY,
	LengthField VARCHAR(50)
)

INSERT INTO @TestInches SELECT 1,'0.50 inches'
INSERT INTO @TestInches SELECT 2,'8 inch'
INSERT INTO @TestInches SELECT 3,'8.2"'
INSERT INTO @TestInches SELECT 4,'5.0inch'
INSERT INTO @TestInches SELECT 5,'"6.7555 in'

;WITH CleanUpInches (ID, LengthField, KeepChar, KillChar) AS (
	SELECT ID, LengthField,
		CASE WHEN SUBSTRING(LengthField,Tally,1) NOT LIKE '%[^0-9.]%' THEN substring(LengthField,Tally,1) end as KeepChar,
		CASE WHEN SUBSTRING(LengthField,Tally,1) LIKE '%[^0-9.]%' THEN substring(LengthField,Tally,1) end as KillChar
	FROM @TestInches,  #Tally
	WHERE LengthField LIKE '%[^0-9.]%' AND Tally <= LEN(LengthField)
)
SELECT DISTINCT LengthField,
	CAST(ISNULL(STUFF((SELECT '' + KeepChar FROM CleanUpInches C1 WHERE C1.ID = C2.ID FOR XML PATH ('')),1,0,''),'') AS DECIMAL(9,2)) AS CleanInches
FROM CleanUpInches C2

DROP TABLE #Tally
Assuming they're all inches under the textual fluff, of course. If not you'll have to do things a bit differently.

E: Obviously the top 20 needs to be the maximum width of data (not the field width, just the max(len) and you can squirt your database table in as a replacement to the table variable. This won't perform wonderfully, but it'll do a healthy job of stripping run as oneshot batch during a quiet moment. You may also need to strip extraneous periods prior to a cast.

Salt n Reba McEntire fucked around with this message at 14:12 on Jul 31, 2013

Anaxandrides
Jun 7, 2012

Every summer day in the desert is like a vacation in hell.

Moogle posted:

Assuming they're all inches under the textual fluff, of course. If not you'll have to do things a bit differently.

E: Obviously the top 20 needs to be the maximum width of data (not the field width, just the max(len) and you can squirt your database table in as a replacement to the table variable. This won't perform wonderfully, but it'll do a healthy job of stripping run as oneshot batch during a quiet moment. You may also need to strip extraneous periods prior to a cast.

I'd personally go with something like...

code:
SELECT ProductID,
	CategoryID,
	CAST(LEFT(Name, PATINDEX('%[^0-9]%', Name)) AS DECIMAL(10,8)) AS Inches
FROM Product
WHERE ISNUMERIC(LEFT(Name, PATINDEX('%[^0-9]%', Name))) = 1
This assumes that data is pretty homogeneous in terms of the number always being before other characters. If this isn't the case, it should be pretty easy to flip this and UNION the results. Unless there's something weird in there like 'Length - 10 and 1/2 inch', this should work.

saint gerald
Apr 17, 2003
I have a troublesome SQL problem.

We have a set of tables that we create every quarter. My boss has asked me to produce views that union these tables into one big view to produce some reports that look over multiple quarters. (You could ask, why are these separate tables in the first place, which is a question to which there seems to be no satisfactory answer.)

So obviously I do
SELECT 'q1_1999' qtr, firsttable.* from firsttable
UNION ALL
SELECT 'q2_1999' qtr, secondtable.* from secondtable
...etc

But you know what's coming: over the years additional fields have been added that don't exist in earlier tables, so a simple UNION won't work unless I go through and painstakingly fill out the earlier tables with nulled fields. Given I have to do this a number of times for different sets of quarterly data, this is not a task I am relishing.

Is there a way to lessen the burden here? Could I do something with dynamic SQL, or a programming solution that would pull the table definitions and assemble the SQL for me? Is this a problem anyone has met before? I'm open to ideas. Backend is Oracle.

edit: OK, this is solved thanks to the all_tab_columns view and some dynamic SQL.

saint gerald fucked around with this message at 20:45 on Aug 2, 2013

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

Moogle that is pretty awesome and strongly resembles what I've half assed out so far, though unfortunately I'm going to have to add textual distinctions to it since as Anaxandrides notes, the data isn't homogeneous. In fact, it's even less homogeneous that that, with an average field being like:
"Bright blue thing with the number 20 printed on it 13.5 inch and 16mm"

Thanks for your help guys, but don't waste any more time on it since I have (miraculously) actually been able to clean up the data at the source by making a big stink about it(!)

roomforthetuna
Mar 22, 2005

I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

Scaramouche posted:

Moogle that is pretty awesome and strongly resembles what I've half assed out so far, though unfortunately I'm going to have to add textual distinctions to it since as Anaxandrides notes, the data isn't homogeneous. In fact, it's even less homogeneous that that, with an average field being like:
"Bright blue thing with the number 20 printed on it 13.5 inch and 16mm"
Ugh, I hate dealing with that sort of data. I wrote a browser plugin script to attempt to convert Amazon prices into a price per [fixed quantity], because it's a pain in the rear end figuring out which is better of "500g container, $20" or "2-pack of 5oz containers, $15". It only helps compare about four in every five entries though because the product titles are so jumbled.

Anaxandrides
Jun 7, 2012

Every summer day in the desert is like a vacation in hell.

roomforthetuna posted:

Ugh, I hate dealing with that sort of data. I wrote a browser plugin script to attempt to convert Amazon prices into a price per [fixed quantity], because it's a pain in the rear end figuring out which is better of "500g container, $20" or "2-pack of 5oz containers, $15". It only helps compare about four in every five entries though because the product titles are so jumbled.

Knowing how the data is heterogeneous actually makes this pretty trivial. Amazon's part catalog is so enormous that no one-size-fits-all solution can be found without really detailed analysis and logic. But Scaramouche's example of an "average" record is still pretty simple indeed to rip through. It would just require deeply nested case ststements and SUBSTRINGs. Trivial to write, but an absolute bear to debug/read if you don't know what it's meant for.

That is to say, probably bad (but functional) code.

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!
This authorization question has been bothering me for a while and I don't know how to solve it best.

Say you're implementing mock Facebook, a web app calling into a db with a users and a messages table. You don't want users to be able to access rows in messages table that don't belong to them.

There are a bunch of ways of doing this. Simplistically, assuming the web app verified and validated the id of the user making the request (as in, you fully trust it), you can just throw in a WHERE user_id = ? (and user_id = ? for INSERTs) into every one of your queries and now you're preventing users from selecting other people's rows, and inserting as if they were another user.

This is fine and dandy until messages table is no longer 1 join away from users, but instead say 10 tables away. Now you have a giant clusterfuck of a query. Yeah you can probably ameliorate it with some magical query-building ORM fairy dust that will figure out the join for you behind the scenes, and the DB might even perform very well behind the scenes, but let's ignore that option for sake of argument.

The other option is to create a "shortcut" table for each of these distant tables I want to ACL, the table would be a many to many between e.g. messages and users, skipping everything in between. Now the problem is that every insert involves 2 tables, but perhaps it's not a biggie.

I think another option is to build a DB procedures layer between the application and the DB and have that one handle access control. Likely quite a few more approaches.

I'm wondering what has worked well for you in a web app situation like what I mentioned above. I'm looking for something that's first of all elegant and easy to maintain, and ideally good enough perf-wise.

DreadCthulhu fucked around with this message at 07:02 on Aug 3, 2013

v1nce
Sep 19, 2004

Plant your brassicas in may and cover them in mulch.
Generally speaking I'd store the user_id with the item in question, going straight off your messages example. I can't think of a situation in which your users table would be a long distance of joins away from the messages table, so maybe that's just a weak example. Generally speaking though, a little duplication of IDs is a far better solution than a lot of joins.

As you already worked out, I wouldn't recommend the ORM solution because that's just simplifying the work you have to do, whereas you might still be running some awful query in the background that brings the DB to it's knees. I don't like using frameworks where you can't easily do custom queries for exactly this reason.

In lieu of a better example, let's talk concepts. We'd normally keep the ACL element on the code side and not the database side; for instance we typically wouldn't do a join across an ACL table which imparts the permissions. By that, I mean we would NOT:
code:
SELECT * FROM
mesages
INNER JOIN users on users.id = messages.user_id
INNER JOIN acl on acl.user_id = users.id AND messages.type = acl.message_type
WHERE 
users.id = '123'
Instead our user permissions are usually stored to the session and loaded just the once when the users logs in. We'd then:
code:
SELECT * FROM
messages
WHERE 
users.id = '123'
AND messages.type IN ('A', 'B')
where A and B message types are part of the relevant permissions drawn from the users permissions stored in the session.

Alternatively, we might have different queries depending on the user permissions; it's not unheard of to have a getMessages() method in your model which checks the users ACL and then runs the appropriate query. That way, your controller code still calls all the same stuff depending on the user, but your model pays attention to the permissions.

Sometimes permissions checks in your query is fine, and depends entirely on what you're up to. Tool for the job and all that. Can you cite an example where you've got a many-table-join you would want to collapse that's caused by permissions checks?

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!

Thanks for the extensive reply, much appreciated. I definitely like the id duplication approach as it makes the query really simple, and depending on the situation it might be the very best choice. The duplication is probably a smaller deal than I imagine since you don't really change user ids and so there wouldn't be a situation where you'd need to go through dozens of tables and change each row in there.

There are a couple of situations in what I'm working on where this might suck a bit: one is obviously the extra space the extra column takes up for large analytics tables with raw data in them, but I'm suspecting that's not worth fretting over until I hit terabytes, which I 99.9% of likelihood never will.

The other issue is potentially having multiple "owners" of that message row, where one id will not cut it anymore. The situation would be along the lines of this totalitarian example. You have:

- a bunch of admins with potentially overlapping sets of users they oversee (they can read their messages)
- admins to groups join table
- groups of users
- groups to users join table
- users
- messages owned by users

Now if you want to figure out, as an admin, which messages you have access to, you can't exactly tag every user's message with your id, since multiple admins might have access to that same message. How do you avoid doing 5 joins here? There must be a really obvious solution, but I'm just too green to see it.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
I am parsing kill logs from a game server and trying to determine how many matches a playername has participated in (in SQL Server 2012).

I have data like:

code:
Table KillInfo

MatchID  Killer  Victim
1         Foo     Foo
1         Foo     Bar
1         Bar     Foo
1         Bar     Buzz
1         Bar     Foo
1         Foo     Buzz
2         Foo     Foo
2         Foo     Bar
2         Bar     Foo
2         Bar     Buzz
2         Bar     Foo
2         Foo     Buzz
3         Bar     Foo
3         Foo     Bar
3         Foo     Foo

What's the best way to determine the number of matches in which a playername appears (as either killer or victim)?

I have a kludged together method where I select distinct MatchID, Killer and then union it with MatchID, Victim, then select distinct of that to purge duplicate records, then do a Name, count(1) by Name to get the total.

This seems really kludgy and inefficient.

Is there a better way to do this?


Agrikk fucked around with this message at 20:10 on Aug 4, 2013

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

Agrikk posted:

What's the best way to determine the number of matches in which a playername appears (as either killer or victim)?

As a completely off-the-top-of-my-head query,

SELECT COUNT( DISTINCT MatchID ) FROM KillInfo WHERE Killer = player OR Victim = player

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
Solved it in a line with no temp tables:

code:
select tem.player, count(1) as MatchesPlayed from
	(select  MatchID, killername as player from killinfo
		union 
	select  MatchID, victimname as player from killinfo) as tem
group by tem.player
order by count(1) desc
Any gotchas here?

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

DreadCthulhu posted:

- a bunch of admins with potentially overlapping sets of users they oversee (they can read their messages)
- admins to groups join table
- groups of users
- groups to users join table
- users
- messages owned by users
Now if you want to figure out, as an admin, which messages you have access to...
Instead of going off the deep end, allow me to start off with the Professor's/Student's Nightmare/Bliss: "Not enough information". If the first step is to understand the problem, then we friendly goons have an inkling, but only you know what you want. Ergo, we move to step two, the generation of solutions. I'm not saying you haven't thought about any of this, but:

I propose that you think about all ways this could be done. You may already be stuck with the layout you describe, but I'm going to assume you aren't :iamafag: so there are a great many pertinent questions: Are administrators just users with special privileges, or an entirely different collection of people (presumably)? Can users be members of multiple groups? What privileges does group membership provide? Are groups recursive/hierarchical? Do administrators just have the ability to see things, or are there other privileges they do or might have? Is a message posted just "by a user", or is it by a user into a specific group (to which the user belongs)?

Generate ideas based on potential pragmatic models. Is this supposed to be like the Unix filesystem? Is it more like the SQL role system? Should administrators mirror SA moderators? Is it a corporate structure with workers, supervisors, managers, middle managers, middle supervisors, dumbjerkery, chiefs officers, seventeenth level presidents, and bigwhigs that are so far gone no one knows their title?

It's clear that one can propose all manner of theoretical contrivances --- no, your example is not contrived, just not really narrowed down in my mind --- that will require many joins. We frequently see examples approaching arbitrary tree structures, and those are certainly the simplest examples of this list of joins you've indicated. Separated tree structures are most likely to require joins by their very nature, whereas some of the above models get around that by flattening "administrator privilege".

For example, one could contrive: Employees -- Supervisors -- Managers -- Administrators as a pure hierarchy, and then create a completely separate "Quality Control" organization of overseers of each one of those levels: EmployeeWatchers -- SupervisorWatchers -- ManagerWatchers -- AdministratorWatchers. Getting "All employees under admins that AdminWatcher 552 watches" is clearly not so easy. If supervisors have multiple managers, it's more complicated, not to mention the question about who manages the ManagerWatchers and who watches them. :buddy:

You could likely make lots of progress by flattening a bit. Unix groups, for example, require explicit user membership and there's no recursive lookup to consider. SA permits everyone to see everything (well, as far as I know), but a system of privileges controls other changes. SQL roles, PG privileges, SELinux, and others, are all based on tagging each object so access/privilege control can be enforced, so there's a great deal more operational overhead.

Think about lots of possibilities, narrow it down to two or three models that seem to fit your needs, and then consider the functional and performance side of the issue. One could flatten your tables with triggers or materialized views, but not without first being very clear (with oneself) about the expected data set size, update cycle, and expected optimizations.

MC Cakes
Dec 30, 2008
Our server has like 3.5TB worth of data stored in our mysql datadir, but when I zip a mysqldump the resulting files are worryingly small-

For example, we have a 650Mb mysql data directory, and the resulting mysqldump.sql.gz file is only 95Mb. Zip is supposed to only have like a 50% compression ratio, so I'm worried where the other 150Mbs are disappearing off to.

Could someone please reassure me that when you import from a mysqldump that some of the .myi index and .myd data tables are not represented 1:1 for size in the mysqldump, and thats why the mystery data is disappearing? I've tested a restore on one of our databases, but one test isn't very reassuring.

RoadCrewWorker
Nov 19, 2007

camels aren't so great

MC Cakes posted:

Zip is supposed to only have like a 50% compression ratio, so I'm worried where the other 150Mbs are disappearing off to.
Where are you getting this from? Compression depends heavily on the redundancy in your data, i regularly get >80% ratios on binary of tables with similar or very clearly structured data, and that's not even counting the redundancy introduced by using human readable sql dump files for storage (which obviously don't include the raw index data except as implicitely defined sql code).

MC Cakes
Dec 30, 2008
I was only able to find a handful of blogs that mentioned the average ~50% compression ratios of mysqldump files. The 50% was the result of a few different tests by bloggers on their own databases comparing 7z vs bz2, etc; I was perhaps erroneously assuming that my compression ratio would be similar to theirs.

Sudden Infant Def Syndrome
Oct 2, 2004

Could you not unzip on another server and compare?

MC Cakes
Dec 30, 2008
To clarify, the datadir is 660GB, the mysqldump is 500gb, and the mysqldump.gz is 100gb.

So its about an 80% compression which was unexpected, plus there's already a ~15% reduction in size just from going from a directory of .frm .myi and .myd files.

Importing a mysqldump seems to restore the directory back to the original size (in that one test I did), but the compression ratio was just so unexpectedly small that it alarmed me, and I was just looking for someone to pat me on the back and tell me everything's going to be OK.

roomforthetuna
Mar 22, 2005

I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

MC Cakes posted:

So its about an 80% compression which was unexpected, plus there's already a ~15% reduction in size just from going from a directory of .frm .myi and .myd files.
Yeah, it really depends on what sort of data you have. If you've got a lot of image files and junk in BLOB columns then 50% would be pretty good. If you've got a lot of plain text, 80% wouldn't be surprising at all. And any index columns and stuff aren't going to be stored in the dump to begin with (because they can be recreated from the data), so if you have a lot of indices that's a huge saving right off the bat.

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!

That was awesome, thank you. It actually took me a while to process what you said, sorry for the late response, but in the end it made a lot of sense. I definitely didn't put a lot of thought into the original schema design, and right now what I have works OK. Once (and if) we need to add a few more complex scenarios I'll certainly revisit the design and see if it can't be significantly simplified by imitating one of the existing patterns out there, as you suggested.

On an unrelated note, am I correctly understanding that in PG with WAL-E I could safely NOT base backup pg_xlogs (by say moving the folder outside of pgdata and symlinking to it) and still be able to do full recovery in case of meteor strike? If I'm getting this right, the WAL segments and the base backup without pg_xlogs should have everything I need to do full recovery.

WHERE MY HAT IS AT
Jan 7, 2011
I have a quick question about laying out databases/tables. I'm working on a side project which will have an SQL database containing all the books for all the programs at my school. Now, I've never done database design at all before, but I'm assuming it's a Bad Thing to just lump everything into one table, correct? Would the best way to break it down be to have a table for each program? Or should I be breaking stuff down even further and using a different database for each program and tables for each semester/year? All told I don't imagine it will be more than ~1500 books across ~150 programs or thereabouts.

roomforthetuna
Mar 22, 2005

I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

WHERE MY HAT IS AT posted:

I have a quick question about laying out databases/tables. I'm working on a side project which will have an SQL database containing all the books for all the programs at my school. Now, I've never done database design at all before, but I'm assuming it's a Bad Thing to just lump everything into one table, correct? Would the best way to break it down be to have a table for each program? Or should I be breaking stuff down even further and using a different database for each program and tables for each semester/year? All told I don't imagine it will be more than ~1500 books across ~150 programs or thereabouts.
I would say no, if these are the table separations you're considering then you should have it all in one table.

Separate tables are generally for separate columns, not separate blocks of data. If you're going to break it down it should be along those lines - eg. you might have a table for "programs" and a table for "books", and the books column would have an ID reference to the programs column to indicate which program it belongs to. Or, if a single book can belong to multiple programs, you'd have a third table "bookprogram" containing only a book ID and a program ID, to make those references.

But if the database's purpose is only to contain books, not programs, and the programs are just going to be referenced by ID, then you don't need a programs table.

Edit: there are some contexts where two or more identically structured tables is a worthwhile thing, but your example is not one.

roomforthetuna fucked around with this message at 21:23 on Aug 9, 2013

RoadCrewWorker
Nov 19, 2007

camels aren't so great
Yeah, 1500 is nothing, so you don't need to split it up because of data volume. Tables should generally model objects and their attributes to reduce redundancy (Book A is used in 2 programs, if all of its attributes are listed twice someone could change one entry and create inconsistent and conflicting information) and make updates and queries easier.

WHERE MY HAT IS AT
Jan 7, 2011
Okay, that makes sense. Thanks a lot!
It seems to make the most since to just shove it all in one table then. Now, barring any books being used for more than one year there shouldn't be a problem with having the following columns, right?:
Program - Year - Title - Price
This seems to make the most since to me since trying to model it as a list of programs with associated books would require a variable number of columns or something. The end game is to have the database accessible through a website where you pick your program/year from a drop down and it shows you your books and how much they cost etc. So organizing it like that and having the query be like SELECT * FROM books WHERE programID='foo' AND year='bar'; would be the simplest way to do it, correct?

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

DreadCthulhu posted:

That was awesome... It actually took me a while to process what you said,... Once (and if) we need to add a few more complex scenarios I'll certainly revisit the design and see if it can't be significantly simplified by imitating one of the existing patterns out there, as you suggested.

On an unrelated note, am I correctly understanding that in PG with WAL-E I could safely NOT base backup pg_xlogs (by say moving the folder outside of pgdata and symlinking to it) and still be able to do full recovery in case of meteor strike? If I'm getting this right, the WAL segments and the base backup without pg_xlogs should have everything I need to do full recovery.
You're welcome. If I did one post per thought instead of lumping them all into a single page, I'd have a higher post count. Haha. Whatever you do, remember that things like tree structures or the examples given are the types of things that suffer from design lock-in. You may get stuck with inefficient extensions, and processing the data to model it in a different way might take considerable processing time.

For base backups, note that the documented procedure (24.3.3.5) is to remove pg_xlog anyway. On the other hand, if you're not appropriately using the -X option, you may miss transactions on high-traffic systems. The argument is that a base backup plus all WAL archives is sufficient to roll forward to a specific point in time. Note that point-in-time recovery takes time to execute (though there are some tricks to speed it up) and that's after you've transferred the base backup onto the recovery system in addition to all the archives (whether streamed in or flat copied). If your archived transaction logs exceed the size of a base backup (for >20GB clusters), you're getting to the point where you need to consider the balance between the two. In any case, it seems unwise to have more than a day without a base backup. Bleh, there are lots of exceptions here.

men with puns
Feb 8, 2010
Young Orc

WHERE MY HAT IS AT posted:

Okay, that makes sense. Thanks a lot!
It seems to make the most since to just shove it all in one table then. Now, barring any books being used for more than one year there shouldn't be a problem with having the following columns, right?:
Program - Year - Title - Price
This seems to make the most since to me since trying to model it as a list of programs with associated books would require a variable number of columns or something. The end game is to have the database accessible through a website where you pick your program/year from a drop down and it shows you your books and how much they cost etc. So organizing it like that and having the query be like SELECT * FROM books WHERE programID='foo' AND year='bar'; would be the simplest way to do it, correct?

If you make the table this way, each row becomes a combination of program + book, rather than just a single book. This will almost certainly cause some problems. For example, you couldn't enter data about books and programs independently of each other. If nothing else, you'll be entering duplicate / redundant data about books next year / semester, because some programs are probably going to re-use the same books they used before.

You probably want to normalize your database to avoid these problems. I don't know offhand of somewhere that explains normalization without a lot of jargon, but let me try to get close to the spirit of it like this: The main idea is to prevent different types of things from depending on each other (what if you want to create a record for a program that doesn't use any books?) or having redundant data (having to enter book's name multiple times, or update the price data multiple times, because it's used again somewhere).

Each type of thing (book, program) should get its own table with a separate column for each type of data you care about. In the book table, there shouldn't be anything about programs that use it. If possible, include at least one thing that can uniquely identify the thing (ISBN, combination of program name + year, etc). If you create a unique index on this column (or for the programs table, the pair of columns for program name + year), you can prevent accidentally having duplicate records for things in the table. It's common to create another ID field (typically an integer that automatically counts up as you add rows, the details vary depending on your DB software) to serve as a PRIMARY KEY which is the main way to refer to a single row in your table. This field doesn't have any meaning outside of the database, but makes it easier to identify rows and relate tables to each other.

A relationship between two types of things (program A uses book B) is also a type of thing, and this relationship should also get its own table. The relationship table would contain copies of the primary keys used for your books and programs. Usually this means it will just have two columns: one for a program's id, one for a book's id. You can make an index on the columns of this table to prevent duplications too.

Then, when someone wants to look up books for their programs, you start with the program name / year, find the matching record for their program, JOIN the relationship table, and finally get the data from the matching rows from the book table.

Something like:
code:
SELECT b.*
FROM   books b
       INNER JOIN programs_books pb
               ON b.id = pb.book_id
       INNER JOIN programs p
               ON pb.program_id = p.id
WHERE  p.name = 'program name'
       AND p.year = 2013
It ends up with a more complicated query, and multiple tables, but it should make it easier to enter the data, or work with it in other ways you many not have anticipated.

roomforthetuna
Mar 22, 2005

I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!
In this particular case it might make sense to have 'year' be a column in the relationship table, since programs may not be specific to a year, books may not be specific to a year, but the information you're wanting out is books-that-go-with-a-program-in-a-year. I can't see a problem with year being in that table, and I can't see a way it makes sense in either of the other tables.

Unless programs are each year-specific, in which case putting year in with the program would be the way to go.

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

Weird little question that I think relates to a PIVOT but I'm not sure, since I am never able to think those through. The way I use MSSQL is more programmatic; I make stored procs that update, insert, and transform things, which is why I have trouble on fiddly little 'end user' requests like this. Basically I have an orders table. I've been asked to prepare a spreadsheet that is intended to show a customer's email, and then the date of each order they've made on one line. So it would look like:
code:
Email | Order1Date | Order2Date | Order3Date | Order4Date | etc.
The problem of course being, the actual number of orders is not known. I can easily find emails that have placed multiple orders with something like this:
code:
SELECT
 Email,COUNT(*) 
FROM
 Orders
GROUP BY
 Email
ORDER BY 
 COUNT(*) DESC
And I can even find the first and last order using MAX(OrderDate), that kind of thing. What I'm not sure of is how to make this work on one row without either:
A)hand-coding a TON of sub-selects of (x) layers, where (x) is the max number of orders
B)Something involving PIVOT/COALESCE that I can't seem to get off the ground because I keep running up against how to extract the various OrderDates and sort them.

Does anyone have any insight on how to proceed? You don't have to solve the entire thing, even just pointing to a good starting point could be helpful.

Salt n Reba McEntire
Nov 14, 2000

Kuparp.
You could use a STUFF & XML PATH in your select list to easily grab each order date grouped by email address, and then simply leave the order dates comma'd and deal with it in your presentation logic (which is simple enough to achieve, and you avoid the dynamic sql crap).

So you'll end up with:

[email], [order1],[order2],...,[ordern]

Stuff and path is easy to deal with, it's just something like (off the top of my head, so check syntax) :

code:
SELECT [o].Email,
	STUFF((SELECT DISTINCT ','+CAST([_o].Date AS VARCHAR) 
		FROM Orders [_o] 
		WHERE [_o].Email = [o].Email 
		FOR XML PATH('')),1,1)
FROM Orders [o]
GROUP BY [o].Email
e: gawd, asleep when I wrote this.

Salt n Reba McEntire fucked around with this message at 22:48 on Aug 15, 2013

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

Moogle posted:

code:
SELECT [o].Email,
	STUFF((SELECT DISTINCT ','+CAST([_o].Date AS VARCHAR) 
		FROM Orders [_o] 
		WHERE [_o].Email = [o].Email 
		FOR XML PATH('')),1,1)
FROM Orders [o]
GROUP BY [o].Email

Well that's frankly awesome; I didn't even need the STUFF because the first comma can act as the separator between email and the first order for CSV. Thanks a lot! I'm going to have to look into this XML PATH stuff I think because that solution came entirely out of left field for me.

more like dICK
Feb 15, 2010

This is inevitable.
I'm not super good at SQL, so I hope someone can help me with this. I've got a table with "duplicate" rows. The primary keys are still unique, but there are some rows that are otherwise identical.

Table looks like:

- Primary Key
- String
- String
- Timestamp

So duplicate rows would look like:
12345 | "Foo" | "Bar" | 2013-07-03 16:07:35.205
12346 | "Foo" | "Bar" | 2013-07-03 16:07:35.205

The three non-PK columns uniquely identify a row. No individual row is duplicated more than once.

Is there any easy way to find and delete these duplicates?

edit: It's DB2 if that makes a difference

edit2:

It looks like maybe something like this will work?

DELETE FROM the_table t
WHERE EXISTS ( SELECT *
FROM the_table t2
WHERE t.column_one = t2.column_one
AND t.column_two = t2.column_two
AND t.column_three = t2.column_three
AND t.primary_key < t2.primary_key
);

more like dICK fucked around with this message at 15:16 on Aug 16, 2013

Janitor Prime
Jan 22, 2004

PC LOAD LETTER

What da fuck does that mean

Fun Shoe

more like dICK posted:

The three non-PK columns uniquely identify a row

Then put a loving unique constraint on it to avoid this poo poo in the future!

more like dICK
Feb 15, 2010

This is inevitable.
The database is literally older than I am, I don't control it. thanks for your help.

Bad Munki
Nov 4, 2008

We're all mad here.


Maybe something like a group by on the three matching columns, and selecting the max (or min, or whatever you prefer) actually unique ID, and nesting that whole thing in a delete-where-not-in statement...Not exactly graceful, but it ought to function.

This is just off the top of my head and I always get group by statements wrong until I fiddle with them a bit, but I'm thinking something like:
code:
delete from yourtable where ID not in (select distinct max(ID), col1, col2, col3 from yourtable group by col1, col2, col3)
Again, that's just typing without thinking there, it's up to you to figure out if something like that will actually work in your case. I think the concept is right, though, because for the duplicates, it'll just grab one row from each, and for any non-duplicates, it'll grab them wholesale. And then delete everything else. You could do a quick check by replacing the 'delete' with a 'select *'.

And then once you've fixed it, yeah, put a constraint on there to prevent future borkage, as advised.

Sab669
Sep 24, 2009

I have a sort of database design question I think.

I've got an application that the user can enter in a bunch of values and it performs a lengthy-but-basic calculation. Just basic arithmetic. But my customer wants to be able to modify some "constants" in this equation. For example, one of the formulas:

(((D54*0.30)+(E14 X E11 X 0.3464)+(E14 X E11 X 0.125)+95+35+5+((D54 X 0.95) X 1.11))+600) / E11


Taken from an excel spreadsheet, all the 'variables' are coming from the end user entering them into the application, that's not a problem. But they might be changing out all those actual numbers there.
I don't know if I should have 1 table with 5 different rows (5 formulas they want to change) and just have 10 columns or something for the new "constants". Unfortunately, not all of the formulas have the same number of constants so that's kind of ugly. But then the actual record for the form could just have a foreign key to which set of formulas to use? I have no idea :suicide:

Sab669 fucked around with this message at 17:02 on Aug 19, 2013

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Sab669 posted:

I have a sort of database design question I think... Variables, formulae, functions, oh my!

(((D54*0.30)+(E14 X E11 X 0.3464)+(E14 X E11 X 0.125)+95+35+5+((D54 X 0.95) X 1.11))+600) / E11
You need to ask yourself what is appropriate to store in the database given your needs. For example, if you have a small number of frequently-used formulas that appear in a number of queries, you can get away with using a stored procedure. Once the number of such grows rather large, however, that becomes somewhat non-optimal, particularly if you're expecting to use a query to resolve a function name and magically "dereference" it to execute as a procedure.

What is the nature of these constants, and the expected purpose of database storage for them? Are you expecting parameters to be varied by user, invoice, or billing cycle? Do you need an audit trail for parameter changes? Are you merely storing them because they are used again later?

Regarding the functions, there are security concerns dependent on the type of stored procedure you create, but it does not sound like you have enough function variation to warrant a more complicated approach to storage. I mean, you could create a table of named functions, their TeX representation, and any of a number of valid computational representations, followed by EXECUTE calls and some flappery to get the appropriate parameters, but that's no small amount of work, blows query optimization, and it's quite likely that you have a programming language of choice that would permit you to do all that work at the application layer.

Edit: Based on your response below, it sounds like you just need to save them in a shipping parameters table so you can select them, load them into a hash, and use them when you're generating output reports. As you've mentioned no need to have all that generation within the database, there will be less pain if you stick with whatever approach you find easiest for those steps, because while you can building a reporting view with the appropriate per-item columns for weight and size charges (assuming the appropriate data is already stored with each item), some values likely depend on aggregates, so you're really talking about a two step (or more) report and a bunch of things that you'll likely want in revision control anyway unless you really are setting out to hurt yourself.

PhantomOfTheCopier fucked around with this message at 02:38 on Aug 20, 2013

Sab669
Sep 24, 2009

Basically, it's a bunch of fees associated with shipping products from China to the US. So, it's a list of fees relative to how it gets shipped; boat, air, combinations there of, and a few other things (testing products for the presence of various metals).

Frankly the data won't change frequently at all, they might add a record per year I'd imagine. But new data is new data. So this "Cost sheet" will use a bunch of these fees in factoring out the total cost of an item. Then in 6 months maybe DHL will charge X instead of Y per pound to ship a product, so they'll need to update these values and use them on future "cost sheets". We don't want to hardcode it so we don't need to recompile & patch the software when these infrequent changes occur, obviously.

Adbot
ADBOT LOVES YOU

Red Red Red
Mar 7, 2013

This is oracle specific. We are switching from 10g to 11g and moving datacenters at the same time. Part of all of this change is that I have a task to change every occurrence of sysdate to current_date.

I have a lot of sql that does
code:
and order_date >= trunc(sysdate) 
and order_date < trunc(sysdate) + 1 
but when I replace the sysdate with current_date and do the same thing it just takes forever to run.

Is there a better way to optimize trunc(current_date) ? I thought the 2 were the same except that current_date returns the date of the local machine and sysdate returns the date of the db server, am I wrong?

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply