SELECT * FROM Questions WHERE Type = 'Stupid'

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > SELECT * FROM Questions WHERE Type = 'Stupid'

«‹›303 »

Scaramouche: Mar 26, 2001; SPACE FACE! SPACE FACE!

I don't think there's an easy answer to this question, but figured I'd ask here anyway.

I'm categorizing items, in this case based on length. The problem being, the items in question are from several different vendors with no standardized way of communicating length (e.g. a field called Length with a number in it). Instead I've got things like '18 inch', '08.00 inch', '7.25 inch', etc etc etc. This has mostly worked because I'm generally only looking for >one< length and I can just put in the edge cases like so:

code:

INSERT INTO ProductCategory (ProductID,CategoryID)
SELECT p.ProductID,218 <-- 218 being the categoryid for something that's say, 10 inches long
FROM Product p
WHERE p.Name like '%10 inch%' or p.Name like '%10.00 inch%'
or p.name like '%010 inch%'

Messy, but mostly works. To give an idea of scale there's only about 300,000 rows in products so I can get away with messy crap like the above.

The problem is, I've now been asked to classify by 'long' and 'short'. Put simply 'long' is > 24/24.00/024 inches and 'short' is < 10/10.00/010 inches. Is there a relatively straightforward way I can extract just the numeric and do a mathematic greater than/less than comparison? Only thing I can think of off-hand is to make a temporary table and then do a bunch of SUBSTRING fuckery based on if 'inch' is found in the p.Name text field but am hoping you geniuses have something better.

EDIT-Sorry, MSSQL 2008R2, p.Name is nvarchar(250)

Scaramouche fucked around with this message at 22:00 on Jul 29, 2013

# ? Jul 29, 2013 21:56

Adbot: ADBOT LOVES YOU

# ? May 14, 2024 09:02

Anaxandrides: Jun 7, 2012; Every summer day in the desert is like a vacation in hell.

Scaramouche posted:

Messy, but mostly works. To give an idea of scale there's only about 300,000 rows in products so I can get away with messy crap like the above.

The problem is, I've now been asked to classify by 'long' and 'short'. Put simply 'long' is > 24/24.00/024 inches and 'short' is < 10/10.00/010 inches. Is there a relatively straightforward way I can extract just the numeric and do a mathematic greater than/less than comparison? Only thing I can think of off-hand is to make a temporary table and then do a bunch of SUBSTRING fuckery based on if 'inch' is found in the p.Name text field but am hoping you geniuses have something better.

EDIT-Sorry, MSSQL 2008R2, p.Name is nvarchar(250)

This doesn't look too bad. Assuming that your examples are reflective of the actual data you have, I'd use PATINDEX to find where the space is and just LEFT(p.Name) the number out. Or is your data more heterogeneous than '18 inch'm '08.00 inch', '7.25 inch'?

# ? Jul 30, 2013 17:13

Salt n Reba McEntire: Nov 14, 2000; Kuparp.

Perhaps something like...

code:

SELECT TOP 20 IDENTITY(INT,1,1) AS Tally
INTO #Tally
FROM master.dbo.syscolumns

DECLARE @TestInches AS TABLE (
	ID INT NOT NULL PRIMARY KEY,
	LengthField VARCHAR(50)
)

INSERT INTO @TestInches SELECT 1,'0.50 inches'
INSERT INTO @TestInches SELECT 2,'8 inch'
INSERT INTO @TestInches SELECT 3,'8.2"'
INSERT INTO @TestInches SELECT 4,'5.0inch'
INSERT INTO @TestInches SELECT 5,'"6.7555 in'

;WITH CleanUpInches (ID, LengthField, KeepChar, KillChar) AS (
	SELECT ID, LengthField,
		CASE WHEN SUBSTRING(LengthField,Tally,1) NOT LIKE '%[^0-9.]%' THEN substring(LengthField,Tally,1) end as KeepChar,
		CASE WHEN SUBSTRING(LengthField,Tally,1) LIKE '%[^0-9.]%' THEN substring(LengthField,Tally,1) end as KillChar
	FROM @TestInches,  #Tally
	WHERE LengthField LIKE '%[^0-9.]%' AND Tally <= LEN(LengthField)
)
SELECT DISTINCT LengthField,
	CAST(ISNULL(STUFF((SELECT '' + KeepChar FROM CleanUpInches C1 WHERE C1.ID = C2.ID FOR XML PATH ('')),1,0,''),'') AS DECIMAL(9,2)) AS CleanInches
FROM CleanUpInches C2

DROP TABLE #Tally

Assuming they're all inches under the textual fluff, of course. If not you'll have to do things a bit differently.

E: Obviously the top 20 needs to be the maximum width of data (not the field width, just the max(len) and you can squirt your database table in as a replacement to the table variable. This won't perform wonderfully, but it'll do a healthy job of stripping run as oneshot batch during a quiet moment. You may also need to strip extraneous periods prior to a cast.

Salt n Reba McEntire fucked around with this message at 14:12 on Jul 31, 2013

# ? Jul 31, 2013 12:30

Anaxandrides: Jun 7, 2012; Every summer day in the desert is like a vacation in hell.

Moogle posted:

Assuming they're all inches under the textual fluff, of course. If not you'll have to do things a bit differently.

E: Obviously the top 20 needs to be the maximum width of data (not the field width, just the max(len) and you can squirt your database table in as a replacement to the table variable. This won't perform wonderfully, but it'll do a healthy job of stripping run as oneshot batch during a quiet moment. You may also need to strip extraneous periods prior to a cast.

I'd personally go with something like...

code:

SELECT ProductID,
	CategoryID,
	CAST(LEFT(Name, PATINDEX('%[^0-9]%', Name)) AS DECIMAL(10,8)) AS Inches
FROM Product
WHERE ISNUMERIC(LEFT(Name, PATINDEX('%[^0-9]%', Name))) = 1

This assumes that data is pretty homogeneous in terms of the number always being before other characters. If this isn't the case, it should be pretty easy to flip this and UNION the results. Unless there's something weird in there like 'Length - 10 and 1/2 inch', this should work.

# ? Jul 31, 2013 20:58

saint gerald: Apr 17, 2003

I have a troublesome SQL problem.

We have a set of tables that we create every quarter. My boss has asked me to produce views that union these tables into one big view to produce some reports that look over multiple quarters. (You could ask, why are these separate tables in the first place, which is a question to which there seems to be no satisfactory answer.)

So obviously I do
SELECT 'q1_1999' qtr, firsttable.* from firsttable
UNION ALL
SELECT 'q2_1999' qtr, secondtable.* from secondtable
...etc

But you know what's coming: over the years additional fields have been added that don't exist in earlier tables, so a simple UNION won't work unless I go through and painstakingly fill out the earlier tables with nulled fields. Given I have to do this a number of times for different sets of quarterly data, this is not a task I am relishing.

Is there a way to lessen the burden here? Could I do something with dynamic SQL, or a programming solution that would pull the table definitions and assemble the SQL for me? Is this a problem anyone has met before? I'm open to ideas. Backend is Oracle.

edit: OK, this is solved thanks to the all_tab_columns view and some dynamic SQL.

saint gerald fucked around with this message at 20:45 on Aug 2, 2013

# ? Aug 2, 2013 17:49

Scaramouche: Mar 26, 2001; SPACE FACE! SPACE FACE!

Moogle that is pretty awesome and strongly resembles what I've half assed out so far, though unfortunately I'm going to have to add textual distinctions to it since as Anaxandrides notes, the data isn't homogeneous. In fact, it's even less homogeneous that that, with an average field being like:
"Bright blue thing with the number 20 printed on it 13.5 inch and 16mm"

Thanks for your help guys, but don't waste any more time on it since I have (miraculously) actually been able to clean up the data at the source by making a big stink about it(!)

# ? Aug 2, 2013 20:53

roomforthetuna: Mar 22, 2005; I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

Scaramouche posted:

Moogle that is pretty awesome and strongly resembles what I've half assed out so far, though unfortunately I'm going to have to add textual distinctions to it since as Anaxandrides notes, the data isn't homogeneous. In fact, it's even less homogeneous that that, with an average field being like:
"Bright blue thing with the number 20 printed on it 13.5 inch and 16mm"

Ugh, I hate dealing with that sort of data. I wrote a browser plugin script to attempt to convert Amazon prices into a price per [fixed quantity], because it's a pain in the rear end figuring out which is better of "500g container, $20" or "2-pack of 5oz containers, $15". It only helps compare about four in every five entries though because the product titles are so jumbled.

# ? Aug 2, 2013 22:10

Anaxandrides: Jun 7, 2012; Every summer day in the desert is like a vacation in hell.

roomforthetuna posted:

Ugh, I hate dealing with that sort of data. I wrote a browser plugin script to attempt to convert Amazon prices into a price per [fixed quantity], because it's a pain in the rear end figuring out which is better of "500g container, $20" or "2-pack of 5oz containers, $15". It only helps compare about four in every five entries though because the product titles are so jumbled.

Knowing how the data is heterogeneous actually makes this pretty trivial. Amazon's part catalog is so enormous that no one-size-fits-all solution can be found without really detailed analysis and logic. But Scaramouche's example of an "average" record is still pretty simple indeed to rip through. It would just require deeply nested case ststements and SUBSTRINGs. Trivial to write, but an absolute bear to debug/read if you don't know what it's meant for.

That is to say, probably bad (but functional) code.

# ? Aug 2, 2013 23:55

DreadCthulhu: Sep 17, 2008; What the fuck is up, Denny's?!

This authorization question has been bothering me for a while and I don't know how to solve it best.

Say you're implementing mock Facebook, a web app calling into a db with a users and a messages table. You don't want users to be able to access rows in messages table that don't belong to them.

There are a bunch of ways of doing this. Simplistically, assuming the web app verified and validated the id of the user making the request (as in, you fully trust it), you can just throw in a WHERE user_id = ? (and user_id = ? for INSERTs) into every one of your queries and now you're preventing users from selecting other people's rows, and inserting as if they were another user.

This is fine and dandy until messages table is no longer 1 join away from users, but instead say 10 tables away. Now you have a giant clusterfuck of a query. Yeah you can probably ameliorate it with some magical query-building ORM fairy dust that will figure out the join for you behind the scenes, and the DB might even perform very well behind the scenes, but let's ignore that option for sake of argument.

The other option is to create a "shortcut" table for each of these distant tables I want to ACL, the table would be a many to many between e.g. messages and users, skipping everything in between. Now the problem is that every insert involves 2 tables, but perhaps it's not a biggie.

I think another option is to build a DB procedures layer between the application and the DB and have that one handle access control. Likely quite a few more approaches.

I'm wondering what has worked well for you in a web app situation like what I mentioned above. I'm looking for something that's first of all elegant and easy to maintain, and ideally good enough perf-wise.

DreadCthulhu fucked around with this message at 07:02 on Aug 3, 2013

# ? Aug 3, 2013 05:16

v1nce: Sep 19, 2004; Plant your brassicas in may and cover them in mulch.

Generally speaking I'd store the user_id with the item in question, going straight off your messages example. I can't think of a situation in which your users table would be a long distance of joins away from the messages table, so maybe that's just a weak example. Generally speaking though, a little duplication of IDs is a far better solution than a lot of joins.

As you already worked out, I wouldn't recommend the ORM solution because that's just simplifying the work you have to do, whereas you might still be running some awful query in the background that brings the DB to it's knees. I don't like using frameworks where you can't easily do custom queries for exactly this reason.

In lieu of a better example, let's talk concepts. We'd normally keep the ACL element on the code side and not the database side; for instance we typically wouldn't do a join across an ACL table which imparts the permissions. By that, I mean we would NOT:

code:

SELECT * FROM
mesages
INNER JOIN users on users.id = messages.user_id
INNER JOIN acl on acl.user_id = users.id AND messages.type = acl.message_type
WHERE 
users.id = '123'

Instead our user permissions are usually stored to the session and loaded just the once when the users logs in. We'd then:

code:

SELECT * FROM
messages
WHERE 
users.id = '123'
AND messages.type IN ('A', 'B')

where A and B message types are part of the relevant permissions drawn from the users permissions stored in the session.

Alternatively, we might have different queries depending on the user permissions; it's not unheard of to have a getMessages() method in your model which checks the users ACL and then runs the appropriate query. That way, your controller code still calls all the same stuff depending on the user, but your model pays attention to the permissions.

Sometimes permissions checks in your query is fine, and depends entirely on what you're up to. Tool for the job and all that. Can you cite an example where you've got a many-table-join you would want to collapse that's caused by permissions checks?

# ? Aug 4, 2013 07:34

DreadCthulhu: Sep 17, 2008; What the fuck is up, Denny's?!

v1nce posted:

stuff

Thanks for the extensive reply, much appreciated. I definitely like the id duplication approach as it makes the query really simple, and depending on the situation it might be the very best choice. The duplication is probably a smaller deal than I imagine since you don't really change user ids and so there wouldn't be a situation where you'd need to go through dozens of tables and change each row in there.

There are a couple of situations in what I'm working on where this might suck a bit: one is obviously the extra space the extra column takes up for large analytics tables with raw data in them, but I'm suspecting that's not worth fretting over until I hit terabytes, which I 99.9% of likelihood never will.

The other issue is potentially having multiple "owners" of that message row, where one id will not cut it anymore. The situation would be along the lines of this totalitarian example. You have:

- a bunch of admins with potentially overlapping sets of users they oversee (they can read their messages)
- admins to groups join table
- groups of users
- groups to users join table
- users
- messages owned by users

Now if you want to figure out, as an admin, which messages you have access to, you can't exactly tag every user's message with your id, since multiple admins might have access to that same message. How do you avoid doing 5 joins here? There must be a really obvious solution, but I'm just too green to see it.

# ? Aug 4, 2013 09:36

Agrikk: Oct 17, 2003; Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

I am parsing kill logs from a game server and trying to determine how many matches a playername has participated in (in SQL Server 2012).

I have data like:

code:

Table KillInfo

MatchID  Killer  Victim
1         Foo     Foo
1         Foo     Bar
1         Bar     Foo
1         Bar     Buzz
1         Bar     Foo
1         Foo     Buzz
2         Foo     Foo
2         Foo     Bar
2         Bar     Foo
2         Bar     Buzz
2         Bar     Foo
2         Foo     Buzz
3         Bar     Foo
3         Foo     Bar
3         Foo     Foo

What's the best way to determine the number of matches in which a playername appears (as either killer or victim)?

I have a kludged together method where I select distinct MatchID, Killer and then union it with MatchID, Victim, then select distinct of that to purge duplicate records, then do a Name, count(1) by Name to get the total.

This seems really kludgy and inefficient.

Is there a better way to do this?

Agrikk fucked around with this message at 20:10 on Aug 4, 2013

# ? Aug 4, 2013 18:13

ShoulderDaemon: Oct 9, 2003; support goon fund; Taco Defender

Agrikk posted:

What's the best way to determine the number of matches in which a playername appears (as either killer or victim)?

As a completely off-the-top-of-my-head query,

SELECT COUNT( DISTINCT MatchID ) FROM KillInfo WHERE Killer = player OR Victim = player

# ? Aug 4, 2013 18:36

Agrikk: Oct 17, 2003; Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

Solved it in a line with no temp tables:

code:

select tem.player, count(1) as MatchesPlayed from
	(select  MatchID, killername as player from killinfo
		union 
	select  MatchID, victimname as player from killinfo) as tem
group by tem.player
order by count(1) desc

Any gotchas here?

# ? Aug 4, 2013 20:11

PhantomOfTheCopier: Aug 13, 2008; Pikabooze!

DreadCthulhu posted:

- a bunch of admins with potentially overlapping sets of users they oversee (they can read their messages)
- admins to groups join table
- groups of users
- groups to users join table
- users
- messages owned by users
Now if you want to figure out, as an admin, which messages you have access to...

Instead of going off the deep end, allow me to start off with the Professor's/Student's Nightmare/Bliss: "Not enough information". If the first step is to understand the problem, then we friendly goons have an inkling, but only you know what you want. Ergo, we move to step two, the generation of solutions. I'm not saying you haven't thought about any of this, but:

I propose that you think about all ways this could be done. You may already be stuck with the layout you describe, but I'm going to assume you aren't :iamafag: so there are a great many pertinent questions: Are administrators just users with special privileges, or an entirely different collection of people (presumably)? Can users be members of multiple groups? What privileges does group membership provide? Are groups recursive/hierarchical? Do administrators just have the ability to see things, or are there other privileges they do or might have? Is a message posted just "by a user", or is it by a user into a specific group (to which the user belongs)?

Generate ideas based on potential pragmatic models. Is this supposed to be like the Unix filesystem? Is it more like the SQL role system? Should administrators mirror SA moderators? Is it a corporate structure with workers, supervisors, managers, middle managers, middle supervisors, dumbjerkery, chiefs officers, seventeenth level presidents, and bigwhigs that are so far gone no one knows their title?

It's clear that one can propose all manner of theoretical contrivances --- no, your example is not contrived, just not really narrowed down in my mind --- that will require many joins. We frequently see examples approaching arbitrary tree structures, and those are certainly the simplest examples of this list of joins you've indicated. Separated tree structures are most likely to require joins by their very nature, whereas some of the above models get around that by flattening "administrator privilege".

For example, one could contrive: Employees -- Supervisors -- Managers -- Administrators as a pure hierarchy, and then create a completely separate "Quality Control" organization of overseers of each one of those levels: EmployeeWatchers -- SupervisorWatchers -- ManagerWatchers -- AdministratorWatchers. Getting "All employees under admins that AdminWatcher 552 watches" is clearly not so easy. If supervisors have multiple managers, it's more complicated, not to mention the question about who manages the ManagerWatchers and who watches them. :buddy:

You could likely make lots of progress by flattening a bit. Unix groups, for example, require explicit user membership and there's no recursive lookup to consider. SA permits everyone to see everything (well, as far as I know), but a system of privileges controls other changes. SQL roles, PG privileges, SELinux, and others, are all based on tagging each object so access/privilege control can be enforced, so there's a great deal more operational overhead.

Think about lots of possibilities, narrow it down to two or three models that seem to fit your needs, and then consider the functional and performance side of the issue. One could flatten your tables with triggers or materialized views, but not without first being very clear (with oneself) about the expected data set size, update cycle, and expected optimizations.

# ? Aug 4, 2013 20:13

MC Cakes: Dec 30, 2008

Our server has like 3.5TB worth of data stored in our mysql datadir, but when I zip a mysqldump the resulting files are worryingly small-

For example, we have a 650Mb mysql data directory, and the resulting mysqldump.sql.gz file is only 95Mb. Zip is supposed to only have like a 50% compression ratio, so I'm worried where the other 150Mbs are disappearing off to.

Could someone please reassure me that when you import from a mysqldump that some of the .myi index and .myd data tables are not represented 1:1 for size in the mysqldump, and thats why the mystery data is disappearing? I've tested a restore on one of our databases, but one test isn't very reassuring.

# ? Aug 8, 2013 23:42

RoadCrewWorker: Nov 19, 2007; camels aren't so great

MC Cakes posted:

Zip is supposed to only have like a 50% compression ratio, so I'm worried where the other 150Mbs are disappearing off to.

Where are you getting this from? Compression depends heavily on the redundancy in your data, i regularly get >80% ratios on binary of tables with similar or very clearly structured data, and that's not even counting the redundancy introduced by using human readable sql dump files for storage (which obviously don't include the raw index data except as implicitely defined sql code).

# ? Aug 8, 2013 23:50

MC Cakes: Dec 30, 2008

I was only able to find a handful of blogs that mentioned the average ~50% compression ratios of mysqldump files. The 50% was the result of a few different tests by bloggers on their own databases comparing 7z vs bz2, etc; I was perhaps erroneously assuming that my compression ratio would be similar to theirs.

# ? Aug 9, 2013 00:10

Sudden Infant Def Syndrome: Oct 2, 2004

Could you not unzip on another server and compare?

# ? Aug 9, 2013 00:22

MC Cakes: Dec 30, 2008

To clarify, the datadir is 660GB, the mysqldump is 500gb, and the mysqldump.gz is 100gb.

So its about an 80% compression which was unexpected, plus there's already a ~15% reduction in size just from going from a directory of .frm .myi and .myd files.

Importing a mysqldump seems to restore the directory back to the original size (in that one test I did), but the compression ratio was just so unexpectedly small that it alarmed me, and I was just looking for someone to pat me on the back and tell me everything's going to be OK.

# ? Aug 9, 2013 00:46

roomforthetuna: Mar 22, 2005; I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

MC Cakes posted:

So its about an 80% compression which was unexpected, plus there's already a ~15% reduction in size just from going from a directory of .frm .myi and .myd files.

Yeah, it really depends on what sort of data you have. If you've got a lot of image files and junk in BLOB columns then 50% would be pretty good. If you've got a lot of plain text, 80% wouldn't be surprising at all. And any index columns and stuff aren't going to be stored in the dump to begin with (because they can be recreated from the data), so if you have a lot of indices that's a huge saving right off the bat.

# ? Aug 9, 2013 01:21

DreadCthulhu: Sep 17, 2008; What the fuck is up, Denny's?!

PhantomOfTheCopier posted:

stuff

That was awesome, thank you. It actually took me a while to process what you said, sorry for the late response, but in the end it made a lot of sense. I definitely didn't put a lot of thought into the original schema design, and right now what I have works OK. Once (and if) we need to add a few more complex scenarios I'll certainly revisit the design and see if it can't be significantly simplified by imitating one of the existing patterns out there, as you suggested.

On an unrelated note, am I correctly understanding that in PG with WAL-E I could safely NOT base backup pg_xlogs (by say moving the folder outside of pgdata and symlinking to it) and still be able to do full recovery in case of meteor strike? If I'm getting this right, the WAL segments and the base backup without pg_xlogs should have everything I need to do full recovery.

# ? Aug 9, 2013 20:52

WHERE MY HAT IS AT: Jan 7, 2011

I have a quick question about laying out databases/tables. I'm working on a side project which will have an SQL database containing all the books for all the programs at my school. Now, I've never done database design at all before, but I'm assuming it's a Bad Thing to just lump everything into one table, correct? Would the best way to break it down be to have a table for each program? Or should I be breaking stuff down even further and using a different database for each program and tables for each semester/year? All told I don't imagine it will be more than ~1500 books across ~150 programs or thereabouts.

# ? Aug 9, 2013 21:05

roomforthetuna: Mar 22, 2005; I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

WHERE MY HAT IS AT posted:

I have a quick question about laying out databases/tables. I'm working on a side project which will have an SQL database containing all the books for all the programs at my school. Now, I've never done database design at all before, but I'm assuming it's a Bad Thing to just lump everything into one table, correct? Would the best way to break it down be to have a table for each program? Or should I be breaking stuff down even further and using a different database for each program and tables for each semester/year? All told I don't imagine it will be more than ~1500 books across ~150 programs or thereabouts.

I would say no, if these are the table separations you're considering then you should have it all in one table.

Separate tables are generally for separate columns, not separate blocks of data. If you're going to break it down it should be along those lines - eg. you might have a table for "programs" and a table for "books", and the books column would have an ID reference to the programs column to indicate which program it belongs to. Or, if a single book can belong to multiple programs, you'd have a third table "bookprogram" containing only a book ID and a program ID, to make those references.

But if the database's purpose is only to contain books, not programs, and the programs are just going to be referenced by ID, then you don't need a programs table.

Edit: there are some contexts where two or more identically structured tables is a worthwhile thing, but your example is not one.

roomforthetuna fucked around with this message at 21:23 on Aug 9, 2013

# ? Aug 9, 2013 21:17

RoadCrewWorker: Nov 19, 2007; camels aren't so great

Yeah, 1500 is nothing, so you don't need to split it up because of data volume. Tables should generally model objects and their attributes to reduce redundancy (Book A is used in 2 programs, if all of its attributes are listed twice someone could change one entry and create inconsistent and conflicting information) and make updates and queries easier.

# ? Aug 9, 2013 21:21

WHERE MY HAT IS AT: Jan 7, 2011

Okay, that makes sense. Thanks a lot!
It seems to make the most since to just shove it all in one table then. Now, barring any books being used for more than one year there shouldn't be a problem with having the following columns, right?:
Program - Year - Title - Price
This seems to make the most since to me since trying to model it as a list of programs with associated books would require a variable number of columns or something. The end game is to have the database accessible through a website where you pick your program/year from a drop down and it shows you your books and how much they cost etc. So organizing it like that and having the query be like SELECT * FROM books WHERE programID='foo' AND year='bar'; would be the simplest way to do it, correct?

# ? Aug 9, 2013 21:54

PhantomOfTheCopier: Aug 13, 2008; Pikabooze!

DreadCthulhu posted:

That was awesome... It actually took me a while to process what you said,... Once (and if) we need to add a few more complex scenarios I'll certainly revisit the design and see if it can't be significantly simplified by imitating one of the existing patterns out there, as you suggested.

On an unrelated note, am I correctly understanding that in PG with WAL-E I could safely NOT base backup pg_xlogs (by say moving the folder outside of pgdata and symlinking to it) and still be able to do full recovery in case of meteor strike? If I'm getting this right, the WAL segments and the base backup without pg_xlogs should have everything I need to do full recovery.

You're welcome. If I did one post per thought instead of lumping them all into a single page, I'd have a higher post count. Haha. Whatever you do, remember that things like tree structures or the examples given are the types of things that suffer from design lock-in. You may get stuck with inefficient extensions, and processing the data to model it in a different way might take considerable processing time.

For base backups, note that the documented procedure (24.3.3.5) is to remove pg_xlog anyway. On the other hand, if you're not appropriately using the -X option, you may miss transactions on high-traffic systems. The argument is that a base backup plus all WAL archives is sufficient to roll forward to a specific point in time. Note that point-in-time recovery takes time to execute (though there are some tricks to speed it up) and that's after you've transferred the base backup onto the recovery system in addition to all the archives (whether streamed in or flat copied). If your archived transaction logs exceed the size of a base backup (for >20GB clusters), you're getting to the point where you need to consider the balance between the two. In any case, it seems unwise to have more than a day without a base backup. Bleh, there are lots of exceptions here.

# ? Aug 9, 2013 23:23

men with puns: Feb 8, 2010; Young Orc

WHERE MY HAT IS AT posted:

Okay, that makes sense. Thanks a lot!
It seems to make the most since to just shove it all in one table then. Now, barring any books being used for more than one year there shouldn't be a problem with having the following columns, right?:
Program - Year - Title - Price
This seems to make the most since to me since trying to model it as a list of programs with associated books would require a variable number of columns or something. The end game is to have the database accessible through a website where you pick your program/year from a drop down and it shows you your books and how much they cost etc. So organizing it like that and having the query be like SELECT * FROM books WHERE programID='foo' AND year='bar'; would be the simplest way to do it, correct?

If you make the table this way, each row becomes a combination of program + book, rather than just a single book. This will almost certainly cause some problems. For example, you couldn't enter data about books and programs independently of each other. If nothing else, you'll be entering duplicate / redundant data about books next year / semester, because some programs are probably going to re-use the same books they used before.

You probably want to normalize your database to avoid these problems. I don't know offhand of somewhere that explains normalization without a lot of jargon, but let me try to get close to the spirit of it like this: The main idea is to prevent different types of things from depending on each other (what if you want to create a record for a program that doesn't use any books?) or having redundant data (having to enter book's name multiple times, or update the price data multiple times, because it's used again somewhere).

Each type of thing (book, program) should get its own table with a separate column for each type of data you care about. In the book table, there shouldn't be anything about programs that use it. If possible, include at least one thing that can uniquely identify the thing (ISBN, combination of program name + year, etc). If you create a unique index on this column (or for the programs table, the pair of columns for program name + year), you can prevent accidentally having duplicate records for things in the table. It's common to create another ID field (typically an integer that automatically counts up as you add rows, the details vary depending on your DB software) to serve as a PRIMARY KEY which is the main way to refer to a single row in your table. This field doesn't have any meaning outside of the database, but makes it easier to identify rows and relate tables to each other.

A relationship between two types of things (program A uses book B) is also a type of thing, and this relationship should also get its own table. The relationship table would contain copies of the primary keys used for your books and programs. Usually this means it will just have two columns: one for a program's id, one for a book's id. You can make an index on the columns of this table to prevent duplications too.

Then, when someone wants to look up books for their programs, you start with the program name / year, find the matching record for their program, JOIN the relationship table, and finally get the data from the matching rows from the book table.

Something like:

code:

SELECT b.*
FROM   books b
       INNER JOIN programs_books pb
               ON b.id = pb.book_id
       INNER JOIN programs p
               ON pb.program_id = p.id
WHERE  p.name = 'program name'
       AND p.year = 2013

It ends up with a more complicated query, and multiple tables, but it should make it easier to enter the data, or work with it in other ways you many not have anticipated.

# ? Aug 10, 2013 03:51

roomforthetuna: Mar 22, 2005; I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

In this particular case it might make sense to have 'year' be a column in the relationship table, since programs may not be specific to a year, books may not be specific to a year, but the information you're wanting out is books-that-go-with-a-program-in-a-year. I can't see a problem with year being in that table, and I can't see a way it makes sense in either of the other tables.

Unless programs are each year-specific, in which case putting year in with the program would be the way to go.

# ? Aug 10, 2013 07:02

Scaramouche: Mar 26, 2001; SPACE FACE! SPACE FACE!

Weird little question that I think relates to a PIVOT but I'm not sure, since I am never able to think those through. The way I use MSSQL is more programmatic; I make stored procs that update, insert, and transform things, which is why I have trouble on fiddly little 'end user' requests like this. Basically I have an orders table. I've been asked to prepare a spreadsheet that is intended to show a customer's email, and then the date of each order they've made on one line. So it would look like:

code:

Email | Order1Date | Order2Date | Order3Date | Order4Date | etc.

The problem of course being, the actual number of orders is not known. I can easily find emails that have placed multiple orders with something like this:

code:

SELECT
 Email,COUNT(*) 
FROM
 Orders
GROUP BY
 Email
ORDER BY 
 COUNT(*) DESC

And I can even find the first and last order using MAX(OrderDate), that kind of thing. What I'm not sure of is how to make this work on one row without either:
A)hand-coding a TON of sub-selects of (x) layers, where (x) is the max number of orders
B)Something involving PIVOT/COALESCE that I can't seem to get off the ground because I keep running up against how to extract the various OrderDates and sort them.

Does anyone have any insight on how to proceed? You don't have to solve the entire thing, even just pointing to a good starting point could be helpful.

# ? Aug 15, 2013 20:22

Salt n Reba McEntire: Nov 14, 2000; Kuparp.

You could use a STUFF & XML PATH in your select list to easily grab each order date grouped by email address, and then simply leave the order dates comma'd and deal with it in your presentation logic (which is simple enough to achieve, and you avoid the dynamic sql crap).

So you'll end up with:

[email], [order1],[order2],...,[ordern]

Stuff and path is easy to deal with, it's just something like (off the top of my head, so check syntax) :

code:

SELECT [o].Email,
	STUFF((SELECT DISTINCT ','+CAST([_o].Date AS VARCHAR) 
		FROM Orders [_o] 
		WHERE [_o].Email = [o].Email 
		FOR XML PATH('')),1,1)
FROM Orders [o]
GROUP BY [o].Email

e: gawd, asleep when I wrote this.

Salt n Reba McEntire fucked around with this message at 22:48 on Aug 15, 2013

# ? Aug 15, 2013 20:46

Scaramouche: Mar 26, 2001; SPACE FACE! SPACE FACE!

Moogle posted:

code:

SELECT [o].Email,
	STUFF((SELECT DISTINCT ','+CAST([_o].Date AS VARCHAR) 
		FROM Orders [_o] 
		WHERE [_o].Email = [o].Email 
		FOR XML PATH('')),1,1)
FROM Orders [o]
GROUP BY [o].Email

Well that's frankly awesome; I didn't even need the STUFF because the first comma can act as the separator between email and the first order for CSV. Thanks a lot! I'm going to have to look into this XML PATH stuff I think because that solution came entirely out of left field for me.

# ? Aug 15, 2013 22:18

more like dICK: Feb 15, 2010; This is inevitable.

I'm not super good at SQL, so I hope someone can help me with this. I've got a table with "duplicate" rows. The primary keys are still unique, but there are some rows that are otherwise identical.

Table looks like:

- Primary Key
- String
- String
- Timestamp

So duplicate rows would look like:
12345 | "Foo" | "Bar" | 2013-07-03 16:07:35.205
12346 | "Foo" | "Bar" | 2013-07-03 16:07:35.205

The three non-PK columns uniquely identify a row. No individual row is duplicated more than once.

Is there any easy way to find and delete these duplicates?

edit: It's DB2 if that makes a difference

edit2:

It looks like maybe something like this will work?

DELETE FROM the_table t
WHERE EXISTS ( SELECT *
FROM the_table t2
WHERE t.column_one = t2.column_one
AND t.column_two = t2.column_two
AND t.column_three = t2.column_three
AND t.primary_key < t2.primary_key
);

more like dICK fucked around with this message at 15:16 on Aug 16, 2013

# ? Aug 16, 2013 14:44

Janitor Prime: Jan 22, 2004; PC LOAD LETTER

What da fuck does that mean; Fun Shoe

Sab669 posted:

I have a sort of database design question I think... Variables, formulae, functions, oh my!

(((D54*0.30)+(E14 X E11 X 0.3464)+(E14 X E11 X 0.125)+95+35+5+((D54 X 0.95) X 1.11))+600) / E11

You need to ask yourself what is appropriate to store in the database given your needs. For example, if you have a small number of frequently-used formulas that appear in a number of queries, you can get away with using a stored procedure. Once the number of such grows rather large, however, that becomes somewhat non-optimal, particularly if you're expecting to use a query to resolve a function name and magically "dereference" it to execute as a procedure.

What is the nature of these constants, and the expected purpose of database storage for them? Are you expecting parameters to be varied by user, invoice, or billing cycle? Do you need an audit trail for parameter changes? Are you merely storing them because they are used again later?

Regarding the functions, there are security concerns dependent on the type of stored procedure you create, but it does not sound like you have enough function variation to warrant a more complicated approach to storage. I mean, you could create a table of named functions, their TeX representation, and any of a number of valid computational representations, followed by EXECUTE calls and some flappery to get the appropriate parameters, but that's no small amount of work, blows query optimization, and it's quite likely that you have a programming language of choice that would permit you to do all that work at the application layer.

Edit: Based on your response below, it sounds like you just need to save them in a shipping parameters table so you can select them, load them into a hash, and use them when you're generating output reports. As you've mentioned no need to have all that generation within the database, there will be less pain if you stick with whatever approach you find easiest for those steps, because while you can building a reporting view with the appropriate per-item columns for weight and size charges (assuming the appropriate data is already stored with each item), some values likely depend on aggregates, so you're really talking about a two step (or more) report and a bunch of things that you'll likely want in revision control anyway unless you really are setting out to hurt yourself.

PhantomOfTheCopier fucked around with this message at 02:38 on Aug 20, 2013

# ? Aug 19, 2013 19:57

Sab669: Sep 24, 2009

Basically, it's a bunch of fees associated with shipping products from China to the US. So, it's a list of fees relative to how it gets shipped; boat, air, combinations there of, and a few other things (testing products for the presence of various metals).

Frankly the data won't change frequently at all, they might add a record per year I'd imagine. But new data is new data. So this "Cost sheet" will use a bunch of these fees in factoring out the total cost of an item. Then in 6 months maybe DHL will charge X instead of Y per pound to ship a product, so they'll need to update these values and use them on future "cost sheets". We don't want to hardcode it so we don't need to recompile & patch the software when these infrequent changes occur, obviously.

# ? Aug 19, 2013 21:14

Adbot: ADBOT LOVES YOU

# ? May 14, 2024 09:02

Red Red Red: Mar 7, 2013

This is oracle specific. We are switching from 10g to 11g and moving datacenters at the same time. Part of all of this change is that I have a task to change every occurrence of sysdate to current_date.

I have a lot of sql that does

code:

and order_date >= trunc(sysdate) 
and order_date < trunc(sysdate) + 1

but when I replace the sysdate with current_date and do the same thing it just takes forever to run.

Is there a better way to optimize trunc(current_date) ? I thought the 2 were the same except that current_date returns the date of the local machine and sysdate returns the date of the db server, am I wrong?

# ? Aug 20, 2013 16:32

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > SELECT * FROM Questions WHERE Type = 'Stupid'

«‹›303 »