Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Xae
Jan 19, 2005

Tinyn posted:

I have a situation where I need to request a some data by id numbers, but that I need to get the data for some N number of id numbers. I could just call the simple stored proc N times, for each id number, but that performance sucks. Making a stored proc that takes 100 arguments, some of which might be NULL, to allow for however many id numbers I need at a time is obviously retarded.

Dynamic SQL is technically an option, but the way the DBAs think, it really isn't. How can I create a stored proc that will handle this problem?

Who is your Database Vendor?


Options:
Pass an Array/Stack in.
Temporary Table
Batch Table: 2 Row Table, a Batch_ID, and a Key for the other ID numbers, pass the Batch_ID in to the function.

Adbot
ADBOT LOVES YOU

Xae
Jan 19, 2005

duck monster posted:

Copy and paste that poo poo, for instant View love. Its that easy. Also, views in theory are much faster than constantly feeding big querys over and over again into it. Pre-compilation or something. Not entirely sure why, but when we benchmarked it at work, the speed differences where outrageous at times.

Caveat. Views tend to dislike this sort of thing;- SELECT * FROM TABLE
Your better off using something like SELECT raraID,raraName,raraHeyMan FROM TABLE . This means you need to watch what happens with your views when you change the tables they are dependant on. If you add a field to a dependancy , you might need to add that field to the view. I tend to keep a views.txt file with my code project s that have copies of all the views inside them.

One of the reasons Views can sometimes perform better than ad-hoc query throwing is that the Optimizer/Planner will have a set plan for it. If you send things in at will the DB engine may have to hard-parse. This is where the the database will optimize the query and generate a new plan. Often times for small, simple joins the act of figuring out what the query wants takes longer than retrieving the data. If you are careful about how you write your SQL statements and you always right them in nearly the same way you can get a soft-parse which won't invoke so much overhead. This is the mechanic that the view works with to improve performance. Since the SQL is executed the exact same way each time, you get soft-parses instead of hard-parses.

Xae
Jan 19, 2005

chocojosh posted:

So essentially I want to do:
1) EXEC 'Some 40 line select statement that dynamically picks the table'
2) Insert the result of (1) into a temp table
3) sp_executesql 'Parameterized search query into the table created in (2)'.

Suggestions for step 2? Any links on how to make temporary tables in SQL Server 2005, and anything I should know?

(More details, but not important)

I've started a stored proc that starts with a fairly complicated select statement to retrieve a bunch of information. The problem is that I'm using the same stored proc for different tables and I use EXEC with the table name at the top (Yes, I know this is horrible, but I don't want to have to write a new stored proc for each table).

After I get all this data, I need to filter it by a bunch of search conditions, based on the user's input and passed as parameters to the stored proc. For this I want to use sp_executesql (because it prevents SQL injection, as I learned when I used exec and entered ' in my search condition). sp_executesql will not accept dynamic table names or field names to be entered.
I don't know if MSSQL does, but Oracle supports inserting from a select.
Edit: It looks like MSSQL supports it, with similar syntax.
Link
You could do:
code:
INSERT INTO my.insert_table(ins_col_1, ins_col_2, ins_etc)
(SELECT	Col_1,
	Col_2,
	etc
FROM	my.table
WHERE	condition = 1
)
It is generally easier and more efficient to have the Database do filtering than doing it your self. Unless you are filtering by non-indexed rows on a very large table.


Temp Table Link

Xae fucked around with this message at 19:53 on Nov 6, 2007

Xae
Jan 19, 2005

chocojosh posted:

Thanks for the link.

I'm wondering if there's any "gotchas" for using temp tables or table variables (and what's the difference between the two).


I'm not sure what you mean by

"It is generally easier and more efficient to have the Database do filtering than doing it your self. Unless you are filtering by non-indexed rows on a very large table."

This part of your first post:

quote:

After I get all this data, I need to filter it by a bunch of search conditions,

Made it sound like you wanted to return a large query then filter it your self.


The only Gotcha I'm aware of with Temp Tables is a bug in Oracle if you execute too many transactions against it, but I haven't used the much in MSSQL.

Xae
Jan 19, 2005

chocojosh posted:

Xae: Essentially I need to first get all my data, and then filter it by doing a second select statement with dynamic where conditions (i.e. I build up a where clause for each field the user can possibly search by). By filtering I meant a second select statement that can remove rows from the temp table; not some kind of filtering in my programming language :) I honestly wish there was an easier solution, but not if I want to use sp_executesql which I need to use (this makes the choco very sad).

Is there a reason you can't to it all in one statement? Is this a two step process, where a user selects one group then refines? Whenever possible I try to avoid Selecting the same data twice.

Xae
Jan 19, 2005

chocojosh posted:

replace(varchar_variable, '''', '''''')


Is that the *ONLY* thing I need to do to avoid SQL injection? Can you provide me with a link (not that I don't trust you, but I'd rather not leave security holes in my software due to my sloppyness)? If so, I may just use one long exec string instead of the table variable.

You would need to check for ';' as well, since they might try to submit a compound statement. ie "Select * from sys;Drop X;"

Xae
Jan 19, 2005

jwnin posted:

This is a bit of a different SQL question, but I thought I'd ask it here.

Our database stores information about our customers. In general, there are very few times where we need to do queries across the entire customer base - what happens at one customer has absolutely no relation to the other ones, other than our people who are assigned to that customer.

As a result, our devevelopment team thinks that the best thing to do is to split each customer out into its own database. The catch is that there are around 20,000 customers per year, and each customer database would be about 100 MB in size.

In general I do not feel that this is a good idea, if only because at some point you start to reach SQL's native limitations- number of databases, SP reuse in caching, and even just raw performance for backup/restore/indexing operations. Beyond that, when you do eventually have some level of cross-customer querying/reporting to do, it will not be very performant to query this, even if you go with a data waarehouse solution.

Can anyone provide me with some more ammo as to why we would not want to go down this path?
The key is to find the things all the customers have in common. You may need to make some tables more abstract. There are some hideously large databases that use SQL, I doubt you are anywhere near the limit.

Just because the data does not all need to be accessed at the same time doesn't mean that they should be in separate databases. They are logically the same entity type so they should logically be grouped together.

From personal experience I have worked on systems where 15 years before someone said "Just clone the DB and put it in a different schema, we will work something out later", and 15 years later the solution was to pay through the rear end for an outsider to come in and unfuck their mess.

Xae fucked around with this message at 06:24 on Nov 9, 2007

Xae
Jan 19, 2005

This week at work reviewing some code I saw the worst SQL statement in history. It was 350 lines long and had 7 selects in it. It UNION ALL'ed then SELECT DISTINCT'ed off the union.


I took the first query, the main one, changed two joins to left outer joins, and deleted the next 300 lines of code.


For fucks sake people, learn to use non-full/inner/equal joins.


See this or, this for more information.

Xae fucked around with this message at 02:03 on Dec 31, 2007

Xae
Jan 19, 2005

code:
SELECT * FROM entries 
WHERE ID = (SELECT entryID FROM metadata 
                        WHERE fieldID = (SELECT ID FROM metafields WHERE name=:name))
Isn't that the same as

code:
SELECT	*
FROM	entries
    INNER JOIN metadata ON metadata.entryID = entries.ID
    INNER JOIN metafields ON metafields.ID = metadata.fieldID
WHERE	metafields.name = :name
It seems like you are using sub-queries instead of just doing a join, unless my still hung-over rear end is missing something.

Xae
Jan 19, 2005

Victor posted:

Xae, joins will possibly increase the number of fields returned, especially if one is performing a semi-join (joining table A against B, but only selecting columns from A). In fact, incorrectly using semi-joins causes people to "fix" the results by slapping in a DISTINCT, where the proper behavior would be to change from a join to an IN subquery.

Hrm, aside from the cock-up about not prefacing the table name on the * I'm still not sure. I'm installing Oracle on my machine right now to check this out. drat server taking a poo poo :argh:

Edit:
Ran it against the now default HR Schema] and I didn't see a difference between the results of these two queries.

code:
Select  locations.*
FROM	locations
    INNER JOIN countries ON countries.country_id = locations.country_Id
    INNER JOIN regions on regions.region_id = countries.region_id
Where	regions.region_id = 2
ORDER BY locations.location_id
code:
Select	*
FROM	locations
WHERE	country_id IN (SELECT country_id FROM countries WHERE region_Id = (Select region_id from regions where region_id = 2))
ORDER BY locations.location_id
There are some minor differences in the explain plan, but nothing I would get worked up about, as they both have similar costs and execution plans
.

Xae fucked around with this message at 00:49 on Jan 3, 2008

Xae
Jan 19, 2005

Use a cursor: This is getting pretty sophisticated, you can probably use some SQL hack to do it, but in the long run doing it "the right way" will be best.

Xae
Jan 19, 2005

Get your DBAs fired. They are worthless and are providing nothing of use to your organization. They should be fired and replaced immediately.


I'm not joking. I'm too lazy to check, but I would wager that query has an implicit cursor in it.

If they won't fire them switch jobs, you're crippling your development as a programmer, unless you really like it there or something...

Xae fucked around with this message at 02:04 on Jan 16, 2008

Xae
Jan 19, 2005

If you care about your data option 1 is best.

If not option 2 works decent enough.

Xae
Jan 19, 2005

cletus42o posted:

I've always used option 3, but then I never use constraints in my database - I handle all of that in my application. But maybe I should rethink that?

But I don't know, I've never had problems with it. The most common way I've used option 3 is in a "files" table, where a file can be attached to any other item in the system. I have the file table itself, which only stores info pertaining to the file, then a table I usually call file_item_jn - the join table has the file ID and two fields called foreign_table and foreign_key.

So I haven't regretted it in the places I've used it.

But then I did develop another system where I have separate join tables for each "type" - and then I have a view that unions them all and creates the type field for me (so a "virtual option 3"), since there are places where I have to provide a snapshot of all the data at once.

Yes, yes you should.

There is a reason why there is millions of dollars put into R&D. The database will do constraints, and transactions faster, more accurately, and just plain better than anything you can possibly conceive of.

Xae
Jan 19, 2005

minato posted:

Sure, but 99% of the time the FK points to the PK of the target table, and PKs should always have an index on them to ensure uniqueness.

I see them as just a good database design principle. They ensure my data integrity, they can help prevent silly coding errors, they can improve performance by ensuring I don't have crufty tables, and they help describe the relationship between two tables.

However like all principles, they sometimes need to be violated in the name of performance.

Foreign Keys, by definition point to the primary key (or unique field, which could be a primary key). If it isn't a key, it is just a lovely denormalized field.

Having primary keys doesn't hurt performance, unless the developers are morons. So only semi-frequently does it hurt performance.

Most databases will automatically create an index on a column once it is the primary key anyway.

quote:

Yes, but constraints are not free.

Example: if you have a column in a table that references another table, and you put a FK constraint on that, you'd better put an index on it too - even if you have no reason to do so. Why? Because if you don't, when you delete rows from the target table the server will have no choice but to do a table scan of the other table to make sure you're not violating the constraint.

Another example: in the case discussed above, with the head table and a heterogeny of auxilliary tables, if each auxilliary table has a FK constraint back to the main table you'll have hell to pay when it's time to clean out a bunch of rows from the main table. (Actually what you'll have to do is drop all the constraints, clean out the crap, and then re-add the constraints - if you don't you'll be waiting days. This has happened to us and we now don't have those constraints.)

I believe that FK constraints are a good idea when appropriate and when you can afford them, but to blindly throw them into your schema whenever you have a cross-table reference is not wise.
It is not only wise, it is the loving point of a relational database system. Otherwise you could just get a really fast disk and use flat files, it would be cheaper and faster. The database will execute and maintain constraints faster than any system you can make.

Plus, if you can't be sure you data is accurate why are you bothering to store it?

A full scan, only if you don't bother making an index, is a required thing. Otherwise you breach data integrity. You have orphaned records. You are now hoping that something else doesn't take the old PK and completely gently caress your system. I prefer to rely on something more than hope and dumb luck...

Any system in which deletes would cause that much of a strain that it would be unworkable with todays hardware probably isn't deleting anyway. They simply end date the record, set a delete flag or otherwise execute a logical delete. Data has value, and you don't just throw it away.

Xae fucked around with this message at 04:41 on Jan 17, 2008

Xae
Jan 19, 2005

Victor posted:

I'm afraid I have a hard time seeing how you have the experience to back this up. At least, where I work, sometimes we go without constraints, but have nightly integrity checks that run to ensure there are no orphaned records. This doesn't catch the circumstance where one can have orphan records created, incorrectly, that then get deleted before this integrity check runs. However, from my small amount of experience, I do not see this special case as a bug pattern common enough to worry about.

You are still running the checks, is your machine that pegged on CPU that you can't afford to ensure that the data you are bothering to store is being stored correctly?

Lets face it, hardware is cheap, the programmer time to build and maintain your own custom system to keep integrity cost more than just buying a better boxx. Also if you are that desperate for computing power why are you deleting in the first place?

That isn't even touching on the fact that a single failure on an enterprise level system can cost serious cash.

Xae
Jan 19, 2005

Victor posted:

You're just assuming too much. The rule that programmer time is more valuable than computer cost is a good one, but it isn't all-covering, all-important, never-inapplicable. To give you an idea, checking a foreign key constraint can keep transactions open a tiny bit longer, which result in locks being held a bit longer, which can, if one is near a certain breaking point (think adding 1% to a fully packed highway), will cause the system to go crazy. Moreover, dealing with constraints can be a royal PITA, due to having to delete in the right order, and having major issues with self-referencing tables.

If your system is at capacity why not execute an immediate logical delete, then come back for the "real" delete during a batch window? It will relieve stress on your system and keep data integrity.

The database doesn't simply decide to do checks for a random reason. It does them because it needs to. It needs to make sure that the integrity of the database is being kept. If you do not check for child records you will problems. Someone running an OLAP system is now getting records, which should have been deleted counted in their statistics.

You don't need to throw big iron at things to solve performance problems. Even a $5 k or $10k server is a hell of a machine these days. If an application takes a month of programmer time, it costs ~$10K between salary, benefits and overhead. This is why throwing hardware at problems is popular, because it works. If you are working in a completely budget constrained organization things may be different, but in that case you shouldn't delete immediately anyway, because it takes too many resources.

Constraints are a tool, and a very useful one. They may force you to change how your normally operate, but they do so for good reasons.

Xae
Jan 19, 2005

m5 posted:

You're not familiar with the real world. The tables I'm talking about have tens of millions of rows in them. We delete stuff because we need to curb infinite growth in the database - I think it's well over 100GB now in backup form.

I have worked on systems ranging from a ~1995 sunbox to a 45node Terradata array. I have worked on databases that where written before "relational" was an option. I am quite familiar with the real world. I mainly do remediation work. I come in when company decides they want a data warehouse only to realize that their OLTP side is a mess and can't be processed until they unfuck it. I see this crap about "oh, we can manage it better than the database", every goddamn time. Other popular excuses include "Well, it slows down the database!" or "We didn't want to write joins".



quote:

You only have orphaned records if the software has bugs in it, mostly.
So, you just have to trust your programmers to be perfect?

quote:

If there are a few orphaned records floating around among millions and millions of fine happy ones, I'm OK with that. I don't need my database to be a perfect shining jewel of integrity. I need it to mostly work, and to support the heavy load it's under with acceptable performance.
Do not lose sight of the purpose of a database. People base decisions of the data in your database. You are feeding them incorrect or incomplete information. The point of a database isn't just to be a tower of zeros and ones. It is so that the data can be accessed and used. Incorrect data is infinitely worse than no data.


There is a cost associated with using bad data, and it is almost always far greater than the cost of fixing the problem.

Xae
Jan 19, 2005

Part of the thing is that he is asking for design advice. You need to catch things like "will we need to partition" before you start making the system. If you don't, then you end up in a situation where you become trapped and can't fix it.

If he was running in a system I wouldn't advice changing it, but since the system isn't yet built it is best to design it correctly, so that when the inevitable happens, and you are in the best position to deal with the situation.

You design following "the rules", then break where needed due to money, political or other "real" constraints.

Xae
Jan 19, 2005

quote:

Specifically, I have a Java system that uses ResultSet from the JDBC and PreparedStatement for every single query and I think this is causing real performance issues on huge (20G) databases where the code is doing hundreds of sequential queries. I suspect that cursor performance on PostgreSQL is really bad and should switch to using limit/offset, but I need more ideas to test.

That is your problem. Figuring out what a query wants, especially for a small query, will take longer than retrieving the data.

You need to figure out exactly what queries are being run against the database. There is no RDBMS on the planet that will perform great with bad code running against it.

The majority of tuning is done at an application/design level. It isn't done to database settings, or even individual queries. Your system has to be designed to perform well, and throwing hundreds of queries at a database is designing to kill the DB.

Xae
Jan 19, 2005

Is there a reason why the SQL was being generated instead of hand written? I know I'm a huge DB nerd, but I always like to right the SQL myself.

Xae
Jan 19, 2005

Victor posted:

That's actually not surprising -- I have seen very few people add a condition to a left join statement that has nothing to do with the join. It's a sad state of affairs, that SQL is so poorly taught... :saddowns:

Sad, but profitable. You can spend a couple weeks, maybe a month, learning about SQL and how to properly use it, and every one thinks you are some super genius.

Xae
Jan 19, 2005

Anyone have experience with Unit Testing frameworks for (PL)SQL? I have looked at UTPLSQL and SQLUnit, but I figure there has to be more than just two out there. Both of them look to be overly complex for just PL/SQL testing, as they both seem to be hacks to get JUnit to test SQL.

Xae
Jan 19, 2005

fletcher posted:

PHP/PDO question: If I have a table with an auto-incrementing id field, what is the fastest way to get the id of a new row I just throw in? Just do a SELECT? Or is there some shortcut?

edit: Another question - This query only returns articles that have comments, how do I get it to return the ones that don't have any comments? Looking at the joins tutorial it looks like LEFT OUTER JOIN is what I want, but the query returns the same thing when I try it.

code:
SELECT 
	COUNT(comment.id) AS numComments, 
	article.id, 
	FROM article 
		JOIN comment ON article.id = comment.articleId 
	GROUP BY comment.articleId
Edit: Try specifying "LEFT" or "RIGHT" join, instead of just "JOIN", see if that helps.

Ghetto Method:
code:
    SELECT  article.id
    FROM    article
    WHERE   article.id NOT IN (SELECT comment.articleID FROM comment)
For the first question: What RDBMS are you using?

Xae fucked around with this message at 02:02 on Feb 27, 2008

Xae
Jan 19, 2005

fletcher posted:

I'm running MySQL 5.

Whoops, I phrased my question a little wrong. I want it to return articles with comments AND articles without comments in the same query, but count the # of comments if it does have any.

edit: specifying LEFT or RIGHT join gives me the same results

code:
SELECT  article.ID, COUNT(*)
FROM	article
    LEFT JOIN comment ON article.ID = comment.ArticleID
GROUP BY article.ID
That should work, I don't have any DB handy to try it, so I probably hosed up the LEFT/RIGHT or order.

If it doesn't look at your data, it may be that you don't have any, or some other issue exists.

Xae
Jan 19, 2005

Toanek posted:

Today I was working on my forum project and while adding in the 'Show My Posts' function I got confused as to how I should format my database query. I initially wanted to sort it so it would return an array of topics ordered by the date of your most recent post in the topic. I eventually decided to sort it by the last post from any user in the topic, but I figured I'd still like to learn how to do it the initial way for well... the learning experience. I can't for the life of me seem to figure it out.
code:
SELECT   topics.id,
         topics.parent_id,
         topics.subject
FROM     topics
         INNER JOIN messages
           ON topics.id = messages.parent_id
WHERE    messages.poster = '" . $user->id . "'
GROUP BY topics.id
ORDER BY messages.date DESC
What I have here orders by the date of your first post in the topic, as opposed to the last.

What RDBMS let you get away with that? Oracle will throw an exception if you try something like that.

Xae
Jan 19, 2005

Victor posted:

When in doubt with this question, MySQL.

Well, there is another reason to hate the drat thing.

Xae
Jan 19, 2005

bitprophet posted:

QFT.

In my own experience - I've used both to an intermediate level both from a developer and sysadmin standpoint - Postgres just feels more mature, streamlined and sensible, versus MySQL feeling, well, like the PHP of databases.

But, like brae said, make up your own mind - Google a bit, and if that doesn't give you anything that makes sense, then come back here and I'm sure we'd be glad to give you a flamewar some insight :)

A while ago I went and thought of the reasons why MySQL is better than Postgres. I couldn't think of a single one. Postgres is more secure, more stable, and faster. What? Faster? Yes, when it matters. MySQL gives you back that basic query when there is 0 load a split second faster than Postgres, but when the machine is under a heavy load MySQL shits it self.

Xae
Jan 19, 2005

brae posted:

I don't know. Several companies have scaled in a big way using MySQL (Yahoo) and carefully choosing your engine type per-table in MySQL can help with scaling because you get a little more fine grained control over locking mechanisms. My company uses MySQL and we've found it to be scaleable. Granted, we haven't done a side-by-side with Postgres, but it's certainly fast enough and for certain optimized queries it is very, very fast even under heavy load.

But I would agree with bitprophet - on some of the more enterprise-y stuff like stored functions and procedures and replication, well, they work with MySQL, but there are a bunch of caveats (for example, stored procedures can't use result sets generated by other stored procedures, at least in the version of 5.0 we're using, and we recently had to upgrade from an earlier 5.0.x release because there were some nasty replication bugs with stored procedures in the version we were on). The same features feel a lot more robust in Postgres - I mean, MySQL only even got stored procedures in version 5. So, we've gotten this stuff to work in MySQL but it feels a little fragile sometimes. Also, MySQL will let you be sloppier in your SQL syntax (see above, selecting a non-aggregate column value from an aggregate query, what the gently caress) and so, at least in my experience, it is easier to write non-portable SQL in it.
http://tweakers.net/reviews/657/5/database-test-dual-intel-xeon-5160-pagina-5.html

The tweakers benchmarks are pretty enlightening. In particular on page 6 you can see how badly MySQL just curls into a ball and cries for mommy on a Niagara (SUN) chip. You also see how quickly MySQL hits the limit of its engine and the requests per second start to decline as concurrency increases, where as Postgres holds stable longer.


MySQL seems to be popular because it is popular. Things are written for it, because it lets you be sloppy, so people run it.

Xae
Jan 19, 2005

fletcher posted:

Great info about postgres guys, thanks. If I was going to try something new should I go postgres or oracle 10g? I don't think I will be going beyond there computer limitations any time soon.

I would go with Oracle. It defiantly has more "quirks" to familiarize your self with, and it has broader commercial use as well.

Xae
Jan 19, 2005

yaoi prophet posted:

Unindexed, I get identical plans:

code:
---------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)|
---------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   184 |  1656 |    56   (2)|
|*  1 |  TABLE ACCESS FULL| JOBS |   184 |  1656 |    56   (2)|
---------------------------------------------------------------
The oracle optimizer is pretty good most of the time, and crappy developers always try to second guess it. I can't imagine a scenario where the CBO is fooled into doing a worse scan using IN() than it would for a DECODE().

This. Trying to game the CBO usually ends up with you failing. There are times when the CBO just plain ol' fucks up, but those are rare. The query as you write it is almost never as it is executed.

I see the same poo poo with COUNT(1) vs COUNT(*) constantly, or people claiming that subselects are faster than joins or some other poo poo. The CBO is good, real good.

Xae
Jan 19, 2005

Little Brittle posted:

I have a table set up like this:

code:
--
-- Table structure for table `product_feed`
--

`product_feed` (
  `sku` varchar(24) NOT NULL,
  `brand` varchar(48) NOT NULL,
  `prodname` varchar(48) NOT NULL,
  `link` text NOT NULL,
  `keywords` text NOT NULL,
  `date` varchar(24) NOT NULL,
  `category` varchar(24) NOT NULL,
  UNIQUE KEY `sku` (`sku`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
How would I create efficient queries and indexes for partial matches? Say I wanted to search the 'brand,prodname,keywords' fields for keyword 'Sony'. Ideally, it would return all SKUs that contain that keyword in any of those three fields. I can't find a nice way to get partial matches without doing a full text search (MATCH AGAINST). Is there a more efficient way of doing it? How would I index the fields to put the least strain on the server?
It depends on your DB.
MS SQL Server
MySQL
Postgres
Oracle

Just google "Full Text Search" and the name of your DB, looks like MySQL.

Xae
Jan 19, 2005

Nibelheim posted:

Good day,

I need some help on a (relatively) simple SQL query. Here's the situation:

I have one table (Table1) containing an immense ammount of data (think: 100,000+ records). This table has a field called "affected_item" which containt a certain PIN, and random gibberish characters.

I have a second table (Table2) which contains one column, which is a list of PINs.

The goal is to find how records from Table1 contain a pin thats in Table2. Here's my query:

code:
SELECT COUNT(*)
FROM Table1
WHERE exists (SELECT pin from Table2 where Table1.affected_item LIKE '%' + Table2.pin + '%')
It's unfortunaly returning 0. Is my logic faulty?

I've tried the JOIN approach, but to no avail. The problem comes from the fact that Table1's 'affected_item' field has 'RANDOM GIBERISH' + PIN + 'RANDOM GIBERISH'

Thanks !

Using like can kill performance quickly. I would suggest finding a way to extract the PIN from the affected_item column. Play around wiht it a bit and see what performance different methods give you on small samples. Separating the PIN can be done through either by truncating, or built in functions like Oracle's 'TRANSLATE'. If this comes up often you might want to consider going to First Normal Form and placing the PIN in its own column.

Xae
Jan 19, 2005

yaoi prophet posted:

You should think about whether or not you want the actual PDF's in your database. The alternative is to not store the actual files, but to load them to a more typical file server running a webserver, and then store the path to the file in the database. This is a bit of a religious war where I work, but the long and the short of it is that these two approaches have (as always) advantages and disadvantages:

In the database:
Advantages
- data integrity (your pdf's won't get deleted by a wayward shell script)
- ease of retrieval (just use the database connection, select the BLOB, presto!)
Disadvantages
- bloats the size of your table, can't be easily divided into multiple storage nodes without dealing with some form of database partitioning/clustering or using some form of clustered filesystem/san
- forces all requests for the file to be served by the database, turning your database into a file server, which it's not really built for

Outside the database:
Advantages
- more easily managed storage of files, can be done across multiple storage servers with no trouble
- keeps your tables normal sized
- requests for the file can be served over something like apache, which is built for exactly that
Disadvantages
- no data integrity enforced by the database (rm -rf *, oh poo poo)
- can't just SELECT your file out, have to SELECT out the location and then retrieve over http, which is sometimes a little more complicated depending on what you're doing
I prefer to keep the min the DB, because I am a DB whore. The ease of access on a database can be improved by using stored procedures to create a hybrid system. You can have one procedure that when passed the name(PK) of a file, and a location will write out the file to that location for easy pick up. A complementary procedure can be used to replace the blob in the database with the one at the specified location.

Xae
Jan 19, 2005

yaoi prophet posted:

Well, that solves some cases but not all. In our case this wouldn't solve the problem, because you are still left with the critical disadvantages of having all of your file data being served at some point via the database (even with your solution, which just adds another layer, it doesn't really take out the bottleneck), which means pushing a high volume of data through what is commonly an already overstressed component. It also means we'd be storing terabytes of file data as part of the database and it would make managing of table data exponentially more difficult. By storing the files on external SAN nodes we can trivially add more storage without database downtime.

I have no idea what kind of needs LightI3ulb has, though. For something small it's probably no problem, but I wanted to point out that this is frequently a hotly contested design point. At my office we refer to people who are "innies" vs those who are "outies".

You should be able to partition the table, vertically if the volume of the files is too much for your database. Split the tablespace and have the blob table on a separate set of disks.

Xae
Jan 19, 2005

fletcher posted:

So will I end up having columns that are indexed multiple times depending on how I'm looking up the data?

Like:

indexA: (first_name, last_name)

indexB: (birthday, last_name)

Yes.

When in Doubt check out Postgres' Docs
http://www.postgresql.org/docs/8.3/static/indexes-bitmap-scans.html
http://www.postgresql.org/docs/8.3/static/indexes-multicolumn.html

Xae fucked around with this message at 04:19 on Apr 3, 2008

Xae
Jan 19, 2005

chocojosh posted:

Can you give me some reading material as to why? At my job the DBAs do not use triggers (I was told we have one in the entire application). I'd really like to know more about any dangers/problems with triggers.
Using triggers for DB mechanics = Good. Filling in Surrogate Primary keys from sequences making sure watermark fields (mod_date, mod_by_user, creation_date, etc) are good uses.

Using triggers for business logic = Bad. Why? Because it is hidden. And if it is hidden it is forgotten. And unless you are home to the world's most careful programmers who always check the documentation, you will get burned.

It is far better design to have an explicit function that will update and manage complex relationship then it is to hide the process in a dozen triggers. One of my previous employer abused triggers. Almost all logic was performed in triggers. It was to the point where no one really knew what was supposed to happen. It just did. Except when it didn't. Then all hell broke lose because no one knew how something was supposed to be handled. And you had to jump through 4 triggers a statement, some of which altered other tables, meaning you had then jump through 4 more triggers, and it goes on and one.

Xae
Jan 19, 2005

Stabby McDamage posted:

Thank you thank you thank you. I was rapidly losing hope. I'll try that out when I get back to work.


I'm rapidly finding that out. It took weeks to get one table loaded because of a particular index field having massive overhead. Is Postgre any better? This is single-user data analysis -- I'm doing machine learning on a dump of wikipedia.

Postgres tends to be faster, when using constraints or with multiple users, more reliable and have more features. MySQL tends to be a bit easier to use and setup.

Xae
Jan 19, 2005

StonedogJones posted:

I have a text column in a table that has random comma's in it.

How can i strip the comma's out of each row in the column? I want the comma's removed, but that can be replaced with a different character if that makes it easier. This is in a MS SQL2000 database/server.

I think Replace() should do the trick

http://doc.ddart.net/mssql/sql70/ra-rz_6.htm

Adbot
ADBOT LOVES YOU

Xae
Jan 19, 2005

StonedogJones posted:

no i dont think replace will do it. the column has different text in each row, so i cant search for a string, and then replace something in that string, because the string is not constant for each row.

Alter for your flavor of SQL:

code:
UPDATE mytbl
   SET text_column = Replace(text_column, '''', ' ');
That should replace all 's with spaces.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply