|
I bet a coworker of mine that I could make a grep that was faster than GNU grep, or at least equal, on any set of benchmarks he devised. The rules are that it has to perform as good as grep on all the benchmarks, and that he must think it's better -- if he thinks I exploited a loophole in the rules, for example with special cases that target these specific benchmarks, it doesn't count. Here are the benchmarks: code:Fortunately, there's no time limit, so I can't lose money on this wager. This thread will be about string searching and regular expressions and making "a faster grep".
|
| # ? Feb 1, 2013 05:55 |
|
|
| # ? May 23, 2013 18:30 |
|
multi-process/thread it to search the input in parallel?
|
| # ? Feb 1, 2013 14:34 |
|
Sylink posted:multi-process/thread it to search the input in parallel? We did this in school and the speedup was pretty remarkable on systems that could take advantage of it.
|
| # ? Feb 1, 2013 15:30 |
|
I've never read the source for grep so for all I know this idea is well-trodden territory, but it seems like it should be possible to analyze the regex and create a function which scans the input file and produces candidate lines to trial the regex on. On the surface it seems like this would be faster-than-or-equal to allowing your regex engine to scan the whole file. For example, all of these regexes are single line and many of them have a minimum possible length, maximum possible length, or marker start character. If you scan the input file for \n and the marker, you may be able to determine if it's possible for the regex to match on that line or not by the presence of the marker or the distance between \n's. So, if the min length of the regex is greater than the difference between subsequent \n's, it definitely won't match so don't trial the regex on this line. Or, If you find a marker character that's more than $regex_max_possible_length to the next \n, go ahead and trial the regex. Etc. I'm not sure how much faster it would be, but it strikes me as useful to think about how to reduce the possible input and identify definitely invalid segments that can be ignored. Also, this type of analysis should be parallelizable as well.
|
| # ? Feb 1, 2013 15:43 |
|
I assume everyone is familiar with the mailing list entry that talks about some of the innards of GNU grep
|
| # ? Feb 1, 2013 17:06 |
|
Low-hanging fruit: 1. Simplify the given regex. You don't want to search for .*.+.*.+ etc. when .{6,} is equivalent. 2. Only find the first instance on a line, since it doesn't look like you need to highlight anything. 3. Don't look at every character (Ideally you could reduce things to a BM variant - it doesn't look too hard to extend BM to cope with alternation and repetition, just as long as you're not nesting repeated stuff).
|
| # ? Feb 1, 2013 23:44 |
|
Otto Skorzeny posted:I assume everyone is familiar with the mailing list entry that talks about some of the innards of GNU grep and this classic http://ridiculousfish.com/blog/post...-treachery.html
|
| # ? Feb 2, 2013 01:17 |
|
tef posted:and this classic http://ridiculousfish.com/blog/post...-treachery.html OK, the "times I've tried to figure out what the hell that blog post is saying" counter is now 4. This guy has the most obnoxious and confusing writing style I've ever seen. He needs to learn to write English, not New York Times-ese.
|
| # ? Feb 2, 2013 01:28 |
|
Jabor posted:Low-hanging fruit: If you're writing your own regex engine right off the bat, you might be Doing It Wrong. I've gotten pretty good results from just using Go and its regex library.
|
| # ? Feb 2, 2013 01:54 |
|
floWenoL posted:I've gotten pretty good results from just using Go Google engineer spotted.
|
| # ? Feb 2, 2013 02:15 |
|
floWenoL posted:If you're writing your own regex engine right off the bat, you might be Doing It Wrong. When you re-organized you blog or something, a bunch of your old entries were lost. I think I remember one from a few years back where you discussed parallelizing FLAC encoding, any chance it's still floating around somewhere?
|
| # ? Feb 2, 2013 02:46 |
|
http://swtch.com/~rsc/regexp/ is a pretty good resource for regex implementation details.
|
| # ? Feb 2, 2013 12:08 |
|
shrughes posted:Google engineer spotted. It's not so much the language itself as it is its regexp library. krisis posted:http://swtch.com/~rsc/regexp/ is a pretty good resource for regex implementation details. Yeap. Russ Cox wrote RE2 and the Go regexp library, which I'm pretty sure is also a floWenoL fucked around with this message at Feb 3, 2013 around 09:33 |
| # ? Feb 3, 2013 07:58 |
|
Otto Skorzeny posted:When you re-organized you blog or something, a bunch of your old entries were lost. I think I remember one from a few years back where you discussed parallelizing FLAC encoding, any chance it's still floating around somewhere? Sure, I'll see if I can put it back up. Edit: Should be back up now. (Let me know if you still can't find it.) floWenoL fucked around with this message at Feb 3, 2013 around 09:42 |
| # ? Feb 3, 2013 07:58 |
|
floWenoL posted:Yeap. Russ Cox wrote RE2 and the Go regexp library, which I'm pretty sure is also a backtracking implementation. Since none of the test cases require backtracking (i.e., have backreferences) I'm pretty sure using a backtracking implementation (i.e., RE2, Go's regexp) would alone enable you to beat GNU grep. RE2 and the Go package are actually NFA based, which is more like the opposite of a backtracking algorithm.
|
| # ? Feb 3, 2013 08:28 |
|
Jabor posted:RE2 and the Go package are actually NFA based, which is more like the opposite of a backtracking algorithm. Yes, you're right. I think I meant to type "non-backtracking". Whoops!
|
| # ? Feb 3, 2013 09:32 |
|
Who implements regex matching using a backtracking algorithm? Doesn't that defeat the whole point of regular languages?
|
| # ? Feb 3, 2013 11:03 |
|
qntm posted:Who implements regex matching using a backtracking algorithm? Doesn't that defeat the whole point of regular languages? Most "regular expression" libraries have a bunch of extra features (stuff like backreferences, recursive patterns allowing to match arbitrarily deep levels of nesting, arbitrary zero-width lookaheads and lookbehinds) that are pretty straightforward with a backtracking algorithm, but a lot harder with an NFA one. Perl and nominally-Perl-compatible regex implementations alone probably mean that backtracking implementations are far more common than NFA-based ones.
|
| # ? Feb 3, 2013 11:50 |
|
According to my coworker who made this challenge, grep will construct an NFA on-the-fly as it evaluates a regular expression. Presumably if you pass -P or use backreferences things are a bit different.
|
| # ? Feb 3, 2013 13:07 |
|
Does it have to support all the different flags of GNU grep too? The conditions should be easier if you don't need to support extended regexes, for example. POSIX regexes could be implemented without a full-blown regex engine supported in most modern languages, but this isn't to say that just including a good regex engine in the first place wouldn't be able to do just as good as a POSIX targeted engine either. Also, an alternative way of going about the problem could be to just profile the hell out of grep on some data and try to understand how to optimize it instead of writing your own grep from scratch.
|
| # ? Feb 3, 2013 16:51 |
|
necrobobsledder posted:Does it have to support all the different flags of GNU grep too? The conditions should be easier if you don't need to support extended regexes, for example. No. The question of what regex support is needed is also rather fuzzy -- but I'm going for the extended regexes as documented in the manpage.
|
| # ? Feb 3, 2013 16:54 |
|
Are there any re engines that use a jit rather than being table driven? (other than pypy, of course)
|
| # ? Feb 3, 2013 19:16 |
|
tef posted:Are there any re engines that use a jit rather than being table driven? All the major JS engines have a JIT specially designed for regexes.
|
| # ? Feb 5, 2013 01:20 |
|
tef posted:Are there any re engines that use a jit rather than being table driven? redgrep http://code.google.com/p/redgrep/ linux.conf.au 2013 talk which may or may not be interesting: http://mirror.linux.org.au/linux.co...ves_to_LLVM.mp4
|
| # ? Feb 6, 2013 01:36 |
|
Suspicious Dish posted:OK, the "times I've tried to figure out what the hell that blog post is saying" counter is now 4. This guy has the most obnoxious and confusing writing style I've ever seen. He needs to learn to write English, not New York Times-ese. in short, grep optimizes for there being few partial matches for your given expression, which means that in data that has many partial matches gnu grep performs something like twice as slow as it would otherwise for about an 8% gain in speed in cases with few partial matches. That's all he's saying.
|
| # ? Feb 6, 2013 02:13 |
|
Right. The insight in Boyer-Moore is that, when considering any particular position, you should start by checking the last byte in the possible match, because it gives you information about length(needle) potential match positions instead of just one. That means that, if the last byte doesn't match, you can often move forward by more than just one byte, and the exact number to skip is easy to precompute for every possible last byte. Both grep and fish are doing this. Okay, next on the optimization list is unrolling. There are several reasons to unroll the loop, but the biggest is that, as you're skipping forward, you do need to keep checking that you haven't overrun your buffer. Instead of checking this every time you skip, though, you can just make sure that your buffer has a bit of extra room in it and then aggressively keep skipping, as long as all the biggest skips you do can't ultimately overrun your buffer. Again, both grep and fish are doing this. The naive way to unroll a loop is to just repeat the work of each iteration N times: that is, read the last byte, break out and do a complete check if it's a match, and otherwise consult the skip table. That still means doing a bunch of separate compares and branches in each iteration. What grep was doing was putting 0 in the skip table for the true last byte in the needle, so that if there was a match, the algorithm could actually just "skip forward" by 0 bytes, and therefore stay at the same position for the next unrolled iteration. That lets the algorithm drop more than half the comparisons in the unrolled loop body, at the cost of spinning for a few extra cycles before it branches to the complete check. In effect, it privileges skipping past data that does not match in the last byte over more quickly checking and reporting when it does find a match in the last byte.
|
| # ? Feb 6, 2013 03:34 |
|
ack is supposedly a better grep at least their url claims it is: http://betterthangrep.com/
|
| # ? Feb 6, 2013 05:09 |
|
text editor posted:ack is supposedly a better grep It has a better UI and it will search through source trees faster. It does this by only looking at source code files with recognizable extensions, skipping large binaries entirely, and also by doing the I/O better. I wouldn't be surprised if grep processes one file at a time. If you run them on a source tree, ack will be quicker, but if you run them on one long file, grep will be quicker.
|
| # ? Feb 6, 2013 06:30 |
|
text editor posted:ack is supposedly a better grep I replaced ack with the silver searcher https://github.com/ggreer/the_silver_searcher. It's noticeably faster and has a few other nice features. It does a few things suggested in this thread as well.
|
| # ? Feb 6, 2013 21:00 |
|
Jabor posted:RE2 and the Go package are actually NFA based, which is more like the opposite of a backtracking algorithm. I really don't understand this post. I thought non-deterministic finite automata WERE normally implemented with backtracking...? What's the main difference in Regex NFAs?
|
| # ? Feb 6, 2013 22:14 |
|
Disnesquick posted:I really don't understand this post. I thought non-deterministic finite automata WERE normally implemented with backtracking...? What's the main difference in Regex NFAs? The correct way to implement an NFA is to keep a collection of states, and then for each character work out what new states the automaton could be in. That way you look at each character once. The backtracking algorithm picks one path and goes down it as far as it can, and if it finds that it can't proceed then it "backtracks" to the nearest choice point and tries a different option. The problem here is that it can result in terrible performance for certain pathological input patterns - if there are a lot of choice points in the pattern, where several options look right but won't fail until the end of the pattern, then the algorithm will basically enumerate all the options at every input position, running over each character multiple times. In general backtracking behaves poorly when there are many partial matches
|
| # ? Feb 6, 2013 22:33 |
|
Jabor posted:The correct way to implement an NFA is to keep a collection of states, and then for each character work out what new states the automaton could be in. That way you look at each character once. I am aware of the definitions. The question was more with regards to the fact that an NFA is a construction that can be parsed via a backtracking algorithm. Saying an NFA is "opposite" to backtracking seems nonsensical because the two terms operate on different levels of conceptual hierarchy.
|
| # ? Feb 6, 2013 23:19 |
|
tef posted:Are there any re engines that use a jit rather than being table driven? .NET's RegexOptions.Compiled makes it emit CLR bytecode for a regex, which gets JITed as it runs.
|
| # ? Feb 6, 2013 23:21 |
|
Disnesquick posted:I am aware of the definitions. The question was more with regards to the fact that an NFA is a construction that can be parsed via a backtracking algorithm. Saying an NFA is "opposite" to backtracking seems nonsensical because the two terms operate on different levels of conceptual hierarchy. A backtracking algorithm is strictly stronger than an NFA, i.e. it can parse more stuff (e.g., backreferences).
|
| # ? Feb 6, 2013 23:50 |
|
Scaevolus posted:.NET's RegexOptions.Compiled makes it emit CLR bytecode for a regex, which gets JITed as it runs. Recently in #cobol, a user's company's server had a massive speedup by disabling RegexOptions.Compiled for a particular regular expression.
|
| # ? Feb 6, 2013 23:50 |
|
The compilation overhead easily exceeds the performance benefits if you aren't doing a lot of work with the regex, but if you're evaluating the same regex repeatedly, it still should show significant benefits.
|
| # ? Feb 7, 2013 19:23 |
|
Zhentar posted:The compilation overhead easily exceeds the performance benefits if you aren't doing a lot of work with the regex, but if you're evaluating the same regex repeatedly, it still should show significant benefits. What's interesting is that the regex wasn't being compiled repeatedly. (Or does C#'s RegexOptions.Compiled option actually recompile the regex every time the regex is used?) The regex in this case was stored in a global variable, constructed once and used many times.
|
| # ? Feb 7, 2013 21:27 |
|
Behold the powers of compilers. There's probably already an optimization to recognize uses of constant regexps and instead use a cached compile. That optimization can do some braindead analysis to figure out that the compiled regexp is only used in certain ways and so avoid some extra bookkeeping when (e.g.) you're ignoring one or more match groups. That analysis is thwarted when you manually put the regexp in a global, because reflection makes it challenging/impossible to do use analysis on objects actually stored in accessible variables.
|
| # ? Feb 7, 2013 22:35 |
|
It's only supposed to perform clever caching if you use the static Regex methods; if you use regex objects, the compiled version should be tied to the object's lifespan (which means repeatedly instantiating the regex gives you terrible performance).
|
| # ? Feb 8, 2013 00:32 |
|
|
| # ? May 23, 2013 18:30 |
|
Bust out LLVM, compile the regexp into actual machine code?
|
| # ? Feb 10, 2013 09:44 |


















