#218701 - 15/06/2004 10:39
A Regex curiosity
|
carpal tunnel
Registered: 29/08/2000
Posts: 14493
Loc: Canada
|
Yesterday, I managed to more than double the responsiveness of my GeoCaching WAP portal, by making a very simple change to how it was using regular expressions to parse the cache pages.
Previously, several AWK statements resembled this: gsub(".*images/WptTypes/","",text) On my 600Mhz server, for a 10KByte text value, this could sometimes take 15-25 seconds to execute under gawk. The simple change was to anchor the regex pattern, by stuffing a caret symbol at the beginning: gsub("^.*images/WptTypes/","",text) This change renders execution time to nearly instantaneous, the way it should work, without any change whatsoever to the meaning of the original regular expression used. I wonder if the gawk folks might be willing to add this to their regex parser: if (substr(pattern,1,2) == ".*") { pattern = "^"pattern }; Or perhaps even this: sub("^[.][*]","^&",pattern)
Cheers
Edited by mlord (15/06/2004 10:44)
|
Top
|
|
|
|
#218702 - 15/06/2004 10:43
Re: A Regex curiosity
[Re: mlord]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
What if you got rid of the ".*" altogether? It doesn't do anything as far as I can see.
Odd optimization nonetheless.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#218703 - 15/06/2004 10:45
Re: A Regex curiosity
[Re: mlord]
|
pooh-bah
Registered: 31/08/1999
Posts: 1649
Loc: San Carlos, CA
|
I wonder if the gawk folks might be willing to add this to their regex parser
I'm not all that familiar with gawk, but wouldn't that break global matching (if their are multiple occurrences on one line)? I.e.
blah, blah pattern blah pattern blah
-Mike
|
Top
|
|
|
|
#218704 - 15/06/2004 10:45
Re: A Regex curiosity
[Re: wfaulk]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14493
Loc: Canada
|
No, the leading ".*" is needed to delete from the beginning of the string all the way to the end of the pattern, which is what that line tries to do. In reality, many of those lines are really using "gensub" or "sub" rather than gsub, though, which allows this to make a little more sense.
Cheers
|
Top
|
|
|
|
#218705 - 15/06/2004 10:47
Re: A Regex curiosity
[Re: mcomb]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14493
Loc: Canada
|
I don't think so, since ".*" will ALWAYS match from the start of the string if it is the first part of the regex -- the rule is that the longest matching pattern is always chosen.
Cheers
|
Top
|
|
|
|
#218706 - 15/06/2004 10:56
Re: A Regex curiosity
[Re: mlord]
|
pooh-bah
Registered: 31/08/1999
Posts: 1649
Loc: San Carlos, CA
|
the rule is that the longest matching pattern is always chosen.
Even if there are multiple results? I'm thinking of something like a perl global search and replace (s/pattern/not pattern/g), but I don't know how gawk does that sort of thing.
-Mike
|
Top
|
|
|
|
#218707 - 15/06/2004 10:58
Re: A Regex curiosity
[Re: mlord]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Oh, I see. You're replacing, not just matching.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#218708 - 15/06/2004 11:02
Re: A Regex curiosity
[Re: mcomb]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
. matches anything, even if it's the pattern. I think gawk is always greedy, so ".*abc" against "123123abc123abc123" will match "123123abc123abc".
_________________________
Bitt Faulk
|
Top
|
|
|
|
#218709 - 15/06/2004 11:10
Re: A Regex curiosity
[Re: wfaulk]
|
pooh-bah
Registered: 31/08/1999
Posts: 1649
Loc: San Carlos, CA
|
. matches anything, even if it's the pattern. I think gawk is always greedy, so ".*abc" against "123123abc123abc123" will match "123123abc123abc".
Ah, OK. Yeah I was thinking of a non-greedy pattern ("^.*?abc" vs. ".*?abc").
-Mike
|
Top
|
|
|
|
#218710 - 15/06/2004 12:09
Re: A Regex curiosity
[Re: mlord]
|
carpal tunnel
Registered: 19/01/2002
Posts: 3584
Loc: Columbus, OH
|
While you're working on it, I've had some problems where some caches will display, and others will return an error message.
For example, when I try to view GCJ4KJ, I get a message that says "The Requested Page can not be displayed". Other cache pages in the area display fine, such as GC7D3B. I'm wondering if that is something returned from the server, or if it is my phone (SonyEricsson T226). I use this quite frequently when I'm out and about, and while I'm very thankful to have the service at all, you can imagine that it's mildly frustrating to not be able to view that one bit of info that may be the key to finding the cache. Let me know if I can supply any debugging info.
Thanks,
John
_________________________
~ John
|
Top
|
|
|
|
#218711 - 15/06/2004 12:44
Re: A Regex curiosity
[Re: JBjorgen]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14493
Loc: Canada
|
Mmm.. I actually have "full logging" on right now, and so I was able to scroll back and find your most recent attempt to view GCJ4KJ. Nothing funny about it that I can see -- my site returned a correclty formatted and sized page to cingular for you. Usually problems like this are getting the size wrong, but it looks like you were able to view other caches which returned even larger sized data, so that's not likely the problem here.
I could force it to return WML source text instead of bytecodes, and let Cingular compile it. Maybe that will solve it.
Try it now and let me know.
Cheers
|
Top
|
|
|
|
#218712 - 16/06/2004 09:58
Re: A Regex curiosity
[Re: mlord]
|
carpal tunnel
Registered: 19/01/2002
Posts: 3584
Loc: Columbus, OH
|
Works great now! Thanks a bunch Mark.
It must have been the browser on my phone, 'cause I tried it on Russmeister's SonyEricsson T616 on Cingular from the same building yesterday and it worked ok. Weird.
Edited by Meatballman (16/06/2004 10:04)
_________________________
~ John
|
Top
|
|
|
|
#218713 - 16/06/2004 20:17
Re: A Regex curiosity
[Re: mlord]
|
carpal tunnel
Registered: 17/12/2000
Posts: 2665
Loc: Manteca, California
|
It's working the way it's supposed to. By anchoring the search you are cutting down the number of matches being performed.
Unachored. A 1k file will yield 1k of matches being attempted. Once for every char. position in the file.
Anchored. Will only attempt as many matches as there are lines in the file.
Both are useful in the right context.
edit: PS. additional thought makes me think my math was too simple. I think there would be a lot more than 1k matches being made.
Edited by gbeer (16/06/2004 20:25)
_________________________
Glenn
|
Top
|
|
|
|
#218714 - 16/06/2004 20:23
Re: A Regex curiosity
[Re: gbeer]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14493
Loc: Canada
|
Yes, but it could be clever enough internally to realize that patterns which begin with dot-star are 100% equivalent to patterns that begin with anchor-dot-star, and much faster to execute.
Cheers
|
Top
|
|
|
|
#218715 - 16/06/2004 20:38
Re: A Regex curiosity
[Re: mlord]
|
carpal tunnel
Registered: 17/12/2000
Posts: 2665
Loc: Manteca, California
|
Consider something like .*abc.*xyz searching in "1234abc12312334xyz1234abc1231xyz"
If this was anchored you'd miss the second match.
edit: well maybe not. But I have to think that search results come out differently depending on what is searched.
Edited by gbeer (16/06/2004 20:49)
_________________________
Glenn
|
Top
|
|
|
|
#218716 - 16/06/2004 20:58
Re: A Regex curiosity
[Re: gbeer]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14493
Loc: Canada
|
Try it yourself. I think you'll be suprised by the outcome. This is gawk, we're discussing.
Cheers
|
Top
|
|
|
|
|
|