Skip to main content.
home | support | download

Back to List Archive

regular expressions in swish (was: Problem with "replace")

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Nov 30 2001 - 17:55:37 GMT
At 09:02 AM 11/30/01 -0800, Bill Moseley wrote:
>ReplaceRules replace ../swish-e/src/ foo
>
>
>> ./swish-e -c c -i ../../swish-e/src/1.html -v 0 -T properties regex
>
>Original String: '../../swish-e/src/1.html'
>replace ../../swish-e/src/1.html =~ /../swish-e/src//foo/: Matched
>replace 1.html =~ /../swish-e/src//foo/: No Match
>  Result String: '../foo1.html'

The reason that "replace" is twice is because a ReplaceRules *replace* is a
global pattern replace.  In perl you just add a "g" modifier to do global
substitutions.  To emulate that I recursively call the pattern replacement
on the part of the source string that hasn't been processed yet until
there's no matches.  

So above the first "replace" ends up with "1.html", and then the second
match tries again on this new string.

I'd like to hear if there's a better way to do that than my recursive
routine.  Coming from perl, I was just hoping I could set a global flag and
be done with it.

There's two other behaviors I'm not so sure about.

First, regular expressions are put on a stack for each directive, and when
processing the regular expressions, they are done in order as in the config
file.  I can see where this might cause confusion.

For example, say you had paths starting with a digit (e.g. you had a bunch
of numbered directories) and you wanted to remove all leading digits.

  ReplaceRules remove ^[0-9]+

That works fine.  But, let's say later in time, as in John Elser's example,
you wanted to remove a leading path leaving only the file name.

  ReplaceRules regex !^.+/(.+)$!$1!

But, you ended up putting that ReplaceRules expressing earlier, then the
one above word work on the *resulting* string.  So if you had a *basename*
that started with a digit you might be in trouble.  For example:

> cat c
ReplaceRules regex !^.+/(.+)$!$1!
# sometime later
ReplaceRules remove ^[0-9]+
 
> ./swish-e -c c -i ../../swish-e/src/1.html -v 0 -T properties regex
Indexing Data Source: "File-System"

Original String: '../../swish-e/src/1.html'
replace ../../swish-e/src/1.html =~ m[^.+/(.+)$][$1]: Matched
  Result String: '1.html'
replace 1.html =~ m[^[0-9]+][]: Matched
replace .html =~ m[^[0-9]+][]: No Match
  Result String: '.html'
          swishdocpath: 6 (  5) S: ".html"
          swishdocsize: 8 (  4) N: "0000000000107"
     swishlastmodified: 9 (  4) D: "2001-11-29 04:48:07"
Indexing done!

I can see confusion.  I can imagine where you might want in some cases to
say, "once a ReplaceRules matches, don't process any more for this file",
and other times I can see where chaining results from one ReplaceRules to
the source pattern of the next could be useful.

So, if anyone has an opinion about this, let me know.


Second, this has to do with pattern substitution and the requirement to match
an entire string.

There's a new directive called "ExtractPath" that can be used to extract
words out of the path, and then index those word(s) under a metaname.

The idea is if your documents are grouped by some type of category, and you
can tell that category by something in the path, then you can assign a
unique metaname value to that document.  http://search.apache.org works
that way.  It's indexing the entire server, but it's using ExtractPath to
assign a work to the "site" metaname.  Sometimes the word is extracted
directly out of the path and used in the replacement string ($1):

   ExtractPath site regex "!^/www/([^.]+).+$!$1!"

Sometimes it's just a pattern match and a word is used:

   ExtractPath site regex "!^/usr/local/share/gnats.+$!bugs!"

So then searching is like -w foo AND site=(bugs or httpd)

What I find kind of a pain is that you must match the entire source
pattern, so that the replace is just what you want.  That is, the replace
pattern only replaces what was matched.  In some ways it would be nice if
you could just say:

   ExtractPath site regex !gnats!bugs!

meaning that if "gnats" matches anyplace in the path, then replace the
*entire* string with the word "bugs".


Bill Moseley
mailto:moseley@hank.org
Received on Fri Nov 30 17:56:11 2001