Skip to main content.
home | support | download

Back to List Archive

[swish-e] Why isn't my MetaName being used? Is the filter not being used?

From: Ben Ostrowsky <ben(at)not-real.benostrowsky.com>
Date: Tue Sep 18 2007 - 18:51:35 GMT
When I search by the 'name' metaname, I get this:

~/local/bin/swish-e -f ~/local/oldarchives.index -w "name=ostrowsky"
# SWISH format: 2.4.5
# Search words: name=ostrowsky
# Removed stopwords:
err: Unknown metaname: 'name'
.

I indexed the source with this command:
~/local/bin/swish-e -S prog -c ~/local/oldarchives.conf

~/local/oldarchives.conf contains these non-comment lines:
IndexFile /home/users/c/codi/local/oldarchivesmeta.index
SwishProgParameters /home/users/c/codi/local/spideroldarchives.conf
IndexDir /home/users/c/codi/local/lib/swish-e/spider.pl
MetaNames name email date subject
PropertyNames name email date subject


/home/users/c/codi/local/spideroldarchives.conf contains these
non-comment lines:
  my ($filter_sub, $response_sub ) = swish_filter();
  @ servers = ({
    base_url    => 'http://username:password@host/path/index.html',
    agent       => 'swish-e spider http://swish-e.org/',
    email       => 'ben@benostrowsky.com',
    filter_content => $filter_sub,
    test_url    => sub {
      if (($_[0]->path =~ /narchive/) && !($_[0]->path =~ /\.txt$/)) {
return 1; }
      return 0;
    },
    test_response => sub {
      my $server = $_[1];
      $server->{no_index}++ if
        $_[0]->path =~ /[author|thread|subject|date|maillist|threads].html/;
      return 1;
    },
    ignore_robots_file => 1,
    delay_sec   => 2,         # Delay in seconds between requests
    keep_alive  => 1,         # enable keep alives requests
    use_cookies  => 1,
    debug => "url, info, headers"
    }
  );
  1;

When I run swish-filter-test -verbose -content
http://username:password@host/path/document.html, I get the correct
document in return, with the metadata inserted in the <head> element
just as I intend:

<!-- MHonArc v2.6.8 -->
<!--X-Subject: RE: [HORIZON&#45;L] RPA ... -->
<!--X-From-R13: "Prgu Yebruyre" <oxebruyreNzhacy.bet> -->
<!--X-Date: Mon,  7 Mar 2005 12:16:35 &#45;0700 (MST) -->
<!--X-Message-Id: F4F5AB8DD4D5EE498C36D38DF809BEF6454078@exchange.munpl.org -->
<!--X-Content-Type: multipart/mixed -->
<!--X-Head-End-->
<!doctype html public "-//W3C//DTD HTML//EN">
<html>
<head>
<meta name="date" content="Mon, 7 Mar 2005 14:16:15 -0500" />
<meta name="email" content="bkroehler@munpl.org" />
<meta name="name" content="Beth Kroehler" />
<meta name="subject" content="RE: [HORIZON-L] RPA ..." />
<title>RE: [HORIZON-L] RPA ...</title>
<link rev="made" href="mailto:bkroehler@munpl.org">
</head>
<body>

I'm not sure how to verify whether the spider is actually invoking the
filter.  The end of the output looks like this:

?Testing 'filter_content' user supplied function #1
'http://www.codi.org/archives/narchive/2003/msg05185.html'
+Passed all 1 tests for 'filter_content' user supplied function
?Testing 'test_response' user supplied function #1
'http://www.codi.org/archives/narchive/2003/msg05184.html'
+Passed all 1 tests for 'test_response' user supplied function
>> +Fetched 2 Cnt: 8540 GET
http://www.codi.org/archives/narchive/2003/msg05184.html  200 OK
text/html 4475 parent:http://www.codi.org/archives/narchive/2003/maillist.html
depth:2
?Testing 'test_url' user supplied function #1
'http://www.codi.org/archives/narchive/2003/msg05183.html'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1
'http://www.codi.org/archives/narchive/2003/msg05185.html'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1
'http://www.codi.org/archives/narchive/2003/msg05182.html'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1
'http://www.codi.org/archives/narchive/2003/msg05186.html'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1
'http://www.codi.org/archives/narchive/2003/maillist.html'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1
'http://www.codi.org/archives/narchive/2003/threads.html'
+Passed all 1 tests for 'test_url' user supplied function
External Program found: /home/users/c/codi/local/lib/swish-e/spider.pl
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 36,606 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
36,606 unique words indexed.
8 properties sorted.
8,527 files indexed.  66,651,072 total bytes.  5,456,505 total words.
Elapsed time: 00:16:03 CPU time: 00:01:25
Indexing done!

So what am I forgetting?  Is the spider actually invoking the filter?
If so, what else do I need to do in order for it to index the
metadata?

Thanks!
Ben

-- 
"Don't get suckered in by the comments;
 they can be terribly misleading.
 Debug only code."  -- Dave Storer
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Sep 18 14:51:37 2007