Skip to main content.
home | support | download

Back to List Archive

Re: Swish-E indexing

From: Ron Samuel Klatchko <rsk(at)not-real.corpmail.brightmail.com>
Date: Thu Feb 17 2000 - 21:44:45 GMT
Chris Humphries wrote:
> I tried indexing this file
> 
> IndexDir http://www.ifac.org/StandardsAndGuidance/FMAC/IMAP1.html
> 
> using the HTTP method.
> 
> It indexed just 2 words.
> 
> The file did not look unusual, so out of curiosity I tried using "get()" in
> a Perl program, saving the string out as a .htm file, and then passing this
> file to Swish-E - which then indexed 718 words.
> 
> Can anyone help explain to me why this should be so?

I'm not sure, but an interesting experiment to do is to use swishspider
to retrieve the URL.  You can use a command line like:

  /path/to/swishspider /var/tmp/data
http://www.ifac.org/StandardsAndGuidance/FMAC/IMAP1.html

Which would then generate files:

  /var/tmp/data.contents
  /var/tmp/data.links
  /var/tmp/data.response

The contents file contain the actual data from the URL.  The only odd
thing I see is that (from the Unix perspective), the file is one long
line.  That shouldn't matter to the engine but perhaps something odd is
going on.

moo
------------------------------------------------------------
           Ron Samuel Klatchko - Software Jester
            Brightmail Inc - rsk@brightmail.com
Received on Thu Feb 17 16:54:14 2000