Skip to main content.
home | support | download

Back to List Archive

Re: Swish-E indexing

From: Chris Humphries <ChrisJMH(at)not-real.vermilion99.freeserve.co.uk>
Date: Fri Feb 18 2000 - 11:18:34 GMT
Ron, I tried this with the url mentioned below, and got these results:

.contents file had the contents of the html file (it was 6,426 bytes)

.links file was not created (the file does have links - could this be a 
problem?)

.response file contained just the text "200" (another file - which indexed 
correctly - returned "200
text/html, text/html; charset=iso-8859-1" in the response file. Could 
*this* be a/the problem?)

Thanks for suggesting this - I did not know the spider could be used in 
this way.

Chris Humphries

P.S.
Note for PC users - usage is:
swishspider var\tmp\data <url>
This could be confusing because at other times (for example, in the Swish-E 
.config file and in Perl files) local paths are written using forward 
slashes.


-----Original Message-----
From:	Ron Samuel Klatchko [SMTP:rsk@corpmail.brightmail.com]
Sent:	Thursday, February 17, 2000 9:45 PM
To:	Multiple recipients of list
Subject:	[SWISH-E] Re: Swish-E indexing

Chris Humphries wrote:
> I tried indexing this file
>
> IndexDir http://www.ifac.org/StandardsAndGuidance/FMAC/IMAP1.html
>
> using the HTTP method.
>
> It indexed just 2 words.
>
> The file did not look unusual, so out of curiosity I tried using "get()" 
in
> a Perl program, saving the string out as a .htm file, and then passing 
this
> file to Swish-E - which then indexed 718 words.
>
> Can anyone help explain to me why this should be so?

I'm not sure, but an interesting experiment to do is to use swishspider
to retrieve the URL.  You can use a command line like:

  /path/to/swishspider /var/tmp/data
http://www.ifac.org/StandardsAndGuidance/FMAC/IMAP1.html

Which would then generate files:

  /var/tmp/data.contents
  /var/tmp/data.links
  /var/tmp/data.response

The contents file contain the actual data from the URL.  The only odd
thing I see is that (from the Unix perspective), the file is one long
line.  That shouldn't matter to the engine but perhaps something odd is
going on.

moo
------------------------------------------------------------
           Ron Samuel Klatchko - Software Jester
            Brightmail Inc - rsk@brightmail.com
Received on Fri Feb 18 06:22:05 2000