Ron, I tried this with the url mentioned below, and got these results:
.contents file had the contents of the html file (it was 6,426 bytes)
.links file was not created (the file does have links - could this be a
.response file contained just the text "200" (another file - which indexed
correctly - returned "200
text/html, text/html; charset=iso-8859-1" in the response file. Could
*this* be a/the problem?)
Thanks for suggesting this - I did not know the spider could be used in
Note for PC users - usage is:
swishspider var\tmp\data <url>
This could be confusing because at other times (for example, in the Swish-E
.config file and in Perl files) local paths are written using forward
From: Ron Samuel Klatchko [SMTP:firstname.lastname@example.org]
Sent: Thursday, February 17, 2000 9:45 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Swish-E indexing
Chris Humphries wrote:
> I tried indexing this file
> IndexDir http://www.ifac.org/StandardsAndGuidance/FMAC/IMAP1.html
> using the HTTP method.
> It indexed just 2 words.
> The file did not look unusual, so out of curiosity I tried using "get()"
> a Perl program, saving the string out as a .htm file, and then passing
> file to Swish-E - which then indexed 718 words.
> Can anyone help explain to me why this should be so?
I'm not sure, but an interesting experiment to do is to use swishspider
to retrieve the URL. You can use a command line like:
Which would then generate files:
The contents file contain the actual data from the URL. The only odd
thing I see is that (from the Unix perspective), the file is one long
line. That shouldn't matter to the engine but perhaps something odd is
Ron Samuel Klatchko - Software Jester
Brightmail Inc - email@example.com
Received on Fri Feb 18 06:22:05 2000