Chris Humphries wrote:
> I tried indexing this file
>
> IndexDir http://www.ifac.org/StandardsAndGuidance/FMAC/IMAP1.html
>
> using the HTTP method.
>
> It indexed just 2 words.
>
> The file did not look unusual, so out of curiosity I tried using "get()" in
> a Perl program, saving the string out as a .htm file, and then passing this
> file to Swish-E - which then indexed 718 words.
>
> Can anyone help explain to me why this should be so?
I'm not sure, but an interesting experiment to do is to use swishspider
to retrieve the URL. You can use a command line like:
/path/to/swishspider /var/tmp/data
http://www.ifac.org/StandardsAndGuidance/FMAC/IMAP1.html
Which would then generate files:
/var/tmp/data.contents
/var/tmp/data.links
/var/tmp/data.response
The contents file contain the actual data from the URL. The only odd
thing I see is that (from the Unix perspective), the file is one long
line. That shouldn't matter to the engine but perhaps something odd is
going on.
moo
------------------------------------------------------------
Ron Samuel Klatchko - Software Jester
Brightmail Inc - rsk@brightmail.com
Received on Thu Feb 17 16:54:14 2000