On Thu, Dec 07, 2006 at 10:39:12AM -0800, Terry Huss wrote:
> I have implemented Swish on my site quite some time ago and have run
> into a recurring problem with the indexed results. There are a couple
> files that simply are not being captured. I currently have the engine
> setup to use the HTTP method to access the files, and it works
> reasonably well. The two files in question are both PDFs and are
> located in a publicly accessible directory (along with 1,000 other
> reference documents). The past attempt I dispersed the two files into
> "test" folders in 5 different directories, but again they were not found
> by Swish.
What version are you running? The http method didn't index pdfs by
default -- you had to use filters.
My suggestion is to make sure you have a recent version of swish 2.4.3
or greater. Then use spider.pl for fetching your documents. It has
debugging options that will tell you what is being fetched and what
isn't (and why).
Several questions for ya...
> Are there any known patterns to how the indexer moves through the
> directories? =20
For http? It follows links in your web pages.
> Are there properties to a particular directory/file which would cause
> the indexer to skip it?
Like being empty or a file type that can't be indexed?
> I feel like I am just rolling dice each time I run the indexer...is
> there any way to more closely dictate its performance?
How fast it runs? Well, there's a few delay options available, but
otherwise, it's dicated on how fast it can fetch and index the
documents on your hardware.
Or are you asking something else?
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Thu Dec 7 10:47:22 2006