Re: (not indexing some files)

From: Terry Huss <terry.huss(at)>
Date: Thu Dec 07 2006 - 18:58:52 GMT
I am running 2.4.3 and have only been able to get the HTTP access to
work properly - the spider method would hang and spat out numerous
ambiguous errors.  I have included filters in the config file and it
seems to perform that task well.  

My config file data and index results are as follows...


#SwishProgParameters C:\SWISH-e\spider.conf

# Swish can index a number of different types of documents.
# .config are text, and .pdf are converted (filtered) to xml:

TruncateDocSize 10000000
DefaultContents HTML2
FileFilter .pdf	C:\SWISH-E\lib\swish-e\pdftotext.exe  '"%p" -htmlmeta -'
FileFilter .doc C:\SWISH-E\lib\swish-e\catdoc.exe '-s8859-1 -d8859-1
IndexContents HTML2 .htm .html .shtml .aspx .cfm
IndexContents TXT2 .txt

StoreDescription HTML2 <body>
StoreDescription TXT2 2000

# Since the pdf2xml module generates xml for the PDF info fields and
# for the PDF content, let's use MetaNames
# Instead of specifying each metaname, let's let swish do it
#UndefinedMetaTags auto

MetaNames swishdocpath sitelimiter

#IndexOnly .pdf

IndexReport 3

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 1,681,722 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
1,681,722 unique words indexed.
5 properties sorted.
38,840 files indexed.  1,898,834,286 total bytes.  224,507,202 total
Elapsed time: 49:11:44 CPU time: 49:11:44
Indexing done!

On Thu, Dec 07, 2006 at 10:39:12AM -0800, Terry Huss wrote:
> I have implemented Swish on my site quite some time ago and have run 
> into a recurring problem with the indexed results.  There are a couple

> files that simply are not being captured.  I currently have the engine

> setup to use the HTTP method to access the files, and it works 
> reasonably well.  The two files in question are both PDFs and are 
> located in a publicly accessible directory (along with 1,000 other 
> reference documents).  The past attempt I dispersed the two files into

> "test" folders in 5 different directories, but again they were not 
> found by Swish.

What version are you running?  The http method didn't index pdfs by
default -- you had to use filters.

My suggestion is to make sure you have a recent version of swish 2.4.3
or greater.  Then use for fetching your documents.  It has
debugging options that will tell you what is being fetched and what
isn't (and why).

Several questions for ya...
> Are there any known patterns to how the indexer moves through the 
> directories? =20

For http?  It follows links in your web pages.

> Are there properties to a particular directory/file which would cause 
> the indexer to skip it?

Like being empty or a file type that can't be indexed?

> I feel like I am just rolling dice each time I run the 
> there any way to more closely dictate its performance?

How fast it runs?  Well, there's a few delay options available, but
otherwise, it's dicated on how fast it can fetch and index the documents
on your hardware.

Or are you asking something else?

Bill Moseley

Received on Thu Dec 7 10:58:53 2006