Skip to main content.
home | support | download

Back to List Archive

Re: (not indexing some files)

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Dec 07 2006 - 19:12:28 GMT
On Thu, Dec 07, 2006 at 10:58:39AM -0800, Terry Huss wrote:
> I am running 2.4.3 and have only been able to get the HTTP access to
> work properly - the spider method would hang and spat out numerous
> ambiguous errors.  I have included filters in the config file and it
> seems to perform that task well.  

I'm pretty sure the spider doesn't spit out ambiguous errors.

moseley@bumby:~$ fgrep -i ambiguous swish-e/prog-bin/spider.pl.in 
moseley@bumby:~$ 

Yep.

You might try it again and note these suggestions when posting.

http://swish-e.org/docs/install.html#when_posting_please_provide_the_following_information_

You can run swish with -v3 and get quite a bit of output.  Not sure
how much you will see about filtering, but it will tell you what files
it is processing.  I'd try that first on the directories that are not
being indexed.  You can also use -T indexed_words to see what text is
actually being indexed for each file.  Might run that on a specific
file when you find one that isn't being indexed like you think.

I assume you have tried running:

    C:\SWISH-E\lib\swish-e\pdftotext.exe  '"%p" -htmlmeta -'

on your pdf files directly and that works, right?

If you use spider.pl don't also use FileFilter in your swish config.


> 
> My config file data and index results are as follows...
> 
> ----------------------------------
> IndexDir http://www.p2pays.org/
> 
> #IndexDir spider.pl
> #SwishProgParameters C:\SWISH-e\spider.conf
> 
> # Swish can index a number of different types of documents.
> # .config are text, and .pdf are converted (filtered) to xml:
> 
> TruncateDocSize 10000000
> DefaultContents HTML2
> FileFilter .pdf	C:\SWISH-E\lib\swish-e\pdftotext.exe  '"%p" -htmlmeta -'
> FileFilter .doc C:\SWISH-E\lib\swish-e\catdoc.exe '-s8859-1 -d8859-1
> "%p"'
> IndexContents HTML2 .htm .html .shtml .aspx .cfm
> #.asp
> IndexContents TXT2 .txt
> 
> StoreDescription HTML2 <body>
> StoreDescription TXT2 2000
> 
> # Since the pdf2xml module generates xml for the PDF info fields and
> # for the PDF content, let's use MetaNames
> # Instead of specifying each metaname, let's let swish do it
> automatically.
> #UndefinedMetaTags auto
> 
> MetaNames swishdocpath sitelimiter
> 
> #IndexOnly .pdf
> 
> IndexReport 3
> ----------------------------------
> 
> ----------------------------------
> Removing very common words...
> no words removed.
> Writing main index...
> Sorting words ...
> Sorting 1,681,722 words alphabetically
> Writing header ...
> Writing index entries ...
>   Writing word text: Complete
>   Writing word hash: Complete
>   Writing word data: Complete
> 1,681,722 unique words indexed.
> 5 properties sorted.
> 38,840 files indexed.  1,898,834,286 total bytes.  224,507,202 total
> words.
> Elapsed time: 49:11:44 CPU time: 49:11:44
> Indexing done!
> ----------------------------------
> 
> -----Original Message-----
> From: swish-e@sunsite3.berkeley.edu
> [mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of Bill Moseley
> Sent: Thursday, December 07, 2006 1:47 PM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: (not indexing some files)
> 
> On Thu, Dec 07, 2006 at 10:39:12AM -0800, Terry Huss wrote:
> > I have implemented Swish on my site quite some time ago and have run 
> > into a recurring problem with the indexed results.  There are a couple
> 
> > files that simply are not being captured.  I currently have the engine
> 
> > setup to use the HTTP method to access the files, and it works 
> > reasonably well.  The two files in question are both PDFs and are 
> > located in a publicly accessible directory (along with 1,000 other 
> > reference documents).  The past attempt I dispersed the two files into
> 
> > "test" folders in 5 different directories, but again they were not 
> > found by Swish.
> 
> What version are you running?  The http method didn't index pdfs by
> default -- you had to use filters.
> 
> My suggestion is to make sure you have a recent version of swish 2.4.3
> or greater.  Then use spider.pl for fetching your documents.  It has
> debugging options that will tell you what is being fetched and what
> isn't (and why).
> 
>     http://swish-e.org/docs/spider.html
> 
> 
> 
> Several questions for ya...
> > =20
> > Are there any known patterns to how the indexer moves through the 
> > directories? =20
> 
> For http?  It follows links in your web pages.
> 
> 
> > Are there properties to a particular directory/file which would cause 
> > the indexer to skip it?
> 
> Like being empty or a file type that can't be indexed?
> 
> 
> > I feel like I am just rolling dice each time I run the indexer...is 
> > there any way to more closely dictate its performance?
> 
> How fast it runs?  Well, there's a few delay options available, but
> otherwise, it's dicated on how fast it can fetch and index the documents
> on your hardware.
> 
> Or are you asking something else?
> 
> --
> Bill Moseley
> moseley@hank.org
> 
> Unsubscribe from or help with the swish-e list: 
>    http://swish-e.org/Discussion/
> 
> Help with Swish-e:
>    http://swish-e.org/current/docs
>    swish-e@sunsite.berkeley.edu
> 
> 
> 

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Dec 7 11:12:28 2006