Skip to main content.
home | support | download

Back to List Archive

Re: (not indexing some files)

From: Terry Huss <terry.huss(at)not-real.ncmail.net>
Date: Thu Dec 07 2006 - 20:21:55 GMT
Thanks for the insight Bill, but unfortunately the 7.0.5 distiller has
been used on many of the located documents.  In fact, of the two in
question, one is being found and the other not.  The only distinction in
versions is that the "found" one was distilled via Word and the "lost"
one was done through Powerpoint.  What versions does the xpdf not like?
We have been PDFing these documents over a 15 year span, so many a
version has been used!   

-----Original Message-----
From: swish-e@sunsite3.berkeley.edu
[mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of William M Conlon
Sent: Thursday, December 07, 2006 2:59 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: (not indexing some files)

Very likely it's an issue with the pdf file, as xpdf does not read the
newest pdf versions.  turn on debug in the spider and you'll see whether
that's the issue.  or just open the pdf in Acrobat and save it as
Acrobat 4 or 5 compatible.


Bill Conlon

On Dec 7, 2006, at 11:12 AM, Bill Moseley wrote:

> On Thu, Dec 07, 2006 at 10:58:39AM -0800, Terry Huss wrote:
>> I am running 2.4.3 and have only been able to get the HTTP access to 
>> work properly - the spider method would hang and spat out numerous 
>> ambiguous errors.  I have included filters in the config file and it 
>> seems to perform that task well.
>
> I'm pretty sure the spider doesn't spit out ambiguous errors.
>
> moseley@bumby:~$ fgrep -i ambiguous swish-e/prog-bin/spider.pl.in 
> moseley@bumby:~$
>
> Yep.
>
> You might try it again and note these suggestions when posting.
>
> http://swish-e.org/docs/
> install.html#when_posting_please_provide_the_following_information_
>
> You can run swish with -v3 and get quite a bit of output.  Not sure 
> how much you will see about filtering, but it will tell you what files

> it is processing.  I'd try that first on the directories that are not 
> being indexed.  You can also use -T indexed_words to see what text is 
> actually being indexed for each file.  Might run that on a specific 
> file when you find one that isn't being indexed like you think.
>
> I assume you have tried running:
>
>     C:\SWISH-E\lib\swish-e\pdftotext.exe  '"%p" -htmlmeta -'
>
> on your pdf files directly and that works, right?
>
> If you use spider.pl don't also use FileFilter in your swish config.
>
>
>>
>> My config file data and index results are as follows...
>>
>> ----------------------------------
>> IndexDir http://www.p2pays.org/
>>
>> #IndexDir spider.pl
>> #SwishProgParameters C:\SWISH-e\spider.conf
>>
>> # Swish can index a number of different types of documents.
>> # .config are text, and .pdf are converted (filtered) to xml:
>>
>> TruncateDocSize 10000000
>> DefaultContents HTML2
>> FileFilter .pdf	C:\SWISH-E\lib\swish-e\pdftotext.exe  '"%p" - 
>> htmlmeta -'
>> FileFilter .doc C:\SWISH-E\lib\swish-e\catdoc.exe '-s8859-1 -d8859-1 
>> "%p"'
>> IndexContents HTML2 .htm .html .shtml .aspx .cfm #.asp IndexContents 
>> TXT2 .txt
>>
>> StoreDescription HTML2 <body>
>> StoreDescription TXT2 2000
>>
>> # Since the pdf2xml module generates xml for the PDF info fields and 
>> # for the PDF content, let's use MetaNames # Instead of specifying 
>> each metaname, let's let swish do it automatically.
>> #UndefinedMetaTags auto
>>
>> MetaNames swishdocpath sitelimiter
>>
>> #IndexOnly .pdf
>>
>> IndexReport 3
>> ----------------------------------
>>
>> ----------------------------------
>> Removing very common words...
>> no words removed.
>> Writing main index...
>> Sorting words ...
>> Sorting 1,681,722 words alphabetically Writing header ...
>> Writing index entries ...
>>   Writing word text: Complete
>>   Writing word hash: Complete
>>   Writing word data: Complete
>> 1,681,722 unique words indexed.
>> 5 properties sorted.
>> 38,840 files indexed.  1,898,834,286 total bytes.  224,507,202 total 
>> words.
>> Elapsed time: 49:11:44 CPU time: 49:11:44 Indexing done!
>> ----------------------------------
>>
>> -----Original Message-----
>> From: swish-e@sunsite3.berkeley.edu
>> [mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of Bill Moseley
>> Sent: Thursday, December 07, 2006 1:47 PM
>> To: Multiple recipients of list
>> Subject: [SWISH-E] Re: (not indexing some files)
>>
>> On Thu, Dec 07, 2006 at 10:39:12AM -0800, Terry Huss wrote:
>>> I have implemented Swish on my site quite some time ago and have run

>>> into a recurring problem with the indexed results.  There are a 
>>> couple
>>
>>> files that simply are not being captured.  I currently have the 
>>> engine
>>
>>> setup to use the HTTP method to access the files, and it works 
>>> reasonably well.  The two files in question are both PDFs and are 
>>> located in a publicly accessible directory (along with 1,000 other 
>>> reference documents).  The past attempt I dispersed the two files 
>>> into
>>
>>> "test" folders in 5 different directories, but again they were not 
>>> found by Swish.
>>
>> What version are you running?  The http method didn't index pdfs by 
>> default -- you had to use filters.
>>
>> My suggestion is to make sure you have a recent version of swish
>> 2.4.3
>> or greater.  Then use spider.pl for fetching your documents.  It has 
>> debugging options that will tell you what is being fetched and what 
>> isn't (and why).
>>
>>     http://swish-e.org/docs/spider.html
>>
>>
>>
>> Several questions for ya...
>>> =20
>>> Are there any known patterns to how the indexer moves through the 
>>> directories? =20
>>
>> For http?  It follows links in your web pages.
>>
>>
>>> Are there properties to a particular directory/file which would 
>>> cause the indexer to skip it?
>>
>> Like being empty or a file type that can't be indexed?
>>
>>
>>> I feel like I am just rolling dice each time I run the indexer...is 
>>> there any way to more closely dictate its performance?
>>
>> How fast it runs?  Well, there's a few delay options available, but 
>> otherwise, it's dicated on how fast it can fetch and index the 
>> documents on your hardware.
>>
>> Or are you asking something else?
>>
>> --
>> Bill Moseley
>> moseley@hank.org
>>
>> Unsubscribe from or help with the swish-e list:
>>    http://swish-e.org/Discussion/
>>
>> Help with Swish-e:
>>    http://swish-e.org/current/docs
>>    swish-e@sunsite.berkeley.edu
>>
>>
>>
>
> --
> Bill Moseley
> moseley@hank.org
>
> Unsubscribe from or help with the swish-e list:
>    http://swish-e.org/Discussion/
>
> Help with Swish-e:
>    http://swish-e.org/current/docs
>    swish-e@sunsite.berkeley.edu
>
Received on Thu Dec 7 12:21:56 2006