Skip to main content.
home | support | download

Back to List Archive

Re: (not indexing some files)

From: William M Conlon <bill(at)not-real.tothept.com>
Date: Thu Dec 07 2006 - 20:43:26 GMT
It's either 1.5 or 1.6 ==> Acrobat Versions 6 or 7 are incompatible.

Whatever distiller you use, you need to set the job options to make  
it campatible with xpdf or you're SOL.

Bill


On Dec 7, 2006, at 12:21 PM, Terry Huss wrote:

> Thanks for the insight Bill, but unfortunately the 7.0.5 distiller has
> been used on many of the located documents.  In fact, of the two in
> question, one is being found and the other not.  The only  
> distinction in
> versions is that the "found" one was distilled via Word and the "lost"
> one was done through Powerpoint.  What versions does the xpdf not  
> like?
> We have been PDFing these documents over a 15 year span, so many a
> version has been used!
>
> -----Original Message-----
> From: swish-e@sunsite3.berkeley.edu
> [mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of William M Conlon
> Sent: Thursday, December 07, 2006 2:59 PM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: (not indexing some files)
>
> Very likely it's an issue with the pdf file, as xpdf does not read the
> newest pdf versions.  turn on debug in the spider and you'll see  
> whether
> that's the issue.  or just open the pdf in Acrobat and save it as
> Acrobat 4 or 5 compatible.
>
>
> Bill Conlon
>
> On Dec 7, 2006, at 11:12 AM, Bill Moseley wrote:
>
>> On Thu, Dec 07, 2006 at 10:58:39AM -0800, Terry Huss wrote:
>>> I am running 2.4.3 and have only been able to get the HTTP access to
>>> work properly - the spider method would hang and spat out numerous
>>> ambiguous errors.  I have included filters in the config file and it
>>> seems to perform that task well.
>>
>> I'm pretty sure the spider doesn't spit out ambiguous errors.
>>
>> moseley@bumby:~$ fgrep -i ambiguous swish-e/prog-bin/spider.pl.in
>> moseley@bumby:~$
>>
>> Yep.
>>
>> You might try it again and note these suggestions when posting.
>>
>> http://swish-e.org/docs/
>> install.html#when_posting_please_provide_the_following_information_
>>
>> You can run swish with -v3 and get quite a bit of output.  Not sure
>> how much you will see about filtering, but it will tell you what  
>> files
>
>> it is processing.  I'd try that first on the directories that are not
>> being indexed.  You can also use -T indexed_words to see what text is
>> actually being indexed for each file.  Might run that on a specific
>> file when you find one that isn't being indexed like you think.
>>
>> I assume you have tried running:
>>
>>     C:\SWISH-E\lib\swish-e\pdftotext.exe  '"%p" -htmlmeta -'
>>
>> on your pdf files directly and that works, right?
>>
>> If you use spider.pl don't also use FileFilter in your swish config.
>>
>>
>>>
>>> My config file data and index results are as follows...
>>>
>>> ----------------------------------
>>> IndexDir http://www.p2pays.org/
>>>
>>> #IndexDir spider.pl
>>> #SwishProgParameters C:\SWISH-e\spider.conf
>>>
>>> # Swish can index a number of different types of documents.
>>> # .config are text, and .pdf are converted (filtered) to xml:
>>>
>>> TruncateDocSize 10000000
>>> DefaultContents HTML2
>>> FileFilter .pdf	C:\SWISH-E\lib\swish-e\pdftotext.exe  '"%p" -
>>> htmlmeta -'
>>> FileFilter .doc C:\SWISH-E\lib\swish-e\catdoc.exe '-s8859-1 -d8859-1
>>> "%p"'
>>> IndexContents HTML2 .htm .html .shtml .aspx .cfm #.asp IndexContents
>>> TXT2 .txt
>>>
>>> StoreDescription HTML2 <body>
>>> StoreDescription TXT2 2000
>>>
>>> # Since the pdf2xml module generates xml for the PDF info fields and
>>> # for the PDF content, let's use MetaNames # Instead of specifying
>>> each metaname, let's let swish do it automatically.
>>> #UndefinedMetaTags auto
>>>
>>> MetaNames swishdocpath sitelimiter
>>>
>>> #IndexOnly .pdf
>>>
>>> IndexReport 3
>>> ----------------------------------
>>>
>>> ----------------------------------
>>> Removing very common words...
>>> no words removed.
>>> Writing main index...
>>> Sorting words ...
>>> Sorting 1,681,722 words alphabetically Writing header ...
>>> Writing index entries ...
>>>   Writing word text: Complete
>>>   Writing word hash: Complete
>>>   Writing word data: Complete
>>> 1,681,722 unique words indexed.
>>> 5 properties sorted.
>>> 38,840 files indexed.  1,898,834,286 total bytes.  224,507,202 total
>>> words.
>>> Elapsed time: 49:11:44 CPU time: 49:11:44 Indexing done!
>>> ----------------------------------
>>>
>>> -----Original Message-----
>>> From: swish-e@sunsite3.berkeley.edu
>>> [mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of Bill Moseley
>>> Sent: Thursday, December 07, 2006 1:47 PM
>>> To: Multiple recipients of list
>>> Subject: [SWISH-E] Re: (not indexing some files)
>>>
>>> On Thu, Dec 07, 2006 at 10:39:12AM -0800, Terry Huss wrote:
>>>> I have implemented Swish on my site quite some time ago and have  
>>>> run
>
>>>> into a recurring problem with the indexed results.  There are a
>>>> couple
>>>
>>>> files that simply are not being captured.  I currently have the
>>>> engine
>>>
>>>> setup to use the HTTP method to access the files, and it works
>>>> reasonably well.  The two files in question are both PDFs and are
>>>> located in a publicly accessible directory (along with 1,000 other
>>>> reference documents).  The past attempt I dispersed the two files
>>>> into
>>>
>>>> "test" folders in 5 different directories, but again they were not
>>>> found by Swish.
>>>
>>> What version are you running?  The http method didn't index pdfs by
>>> default -- you had to use filters.
>>>
>>> My suggestion is to make sure you have a recent version of swish
>>> 2.4.3
>>> or greater.  Then use spider.pl for fetching your documents.  It has
>>> debugging options that will tell you what is being fetched and what
>>> isn't (and why).
>>>
>>>     http://swish-e.org/docs/spider.html
>>>
>>>
>>>
>>> Several questions for ya...
>>>> =20
>>>> Are there any known patterns to how the indexer moves through the
>>>> directories? =20
>>>
>>> For http?  It follows links in your web pages.
>>>
>>>
>>>> Are there properties to a particular directory/file which would
>>>> cause the indexer to skip it?
>>>
>>> Like being empty or a file type that can't be indexed?
>>>
>>>
>>>> I feel like I am just rolling dice each time I run the indexer...is
>>>> there any way to more closely dictate its performance?
>>>
>>> How fast it runs?  Well, there's a few delay options available, but
>>> otherwise, it's dicated on how fast it can fetch and index the
>>> documents on your hardware.
>>>
>>> Or are you asking something else?
>>>
>>> --
>>> Bill Moseley
>>> moseley@hank.org
>>>
>>> Unsubscribe from or help with the swish-e list:
>>>    http://swish-e.org/Discussion/
>>>
>>> Help with Swish-e:
>>>    http://swish-e.org/current/docs
>>>    swish-e@sunsite.berkeley.edu
>>>
>>>
>>>
>>
>> --
>> Bill Moseley
>> moseley@hank.org
>>
>> Unsubscribe from or help with the swish-e list:
>>    http://swish-e.org/Discussion/
>>
>> Help with Swish-e:
>>    http://swish-e.org/current/docs
>>    swish-e@sunsite.berkeley.edu
>>
>
>
Received on Thu Dec 7 12:43:28 2006