Skip to main content.
home | support | download

Back to List Archive

Re: Strange indexing of word documents

From: Bill Conlon <bill(at)not-real.tothept.com>
Date: Tue Apr 12 2005 - 17:15:44 GMT
start with man catdoc

try running catdoc from the shell and see what you get:

$catdoc foo.doc

A good file just prints out.  I presume you'll get an error message if 
there's a problem.


On Tuesday, April 12, 2005, at 09:47  AM, Thomas Nyman wrote:

> Hi
>
> As a novice with catdoc .. how do I look at the output from catdoc?
>
>
> 2005-04-12 kl. 18.43 skrev Bill Conlon:
>
>> When I spider Word files, I sometimes get messages from the catdoc 
>> filter "this file has been fast-saved x times -- some info may be 
>> lost"
>>
>> Look at the output from catdoc.
>>
>>
>> On Tuesday, April 12, 2005, at 09:35  AM, Thomas Nyman wrote:
>>
>>> Hi again all
>>>
>>> I received help getting catdoc to work and all seems well..or so i
>>> thought. I'm not getting any error messages but I have noted the
>>> certain word documents are not being indexed. I have one test 
>>> document
>>> which simply contains the word Dog and Computer but when I search for
>>> "Dog" I recieve no hits event though the document contains that word.
>>> If i search on document path and give the file name swish-e finds the
>>> file. It looks like its being indexed but if thats the case why am I
>>> not receiving any hits?
>>>
>>> This is my conf file
>>>
>>> IndexDir /usr/local/arkiv/
>>> IndexOnly .doc .txt .rtf
>>> IndexContents TXT .txt .doc .rtf
>>> StoreDescription HTML <body> 200000
>>> StoreDescription TXT 10000
>>> MetaNames swishdocpath swishtitle
>>> FileFilter .doc /usr/bin/catdoc "-s8859-1 -d8859-1 '%p'"
>>> ReplaceRules remove /usr/local/arkiv/
>>>
>>> and my .swishcgi.conf file
>>>
>>> return {
>>>
>>>          title => 'Dokument Arkiv',
>>>          swish_binary => '/usr/local/bin/swish-e',
>>>          swish_index => '/home/admin/swishindex/index.swish-e',
>>>          prepend_path => '/arkiv/'
>>>
>>> }
>>>
>>> Thomas
>>>
>>
>
Received on Tue Apr 12 10:15:44 2005