Skip to main content.
home | support | download

Back to List Archive

[swish-e] *some* pdf documents not indexed

From: Dr Michael Daly <gp(at)not-real.holisticgp.com.au>
Date: Sat Jul 28 2007 - 03:43:07 GMT
Dear list
If anyone can solve this mystery, it would be great! Swish-e 2.4.5 (on
centos) is failing to index some pdf documents. Here is the index file:

    # Tell Swish-e what to index (same as -i switch above)
    #
    IndexDir /home/server_dir/Resources/Research/2007


   #Document Source Directives - refer www.swish-e.org/docs/swish-config.html
   #ReplaceRules [replace|remove|prepend|append|regex]
    ReplaceRules remove /home/server_dir/


    # Only the following type of files
    IndexOnly .htm .html .txt .doc .pdf

    # Tell Swish-e that .txt files are to use the text parser.
    IndexContents TXT* .txt

    # Otherwise, use the HTML parser
    DefaultContents HTML*

    # Ask libxml2 to report any parsing errors and warnings or
    # any UTF-8 to 8859-1 conversion errors
    ParserWarnLevel 9

    # index.swish-e is the default index file name, unless the
    # IndexFile directive is specified in this config file
    IndexFile /home/indices/for_index4.index


Checks I have made include:
1. pdftotext on the affected pdf documents  - this works ie creates the
non-indexable document in .txt form
2. owner, group, others permissions on missing documents - same as those
that are not missing (764)
3. pdf version (1.3, 1.4) no consistent difference

4. this seems to work:
swish-e -i
/home/server_dir/Resources/Research/2007/Low_Purine_Diet_405.pdf -T
indexed_words | less
eg
Warning: Substituted 86 embedded null character(s) in file
'/home/server_dir/Resources/Research/2007/Low_Purine_Diet_405.pdf'
 with a newline

    Adding:[1:swishdefault(1)]   'pdf'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   '1'   Pos:2  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   '4'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   ''   Pos:4  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   '2'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   '0'   Pos:6  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'obj'   Pos:7  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'stream'   Pos:8  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'h'   Pos:9  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'wk'   Pos:10  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   ''   Pos:11  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   ''   Pos:12  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   '5ip'   Pos:13  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'z'   Pos:14  Stuct:0x1 ( FILE )


5. this also seems to work:
swish-e -i
/home/server_dir/Resources/Research/2007/Low_Purine_Diet_405.pdf -c
/home/indices/index4.conf -T indexed_words | less

(same output as above)

6. makes no difference whether using the gui search tool to find indexed
words:
http://192.168.x.y/cgi-bin/swish.cgi

or just the command line:
swish-e -f /home/indices/for_index4.index -w purine

7. I have enclosed the file (low purine diet...needed for people with gout)

8. home/server_dir is configured on a samba server and also on an apache
server ie can be accessed via samba and also via url

If anyone can solve this mystery, it would be great!

Regards
Michael Daly MB BS GradDip(Integrative Medicine) GradCert(Evidence Based
Practice) M Bus(Information Innovation) GradDip(Document Management)
http://www.holisticgp.com.au
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sun Jul 29 01:02:22 2007