Skip to main content.
home | support | download

Back to List Archive

Re: IndexContents and StoreDescription for doc PDF PPT XLS files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Dec 02 2005 - 21:50:08 GMT
On Fri, Dec 02, 2005 at 01:30:25PM -0800, David Larkin wrote:
> I've got swish-e to index a directory with mixed content (HTML DOC PDF XLS PPT files) and swish.cgi produces half sensible output.
> 
> At first it gave "(null)" where i'd expect to see the context of the string i was searching for.
> 
> So, i added 
> 
> StoreDescription HTML* <body> 20000
> StoreDescription TXT* 20000
> StoreDescription XML* <desc> 20000
> 
> and the "(null)" dissapeared , but still no context 
> 
> so i added
> 
> IndexContents HTML* .htm .html .shtml
> IndexContents TXT* .txt .log .text
> IndexContents XML* .xml
> 
> and now i get the context i expect for HTM files.
> 
> Can i get it to work for other filetypes ?
> 
> The documentation suggests HTML,TXT,XML are only legal arguments to StoreDescription.

That allows assigning the IndexContents based on each parser.  I'm not
sure it makes much sense.

I assume you are using -S prog for indexing.  You should look at what
each file reports in its header.

For example:

~$ /usr/local/lib/swish-e/spider.pl default file:///home/moseley/050819-securing-mac-os-x-tiger.pdf  | head
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
Path-Name: file:///home/moseley/050819-securing-mac-os-x-tiger.pdf
Content-Length: 90697
Last-Mtime: 1126391567
Document-Type: HTML*    <<<<<<<---- notice this

<html>
<head>
<title>Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc</title>
<meta name="author" content="martin">
<meta name="creationdate" content="Fri Aug 19 13:07:33 2005">


THere's a PDF file that was filtered into HTML.  So it's telling swish
to use the HTML* parser.  That will *override* anything you set in
your swish config file.

So to store the description for that you would need:

StoreDescription HTML* <body>

Again, here you can see that the description is indeed saved:

$ /usr/local/lib/swish-e/spider.pl default file:///home/moseley/050819-securing-mac-os-x-tiger.pdf  | swish-e -c c -S prog -i stdin -v0 -T properties
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'

(sorry the spider output and swish-e output are mixed a bit)

Summary for: file:///home/moseley/050819-securing-mac-os-x-tiger.pdf
         Connection: Close:      1  (1.0/sec)
               Total Bytes: 90,697  (90697.0/sec)
                Total Docs:      1  (1.0/sec)
               Unique URLs:      1  (1.0/sec)
application/pdf->text/html:      1  (1.0/sec)
          swishdocpath: 6 ( 55) S: "file:///home/moseley/050819-securing-mac-os-x-tiger.pdf"
            swishtitle: 7 ( 58) S: "Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc"
          swishdocsize: 8 (  4) N: "90697"
     swishlastmodified: 9 (  4) D: "2005-09-10 15:32:47 PDT"
      swishdescription:10 (88766) S: "The natural choice for information security solutions A Corsaire White Paper: Securing Mac OS X Author Document Reference Document Revision Date Stephen de Vries Securing Mac OS X 10.4 Tiger v1.0.doc 1.0 Released 19 August 2005  Copyright 2000  2005 Corsaire Limited All Rights Reserved A Corsaire W ..."



-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Dec 2 13:50:09 2005