Skip to main content.
home | support | download

Back to List Archive

RE: Filter, 2.0, ishtml

From: <Rainer.Scherg(at)>
Date: Tue Jul 18 2000 - 13:45:35 GMT

>> E.g.:  there are still "small" bugs:
>>     - Filesize is wrong (== 0) on filtered files.

>That could be a minor problem.  I actually check the file size, date,
>etc in my results script.

Of course, this could be done by the cgi script. But some features
should be done by the index process (e.g. storing a short description of
a document - Meta Tag or first xx word of the doc.]

>>     - No Title for filtered files (e.g.: PDF-Subject or Title Fields)
>This seems to be because of the ishtml file extension check.
>How is this for a temporary hack?  Make ishtml into a stub:
>int ishtml(filename)
>char *filename; 
>	return 1;

>That handles the annoyance until we can fix it properly ;-)  I don't
>know if we (collectively) would want that in 2.0, however, it may be
>good to list as a "known bug."  I know some folks are anxious to get a
>2.0 release out (can't blame them :-)

This would be only be a temp. solution, because HTML stores/handles titles
different than e.g. PDF files (in what way should a filter script return
the document title?) - has to be discussed...

>>     - Checking only for HTML file on the extension
>>        'html, shtml, htm' e.g. fails, if  - as we do - you
>>        are using apache multiviews features. In this case filenames
>>        are named:, foo.htm.en,, etc.
>That's another problem I have.  I use content negotiation everywhere. 
>Listing dozens of file extensions is a problem.  I really want it to
>index files of type text/*, image/png, application/pdf, and others with
>filters converting the content into HTML or text as needed.  That's
>surely something for a future version.  

Even if apache has a (IMO) braindead in handling content neg. ("Error 406" -
but that's another story...), more and more are using content negotiation.
Swish should be able to take care about this.

A quick bugfix would be checking for ".html" at the end of a filename, etc.
and also for ".html.", etc. within an filename. But as you described, this
not fix the php problem.

IMO we need a Conf directive, like:
  ContentType   .php3$     HTML
  ContentType   .html$     HTML
  ContentType   .html.     HTML
  ContentType   .txt$      TEXT
  ContentType   .pdf$      TEXT  (returned by filter)
  ContentType   .xml       XML

Also a vice versa config  would be possible (maybe better):

  NoContents      .avi .mpeg .wav .some-junk    # only path will be
  IndexContents   HTML  .html .htm .shtml   .htm.  .html. .shtml.   #index
  IndexContents   XML   .xml
  IndexContents   WAP   .wap
  IndexContents   TXT   .txt .txt.
  IndexContents   TXT   .pdf .poc .dot .xls    # (filters are returning TXT)

  FileFilter      .doc
  FileFilter      .dot
  FileFilter      .pdf
  FileFilter      .xls

This would make "IndexOnly" obsolete and would result in a redesign of the
engine... (would be a major change...). But if this is done in a modular
design, new
parser engines could be installed in the future. So it could be easy to
decide to
add a new parser engine (e.g. for WAP files) or to handle this via external

(just some thoughts)

cu Rainer

 (hey someone with a correct footer line ;-)
,David Norris
  Dave's Web -
  Dave's Weather -
  ICQ Universal Internet Number - 412039
  E-Mail -

This Mail has been checked for Viruses
Attention: Encrypted Mails can NOT be checked !

* * *

Diese Mail wurde auf Viren ueberprueft
Hinweis: Verschluesselte Mails koennen NICHT geprueft werden !
Received on Tue Jul 18 09:48:57 2000