Skip to main content.
home | support | download

Back to List Archive

Re: RE: Filter, 2.0, ishtml

From: David Norris <dave(at)>
Date: Tue Jul 18 2000 - 08:27:52 GMT wrote:
> As I remember the subroutine countwords (index.c) should do all the
> indexing of a file. This routines treats all input like HTML, so
> text input is a "HTML file with no tags".

This makes the most sense to me.  I see no need to check extensions.

> E.g.:  there are still "small" bugs:
>     - Filesize is wrong (== 0) on filtered files.

That could be a minor problem.  I actually check the file size, date,
etc in my results script.

>     - No Title for filtered files (e.g.: PDF-Subject or Title Fields)

This seems to be because of the ishtml file extension check.

How is this for a temporary hack?  Make ishtml into a stub:
int ishtml(filename)
char *filename; 
	return 1;

That handles the annoyance until we can fix it properly ;-)  I don't
know if we (collectively) would want that in 2.0, however, it may be
good to list as a "known bug."  I know some folks are anxious to get a
2.0 release out (can't blame them :-)

>     - Checking only for HTML file on the extension
>        'html, shtml, htm' e.g. fails, if  - as we do - you
>        are using apache multiviews features. In this case filenames
>        are named:, foo.htm.en,, etc.

That's another problem I have.  I use content negotiation everywhere. 
Listing dozens of file extensions is a problem.  I really want it to
index files of type text/*, image/png, application/pdf, and others with
filters converting the content into HTML or text as needed.  That's
surely something for a future version.  

,David Norris
  Dave's Web -
  Dave's Weather -
  ICQ Universal Internet Number - 412039
  E-Mail -
Received on Tue Jul 18 01:25:17 2000