> As I remember the subroutine countwords (index.c) should do all the
> indexing of a file. This routines treats all input like HTML, so
> text input is a "HTML file with no tags".
This makes the most sense to me. I see no need to check extensions.
> E.g.: there are still "small" bugs:
> - Filesize is wrong (== 0) on filtered files.
That could be a minor problem. I actually check the file size, date,
etc in my results script.
> - No Title for filtered files (e.g.: PDF-Subject or Title Fields)
This seems to be because of the ishtml file extension check.
How is this for a temporary hack? Make ishtml into a stub:
That handles the annoyance until we can fix it properly ;-) I don't
know if we (collectively) would want that in 2.0, however, it may be
good to list as a "known bug." I know some folks are anxious to get a
2.0 release out (can't blame them :-)
> - Checking only for HTML file on the extension
> 'html, shtml, htm' e.g. fails, if - as we do - you
> are using apache multiviews features. In this case filenames
> are named: foo.htm.de, foo.htm.en, foo.htm.es, etc.
That's another problem I have. I use content negotiation everywhere.
Listing dozens of file extensions is a problem. I really want it to
index files of type text/*, image/png, application/pdf, and others with
filters converting the content into HTML or text as needed. That's
surely something for a future version.
Dave's Web - http://www.webaugur.com/dave/
Dave's Weather - http://www.webaugur.com/dave/wx
ICQ Universal Internet Number - 412039
E-Mail - firstname.lastname@example.org
Received on Tue Jul 18 01:25:17 2000