Skip to main content.
home | support | download

Back to List Archive

RE: Filter, 2.0, ishtml

From: PropheZine Webmaster <bob(at)not-real.prophezine.com>
Date: Tue Jul 18 2000 - 14:28:05 GMT
Hi:

I see lots of talk about html extensions and even *.html   Please do not
forget  .shtml extensions.

When our site was originally designed years ago   .shtml was used often.
Now we have parsing on for *.htm*

Bob

-----Original Message-----
From: swish-e@sunsite.berkeley.edu
[mailto:swish-e@sunsite.berkeley.edu]On Behalf Of
Rainer.Scherg@rexroth.de
Sent: Tuesday, July 18, 2000 9:46 AM
To: Multiple recipients of list
Subject: [SWISH-E] RE: Filter, 2.0, ishtml


Hi!


>> E.g.:  there are still "small" bugs:
>>     - Filesize is wrong (== 0) on filtered files.

>That could be a minor problem.  I actually check the file size, date,
>etc in my results script.

Of course, this could be done by the cgi script. But some features
should be done by the index process (e.g. storing a short description of
a document - Meta Tag or first xx word of the doc.]


>>     - No Title for filtered files (e.g.: PDF-Subject or Title Fields)
>>
>This seems to be because of the ishtml file extension check.
>How is this for a temporary hack?  Make ishtml into a stub:
>
>int ishtml(filename)
>char *filename;
>{
>	return 1;
>}

>That handles the annoyance until we can fix it properly ;-)  I don't
>know if we (collectively) would want that in 2.0, however, it may be
>good to list as a "known bug."  I know some folks are anxious to get a
>2.0 release out (can't blame them :-)

This would be only be a temp. solution, because HTML stores/handles titles
different than e.g. PDF files (in what way should a filter script return
the document title?) - has to be discussed...


>>     - Checking only for HTML file on the extension
>>        'html, shtml, htm' e.g. fails, if  - as we do - you
>>        are using apache multiviews features. In this case filenames
>>        are named:  foo.htm.de, foo.htm.en, foo.htm.es, etc.
>
>That's another problem I have.  I use content negotiation everywhere.
>Listing dozens of file extensions is a problem.  I really want it to
>index files of type text/*, image/png, application/pdf, and others with
>filters converting the content into HTML or text as needed.  That's
>surely something for a future version.

Even if apache has a (IMO) braindead in handling content neg. ("Error 406" -
but that's another story...), more and more are using content negotiation.
Swish should be able to take care about this.

A quick bugfix would be checking for ".html" at the end of a filename, etc.
and also for ".html.", etc. within an filename. But as you described, this
would
not fix the php problem.

IMO we need a Conf directive, like:
  ContentType   .php3$     HTML
  ContentType   .html$     HTML
  ContentType   .html.     HTML
  ContentType   .txt$      TEXT
  ContentType   .pdf$      TEXT  (returned by filter)
  ContentType   .xml       XML


Also a vice versa config  would be possible (maybe better):

  NoContents      .avi .mpeg .wav .some-junk    # only path will be
stored...
  IndexContents   HTML  .html .htm .shtml   .htm.  .html. .shtml.   #index
as HTML
  IndexContents   XML   .xml
  IndexContents   WAP   .wap
  IndexContents   TXT   .txt .txt.
  IndexContents   TXT   .pdf .poc .dot .xls    # (filters are returning TXT)

  FileFilter      .doc  doc-filter.sh
  FileFilter      .dot  doc-filter.sh
  FileFilter      .pdf  pdf-filter.sh
  FileFilter      .xls  xls-filter.sh

This would make "IndexOnly" obsolete and would result in a redesign of the
index/parser
engine... (would be a major change...). But if this is done in a modular
design, new
parser engines could be installed in the future. So it could be easy to
decide to
add a new parser engine (e.g. for WAP files) or to handle this via external
filters.

(just some thoughts)

cu Rainer



 (hey someone with a correct footer line ;-)
--
,David Norris
  Dave's Web - http://www.webaugur.com/dave/
  Dave's Weather - http://www.webaugur.com/dave/wx
  ICQ Universal Internet Number - 412039
  E-Mail - dave@webaugur.com


----------------------------------------------------------------------
This Mail has been checked for Viruses
Attention: Encrypted Mails can NOT be checked !

* * *

Diese Mail wurde auf Viren ueberprueft
Hinweis: Verschluesselte Mails koennen NICHT geprueft werden !
----------------------------------------------------------------------
Received on Tue Jul 18 10:31:21 2000