Skip to main content.
home | support | download

Back to List Archive

Re: ishtml()

From: <Rainer.Scherg(at)>
Date: Mon Aug 21 2000 - 08:45:46 GMT
IMO the Filter directives should not include a content type.
There should be a special directive.

(please see thread dated of 2000-07-18)

>IMO we need a Conf directive, like:
> ContentType   .php3$     HTML
> ContentType   .html$     HTML
>  ContentType   .html.     HTML
>  ContentType   .txt$      TEXT
>  ContentType   .pdf$      TEXT  (returned by filter)
>  ContentType   .xml       XML
>Also a vice versa config  would be possible (maybe better):
>  NoContents      .avi .mpeg .wav .some-junk    # only path will be
>  IndexContents   HTML  .html .htm .shtml   .htm.  .html. .shtml. #index as
>  IndexContents   XML   .xml
>  IndexContents   WAP   .wap .wml
>  IndexContents   TXT   .txt .txt.
>  IndexContents   TXT   .pdf .poc .dot .xls    # (filters are returning
>  FileFilter      .doc
>  FileFilter      .dot
>  FileFilter      .pdf
>  FileFilter      .xls
>This would make "IndexOnly" obsolete and would result in a redesign of
> the index/parser engine... (would be a major change...). But if this
> is done in a modular design, new parser engines could be installed
> in the future. So it could be easy to decide to
> add a new parser engine (e.g. for WAP files) or to handle this via
> external filters.

-----Original Message-----
From: []
Sent: Monday, August 21, 2000 10:22 AM
To: Multiple recipients of list
Subject: [SWISH-E] Re: ishtml()

Hi David

On 18 Aug 2000, at 19:57, David Norris wrote:

> I think ishtml() might qualify as a bug.  It doesn't seem to help
> anything.  Do you see any problems with assuming everything to be HTML? 
> No one seems to mention whether they think it is good or bad.  As
> SWISH-E becomes more powerful I think assuming plain text is very
> limiting.
I totally agree. I am thinking on adding more directives to the config 
file. On of them could be:

DefaultFileType Value

Possible values are: txt, html, xml, wap ...
If Value is html, ishtml() can always return 1.
To maintain backwards compatibility, the default value should be txt

> For example, all filters are assumed to be text.  I am using many
> filters which return HTML.

In the same way. We can extend FileFilter to:
FileFilter  <file-ext> <filter-program> <file-type>
If no file-type is given, then DefaultFileType should be used

> I plan to spend some time on the stemmer.c and soundex.c this weekend. 
> I have been busy during the week.

Good luck.


This Mail has been checked for Viruses
Attention: Encrypted Mails can NOT be checked !

* * *

Diese Mail wurde auf Viren ueberprueft
Hinweis: Verschluesselte Mails koennen NICHT geprueft werden !
Received on Mon Aug 21 04:49:44 2000