Skip to main content.
home | support | download

Back to List Archive

RE: Modifications

From: <Rainer.Scherg(at)>
Date: Mon Nov 20 2000 - 11:05:15 GMT
Hi Jose!

  I forgot to say: I made some remarks in the source marked with
  "$$"   (e.g. /* $$ ... */).

  Please read this comments. This things have IMO be discussed.
  The comments can be removed when the problem is solved...

  Search with e.g. grep for "$$".

  e.g. concerning the routines  isoktitle () and isokhtml()  

> Hi Rainer,
> On 19 Nov 2000, at 18:45, wrote:
> > But here are some things which are still to be discussed & to be
> > done...
> > 
> > 
> > Swish-e:
> > 
> > ToDo and some questions for my understanding:
> > 
> >   - there is a "title" passed from "outside" to the index 
> routines...
> >     this seems to be for historic reasons, when swish did only HTML.
> >     mostly the "title" contains the filepath.
> > 
> >     At this point, I let this still be untouched...
> >     but, we should get rid of this relicts.
> > 
> >     the title should be retrieved within the indexing routine for a
> >     doctype. (XML may be different to HTML or other types...)
> > 
> > 
> IMO, we can use this field to store the summary of the document.

Same point. The "Description" (IMO first words of a document
- not a summary or abstract) can only be retrieved by the
indexing routine for a document itself.
So: IMO we don't need this.

"Description" storage can be done like follows (standard way):

  index_string is saving (somehow)  n bytes of the first words of a
  document. This string can be stored as a description...

  For each document type there may be also an alternative method (e.g.
  HTML Meta-Tag "Description" can override this string).

> >   - "indextitleonly" (now: fprop->index_no_content) is not 
> honoured in
> >   each
> >     index-routine (only the original one: countwords).
> >     Should be done.
> >
> Which is the title for non html docs? Perhaps for non-html docs this
> field should ne interpreted as "indexsummaryonly"

The "indextitleonly" variable was triggered by the "NoContents" config
So IMO the name for the variable was wrong in the first place.
Therefor I renamed it in the FileProp structure to fprop->index_no_content.

  The docs reads as follows:

    NoContents *.suffix1 .suffix2 .suffix3 ...*
        This variable lets you control what files will have their
        contents indexed. If a file with a suffix in this list is
        indexed, only its file name (and not any words in the file)
        will be indexed. This is useful because normally swish-e
        will try to index the contents of every file, even files
        without words (such as images or movies). Suffix checking is

We can enhance this by saving the "doc title" instead as of the filename.
But this has to be decided by the indexing routine for a document
(e.g. not possible for a TXT doc).

But point is, e.g. countwords_txt doesn't check this variable at the moment
and is ignoring the "NoContents" settings.

> > 
> >   - in routine "indexafile": DOCENTRY *e only contains the 
> filename...
> >     (and the misplaced "title")
> >     What do we need this structure for?
> > 
> You are right, it is non sense if title is equal to filepath. 
> I need to
> look at the code because there are other functions affected by 
> DOCENTRY like indexadir and addsortentry. 

Yep, right.

 indexadir is the same problem.

>I do not remember at 
> this moment if there can be a situation with different 
> filepath and title.

Yes there is (historically)...

When an HTML-Doc is indexed the title var could contain the <Title>-tag.
But this was done in assumption, that each doc is a html doc.

IMO this behavior has to be placed into countword_xxx.

cu - rainer

This Mail has been checked for Viruses
Attention: Encrypted Mails can NOT be checked !

* * *

Diese Mail wurde auf Viren ueberprueft
Hinweis: Verschluesselte Mails koennen NICHT geprueft werden !
Received on Mon Nov 20 11:07:04 2000