Skip to main content.
home | support | download

Back to List Archive

RE: Modifications

From: <Rainer.Scherg(at)not-real.rexroth.de>
Date: Mon Nov 20 2000 - 11:05:15 GMT
Hi Jose!

BTW:
  I forgot to say: I made some remarks in the source marked with
  "$$"   (e.g. /* $$ ... */).

  Please read this comments. This things have IMO be discussed.
  The comments can be removed when the problem is solved...

  Search with e.g. grep for "$$".

  e.g. concerning the routines  isoktitle () and isokhtml()  



> Hi Rainer,
> 
> On 19 Nov 2000, at 18:45, Rainer.Scherg@rexroth.de wrote:
> 
> 
> > But here are some things which are still to be discussed & to be
> > done...
> > 
> > 
> > Swish-e:
> > 
> > ToDo and some questions for my understanding:
> > 
> >   - there is a "title" passed from "outside" to the index 
> routines...
> >     this seems to be for historic reasons, when swish did only HTML.
> >     mostly the "title" contains the filepath.
> > 
> >     At this point, I let this still be untouched...
> >     but, we should get rid of this relicts.
> > 
> >     the title should be retrieved within the indexing routine for a
> >     doctype. (XML may be different to HTML or other types...)
> > 
> > 
> 
> IMO, we can use this field to store the summary of the document.

Same point. The "Description" (IMO first words of a document
- not a summary or abstract) can only be retrieved by the
indexing routine for a document itself.
So: IMO we don't need this.

"Description" storage can be done like follows (standard way):

  index_string is saving (somehow)  n bytes of the first words of a
  document. This string can be stored as a description...

  For each document type there may be also an alternative method (e.g.
  HTML Meta-Tag "Description" can override this string).





> 
> >   - "indextitleonly" (now: fprop->index_no_content) is not 
> honoured in
> >   each
> >     index-routine (only the original one: countwords).
> >     Should be done.
> >
> 
> Which is the title for non html docs? Perhaps for non-html docs this
> field should ne interpreted as "indexsummaryonly"

The "indextitleonly" variable was triggered by the "NoContents" config
directive.
So IMO the name for the variable was wrong in the first place.
Therefor I renamed it in the FileProp structure to fprop->index_no_content.

  The docs reads as follows:

    NoContents *.suffix1 .suffix2 .suffix3 ...*
        This variable lets you control what files will have their
        contents indexed. If a file with a suffix in this list is
        indexed, only its file name (and not any words in the file)
        will be indexed. This is useful because normally swish-e
        will try to index the contents of every file, even files
        without words (such as images or movies). Suffix checking is
        case-insensitive.


We can enhance this by saving the "doc title" instead as of the filename.
But this has to be decided by the indexing routine for a document
(e.g. not possible for a TXT doc).

But point is, e.g. countwords_txt doesn't check this variable at the moment
and is ignoring the "NoContents" settings.


> > 
> >   - in routine "indexafile": DOCENTRY *e only contains the 
> filename...
> >     (and the misplaced "title")
> >     What do we need this structure for?
> > 
> 
> You are right, it is non sense if title is equal to filepath. 
> I need to
> look at the code because there are other functions affected by 
> DOCENTRY like indexadir and addsortentry. 

Yep, right.

 indexadir is the same problem.


>I do not remember at 
> this moment if there can be a situation with different 
> filepath and title.

Yes there is (historically)...

When an HTML-Doc is indexed the title var could contain the <Title>-tag.
But this was done in assumption, that each doc is a html doc.

IMO this behavior has to be placed into countword_xxx.









cu - rainer




----------------------------------------------------------------------
This Mail has been checked for Viruses
Attention: Encrypted Mails can NOT be checked !

* * *

Diese Mail wurde auf Viren ueberprueft
Hinweis: Verschluesselte Mails koennen NICHT geprueft werden !
----------------------------------------------------------------------
Received on Mon Nov 20 11:07:04 2000