Skip to main content.
home | support | download

Back to List Archive

RE: New alpha version swish-e-2.1.4

From: <Rainer.Scherg(at)not-real.rexroth.de>
Date: Wed Nov 15 2000 - 09:00:16 GMT
Hi Jose!

Thinking again about "Descriptions", I would not distinguish
between HTML or other doc-types.

Normally, you want to have descriptions switched on or not.
IMO it's enough to have a config parameter describing the
length of the description being stored into the database.

I have to look on the code, but - as a first guess - the 
descr. field can be stored along with the path data. So
we only need a routine to gather and store the descr. info.

The add. amount of uncompressed data is also acceptable:

   10000 files  x 200 bytes  would be @2 megs plus
   add pointer storage if needed.

When storing the data into the existing swish database, we
need only an add. config statement:

  StoreDescription <count>      # 0 = default = off


---------


Another topic to discuss:

  Since filtering has been implemented I got the still existing
  problem, that swish-e cannot retrieve the file size.
  This is because filtering is implemented as PIPE stream.

  Saving a filteroutput to a tempfile would not resolve the
  problem, because you don't get the size of the original file.

  I would like to see the query for the filesize removed from the
  countwordsXXX-routines and placed outside.

  This would fix the bug, but brings a small performance penalty
  due the extra request for file information. This routine
  could also be used e.g. to retrieve and store last modification
  dates, etc.

 Any opinions?


cu - rainer


-----Original Message-----
From: jmruiz@boe.es [mailto:jmruiz@boe.es]
Sent: Monday, November 13, 2000 6:22 PM
To: Rainer.Scherg@rexroth.de
Cc: swish-e@sunsite.berkeley.edu
Subject: Re: [SWISH-E] RE: New alpha version swish-e-2.1.4

> 
> 3. ------------------------
> 
> Wish-list:
> 
>  Right now I need a short description of the found pages on the result
>  page (say: first 200 chars or 40 words of the indexed text).
> 
>  When using fileindexing, of course you can display parts of the
>  files. But this doesn't work on http sidering and also not the best
>  way, when
> displaying
>  pdf or doc links.
> 

I have the same problem. But I can use properties because my files
are not html. 

I think that we must give a way to store something useful in the
title, even if the docs are not html. Now, for the filesystem method, 
swish is storing the filepath in this field, so title is useless and 
using a property is just a workaround.

>  What are the chances to get the following functionality:
> 
>    Config:
>        StoreDescription  <number>  # 0 = None  >1 Char or word count

What do you think about ContentTitle (perhaps there is a more 
descriptive word)

ContentTitle [TXT|XML|HTML] field:length  

Eg, for a text file:
ContentTitle TXT :400  #No field-> take the first 400 characters

For an XML-like file;
ContentTitle XML desc:400 # Take the first 400 characters of desc 
field. If length is omitted, the full length will be used

For an HTML file:
ContentTitle HTML title  # As default...

And, for other docs, we can use something like...
DefaultTitle [field:length]

Of course, if none is specified; swish-e must work like 1.3 and 2.0.

If I can fix these two bugs this evening, I will release 2.1.8 tomorrow.

Here is the status of 2.1.8.
>From the WishList:
- Show keywords (-k and its library and perl function) (done in 2.1.8)
- UseWords option (done in 2.1.8)
- Thread safe stemmer and SwishStem perl function (done in 2.1.8)

To do:
- Better XML filter (many people have asked for it). It must discard 
tags like <tag/> and may index things like <field prop="bla bla">. 
- Rainer's suggestion about description of files
- PHP extension. Everybody loves Rasmus work, right?

Hard to do (because of the difficulty)
- Updating, inserting and deleting docs in an index file
- Numerical and date fields and new operators like
<, >, <=, >=... For good performance we need a btree
structure for these fields.

If I am missing something, please let me know.

cu
Jose


----------------------------------------------------------------------
This Mail has been checked for Viruses
Attention: Encrypted Mails can NOT be checked !

* * *

Diese Mail wurde auf Viren ueberprueft
Hinweis: Verschluesselte Mails koennen NICHT geprueft werden !
----------------------------------------------------------------------
Received on Wed Nov 15 09:01:59 2000