Skip to main content.
home | support | download

Back to List Archive

Re: Electronic Data Management System

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Jan 12 2002 - 19:10:42 GMT
At 08:34 PM 01/12/02 +0200, Panos Adam wrote:
>a) An employee  has to scan quite a few articles from the daily press and
>store them somewhere.
>My idea was to have them in pdf format instead of going in OCR and have them
>stored either as plain text or html pages.

Oh, they are scanned.  So the docs are plain text, with out any associated
meta data?

There's two parts to think about.  One is what data you have to work with.
That is, is it just plain text, or do you add "meta" data such as the date
published, date entered, author, and source (news paper or magazine).

If they are plain text files, then swish can index them, but you won't be
able to search on just titles, or rank hits in the titles higher then
normal words.

Now, if you scan the files, and then also add some extra meta data, like
title, date, their source, then you can index that too.  That makes it more
interesting because you can then limit your searches to a date range, or
search within a spcific paper or magazine, and ranking is a bit more helpful.

The second part is how to get that data into swish.  If you scan your docs,
and just store them as plain text files then you can index those directly.
But, if you scan the files, and use some program to add the meta data
(date, title, author...) and it gets stored in a RDBMS, then you can write
a program to read the database, extract out the data you want swish to
index, and let swish use that program to gather its input data.

>But I really do not care as long as I have a fast machine response when I
>search for something. What would it be the format that suits best?

Searching speed should not be a problem.  I think with swish, if you can
index it you can search it.  (The key being how much memory you have for
indexing).
For example (and just a made up example), but 100,000 files would not be
any problem for searching, but you might need, say 1/2 GB to index.


>b) Use of the SWISH software package by another employee as the retrieval
>tool and here are some other questions:
>- How true the SWISH search results are given that it is a common problem
>among search engines to have tons of resulting pages but only few true
>matches? In other words how smart can it be? Is there any tuning method?

Depends on what you are asking.

Swish finds the words you are searching.  That's not a problem.

How well does swish rank?  That's a tough question.  First, it seems to
rank OK, for most searches.   You can't compare it to something like google
where ranking is based on many factors (such as how many other pages link
to a given page), but it does basic ranking that checks how many times a
given word is in a document, and ranks some words higher than others (e.g.
title words rank much higher than other words).

Improving the way to customize the ranking is one of the top five "to do"
items for 2.2.

>- Can SWISH generate an alphabetical site map (index) ?

Not sure what you mean.  You can sort search results, yes.

>- Can I change easily all the English screen titles to Greek? (should i have
>a developer's version for doing this?)

swish-e is a command line program.  There is an example CGI program
included that you can customize any way you like.


>c) I am thinking of having my data simply stored in folders. Do you think
>that storing them in a data base (SQL maybe) would be better?

Depends on your data.  A database makes things easier (like selecting your
data ;), and provides some features like caching which may improve
performance.  But you have to look at your data and decide for yourself.
It doesn't matter to swish.

>d) Can I launch the SWISH-E application through ASP scripts?

I assume so.





-- 
Bill Moseley
mailto:moseley@hank.org
Received on Sat Jan 12 19:11:36 2002