Skip to main content.
home | support | download

Back to List Archive

Re: cygwin: email archive indexing problem

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Nov 23 2001 - 14:23:08 GMT
At 01:53 AM 11/23/2001 -0800, lanz+usenet@wsl.ch wrote:
>I have different problems and questions concerning swish-e:
>
>- How do you index an email archive with swish-e? Each file in
>  directory is an email message - I think - mbox format (nnml spool
>  under gnus). Are there pre-configured filters or other tools to get
>  subject and other mail headers as properties? How to I instruct
>  swish-e to not index embedded mail attachments?

You write a program (such as a perl script that has modules that will help
parse mail), and you use swish-e's -S prog feature to index the documents.

The indexing script for the swish-e archive indexes hypermail archive
files, which are html docs.  I decided to go the quick and easy route and
just use regular expression matching.  See 
  http://swish-e.org/Discussion/search/index_hypermail.pl

(But don't use IE as it's brain dead about Content-Type headers)


>- With -S fs turned on, I do not get NoContents, FileRules or
>  FileMatch accepted by swish-e (a cygwin problem?).

Please, if something doesn't work, send a tiny document and config file so
someone can reproduce it.

> swish-e seems to
>  scan index.swish-e.temp and index.swish.e.prop.temp, or what does
>  the "Warning: Substitute possible embedded null character(s) in file
>  index.swish-e" (and index.swish-e.temp, index.swish-e.prop,
>  index.swish-e.prop.temp) mean? I have set "NoContents .swish-e .temp
>  .prop" in my config file.

The embedded null message means that your document probably has an embedded
null and was thus truncated.  (It really means that the files system said
that the document was X bytes long, but strlen(buf) says it's Y bytes, and
Y < X.)

It might also mean you are trying to index binary data that contains a null.

If you use the libxml2 parser you won't have this problem with HTML docs.

BTW - Many people do this:

IndexOnly .html .htm
NoContents .gif .jpeg

But swish will never see the .gif and .jpeg since it's only looking at .htm
and .html.


>- Even with option -e I ran often out of memory: "err: Ran out f
>  memory(could not allocate NNNN more bytes)!", even wtih IgnoreWords
>  instead of IgnoreLimit. Is swish-e not made for scanning of some
>  2000 email messages in a directory (some 2'000'000 words)? I have a
>  reasonable PC with 128MB RAM and free disk space.

128MB isn't really that much, but I think 2000 messages is not a problem (I
can index my /usr/doc (24,000 files) in about 70MB).  You probably want to
make sure you are using a current version from the swish-daily page.  You
should also make sure you are indexing what you want to index.  Just
indexing mail messages will end up with a lot of things you will never
search.  If there are a lot of numbers in your source docs, then maybe you
don't need to index them.

I have seen that out of memory error in swish when it was indexing binary
data, but that was a while ago.

>- In WordCharacters I define an extended international character set
>  (only accented letters). Would it help to solve memory problems, if
>  I reduced this set of characters? I am not shure what exactly
>  happens with this extended character set combined with
>  TranslateCharacters set to :ascii7:? Is configuring both
>  redundant/wrong?

TranslateCharacters is done before WordCharacters is checked.  I doubt
using accented chars will change your memory usage much.  Not indexing
digits, as I said, might make a difference, though.


>- I tried different (development) versions of swish-e. On some
>  versions I also get an COALESCE_BUFFER_MAX_SIZE error, but
>  increasing the value in config.h (do not change this!) does not
>  help. Any idea?

Can you get together a small sample set of documents and a small config
file that demonstrates this and send it to me?  Jose will need to look at
that.




Bill Moseley
mailto:moseley@hank.org
Received on Fri Nov 23 14:23:51 2001