Skip to main content.
home | support | download

Back to List Archive

Re: cygwin: email archive indexing problem

From: <lanz+usenet(at)>
Date: Mon Nov 26 2001 - 15:56:54 GMT
>>>>> "Bill" == Bill Moseley <> writes:

  Bill> At 01:53 AM 11/23/2001 -0800, wrote:
  >> I have different problems and questions concerning swish-e:
  >> - How do you index an email archive with swish-e? Each file in
  >> directory is an email message - I think - mbox format (nnml spool
  >> under gnus). Are there pre-configured filters or other tools to
  >> get subject and other mail headers as properties? How to I
  >> instruct swish-e to not index embedded mail attachments?

  Bill> You write a program (such as a perl script that has modules
  Bill> that will help parse mail), and you use swish-e's -S prog
  Bill> feature to index the documents.

  Bill> The indexing script for the swish-e archive indexes hypermail
  Bill> archive files, which are html docs.  I decided to go the quick
  Bill> and easy route and just use regular expression matching.  See

Hm, I'll try that later (I don't know perl). Thank you very much for
the script (idea).

  Bill> (But don't use IE as it's brain dead about Content-Type
  Bill> headers)

  >> - With -S fs turned on, I do not get NoContents, FileRules or
  >> FileMatch accepted by swish-e (a cygwin problem?).

  Bill> Please, if something doesn't work, send a tiny document and
  Bill> config file so someone can reproduce it.

With an entry like

FileMatch filename contains "^\d+$"

I get "err: Failed to complie regular expression '^d+$',
pattern. Error: 167970536" with my cygwin compiled swish-e system
(daily snapshot) under WinNT. Similarly for FileRules entries. An

NoContents .overview .temp .prop

seems to be ignored (at least for the .temp and .prop files; see
comment below).

  >> swish-e seems to scan index.swish-e.temp and
  >> index.swish.e.prop.temp, or what does the "Warning: Substitute
  >> possible embedded null character(s) in file index.swish-e" (and
  >> index.swish-e.temp, index.swish-e.prop, index.swish-e.prop.temp)
  >> mean? I have set "NoContents .swish-e .temp .prop" in my config
  >> file.

  Bill> The embedded null message means that your document probably
  Bill> has an embedded null and was thus truncated.  (It really means
  Bill> that the files system said that the document was X bytes long,
  Bill> but strlen(buf) says it's Y bytes, and Y < X.)

  Bill> It might also mean you are trying to index binary data that
  Bill> contains a null.

I did ask swish-e to create the index in the indexed directory, and
the error message concerning the embedded null was on the swish-e
generated index temporary files! I store my index file in a different
directory now. ;-)

  Bill> If you use the libxml2 parser you won't have this problem with
  Bill> HTML docs.

I have mail messages. Text files!?

  Bill> BTW - Many people do this:

  Bill> IndexOnly .html .htm NoContents .gif .jpeg

  Bill> But swish will never see the .gif and .jpeg since it's only
  Bill> looking at .htm and .html.

What is the IndexOnly syntax for just indexing files with NO extension
name? My mail messages are stored in files named NNNNN, where the N
are digits.

  >> - Even with option -e I ran often out of memory: "err: Ran out f
  >> memory(could not allocate NNNN more bytes)!", even wtih
  >> IgnoreWords instead of IgnoreLimit. Is swish-e not made for
  >> scanning of some 2000 email messages in a directory (some
  >> 2'000'000 words)? I have a reasonable PC with 128MB RAM and free
  >> disk space.

  Bill> 128MB isn't really that much, but I think 2000 messages is not
  Bill> a problem (I can index my /usr/doc (24,000 files) in about
  Bill> 70MB).  You probably want to make sure you are using a current
  Bill> version from the swish-daily page.  You should also make sure
  Bill> you are indexing what you want to index.  Just indexing mail
  Bill> messages will end up with a lot of things you will never
  Bill> search.  If there are a lot of numbers in your source docs,
  Bill> then maybe you don't need to index them.

  Bill> I have seen that out of memory error in swish when it was
  Bill> indexing binary data, but that was a while ago.

I use the latest development version of swish-e. The problems are
caused by the sometimes very large mail attachments embedded in the
mail message files (usually base64 encoded). It would be nice to have
a simple option (filter) in the swish-e configuration file, which
would prevent scanning embedded mail attachments (I mean the base64
encoded parts of a mail message with Content-Type:
Multipart-Mixed). My work around (config file) is:

DefaultContents TXT
# .overview is generated by Gnus nnml
NoContents .overview
MinWordLimit 3
MaxWordLimit 40
TruncateDocSize 100000
IgnoreWords /cygdrive/c/lanz/private/mail/gnus/expired/stopwords
# IgnoreLimit 50 50
TranslateCharacters :ascii7:
WordCharacters 0123456789abcdefghijklmnopqrstuvwxyz.-
BeginCharacters 0123456789abcdefghijklmnopqrstuvwxyz
EndCharacters 0123456789abcdefghijklmnopqrstuvwxyz
IgnoreFirstChar .-
IgnoreLastChar  .-

Even with digits in WordCharacters I index my mail archive now. The
crucial entry was: TruncateDocSize 100000 (or MaxWordLimit?; I think
40 is the default), which is not exactly what I want to do, but at
least I do not have the memory problems anymore! Usually, the mail
attachments are stored at the end of a mail messages, so the first
(not encoded) part of the mail message gets still indexed?

  >> - In WordCharacters I define an extended international character
  >> set (only accented letters). Would it help to solve memory
  >> problems, if I reduced this set of characters? I am not shure
  >> what exactly happens with this extended character set combined
  >> with TranslateCharacters set to :ascii7:? Is configuring both
  >> redundant/wrong?

  Bill> TranslateCharacters is done before WordCharacters is checked.
  Bill> I doubt using accented chars will change your memory usage
  Bill> much.  Not indexing digits, as I said, might make a
  Bill> difference, though.

That means, I could set 

TranslateCharacters :ascii7:
WordCharacters 0123456789abcdefghijklmnopqrstuvwxyz.-

and search for the string "Zrich"?

  >> - I tried different (development) versions of swish-e. On some
  >> versions I also get an COALESCE_BUFFER_MAX_SIZE error, but
  >> increasing the value in config.h (do not change this!) does not
  >> help. Any idea?

  Bill> Can you get together a small sample set of documents and a
  Bill> small config file that demonstrates this and send it to me?
  Bill> Jose will need to look at that.

I do get this error on files with very large base64 encoded mail
attachments, but not with TruncateDocSize set to 100000.

Many thanks, Adrian.
Received on Mon Nov 26 15:57:36 2001