>>>>> "Bill" == Bill Moseley <moseley@hank.org> writes:
Bill> At 01:53 AM 11/23/2001 -0800, lanz+usenet@wsl.ch wrote:
>> I have different problems and questions concerning swish-e:
>>
>> - How do you index an email archive with swish-e? Each file in
>> directory is an email message - I think - mbox format (nnml spool
>> under gnus). Are there pre-configured filters or other tools to
>> get subject and other mail headers as properties? How to I
>> instruct swish-e to not index embedded mail attachments?
Bill> You write a program (such as a perl script that has modules
Bill> that will help parse mail), and you use swish-e's -S prog
Bill> feature to index the documents.
Bill> The indexing script for the swish-e archive indexes hypermail
Bill> archive files, which are html docs. I decided to go the quick
Bill> and easy route and just use regular expression matching. See
Bill> http://swish-e.org/Discussion/search/index_hypermail.pl
Hm, I'll try that later (I don't know perl). Thank you very much for
the script (idea).
Bill> (But don't use IE as it's brain dead about Content-Type
Bill> headers)
>> - With -S fs turned on, I do not get NoContents, FileRules or
>> FileMatch accepted by swish-e (a cygwin problem?).
Bill> Please, if something doesn't work, send a tiny document and
Bill> config file so someone can reproduce it.
With an entry like
FileMatch filename contains "^\d+$"
I get "err: Failed to complie regular expression '^d+$',
pattern. Error: 167970536" with my cygwin compiled swish-e system
(daily snapshot) under WinNT. Similarly for FileRules entries. An
entry
NoContents .overview .temp .prop
seems to be ignored (at least for the .temp and .prop files; see
comment below).
>> swish-e seems to scan index.swish-e.temp and
>> index.swish.e.prop.temp, or what does the "Warning: Substitute
>> possible embedded null character(s) in file index.swish-e" (and
>> index.swish-e.temp, index.swish-e.prop, index.swish-e.prop.temp)
>> mean? I have set "NoContents .swish-e .temp .prop" in my config
>> file.
Bill> The embedded null message means that your document probably
Bill> has an embedded null and was thus truncated. (It really means
Bill> that the files system said that the document was X bytes long,
Bill> but strlen(buf) says it's Y bytes, and Y < X.)
Bill> It might also mean you are trying to index binary data that
Bill> contains a null.
I did ask swish-e to create the index in the indexed directory, and
the error message concerning the embedded null was on the swish-e
generated index temporary files! I store my index file in a different
directory now. ;-)
Bill> If you use the libxml2 parser you won't have this problem with
Bill> HTML docs.
I have mail messages. Text files!?
Bill> BTW - Many people do this:
Bill> IndexOnly .html .htm NoContents .gif .jpeg
Bill> But swish will never see the .gif and .jpeg since it's only
Bill> looking at .htm and .html.
What is the IndexOnly syntax for just indexing files with NO extension
name? My mail messages are stored in files named NNNNN, where the N
are digits.
>> - Even with option -e I ran often out of memory: "err: Ran out f
>> memory(could not allocate NNNN more bytes)!", even wtih
>> IgnoreWords instead of IgnoreLimit. Is swish-e not made for
>> scanning of some 2000 email messages in a directory (some
>> 2'000'000 words)? I have a reasonable PC with 128MB RAM and free
>> disk space.
Bill> 128MB isn't really that much, but I think 2000 messages is not
Bill> a problem (I can index my /usr/doc (24,000 files) in about
Bill> 70MB). You probably want to make sure you are using a current
Bill> version from the swish-daily page. You should also make sure
Bill> you are indexing what you want to index. Just indexing mail
Bill> messages will end up with a lot of things you will never
Bill> search. If there are a lot of numbers in your source docs,
Bill> then maybe you don't need to index them.
Bill> I have seen that out of memory error in swish when it was
Bill> indexing binary data, but that was a while ago.
I use the latest development version of swish-e. The problems are
caused by the sometimes very large mail attachments embedded in the
mail message files (usually base64 encoded). It would be nice to have
a simple option (filter) in the swish-e configuration file, which
would prevent scanning embedded mail attachments (I mean the base64
encoded parts of a mail message with Content-Type:
Multipart-Mixed). My work around (config file) is:
DefaultContents TXT
# .overview is generated by Gnus nnml
NoContents .overview
MinWordLimit 3
MaxWordLimit 40
TruncateDocSize 100000
IgnoreWords /cygdrive/c/lanz/private/mail/gnus/expired/stopwords
# IgnoreLimit 50 50
TranslateCharacters :ascii7:
PreSortedIndex
WordCharacters 0123456789abcdefghijklmnopqrstuvwxyzäëïöüÄËÏÖÜâêîôûÂÊÎÔÛáéíóúÁÉÍÓÚàèìòùÀÈÌÒÙãõñÃÕÑ.-
BeginCharacters 0123456789abcdefghijklmnopqrstuvwxyzäëïöüÄËÏÖÜâêîôûÂÊÎÔÛáéíóúÁÉÍÓÚàèìòùÀÈÌÒÙãõñÃÕÑ
EndCharacters 0123456789abcdefghijklmnopqrstuvwxyzäëïöüÄËÏÖÜâêîôûÂÊÎÔÛáéíóúÁÉÍÓÚàèìòùÀÈÌÒÙãõñÃÕÑ
IgnoreFirstChar .-
IgnoreLastChar .-
Even with digits in WordCharacters I index my mail archive now. The
crucial entry was: TruncateDocSize 100000 (or MaxWordLimit?; I think
40 is the default), which is not exactly what I want to do, but at
least I do not have the memory problems anymore! Usually, the mail
attachments are stored at the end of a mail messages, so the first
(not encoded) part of the mail message gets still indexed?
>> - In WordCharacters I define an extended international character
>> set (only accented letters). Would it help to solve memory
>> problems, if I reduced this set of characters? I am not shure
>> what exactly happens with this extended character set combined
>> with TranslateCharacters set to :ascii7:? Is configuring both
>> redundant/wrong?
Bill> TranslateCharacters is done before WordCharacters is checked.
Bill> I doubt using accented chars will change your memory usage
Bill> much. Not indexing digits, as I said, might make a
Bill> difference, though.
That means, I could set
TranslateCharacters :ascii7:
WordCharacters 0123456789abcdefghijklmnopqrstuvwxyz.-
and search for the string "Zürich"?
>> - I tried different (development) versions of swish-e. On some
>> versions I also get an COALESCE_BUFFER_MAX_SIZE error, but
>> increasing the value in config.h (do not change this!) does not
>> help. Any idea?
Bill> Can you get together a small sample set of documents and a
Bill> small config file that demonstrates this and send it to me?
Bill> Jose will need to look at that.
I do get this error on files with very large base64 encoded mail
attachments, but not with TruncateDocSize set to 100000.
Many thanks, Adrian.
Received on Mon Nov 26 15:57:36 2001