I've seperated the index into 12 indexes (for the different kinds of files).
The index for the html files is the biggest one. First time the segmentation
fault appears I tried to seperate more. But 20 or more additional indexes
are too much. Every day the segmentation fault-error may appear again
because every day there are more documents. Really: It's a workaround, not a
I went back to Swish-E 2.2.3 today. So our document server is online in our
intranet. (But I am still looking for a solution. We want to update to a
newer version of Swish-E. The other way is to use another search engine. Not
a good way ...)
Thank you for your advice.
P.S.: Swish-E is difficult to handle if there are documents that doesn't
fulfill the standard. There are control characters in html files or a lot of
additional meta tags. For example there is a segmentation fault (in Swish-E
2.2.3) if "MetaNameAlias swishdescription body description abstract" is in
the configuration file. The error appears because there are some documents
which use the meta tags description and abstract. But all this problems are
solved in a useable way.
> A successful workaround that I'm using for very large sets of files is to
> index part of the files as a seperate index and just specify multiple
> files when searching. Swish-e is fast enough that it doesn't seem to have
> had much of a negative impact.
> -----Original Message-----
> From: Dietmar Rabich [mailto:firstname.lastname@example.org]
> Sent: Tuesday, April 20, 2004 3:11 AM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: Segmentation fault while indexing
> it is a little bit difficult ... There are 3 paths with html files.
> path1: 3 html files
> path2: 38,278 html files
> path3: 64,168 html files
> The whole directory is 5,565,272 kByte - unzipped. And some of the
> are confidential.
> Here is an extract of the old header file:
> > (I cannot contact you directly because of your email address)
> > If possible, can you gzipped "path1" and "path2" and make them
> > available to me to try them?
> > cu
> > Jose
> > Dietmar Rabich escribió:
> > >Some more information:
> > >
> > >In many other cases Swish-E crashes too. In each case there are many
> > >documents to be indexed. Here an example:
> > >
> > >..
> > >Removing very common words...
> > >no words removed.
> > >Writing main index...
> > >Sorting words ...
> > >Sorting 170,500 words alphabetically
> > >Writing header ...
> > >Writing index entries ...
> > > Writing word text: 20%Segmentation fault
> > >
> > >cu Dietmar.
> > >
> > >
> > >
> > >>I've just a problem while indexing HTML-Files. I have update Swish-E
> > from
> > >>version 2.2.3 to 2.4.2. Indexing with the old version works fine. Now
> > >>get
> > >>a message "segmentation fault".
> > >>
> > >>The config file is simple:
> > >>
> > >>IndexDir ../../path1 ../../path2
> > >>IndexOnly .html
> > >>IndexReport 3
> > >>IndexFile ./test.swish-e
> > >>IndexContents HTML .html
> > >>DefaultContents HTML
> > >>StoreDescription HTML <body> 2000
> > >>...
> "Sie haben neue Mails!" - Die GMX Toolbar informiert Sie beim Surfen!
> Jetzt aktivieren unter http://www.gmx.net/info
NEU : GMX Internet.FreeDSL
Ab sofort DSL-Tarif ohne Grundgebühr: http://www.gmx.net/dsl
Received on Tue Apr 20 05:15:44 2004