Hi!
I'm trying to index a database of urls and corresponding pages that somebody
spidered for me. I use the -S prog option with a little perl script that just
puts out the path and size headers and then the page content. This works fine
with 2 of these databases, but the third one (the biggest one, output to
swish is about 276 MB) gives me a segmentation fault.. I was first using
2.1-dev-25-2002-07-16, then upgraded to 2002-08-03 and still got the same
result
Heres the output I get:
.
.
.
http://www.consors.de/home/ueber_consors/werbung/werbung_buchen/ - Using
HTML2 parser - (1363 words)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 807572 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: ...^M Writing word text: 10%^M Writing word text:
20%^M Writing word text: 30%^M Writing word text: 40%^M Writing word
text: 50%
Writing word text: 60%^M Writing word text: 70%^M Writing word text:
80%
Writing word text: 90%^M Writing word text: 100%^M Writing word text:
Complete
Writing word hash: ...^M Writing word hash: 10%^M Writing word hash:
20%^M Writing word hash: 30%^M Writing word hash: 40%^M Writing word
hash: 50%
Writing word hash: 60%^M Writing word hash: 70%^M Writing word hash:
80%
Writing word hash: 90%^M Writing word hash: 100%^M Writing word hash:
Complete
Writing word data: ...^M Writing word data: 9%^M Writing word data:
19%^M Writing word data: 29%^M Writing word data: 39%^M Writing word
data: 49%
Segmentation fault
Heres my Config:
# ----- Example 1 - limit by extension -------------
#
# Please see the swish-e documentation for
# information on configuration directives.
# Documentation is included with the swish-e
# distribution, and also can be found on-line
# at http://sunsite.berkeley.edu/SWISH-E/
#
#
# This example demonstrates how to limit
# indexing to just .htm and .html files.
#
#---------------------------------------------------
# By default, swish creates an index file in the current directory
# called "index.swish-e" (and swish uses this name by default when
# searching. This is convenient, but not always desired.
IndexFile /home/khalid/thomas/googleres.swish-e
# Although you can specify which files or directories to
# index on the command line with -i, it's common to specify
# it here. Note that these are relative to the current directory.
# Index two directories, "docs" (below current directory) and
# "/home/otherdocs", and within those directories (and all sub
# directories) index only files ending in .html and .htm.
#IndexDir /bin/cat
IndexDir /usr/bin/perl
Indexreport 3
ParserWarnLevel 3
obeyRobotsNoIndex yes
IndexContents HTML2 .htm .html .shtml
IndexContents XML2 .xml
DefaultContents HTML2
ReplaceRules replace http://
ConvertHTMLEntities yes
MetaNames swishtitle swishdocpath keywords address summary faculty
#MetaNames all
#MetaNames date
MetaNameAlias faculty fakultaet
MetaNameAlias keywords categories category gegenstand index info
information inhalt issue kategorie kategorien keyords keyphrase keys
keyswords keyword keywordsclassifics keywordsd pagetopic product products
project projekt rubrik rubrik1 rubrik2 rubrikdescription schlagworte
seitenthema sideinfo stichwort stichworte subcategories subject thema
themen topic topics
# MetaNameAlias summary abstact abstract abstracts abstrct beschreibung
classification content contents coverage decription desciption
desciription describtion descriotion descripion descripition descripiton
description descriptions descripton descrition deskription desription
discription discritiption
#MetaNameAlias date changed createdat datepubication geaendert generated
lastchange lastchanged lastmodified lastupdate pubdate published
publishingdate pushdate released revised searchpublicationdate time
timecreated timemodified timestamp updated updatedat updatet validfrom year
MetaNameAlias address addres adress adresse area city company contact
contacts country email firma firmafax institution kontaktperson location
lupdate modification modificationdate modified modifiedby modifieddate name
org organiastion organisation organization ort placename plz postinfo
region standort town who
MetaNameAlias swishtitle searchtitle subtitle titel title
#MetaNameAlias all swishdefault swishtitle keywords address summary links
date faculty
HTMLLinksMetaname links
ImageLinksMetaName images # swishdefault?
# AbsoluteLinks yes
UndefinedMetaTags index
UndefinedXMLAttributes index
PropertyNames keywords links
PropertyNamesIgnoreCase keywords links swishdocpath swishtitle
#PropertyNamesComparecase
#PropertyNamesNumeric
#PropertyNamesDate
#PreSortedIndex
StoreDescription HTML <body>
#PropCompressionLevel 9?
#IgnoreTotalWordCountWhenRanking
WordCharacters abcdefghijklmnopqrstuvwxyzäöüß-0123456789
IgnoreFirstChar - ß
IgnoreLastChar -
Buzzwords C++ 2D 3D TCP/IP X11 X11R6 modula-2 C#
IgnoreWords File: /usr/share/doc/swish-e/examples/stopwords/german.txt
#UseWords File:???
#IgnoreLimit 80 ??
#IgnoreMetaTags
#IgnoreNumberChars 0123456789
#IndexComments
#TranslateCharacters ??
#BumpPositionCounterCharacters | ?
# If you wish to follow symbolic links use the following.
# Note that the default is "no". I you are indexing many
# files, and you do not have any symbolic links, you may
# still want to set this to "yes". This will avoid an extra
# lstat system call for every file while indexing.
#SwishProgParameters /home/khalid/thomas/muell.txt
SwishProgParameters /home/khalid/thomas/makeghoulindex.pl
# end of example
I hope somebody can help me !
Thanks
Khalid
:
Received on Sun Aug 4 16:10:19 2002