Skip to main content.
home | support | download

Back to List Archive

Segmantation Fault

From: Khalid Shukri <khalid(at)not-real.einblick.de>
Date: Sun Aug 04 2002 - 16:06:40 GMT
Hi!
I'm trying to index a database of urls and corresponding pages that somebody 
spidered for me. I use the -S prog option with a little perl script that just
puts  out the path and size headers and then the page content. This works fine 
with 2 of these databases, but the third one (the biggest one, output to 
swish is about 276 MB) gives me a segmentation fault.. I was first using 
2.1-dev-25-2002-07-16, then upgraded to 2002-08-03 and still got the same 
result
Heres the output I get:

.
.
.
http://www.consors.de/home/ueber_consors/werbung/werbung_buchen/ - Using 
HTML2 parser -  (1363 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 807572 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: ...^M  Writing word text:  10%^M  Writing word text:  
20%^M  Writing word text:  30%^M  Writing word text:  40%^M  Writing word 
text:  50%
  Writing word text:  60%^M  Writing word text:  70%^M  Writing word text:  
80%
  Writing word text:  90%^M  Writing word text: 100%^M  Writing word text: 
Complete
  Writing word hash: ...^M  Writing word hash:  10%^M  Writing word hash:  
20%^M  Writing word hash:  30%^M  Writing word hash:  40%^M  Writing word 
hash:  50%
  Writing word hash:  60%^M  Writing word hash:  70%^M  Writing word hash:  
80%
  Writing word hash:  90%^M  Writing word hash: 100%^M  Writing word hash: 
Complete
  Writing word data: ...^M  Writing word data:   9%^M  Writing word data:  
19%^M  Writing word data:  29%^M  Writing word data:  39%^M  Writing word 
data:  49%
Segmentation fault

Heres my Config:

# ----- Example 1 - limit by extension -------------
#
#  Please see the swish-e documentation for
#  information on configuration directives.
#  Documentation is included with the swish-e
#  distribution, and also can be found on-line
#  at http://sunsite.berkeley.edu/SWISH-E/
#
#
#  This example demonstrates how to limit
#  indexing to just .htm and .html files.
#
#---------------------------------------------------

# By default, swish creates an index file in the current directory
# called "index.swish-e" (and swish uses this name by default when
# searching.  This is convenient, but not always desired.

IndexFile /home/khalid/thomas/googleres.swish-e


# Although you can specify which files or directories to
# index on the command line with -i, it's common to specify
# it here.  Note that these are relative to the current directory.

# Index two directories, "docs" (below current directory) and
# "/home/otherdocs", and within those directories (and all sub
# directories) index only files ending in .html and .htm.

#IndexDir /bin/cat
IndexDir /usr/bin/perl

Indexreport 3
ParserWarnLevel 3

obeyRobotsNoIndex yes

IndexContents HTML2 .htm .html .shtml
IndexContents XML2 .xml
DefaultContents HTML2
ReplaceRules replace http://
ConvertHTMLEntities yes
MetaNames swishtitle swishdocpath keywords address summary faculty
#MetaNames all
#MetaNames date
MetaNameAlias faculty fakultaet
MetaNameAlias keywords categories category gegenstand  index  info  
information inhalt issue  kategorie kategorien keyords keyphrase keys 
keyswords keyword keywordsclassifics keywordsd pagetopic   product products 
project projekt rubrik rubrik1 rubrik2 rubrikdescription schlagworte  
seitenthema sideinfo  stichwort stichworte subcategories subject   thema 
themen   topic topics
# MetaNameAlias summary abstact abstract abstracts abstrct beschreibung 
classification content contents  coverage  decription  desciption 
desciription describtion descriotion descripion descripition descripiton 
description descriptions descripton descrition deskription desription 
discription discritiption
#MetaNameAlias date changed createdat datepubication  geaendert  generated 
lastchange lastchanged lastmodified lastupdate  pubdate published 
publishingdate pushdate released revised searchpublicationdate    time 
timecreated timemodified timestamp updated updatedat updatet validfrom  year
MetaNameAlias address addres adress adresse area city company contact 
contacts country email firma firmafax institution  kontaktperson location 
lupdate modification modificationdate modified modifiedby modifieddate name 
org organiastion organisation organization ort   placename plz postinfo 
region standort  town  who
MetaNameAlias swishtitle searchtitle  subtitle  titel title
#MetaNameAlias all swishdefault swishtitle keywords address summary links 
date faculty
HTMLLinksMetaname links
ImageLinksMetaName images # swishdefault?
# AbsoluteLinks yes
UndefinedMetaTags  index
UndefinedXMLAttributes index
PropertyNames keywords links
PropertyNamesIgnoreCase keywords links swishdocpath swishtitle
#PropertyNamesComparecase
#PropertyNamesNumeric
#PropertyNamesDate
#PreSortedIndex
StoreDescription HTML <body>
#PropCompressionLevel 9?
#IgnoreTotalWordCountWhenRanking
WordCharacters abcdefghijklmnopqrstuvwxyz-0123456789
IgnoreFirstChar - 
IgnoreLastChar -
Buzzwords C++  2D 3D TCP/IP X11 X11R6 modula-2 C#
IgnoreWords File: /usr/share/doc/swish-e/examples/stopwords/german.txt
#UseWords File:???
#IgnoreLimit 80 ??
#IgnoreMetaTags
#IgnoreNumberChars 0123456789
#IndexComments
#TranslateCharacters ??
#BumpPositionCounterCharacters | ?
# If you wish to follow symbolic links use the following.
# Note that the default is "no".  I you are indexing many
# files, and you do not have any symbolic links, you may
# still want to set this to "yes".  This will avoid an extra
# lstat system call for every file while indexing.

#SwishProgParameters /home/khalid/thomas/muell.txt

SwishProgParameters /home/khalid/thomas/makeghoulindex.pl

# end of example


I hope somebody can help me !
Thanks
Khalid
:
Received on Sun Aug 4 16:10:19 2002