Hi,
We have been happily trialling swish-e on our clients HPUX 10.20 systems for
several months now on sets of files between 20 and 10000 files. After
realizing that we needed to exclude a small number of files in arbitrary
places from being indexed i turned on obeyRobotsNoIndex (as the files were
already set up to block indexing from the web) .
The indexing process then consistently core dumped on our largest data set (
at the same file every time ) unless I forced it to use HTML parsing. Of
course obeyRobotsNoIndex then has no useful effect as it requires the HTML2
parser (which I want to use anyway for all indexing anyway) . Not wanting to
end up maintaining a lot of arbitrary FileRules entries unnecessarily. I have
attempted to debug the problem. What I have found so far is that it appears
that a problem reported last year may still be present in the 2.2 code base.
>From message 4541 in the archive :
Bill Moseley wrote :
> There was a bug in the code that handled removing files (when that no index
> meta tag is found swish has to back-out the additions to the index up to
> that point for the current file). But that should have been fixed. Maybe
> there's still another problem.
I have only been able to reproduce on the large data set (9k+ files) .
In my testing so far the segmentation violation occurs only if some files
with the
<meta name="robots" content="noindex"> tag
have been countered earlier in the indexing process (skipped "due to Robots
Exclusion Rule in meta tag" )
The segv occurs in the CompressCurrentLocEntry routine (compress.c) as
swish-e is indexing the next indexable file .
The exact point of failure is line 594 : next = l->next
When currentChunkLocation instance ptr (l) is set to null a prior loop thru
the hash list walker (a 'for' loop that only terminates when the
entry->currentlocation marker matches 'l' ) .
It appears that either the 'for' loop is lacking an extra loop termination
test or , more likely, a prior modification to the currentChunkLocationList
for the 'ENTRY' instance has failed to set the correct link when terminating
the currentChunkLocation chain. My guess is that this could have happened
during a remove_last_file_from_list() call made during processing of the
robot excluded files.
I have run out of time to follow this further on my own. Can I register this
as a bug on Source Forge or is this posting sufficient to have someone more
familiar with the code to investigate ?
Some relevant info from gdb :
gdb /local/bin/swish-e-2.2.2 core
GNU gdb 5.2
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "hppa2.0-hp-hpux10.20"...
Core was generated by `swish-e-2.2.2'.
Program terminated with signal 11, Segmentation fault.
warning: The shared libraries were not privately mapped; setting a
breakpoint in a shared library will not work until you rerun the program.
Reading symbols from /usr/local/bin/swish-e...done.
Reading symbols from /usr/local/lib/libxml2.sl.7...done.
Reading symbols from /usr/lib/libM.1...done.
Reading symbols from /usr/local/lib/libz.sl...done.
Reading symbols from /usr/lib/libc.1...done.
Reading symbols from /usr/lib/libdld.1...done.
#0 CompressCurrentLocEntry (sw=0x40048be8, indexf=0x400e6530, e=0x401e89d4)
at compress.c:594
594 next = l->next;
(gdb) bt
#0 0x000492ec in CompressCurrentLocEntry (sw=0x40048be8, indexf=0x400e6530,
e=0x401e89d4) at compress.c:594
#1 0x000129d0 in _dmatherr () at index.c:940
#2 0x00034d84 in printfile (sw=0x40048be8,
filename=0x400d9b00 "/data/WWW/cwco/index.html") at fs.c:601
#3 0x00034ea8 in printfiles (sw=0x40048be8, e=0x400d98d0) at fs.c:642
#4 0x00034914 in indexadir (sw=0x40048be8,
dir=0x400ef920 "/data/WWW/cwco") at fs.c:445
#5 0x00034ff4 in printdirs (sw=0x40048be8, e=0x400ef6d0) at fs.c:680
#6 0x00034924 in indexadir (sw=0x40048be8, dir=0x400f1300
"/data/WWW/")
at fs.c:446
#7 0x00035220 in fs_indexpath (sw=0x40048be8, path=0x400f1300 "/data/WWW/")
at fs.c:733
#8 0x00029a1c in indexpath (sw=0x40048be8, path=0x400f1300
"/data/WWW/")
at file.c:193
#9 0x00010024 in cmd_index (sw=0x40048be8, params=0x400e5cf0) at swish.c:1121
#10 0x0000db3c in y1 () at swish.c:179
Regards
Peter
--
Peter Farmer | Custom XML software | Internet Engineering
Zveno Pty Ltd | Website XML Solutions | Training & Seminars
http://www.zveno.com/ | Open Source Tools | - XML XSL Tcl
Peter.Farmer@zveno.com +------------------------+---------------------
Ph. +61 8 92036380 | Mobile +61 417 906 851 | Fax +61 8 92036380
Received on Mon Jan 13 10:41:46 2003