Sorry for being unclear. We are running on Windows Server 2003, and the
swish is on version 2 of July 2008.
Here is our problem:
Couple days ago we made some changes to our website which one of them is
everything including the indexing database .idx file for swish. When we use
the old .idx file, the search worked just fine with the new website, beside
obviously the link will not be correct anymore (because we change some of
the directory structure as well).
However, when we use the latest .idx file (we run a scheduled job everyday
to reindex the site), the search result is showing duplicate result and some
of the url is just wrong such as:
www.domainname.com//news/article1.html (with extra '/')
So I assume that the problem is on the indexing configuration.
# Include our site-wide configuration settings:
# Specify the program to run
SwishProgParameters default http://www.domainname.com/index.html
# Tell swish that about how to parse the content
IndexContents HTML .htm .html .php
IndexContents TXT .txt .conf
StoreDescription HTML* <body> 1000
# These settings tell swish what defines a word.
# We only index words that include letters, numbers, a dash,
# or a period. (Not very realistic)
# These are the characters that are allowed in a "word".
# i.e. words are split on any character NOT found in WordCharacters
# We allow a period and a dash within words, but strip them
# from the beginning or end of a word. This is done after
# WordCharacters above is used to split words.
# Finally, resulting words must begin/end with one
# of the characters listed here
# Turn this on for a slight performance improvement
# This is how detailed you want reporting. You can specify numbers
# 0 to 3 - 0 is totally silent, 3 is the most verbose.
# 4 is debugging. Can be overridden with -v on the command line
# Set the stopwords (words to ignore when searching and when indexing)
# Carefully think about this feature before using a list of stopwords
# You can list the words here:
# IgnoreWords of or and the a to
# Or you can use the compiled in defaults:
# IgnoreWords SwishDefault
# Or you can use a file that includes your own words:
IgnoreWords file: stopwords/english.txt
# Since we are using such a restrictive WordCharacters settings, we
# want to map eight-bit characters to ascii.
# For example, "resumé" will be indexed and searched as "resume".
# See docs for more info.
# We don't want pharse searches to work across sentenses, plus
# we use the pipe "|" to force a break in phrases when indexing.
For spider.pl, we use the default file when we downloaded swish-e.
On Thu, Sep 17, 2009 at 9:35 PM, Peter Karman <firstname.lastname@example.org> wrote:
> Ronny Rahardjo wrote on 9/17/09 10:04 PM:
> > Hi All,
> > Can someone help me to resolve our issue? The title and description is
> > redirect to the wrong URL.
> > It goes to www.domainname.com/press/article123.html
> > <http://www.domainname.com/press/article123.html> instead of
> > www.domainname.com/news/press/article123.html
> > <http://www.domainname.com/news/press/article123.html>
> > Can you give me a pointer which file should I check? Is it happened on
> > indexing??
> > Also, it cannot resolve any querystring. For example,
> > www.domainname.com/index.php?itemid=12
> > <http://www.domainname.com/index.php?itemid=12>
> Both with this post and your previous one you have given us precious little
> information to work with. It would help if we knew:
> (1) windows or unix
> (2) what version of swish-e
> (3) how you are indexing (spider.pl, -S prog, -S fs, etc)
> (4) copies of all your config files.
> I'm guessing you have inherited this system from someone else, so the more
> learn about how it is set up, the more we can help you.
> Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
> Users mailing list
Users mailing list
Received on Fri Sep 18 02:20:23 2009