I am asking this for some friends who are running a swish-e site.
They are trying to control which pages on the site get indexed by using the
"obeyRobotsNoIndex yes" directive. It works in the sense that the pages
with the meta tag "noindex" do not get indexed. But some pages that do not
have that tag are not returned in the results even though they satisfy the
search. The pattern of what gets picked up and what does not is not
obvious.
When the obey directive is switched to "no", all pages that satisfy the
search (both with and without "noindex") are returned. They tried debugging
with -T PROPERTIES and -k [letter], and have confirmed that the fugitive
pages are being indexed, they are just not being returned by the search.
They had been using a swish-e version that had been in development from
mid-July, so yesterday, it was upgraded to the new 2.2 version. The problem
persists. The parser is HTML2 using libxml2 2.4.22, and the operating
system is Solaris 2.8. Swish-e indexes via the file system, not via a
spider.
Any help would be appreciated.
Vernon Leighton
EXAMPLES:
Command line to index the site:
swish-e -c swish_main.conf -f swish_main.index
A sample command that retrieves different non-robot directive pages
depending on the status of obeyRobotsNoIndex:
swish-e -w participation -f swish_test.index -p description -d ::
Portions of the swish-e configuration file:
IndexContents HTML2 .html .htm
DefaultContents HTML2
#INDEX ONLY FILES WITH THESE EXTENSIONS
IndexOnly .html .htm
obeyRobotsNoIndex yes
FollowSymLinks no
# TYPES OF DOCS NOT TO INDEX
NoContents .doc .gif .js .pdf .php .txt .xml
MetaNames subject
MetaNames description
###########################################
# Properties to be returned in the results
###########################################
StoreDescription HTML <description> 200
PropertyNameAlias swishdescription description
Received on Thu Sep 19 09:07:04 2002