At 07:30 AM 09/23/02 -0700, H Vernon Leighton wrote:
>As I said before, if you leave obeyRobotsNoIndex off, then swish-e will
>index all of the pages in the directory tree. However, with obeyRobots set
>to yes, it will index all of the pages properly, but it will not return all
>of them from a legitimate search. We have tried -T INDEXED_WORDS and other
>tests, and the pages are apparently being indexed, just not returned from
>Say the word "testword" is in three documents: sample1.html, sample2.html
>and sample3.html. Say that swish-e indexes them in that exact order.
>If sample2.html has the tag <meta name="robots" content="noindex">, then in
>INDEXED_WORDS, both sample1.html and sample3.html appear in the index under
>"testword." However, if you do the search:
>swish-e -w testword -f swish_test.index
>you will only get sample3.html returned from the search.
Is this what you are describing? sample2.html has the noindex tag:
~/swish-e.2.2-patches/src > cat sample?.html
<title>This is sample1</title>
<title>This is sample2</title>
<meta name="robots" content="noindex">
<title>This is sample3</title>
Here's the config file:
~/swish-e.2.2-patches/src > cat c
IndexContents HTML2 .html
Now index, see that sample2.html is ignored:
~/swish-e.2.2-patches/src > ./swish-e -c c -i sample1.html sample2.html
Parsing config file 'c'
Indexing Data Source: "File-System"
Checking file "sample1.html"...
sample1.html - Using HTML2 parser - (4 words)
Checking file "sample2.html"...
sample2.html - Using HTML2 parser - (Skipped due to Robots Excluion Rule
in meta tag)
Checking file "sample3.html"...
sample3.html - Using HTML2 parser - (4 words)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 5 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
5 unique words indexed.
4 properties sorted.
2 files indexed. 297 total bytes. 11 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Now search. I get sample1 and sample3 as expected.
~/swish-e.2.2-patches/src > ./swish-e -w testword
# SWISH format: 2.2.1
# Search words: testword
# Number of hits: 2
# Search time: 0.000 seconds
# Run time: 0.050 seconds
1000 sample3.html "This is sample3" 86
1000 sample1.html "This is sample1" 86
>If, however, the
>"noindex" tag appears in sample1.html and not in sample2.html, then you
>will get sample2.html and sample3.html returned in the search. If the
>"noindex" tag appears in sample3.html only, then you will get no results
>from the search for "testword". With obeyRobotsNoIndex off, you will get
>all three no matter where the noindex tag is.
Interesting. I don't see that at all in my test setup I'm using above.
There is special code we use to back-out a partially indexed file (when
"noindex" is found). That would be the only place I would suspect a
problem could be happening.
Can you put together a sample of files like above that demonstrate this?
>IndexOnly .html .htm
>NoContents .doc .gif .js .pdf .php .txt .xml
Those won't be indexed because you are only indexing .html and .htm
Received on Tue Sep 24 03:27:15 2002