-----BEGIN PGP SIGNED MESSAGE-----
I'm returning to a thread I started (which Peter kindly replied to) a
few weeks back - I was wondering why some of my HTML documents were
not having their titles found by swish-e.
As suggested, I created a test, and noticed that, sure enough, the
titles of the HTML documents I was indexing were mostly not being
found - that is because while many of the documents are straight HTML,
many use SSI variables for the title, so they look like this:
<!--#set var="title" value="Intranet: Directories" -->
Is there any way to get swish-e to use "Intranet Directories" as the
document title found, extracted from this SSI variable?
Thanks for any help!
Peter Karman wrote:
> Greg Keith wrote on 3/11/09 4:14 PM:
> I want the document title returned as the first link, if there
>> is one - most of the documents I'm indexing are HTML, so there should
>> be a <title> tag for most of them. I am not clear on how to do this -
>> it looks like it should be the proper combination of specifying the
>> title_property in swish.cgi and the MetaNames directive in my
>> swish.conf. However, I don't know what the proper combination is - I
>> tried not having any MetaNames directive in the swish.conf, and
>> having title_property set to "title" rather than "swishtitle", but
>> this just produces a "(null)" result for each document found. My
>> swish.conf and swish.cgi are below.
>> Can anyone enlighten me?
> The MetaNames config option is irrelevant in this case. MetaNames are for
> limiting a query to certain *contexts*. PropertyNames are for returning
> *contents* of hits.
> The best thing to do is find a document you think *should* be returning
> and isn't, and then make a test case with it. Here's an example:
> [karpet@pekmac:~/tmp]$ swish-e -i title.html
> Indexing Data Source: "File-System"
> Indexing "title.html"
> Removing very common words...
> no words removed.
> Writing main index...
> Sorting words ...
> Sorting 6 words alphabetically
> Writing header ...
> Writing index entries ...
> Writing word text: Complete
> Writing word hash: Complete
> Writing word data: Complete
> 6 unique words indexed.
> 4 properties sorted.
> 1 file indexed. 94 total bytes. 6 total words.
> Elapsed time: 00:00:00 CPU time: 00:00:00
> Indexing done!
> [karpet@pekmac:~/tmp]$ swish-e -w hello
> # SWISH format: 2.5.6
> # Search words: hello
> # Removed stopwords:
> # Number of hits: 1
> # Search time: 0.000 seconds
> # Run time: 0.007 seconds
> 1000 title.html "this is the title" 94
> [karpet@pekmac:~/tmp]$ cat title.html
> <title>this is the title</title>
> <body>hello world</body>
> What you'll probably find, in the case of your HTML anyway, is that the
> HTML parser isn't finding your <title> tagset for some reason: it isn't
> or is named slightly differently, or...
Greg Keith - Web System Administrator greg.keith(-at-)noaa.gov
NOAA ESRL Physical Sciences Division http://www.esrl.noaa.gov/psd
R/PSD, 325 Broadway, Boulder, CO phone: 303-497-6645
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
-----END PGP SIGNATURE-----
Users mailing list
Received on Thu Apr 2 16:52:57 2009