Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Returning document title rather than file name in search results?

From: Greg Keith <Greg.Keith(at)not-real.noaa.gov>
Date: Thu Apr 02 2009 - 20:52:58 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
 
I'm returning to a thread I started (which Peter kindly replied to) a
few weeks back - I was wondering why some of my HTML documents were
not having their titles found by swish-e.

As suggested, I created a test, and noticed that, sure enough, the
titles of the HTML documents I was indexing were mostly not being
found - that is because while many of the documents are straight HTML,
many use SSI variables for the title, so they look like this:

<!--#set var="title" value="Intranet: Directories" -->

Is there any way to get swish-e to use "Intranet Directories" as the
document title found, extracted from this SSI variable?

Thanks for any help!

Greg

Peter Karman wrote:
> Greg Keith wrote on 3/11/09 4:14 PM:
> I want the document title returned as the first link, if there
>> is one - most of the documents I'm indexing are HTML, so there should
>> be a <title> tag for most of them. I am not clear on how to do this -
>> it looks like it should be the proper combination of specifying the
>> title_property in swish.cgi and the MetaNames directive in my
>> swish.conf. However, I don't know what the proper combination is - I
>> tried  not having any MetaNames directive in the swish.conf, and
>> having title_property set to "title" rather than "swishtitle", but
>> this just produces a "(null)" result for each document found. My
>> swish.conf and swish.cgi are below.
>>
>> Can anyone enlighten me?
>>
>
> The MetaNames config option is irrelevant in this case. MetaNames are for
> limiting a query to certain *contexts*. PropertyNames are for returning
> *contents* of hits.
>
> The best thing to do is find a document you think *should* be returning
a title
> and isn't, and then make a test case with it. Here's an example:
>
> [karpet@pekmac:~/tmp]$ swish-e -i title.html
> Indexing Data Source: "File-System"
> Indexing "title.html"
> Removing very common words...
> no words removed.
> Writing main index...
> Sorting words ...
> Sorting 6 words alphabetically
> Writing header ...
> Writing index entries ...
>   Writing word text: Complete
>   Writing word hash: Complete
>   Writing word data: Complete
> 6 unique words indexed.
> 4 properties sorted.
> 1 file indexed.  94 total bytes.  6 total words.
> Elapsed time: 00:00:00 CPU time: 00:00:00
> Indexing done!
> [karpet@pekmac:~/tmp]$ swish-e -w hello
> # SWISH format: 2.5.6
> # Search words: hello
> # Removed stopwords:
> # Number of hits: 1
> # Search time: 0.000 seconds
> # Run time: 0.007 seconds
> 1000 title.html "this is the title" 94
> .
> [karpet@pekmac:~/tmp]$ cat title.html
> <html>
>  <head>
>   <title>this is the title</title>
>  </head>
>  <body>hello world</body>
> </html>
>
>
> What you'll probably find, in the case of your HTML anyway, is that the
swish-e
> HTML parser isn't finding your <title> tagset for some reason: it isn't
there,
> or is named slightly differently, or...
>

- --
Greg Keith - Web System Administrator   greg.keith(-at-)noaa.gov
NOAA ESRL Physical Sciences Division  http://www.esrl.noaa.gov/psd
R/PSD, 325 Broadway, Boulder, CO         phone: 303-497-6645


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
 
iD8DBQFJ1SWq8IR34NeP2BwRAg8uAJ9LKg7RAP4hnrntL2M0e1SOdPS0EwCcDHA1
e7zKRCVr1e0CxD4OOhauNao=
=IxXV
-----END PGP SIGNATURE-----

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Apr 2 16:52:57 2009