Skip to main content.
home | support | download

Back to List Archive

StoreDescription / swishdescription field parsing wrong meta tags

From: Tref Gare <trefg(at)not-real.areeba.com.au>
Date: Tue Dec 17 2002 - 01:00:04 GMT
Hi all and thanks again for any assistance anyone may be able to give.

I'm indexing a bunch of html files (alongside some pdfs and jsp) and am
having trouble getting the StoreDescription to work quite as I'd expect.

As far as I can tell I've set up the config file to index htm and html
using HTML2 and to store the description as <swishdescription> using up
to 120 characters from the <body> tag.  However swish-e is
intermittently returning a swishdescription field with contents from one
of the meta tags on the page, namely the Generator tag.
Eg:
<META NAME="GENERATOR" CONTENT="PageID 667 - generated by RedDot 4.5
(SP3) - 4.5.3.14 - 2-K5b" />

As you can see from the following config file I am indexing a specific
metatag named "description but this element of the equation is working
fine it seems.

My search string looks like this:

<snip>

swish-e.exe -m 15 -b 1 -f /WWW/ACMI/catalog/acmi.index -w "lola " -x
"<swishdocpath>\t<swishtitle>\t<swishdescription>\t<swishdocsize>\t<desc
ription>\n"

</snip>


Here's an example of one of the swishdescriptions getting returned:

<snip>

PageID 531 - generated by RedDot 4.5 (SP3) - 4.5.3.14 - 2-K5b true Run
Lola Run Run Lola Run Run Lola Run Run Lola Run R

</snip>


And here's the head of that page

<snip>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><META NAME="GENERATOR" CONTENT="PageID 531 - generated by RedDot
4.5 (SP3) - 4.5.3.14 - 2-K5b" />
 <meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
 <meta http-equiv="imagetoolbar" content="no">
 <meta http-equiv="MSThemeCompatible" content="no">
 <meta name="MSSmartTagsPreventParsing" content="true">
 <!-- metadata -->
 <title>Run Lola Run</title>
 <meta name="DC.Title" lang="en" content="Run Lola Run">
 <meta name="DC.Subject" scheme="to be advised before development"
content="Run Lola Run">
 <meta name="keywords" content="Run Lola Run">
 <meta name="DC.Description" lang="en" content="Run Lola Run">
 <meta name="Description" content="Run Lola Run">
 <meta name="DC.Creator" lang="en" content="corporateName=Australian
Centre for the Moving Image; address=Federation Square, Melbourne, VIC;
contact=+61 3 8663 2200">
 <meta name="DC.Publisher" lang="en" content="corporateName=Australia
Centre for the Moving Image">
 <meta name="DC.Date.modified" scheme="ISO8601" content="2002-11-21">

</snip>

The config file looks like this:

<snip>
IndexFile "C:/WWW/ACMI/catalog/acmi.index"
IndexDir .
IndexContents HTML2 .htm .html .jsp 
StoreDescription HTML2 <body> 120

IndexContents TXT2 .pdf
FileFilter .pdf "pdftotext" "%p -"
NoContents .gif .jpg .mdb .xml
IndexOnly .htm .html .jsp .pdf
FollowSymLinks yes
MetaNames description
PropertyNames description
ReplaceRules prepend "filesys"
ReplaceRules replace "filesys\." "http://devbox:88"
# ReplaceRules regex "/\x5c/\x2f/gi"
 ReplaceRules replace "\\\\" "/"
# this line tells swish-e not to index the any folders with xml in the
path.
# ie the xml folder and all its subfolders
FileRules pathname contains xml
FileRules dirname contains WEB-INF

</snip>

Any guidance/clarification of where I'm going wrong will be much
appreciated.

Cheers
Tref

------------------------------------------------------
Tref Gare
Development Consultant
Areeba
Level 19/114 William St, Melbourne VIC 3000
email: trefg@areeba.com.au
phone: +61 3 9642 5553
fax: +61 3 9642 1335
website: http://www.areeba.com.au
------------------------------------------------------
"This email is intended only for the use of the individual or entity
named above and contains information that is confidential. No
confidentiality is waived or lost by any mis-transmission. If you
received this correspondence in error, please notify the sender and
immediately delete it from your system. You must not disclose, copy or
rely on any part of this correspondence if you are not the intended
recipient. Any communication directed to clients via this message is
subject to our Agreement and relevant Project Schedule. Any information
that is transmitted via email which may offend may have been sent
without knowledge or the consent of Areeba."
------------------------------------------------------
Received on Tue Dec 17 01:00:21 2002