Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] How can I adjust the META names before an HTML document is indexed?

From: <harmo(at)not-real.valt.helsinki.fi>
Date: Mon Sep 10 2007 - 05:03:03 GMT
On 8 Sep 2007 at 19:36, Peter Karman wrote:
> If you're using the spider.pl or DirTree.pl with -S prog, then yes, you
> can filter the content with a regex and output additional <meta> tags 
> with the content.

I'm planning to do a -prog thing that would do its own xml-parsing 
and pass just plain text for swish to index. Is it possible to 
produce meta-fields in this scenario? The text would not have any 
tags.. no "<" or ">" .. well, of course I could write them, but seems 
like a waste to have swish parse it for xml a second time,

Something like outputting:
Path-Name: MYPATH
Content-Lines: NUBWER_OF_LINES
Last-Mtime: $mtime
Document-Type: TEXT
Meta: Subject=MYSUBJECT
Meta: AUTHOR=MYAUTHOR

DOCUMENT-CONTENT-TEXT



(I changed the content-length -header wishfully to content-lines,
as calculating the number of bytes swish thinks the file contains can be a
bit tedios if I have lines ending in crlf, and others with just cr or lf..
number of lines would be much easier. Also for swish, i think, if it reads
the input line-by-line. But this is not so important)
 .Timo
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Mon Sep 10 01:03:04 2007