Dennis Gerasimenko wrote on 7/27/10 12:50 PM:
> Hi.
>
> I am running Swish-E 2.4.7 (on RHEL5) and I am trying to skip a few HTML
> tags (specifically “script” as in <script ...></script>) inside HEAD and
> BODY, while parsing HTML files but, despite configuration directive
> “IgnoreMetaTags script style link select”, tag “script” is still being
> parsed. That generates many errors such as:
>
> error: Unexpected end tag : dt '<dt>' + listingName + '</dt>' +
the error seems to come from libxml2. I was able to reproduce (see end of this
email).
I'm not sure if this is a bug in swish-e or not. But to work around it, I would
probably just strip out the <script> tag content before handing the file to
swish-e. If you're using spider.pl or DirTree.pl, you could probably add a
simple regex to strip out the contents of the <script> tags with:
$buf =~ s,<script[^>]*>.+?</script>,,sgi;
see the 'filter_content' callback in spider.pl and the process_file() function
in DirTree.pl.
Or use swish3 with a custom Aggregator where you filter the content yourself:
% swish3 -S MyAggregator -c conffile -i files
where MyAggregator looks like:
package MyAggregator;
use strict;
use base qw( SWISH::Prog::Aggregator::FS );
sub init {
my $self = shift;
$self->SUPER::init(@_);
$self->set_filter( \&my_filter );
}
sub my_filter {
my $doc = shift;
my $buf = $doc->content;
$buf =~ s,<script[^>]*>.+?</script>,,sgi;
$doc->content($buf);
return $doc;
}
1;
my test case below.
[karpet@pekmac:~/tmp/nometa]$ cat script.html
<html>
<head>
<title>i have script</title>
<script type="text/javascript">
//<!-- noindex -->//
var foo = '<foo>bar</foo>';
//<!-- index -->//
</script>
</head>
<body>
<p>hello world</p>
</body>
</html>
[karpet@pekmac:~/tmp/nometa]$ cat conf
# Ignore select HTML tag
IgnoreMetaTags script style link select
[karpet@pekmac:~/tmp/nometa]$ swish-e -c conf -i script.html -T indexed_words
parsed_words parsed_tags parsed_text properties -v9
Parsing config file 'conf'
Indexing Data Source: "File-System"
Indexing "script.html"
Checking file "script.html"...
script.html - Using DEFAULT (HTML2) parser - i have script
White-space found word 'i'
Adding:[1:swishdefault(1)] 'i' Pos:5 Stuct:0x7 ( HEAD TITLE FILE )
White-space found word 'have'
Adding:[1:swishdefault(1)] 'have' Pos:6 Stuct:0x7 ( HEAD TITLE FILE )
White-space found word 'script'
Adding:[1:swishdefault(1)] 'script' Pos:7 Stuct:0x7 ( HEAD TITLE FILE )
<script> (meta [no meta name defined] *Start Ignore*)
<script> (property [no meta name defined] *Start Ignore*)
script.html:6: error: Unexpected end tag : foo
var foo = '<foo>bar</foo>';
^
</script> (meta) end ignore
</script> (property) end ignore
hello world
White-space found word 'hello'
Adding:[1:swishdefault(1)] 'hello' Pos:14 Stuct:0x9 ( BODY FILE )
White-space found word 'world'
Adding:[1:swishdefault(1)] 'world' Pos:15 Stuct:0x9 ( BODY FILE )
(5 words)
swishdocpath: 6 ( 11) S: "script.html"
swishtitle: 7 ( 13) S: "i have script"
swishdocsize: 8 ( 8) N: "225"
swishlastmodified: 9 ( 8) D: "2010-07-30 09:54:53 CDT"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 5 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
5 unique words indexed.
4 properties sorted.
1 file indexed. 225 total bytes. 5 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
[karpet@pekmac:~/tmp/nometa]$ swish3 -S MyAggregator -c conf -i script.html -v
Indexing Data Source: "External-Program"
Indexing "stdin"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 5 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
5 unique words indexed.
4 properties sorted.
1 file indexed. 105 total bytes. 5 total words. # NOTICE byte count less
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
1 documents in 00:00:00
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Jul 30 11:16:10 2010