Skip to main content.
home | support | download

Back to List Archive

Re: How to configure swish-e to index content/meta in OO.o?

From: Roman Chyla <chyla(at)not-real.knihovnabbb.cz>
Date: Tue May 31 2005 - 07:13:29 GMT
hi,
so i found the script (and found that my memory is not so good) - it is 
only filter and I did not use XML but HTML2. But the main thing, it works!
hope that helps

Config:

IndexFile sxw.index
IncludeConfigFile  .\conf\common.config
IndexDir d:/temp/test		
IndexOnly .sxw
DefaultContents HTML
IndexContents HTML2 .sxw .xml .sxwrtf      #<-- here
IndexContents TXT .txt

FollowSymLinks yes
IndexComments no
UndefinedMetaTags index                    #<-- here?
UndefinedXMLAttributes index

StoreDescription HTML2 <office:body> 1000   #<-- here
StoreDescription HTML <body> 1000
StoreDescription TXT 1000
FileFilter .sxw 'perl -w ./filter/simplesxw.pl' '"%P"'



Script:

#! "C:\Perl\bin\perl.exe"

use strict;

#simple tool to extract text from sxw files
#receives path to file, outputs the concatenated xml files (in utf8 
encoding)
#oOO stores files in utf8, if you want something else, you must convert 
the file yourself
#my sollution was not portable /not even very quick/, so I did not add it



my $input_file = shift || die "Usage: $0 <filename>\n";


my $slash = "\\";


#this is important, path to unzip utility "-p" says that it should
#pipe output to STDOUT
my $_util = "unzip -p";

#what files do you want to extract from sxw file, only two of them are 
important
my $_conf = "meta.xml content.xml";


our $MYLOGFILE = 'sxwlog.txt';


#print STDERR " $input_file\n\n###!--------------------------------";


my @new = split(/\//, $input_file);
$input_file = join ($slash, @new);

#print STDERR "new name is - $input_file\n\n";


print STDERR "$_util $input_file $_conf\n";


#on my windows version of unzip - "|" seems not to work; hope you linux 
guys have better system
open (INPUT, "|$_util \"$input_file\" $_conf")         || die "can't 
open $input_file: $!";
     while (<INPUT>) {
          print "$_";
		  #print(replaceChar($_));   #you may do something with the contents			
         # do something with $_
						
     }
save_to_file("OK: $input_file\n");
close(INPUT)                || die "can't close $input_file: $!";



sub save_to_file {
my $str = shift;

if ($MYLOGFILE) {
	open (MYFILE, ">>$MYLOGFILE") || die "Check MYLOGFILE: $!";
		print (MYFILE "$str");
	close (MYFILE) || "Can't close $MYLOGFILE: $!";
}
else {
	print STDERR $str;
}

}


#you have a chance to do some cleaning inside of xml file
sub replaceChar {
	my $str = shift;
		for ( $str ) {
#       s/&[sS]caron;//go;

    }
    return $str;
}

1;



Philip Young napsal(a):
> Hey,
> 
> As I'm having alot of frustration trying to get the meta.xml (document
> properties) and the content.xml to be indexed.   I would like the
> content to be indexed into the "swishdefault" category (normal indexed
> content) and the document properties indexed with the
> "UndefinedMetatags auto" .
> 
> So I'm Just looking for a quick and dirty way to accomplish this task.
>  Originally I thought of concatenating the two .xmls to be indexed
> like so:
> 
> FileFilterMatch "/usr/bin/unzip" "-p \"%p\" meta.xml content.xml"
> /\.(sxw|sxc|sxi|odt)$/i
> 
> This line compiles and indexes with no syntax errors.  But the problem
> is it does not seem to index properly.
> 
> Anyone got any ideas on how to get the meta.xml and content.xml indexed?
> 
> My swish.conf file is located below.
> 
> Thankyou,
> 
> 
> Philip Young
> 
> -- swish.conf --
> IndexDir	/var/www/test
> IndexFile	/var/www/test/index.swish-e
> IndexName	Documents
> IndexOnly	.xml .htm .html .txt .doc .rtf .sxw .sxc .sxi .odt 
> DefaultContents	TXT
> SwishProgParameters -S fs
> 
> ReplaceRules replace /var/www/test /test
> ExtractPath subject regex !^/test/([^/]+)/.*$!$1!
> 
> # Allow extra searching by title, path
> metanames swishtitle swishdocpath
> UndefinedMetaTags auto
> 
> IndexContents TXT* .pdf
> FileFilter .pdf "/usr/bin/pdftotext" "'%p' -"
> #SWISH::Filter .pdf "/usr/bin/pdftotext" "'%p' -"
> 
> IndexContents TXT* .doc
> FileFilter .doc "/usr/bin/catdoc" "-s8859-1 -d8859-1 '%p'"
> #SWISH::Filter .doc "/usr/bin/catdoc" "-s8859-1 -d8859-1 '%p'"
> 
> IndexContents TXT* .rtf
> FileFilter .doc "/usr/bin/catdoc" "'%p'"
> #SWISH::Filter .doc "/usr/bin/catdoc" "'%p'"
> 
> FileFilterMatch "/usr/bin/unzip" "-p \"%p\" meta.xml" /\.(sxw|sxc|sxi|odt)$/i
> IndexContents XML* .sxw .sxc .sxi .odt
> StoreDescription XML* <text:p>
> 
> FileFilterMatch "/usr/bin/unzip" "-p \"%p\" content.xml" /\.(sxw|sxc|sxi|odt)$/i
> IndexContents XML* .sxw .sxc .sxi .odt
> StoreDescription XML* <text:p>
> 
> 
> 
> 
Received on Tue May 31 00:13:30 2005