Hi again!
As discussed on IRC, here are some stats of my swish-e (swish configs
are in fact those I gave some hours ago, you can find them at the end of
this mail)
Wikis (moin-moin, through filesystem. NO http access are done for this
indexation):
Sorting 53,352 words alphabetically
3,352 unique words indexed.
Sorting property: swishdocpath
Sorting property: swishtitle
Sorting property: swishdocsize
Sorting property: swishlastmodified
Sorting property: swishdescription
5 properties sorted.
2,091 files indexed. 16,440,593 total bytes. 926,703 total words.
Elapsed time: 00:00:08 CPU time: 00:00:08
hardware/software spec :
Debian 3.1, kernel 2.6.16 on bi Intel(R) Xeon(TM) CPU 2.80GHz, 2025Mo RAM
Filesystem :
Sorting 2,618,998 words alphabetically
2,618,998 unique words indexed.
5 properties sorted.
26,361 files indexed. 3,559,089,581 total bytes. 53,755,796 total words.
Elapsed time: 00:35:18 CPU time: 00:08:17
hardware/software spec :
Debian 4.0, kernel 2.6.18 on octo-Intel(R) Xeon(R) CPU E5420 @ 2.50GHz, 7964Mo RAM
!! it'sa Virtual Environment which shares system properties with 9 other VE.
Request Tracker (pgsql through LAN):
Sorting 59,507 words alphabetically
59,507 unique words indexed.
Sorting property: swishdocpath
Sorting property: swishtitle
Sorting property: swishdocsize
Sorting property: swishlastmodified
Sorting property: swishdescription
5 properties sorted.
13,437 files indexed. 32,635,567 total bytes. 1,714,068 total words.
Elapsed time: 00:00:34 CPU time: 00:00:26
hardware/software spec:
Debian 4.0, kernel 2.6.18 openVZ, on a bi Intel(R) Pentium(R) 4 CPU 2.80GHz, 2023Mo RAM
!! it's a Virtual Environment which shares system properties with 8 other VE
Database is a posgres one, with about 7'000 tickets.
I don't know if it can be really significant, but... maybe yes ;)
Regards
C. Jeanneret
Jeanneret Internux cjeanneret@internux.ch
Av. des Alpes 123 +41 78 748 03 02
1814 La Tour-de-Peilz +41 21 550 02 09
>
> Walk through a fileserver, with opendocument + ms-office + pdf support :
>
> # replace /path/to/files with local mount point
> ReplaceRules regex #/path/to/files/#/mnt/local_mount_point/#ig
>
> IndexOnly .txt .htm .html .pdf .ods .odt .odp .doc .xls .ppt .pps .sxw .sxc .sxg .xml
> IndexDir /path/to/files/
>
> FileRules filename contains /.\$/
> IndexFile /path/to/index/fileserver.swish-e
> MinWordLimit 3
>
>
> # XML files and associated
> FileFilterMatch "/usr/bin/unzip" "-p %p content.xml" /\.(sxw|sxc|sxg|ods|odt|odp)$/i
>
> IndexContents XML* .sxw .sxc .sxg .ods .odt .xml .odp
> StoreDescription XML* <office:body> 20000
>
> IndexContents TXT* .xls .doc .pps .ppt .txt .pdf
> StoreDescription TXT* 20000
>
> IndexContents HTML* .html .htm
> StoreDescription HTML* <body> 20000
>
> # DOC files
> FileFilterMatch "/usr/bin/catdoc" "-b %p | recode -p -q -f ..latin1" /\.(doc)$/i
> # XLS files
> FileFilterMatch "/usr/bin/xls2csv" "-x %p | recode -p -q -f ..latin1" /\.(xls)$/i
> # PPT/PPS
> FileFilterMatch "/usr/bin/catppt" " %p | recode -p -q -f ..latin1" /\.(ppt|pps)$/i
> # PDF files
> FileFilterMatch "/usr/bin/pdftotext" " -q %p - | recode -p -q -f ..latin1" /\.(pdf)$/i
>
> ___________________________________________________________________________________
>
> RT Spider :
>
> #!/usr/bin/perl -w
> use strict;
>
> use DBI;
> use Compress::Zlib;
> use Time::Local;
> use Locale::Recode;
>
>
> my $dbh = DBI->connect( "dbi:Pg:dbname=rtdb;host=HOST","USER","PASSWORD", { RaiseError => 1 } );
>
> my $sth = $dbh->prepare("select ti.id,ti.subject,at.content,at.created from tick
> ets ti, transactions tr, attachments at where ti.status <> 'deleted' and tr.obje
> ctid=ti.id and at.transactionid=tr.id and at.contenttype like 'text/%' and (tr.t
> ype= 'Comment' or tr.type = 'CommentEmailRecord' or tr.type = 'Create')");
>
> $sth->execute();
>
> while ( my( $id, $title,$ticket,$date) = $sth->fetchrow_array ) {
>
> my $uncompressed = uncompress( $ticket );
> my $unix_date = unixtime( $date );
>
> my $cd = Locale::Recode->new (from => 'UTF-8', to => 'ISO-8859-15');
> $cd->recode($ticket);
>
> my $content = <<EOF;
> <html>
> <head>
> <title>
> RT - $title
> </title>
> <meta http-equiv="content-type" content="text/html;charset=iso-8859-15" />
> </head>
> <body>
> $ticket
> </body>
> </html>
> EOF
>
>
> my $length = length $content;
>
> print <<EOF;
> Content-Length: $length
> Last-Mtime: $unix_date
> Path-Name: http://mydomain.wxt/Ticket/Display.html?id=$id
> Document-Type: HTML
>
> EOF
> print $content;
>
> }
>
> sub unixtime {
> my ( $y, $m, $dh ) = split /-/, shift;
> my ($d, $hms) = split / /, $dh;
> my ($h,$i,$s) = split /:/,$hms;
> return timelocal($s,$i,$h,$d,$m-1,$y-1900);
> };
>
> swish.conf :
>
> IndexFile /path/to/rt.swish-e
>
> DefaultContents HTML
> StoreDescription HTML <body> 200000
> MetaNames swishdocpath swishtitle
>
> MinWordLimit 3
>
>
> Command line to run this :
>
> swish-e -c /path/to/config/file/swish.conf -S prog -i /path/to/rt_spider.pl
>
> _______________________________________________________________________
>
> Moin-moin wiki indexer (through filesystem)
>
> #!/usr/bin/perl
> use File::Find;
> use Locale::Recode;
> use strict;
>
> sub wanted {
> return if -d;
> return unless /text_html$/;
>
> my $mtime = (stat)[9];
>
> my $child = open( FH, "< $_" ) or die($!);
>
> my $content = '';
> while(my $l = <FH>) {
> chomp($l);
> $content .= $l;
> }
> close FH;
>
> my $cd = Locale::Recode->new(from => 'UTF-8', to => 'ISO-8859-15');
> $cd->recode($content);
> $content = "<body>$content</body>";
>
> my $size = length $content;
>
> print <<EOF;
> Content-Length: $size
> Last-Mtime: $mtime
> Path-Name: $_
>
> EOF
> print "$content";
> }
>
> find({ wanted => \&wanted, no_chdir => 1, },'.', );
>
>
> swish config file :
>
> IndexFile /path/to/my/indexes/all_wikis.swish-e
>
> DefaultContents HTML*
> StoreDescription HTML* <body> 200000
> ConvertHTMLEntities yes
>
> MinWordLimit 2
>
> ReplaceRules regex !^.*/doc/wikis/!!
> ReplaceRules remove data/
> ReplaceRules remove cache/
> ReplaceRules remove pages/
> ReplaceRules remove /text_html
> ReplaceRules remove /pagelinks
>
> ReplaceRules replace \(2f\) \/
> ReplaceRules replace \(2e\) \.
> ReplaceRules replace \(2d\) \-
>
> ReplaceRules regex /\(([a-z0-9]{2})([a-z0-9]{2})\)/%$1%$2/gi
> ReplaceRules prepend 'http://my.domain.org/'
>
>
> Command line :
>
> /path/to/swish_filter/filter.pl | swish-e -c /path/to/swish-wiki.config -i stdin -S prog
>
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Apr 18 11:11:32 2008