Skip to main content.
home | support | download

Back to List Archive

php support aka PropertyNamesNoStripChars

From: todd breslow <todd(at)not-real.viatraveldesign.com>
Date: Sat Sep 20 2003 - 01:01:13 GMT
i'm calling the swish-e binary from php and i am surprised at how well  
it works. it would be nice for my application to take advantage of the  
PropertyNamesNoStripChars directive, but this is not possible when  
using the binary. any update on native php support?

On Friday, September 19, 2003, at 06:34  PM,  
swish-e@sunsite.berkeley.edu wrote:

> 			    SWISH-E Digest 1571
>
> Topics covered in this issue include:
>
>   1) Re: Filtering problems
> 	by Bill Moseley <moseley@hank.org>
>   2) Re: index file word-list and fuzzy searching
> 	by Bill Moseley <moseley@hank.org>
>   3) FW: Re: Filtering problems
> 	by "Klingensmith, Rick" <klingensmith@hr.msu.edu>
>
> ----------------------------------------------------------------------
>
> Topic No. 1
>
> Date: Thu, 18 Sep 2003 16:05:16 -0700
> From: Bill Moseley <moseley@hank.org>
> To: Multiple recipients of list <swish-e@sunsite.berkeley.edu>
> Subject: Re: Filtering problems
> Message-ID: <20030918230516.GE30162@hank.org>
>
> On Thu, Sep 18, 2003 at 03:23:31PM -0700, Klingensmith, Rick wrote:
>
>> I've caused myself some more problems with filtering PDF documents I
>> believe. I've installed the latest windows install exe on my test  
>> server and
>> modified windows fork in filter.pm. This was to get around a memory  
>> issue
>> that started, which we couldn't solve. Now I'm getting the following  
>> error
>> message when swish-e tries to index a pdf:
>>
>> retrieving http://35.8.31.67/affidavit.pdf (1)...
>>
>> Can't locate object method "convert" via package "SWISH::Filter" at
>> C:/Swish-E/swishspider line 149.
>
> I already responded to Rick by email, but for the list (and archive):
>
> SWISH::Filter was updated.  Before to filter a document
>
>    $filtered = $filter->filter(...)
>
> which returned true or false.  But that's not a very Object Oriented
> interface so I added a new method:
>
>    $doc = $filter->convert(...)
>
> which returns an object "$doc".
>
> The programs swishspider and spider.pl were updated to use that new
> interface.
>
> Rick's problem (so I assume) is that he's using a new version of
> swishspider, but an old version of SWISH::Filter.  I assume that
> happened because he's got a "use lib" line in swishspider pointing to
> an old version of SWISH::Filter.
>
> But swishspider is an exception in that it doesn't automatically point
> to where SWISH::Filter is installed.  In other words, swishspider
> doesn't use SWISH::Filter by default because (unlike spider.pl)
> swishspider runs for each document spidered.  That would mean loading
> SWISH::Filter (and all the associated filter modules) over and over.
>
> The better solution is to use spider.pl instead of swishspider.
>
> Much of the work in getting 2.4.0 released is getting Windows to  
> install
> (and use) things in their right place.  So perhaps that was the  
> problem.
>
> Why doesn't Microsoft follow Apple's lead and replace their OS with  
> BSD?
>
> --  
> Bill Moseley
> moseley@hank.org
>
>
> ------------------------------
>
> Topic No. 2
>
> Date: Thu, 18 Sep 2003 16:07:47 -0700
> From: Bill Moseley <moseley@hank.org>
> To: Masoud Pirnazar <amp834@rqinc.com>
> Cc: Multiple recipients of list <swish-e@sunsite.berkeley.edu>
> Subject: Re: index file word-list and fuzzy searching
> Message-ID: <20030918230747.GF30162@hank.org>
>
> On Thu, Sep 18, 2003 at 03:27:51PM -0700, Masoud Pirnazar wrote:
>> two related questions
>>
>> (1)
>> is there a way of searching the list of terms in the index file, e.g.  
>> to see
>> that
>> "MyIndex" has the words (apple, pear, watermelon) in it?  (treating  
>> the
>> "index" as if it was a dictionary or thesarus)
>
> That's exactly what swish-e does.
>
>> some kind of api such as "start at word >= 'banana'", and "read next"  
>> would
>> do it (maybe a "give a count of total # words in the index")
>
> The header of each search will tell you the number of words in the
> index.
>
>   # Total Words: 15209
>   # Total Files: 1252
>
>> (2)
>> is there any kind of fuzzy searching, e.g. "apple" with one spelling  
>> error
>> accepatable, e.g. "appie" would still match.
>
> Yes.
>
>> (3)
>> any support for "near", e.g. "apple" within 3 words of "banana"
>
> No, not yet.
>
>
> --  
> Bill Moseley
> moseley@hank.org
>
>
> ------------------------------
>
> Topic No. 3
>
> Date: Fri, 19 Sep 2003 15:04:28 -0400
> From: "Klingensmith, Rick" <klingensmith@hr.msu.edu>
> To: "'swish-e@sunsite.berkeley.edu'" <swish-e@sunsite.berkeley.edu>
> Subject: FW: Re: Filtering problems
> Message-ID: <47C544CEBAFBA74EB2FEC55410BEFBBB11F968@hrnt2.hr.msu.edu>
>
> OK, I've seen the light and am switching over to use spider.pl. So far  
> I've
> gotten it to use SwishSpiderConfig.pl and point to my local host to  
> find 4
> URLs (which is correct). However, it is not indexing the output and  
> it's
> probably issues with the filter object? Here is the output from swish-e
> using the following command line:
>
> C:\SWISH-E>C:\Swish-E\swish-e -S prog -c  
> C:\Swish-E\conf\siteindexpl.config
>
> Warning: Configuration setting for TmpDir 'C:/Inetpub/Indexes/Temp'  
> will be
> over
> ridden by environment setting 'C:\DOCUME~1\klingen2\LOCALS~1\Temp'
> Indexing Data Source: "External-Program"
> Indexing "prog-bin/spider.pl"
> External Program found: ./prog-bin/spider.pl
> C:\SWISH-E\prog-bin\spider.pl: Reading parameters from
> 'SwishSpiderConfig.pl'
>
>  -- Starting to spider: http://localhost/ --
> ?Testing 'test_url' user supplied function #1 'http://localhost/'
> +Passed all 1 tests for 'test_url' user supplied function
> ?Testing 'test_response' user supplied function #1 'http://localhost/'
> +Passed all 1 tests for 'test_response' user supplied function
> ?Testing 'test_url' user supplied function #1
> 'http://localhost/affidavit.pdf'
> +Passed all 1 tests for 'test_url' user supplied function
> ?Testing 'test_url' user supplied function #1
> 'http://localhost/TerminationCheck
> list.pdf'
> +Passed all 1 tests for 'test_url' user supplied function
> ?Testing 'test_url' user supplied function #1  
> 'http://localhost/OEHowTo.pdf'
> +Passed all 1 tests for 'test_url' user supplied function
> ! Found 3 links in http://localhost/index.htm
>
> ?Testing 'filter_content' user supplied function #1 'http://localhost/'
> ?Testing 'test_response' user supplied function #1
> 'http://localhost/affidavit.p
> df'
> +Passed all 1 tests for 'test_response' user supplied function
> ?Testing 'filter_content' user supplied function #1
> 'http://localhost/affidavit.
> pdf'
> ?Testing 'test_response' user supplied function #1
> 'http://localhost/Termination
> Checklist.pdf'
> +Passed all 1 tests for 'test_response' user supplied function
> ?Testing 'filter_content' user supplied function #1
> 'http://localhost/Terminatio
> nChecklist.pdf'
> ?Testing 'test_response' user supplied function #1
> 'http://localhost/OEHowTo.pdf
> '
> +Passed all 1 tests for 'test_response' user supplied function
>
> Summary for: http://localhost/
> Connection: Keep-Alive: 3  (1.5/sec)
>                Skipped: 4  (2.0/sec)
>            Unique URLs: 4  (2.0/sec)
>
> Removing very common words...
> no words removed.
> Writing main index...
> err: No unique words indexed!
> .
>
> I can understand the tempdir warning which is no problem. I'm not sure  
> how
> to get Swish-e to actually build the index. This is my  
> siteindexpl.config
> file:
>
> # Include our site-wide configuration settings:
>
> IncludeConfigFile conf/settings.config
>
> # Specify the program to run
> IndexDir prog-bin/spider.pl
>
>
> # When running under the "prog" document source method you can
> # pass a list of parameters to the program (specified with -i or  
> IndexDir).
>
> # If a parameter is passed to spider.pl, it will use that as the
> configuration
> # file.
>
> # As a special case, the word "default" followed by URL(s).
> # In this case the spider will use default settings to spider the  
> provided
> URLs.
>
> # SwishProgParameters default http://35.8.31.67
> # SwishProgParameters default http://www.hr.msu.edu/hrsite
>
> # Note: the default used by spider.pl is SwishSpiderConfig.pl.
> # See prog-bin/SwishSpiderConfig.pl for examples
> # that include filtering PDF and MS Word documents.
>
> # (default /var/tmp)  The location of a writeable temp directory
> # on your system.  The HTTP access method tells the Perl helper to  
> place
> # its files there.  The default is defined in src/config.h and depends  
> on
> # the current OS.
>
> TmpDir C:/Inetpub/Indexes/Temp
>
> # Tell swish that about how to parse the content
> DefaultContents HTML
> IndexContents HTML .htm .html
> FileFilter .pdf filter-bin/pdf2html
> IndexContents HTML .pdf
>
> IndexComments no
>
> # Just to make it interesting, let's modify the URL that get's indexed:
> # replace http://swish-e.org/ => http:/localhost/
>
> # ReplaceRules replace swish-e.org localhost
>
>
>
>
> This is the SwishSpiderConfig.pl file:
>
> #--------------------- Global Config ----------------------------
>
> #  @servers is a list of hashes -- so you can spider more than one site
> #  in one run (or different parts of the same tree)
> #  The main program expects to use this array  
> (@SwishSpiderConfig::servers).
>
>   ### Please do not spider these examples -- spider your own servers,  
> with
> permission ####
>
> @servers = (
>
>
> #====================================================================== 
> =====
> ==
>     # This is a more advanced example that uses more features,
>     # such as ignoring some file extensions, and only indexing
>     # some content-types, plus filters PDF and MS Word docs.
>     # The call-back subroutines are explained a bit more below.
>     {
>         skip        => 0,  # skip spidering this server
>         debug       => DEBUG_INFO,  # print some debugging info to  
> STDERR
>
>       #  debug       => DEBUG_URL,  # print some debugging info to  
> STDERR
>
>
>       #  base_url        => 'http://www.swish-e.org/',
>         base_url        => 'http://localhost/',
>       #  base_url        => 'http://www.hr.msu.edu/hrsite/',
>         email           => 'webmaster@hr.msu.edu',
>         link_tags       => [qw/ a frame /],
>         delay_sec       => 30,        # Delay in seconds between  
> requests
>         max_files       => 50,
>         max_indexed     => 20,        # Max number of files to send to  
> swish
> for indexing
>
>         max_size        => 1_000_000,  # limit to 1MB file size
>         max_depth       => 10,         # spider only ten levels deep
>         keep_alive      => 1,
>
>         test_url        => sub { $_[0]->path !~ /\.(?:gif|jpeg)$/ },
>
>         test_response   => sub {
>             my $content_type = $_[2]->content_type;
>             my $ok = grep { $_ eq $content_type } qw{ text/html  
> text/plain
> application/pdf };
>
>             # This might be used if you only wanted to index PDF  
> files, yet
> spider still spider.
>             #$_[1]->{no_index} = $content_type ne 'application/pdf';
>
>             return 1 if $ok;
>             print STDERR "$_[0] wrong content type ( $content_type  
> )\n";
>             return;
>         },
>
>         filter_content  => [ \&pdf],
>     },
>
>
>
> );
>
> I have not changed the global functions in SwishSpiderConfig.pl from  
> the
> windows daily build download dated 9/17/2003.
>
> Swish_filter.pl is in the c:/swish-e/filter-bin/ subdirectory and I've  
> also
> tried moving it to the c:/swish-e directory where I'm executing the  
> command
> to run swish-e and received the same results.
>
> My filter.pm module is located in C:/Swish-e/filters/swish/  
> subdirectory. I
> have applied the following lines of code to the windows_fork  
> subroutine:
>
> sub windows_fork {
>     my ( $self, @args ) = @_;
>
>
>     require IPC::Open2;
>     my ( $rdrfh, $wtrfh );
>
>     # Added these three lines per instructions from Bill Moseley  
> 7/29/2003
>     my $path = join " ", @args;
>     open FH, "$path|" or die $!;
>     return \*FH;
>
>     my @command = map { s/"/\\"/g; qq["$_"] }  @args;
>
> I get the same results with or without the lines even after moving the
> module to the C:/Swish-e directory.
>
> Where am I going wrong or do I need to give you more information?
>
> Rick
>
>> -----Original Message-----
>> From: Bill Moseley [mailto:moseley@hank.org]
>> Sent: Thursday, September 18, 2003 7:06 PM
>> To: Multiple recipients of list
>> Subject: [SWISH-E] Re: Filtering problems
>>
>> On Thu, Sep 18, 2003 at 03:23:31PM -0700, Klingensmith, Rick wrote:
>>
>>> I've caused myself some more problems with filtering PDF documents I
>>> believe. I've installed the latest windows install exe on my test  
>>> server
>> and
>>> modified windows fork in filter.pm. This was to get around a memory
>> issue
>>> that started, which we couldn't solve. Now I'm getting the following
>> error
>>> message when swish-e tries to index a pdf:
>>>
>>> retrieving http://35.8.31.67/affidavit.pdf (1)...
>>>
>>> Can't locate object method "convert" via package "SWISH::Filter" at
>>> C:/Swish-E/swishspider line 149.
>>
>> I already responded to Rick by email, but for the list (and archive):
>>
>> SWISH::Filter was updated.  Before to filter a document
>>
>>    $filtered = $filter->filter(...)
>>
>> which returned true or false.  But that's not a very Object Oriented
>> interface so I added a new method:
>>
>>    $doc = $filter->convert(...)
>>
>> which returns an object "$doc".
>>
>> The programs swishspider and spider.pl were updated to use that new
>> interface.
>>
>> Rick's problem (so I assume) is that he's using a new version of
>> swishspider, but an old version of SWISH::Filter.  I assume that
>> happened because he's got a "use lib" line in swishspider pointing to
>> an old version of SWISH::Filter.
>>
>> But swishspider is an exception in that it doesn't automatically point
>> to where SWISH::Filter is installed.  In other words, swishspider
>> doesn't use SWISH::Filter by default because (unlike spider.pl)
>> swishspider runs for each document spidered.  That would mean loading
>> SWISH::Filter (and all the associated filter modules) over and over.
>>
>> The better solution is to use spider.pl instead of swishspider.
>>
>> Much of the work in getting 2.4.0 released is getting Windows to  
>> install
>> (and use) things in their right place.  So perhaps that was the  
>> problem.
>>
>> Why doesn't Microsoft follow Apple's lead and replace their OS with  
>> BSD?
>>
>> --
>> Bill Moseley
>> moseley@hank.org
>
> ------------------------------
>
> End of SWISH-E Digest 1571
> **************************
Received on Sat Sep 20 01:01:26 2003