Skip to main content.
home | support | download

Back to List Archive

Re: not ignoring content (leave those files alone!)

From: Linda W. (that's swishey, not squishey!) <swishey(at)not-real.tlinx.org>
Date: Sun Jun 11 2006 - 20:55:27 GMT
Bill Moseley wrote:
> Now, most people spider their sites, and the spider can look at the
> Content-Type header to determine what to index.
---
	I'm trying to index local file systems, not a website.


> What gets filtered depends on what you might have installed.  IIRC,
> xpdf and catdoc are included in the windows build, where building from
> source you have to install those separately.  So, if you use the
> spider you will likely not have all these problems.
---
	spider seems designed for a website, not a local file system.


> The DirTree.pl program that's included with the distribution makes
> use of SWISH::Filter.  It's simple scans the file system (like the
> default mode of swish), but it will filter based on mime type just
> like spidering.  So, that may be much easier if you want to scan the
> file system instead of spider a web site.
----
	The problem, I think as you mentioned, was that "NoContents" will
still look through binary files to find a title (or content type).  It
seems to take a long time on large binary files.  In the one directory
I have scanned so far, it too 5 minutes just to plow through 1 2.2M file.

	Maybe NoContents would be better named "FileMetaOnly" -- that
makes it clear that the file may be scanned by the default scanner for HTML
tags?


> 
> perldoc DirTree.pl for some details, but it's not a very complex
> script.
---
	It's my first shot on some of this, and wanted to try simpler (though
less efficient) methods first (verify and get comfortable with the basic
functions).


> If you want the details of SWISH::Filter see:
> 
>     http://swish-e.org/docs/filter.html
> 
> The INSTALL doc has examples of indexing, and one is spidering.
> Might save yourself a lot of time if you follow those instructions.
----
	Looked through all of them before writing my first conf file.
At some point the nomenclature for conf files reads "config".  I found
this a bit confusing.  With the main config file being referred to as
swish.conf, I str8away looked for *.conf.  Didn't pick up the examples
in the config dir until later examination...


> http://swish-e.org/docs/install.html#spidering_and_searching_with_a_web_form_
> 
> My only comment is *I* probably would not use the swish.cgi script.
> It's a bit bloated with features.  I think it's easier to just write
> a simple search script -- maybe use the search.cgi script for ideas.
---
	"Teaching" scripts don't have to be the most efficient, though
efficient examples of "best practices", are certainly a great aide.


	Seems like it has been a while since the last release.  Is that expected
to remain the same in the near future or do you think there will be more 
frequent releases coming up?

thanks,
linda
Received on Sun Jun 11 13:55:29 2006