Skip to main content.
home | support | download

Back to List Archive

Re: Q: Swish-E foreign language character support

From: Kati Gäbler <katigaebler(at)not-real.topmail.de>
Date: Mon Feb 05 2001 - 22:34:17 GMT
Hello,

Thanks for the advise on the last security hole on that script, fortunately I 
couldn't even make the -d option, or the modified script work using the 
cut-n-paste method! I think I will advise my hosting provider to offer a 
safer script.

Also, I have some ideas I'd like to contribute to the Swish developers on 
this list, in case some of the features doesn't already exist.

For example, if an index has been created from 100 HTML pages or so, all 
of one domain, imagine its a company with two floors, or two departments. 
About half of the company is on the first floor, and the other half on the 
second floor. If every HTML page has a label with some unique keyword, 
however the website author feels like placing them, e.g. in some META tag:

<meta name="department" content="first_floor">

Or within some script tag:

<script language="javascript">
var = 'first_floor';
</script>

Or it could just be a keyword or phrase contained somewhere in the files. 
Would it be possible for Swish to index ONLY those FILES where the spider 
finds the keyword? or the other way, those where keyword does NOT exist.

Forgive me if it can already be done, if not, I suppose it can easily be 
done by some shell script that searches the files an adds the 
URLs in some Swish configuration file.

Another thing that might be useful would be if the spider could recognize and 
ignore any frameset files, or reverse, only to index framesets, as the 
administrator likes it. Because framesets can be pretty useless from an 
indexing point of view as they carry little content, although in some cases 
they may be more useful, as the administrator could have more control over 
the content META info tags, if its well specified, I guess this depends on 
how a site is built in the first place and if the site is workable in a 
no-frames state or not. Maybe could Swish could therefore avoid indexing any 
files containing a part of a string, as always found in a frameset file, e.g.:

<frame src=

Or just:

</frameset>

Either of the two should do the job I think (of course, it wouldn't work for 
a site that actually contains code examples of framesets files).

Another thing that might be useful would be to index or not index certain 
files containing specific characters in the filename only, (not just the 
suffix), anything after the last "/" of the URL. It could for example be any 
upper cases as defined [ABCDEFG...] etc., or whatever else specified by the 
administrator, in a Swish config option.

I guess also that could be done with an external shell script that traverses 
directories and adds any such files in the list of files to be excluded or 
something. Although I think it would be better if it could be done with Swish 
indexing commands, if Swish is meant to be search ENGINE and not a manually 
updated index. Maybe other people also use naming structures in groups of 
page names like me.

Lastly, I'm still searching for simple front-end. The one I found earlier was 
simple, but missing descriptions possibilities and it was insecure. I tried 
to implement many others without much success, as mostly they seem very hard 
to modify, with to many options etc., no offense in ase any of you wrote 
them, they are probably great for an expert programmer, but to complex for 
the averege Webmaster to jump into. I just couldn't imagine that such a 
sophisticated search tool would be so simple to set up on the command line 
whilst the front-end is so difficult (for the non expert). If there simpler 
and more portable front-end examples available to choose from, only needing 
perl 5, not requiring installation of various non-standard modules or other 
libraries that doesn't exist in the regular hosting situation my guess is 
that the Swish number of installations would be a hundred times more 
successful! 

Therefore, I would like to contribute a simple idea for a simple interface I 
think might work well and without too may options designed more-so for the 
average surfer. Such script could include a plain .HTML form and a .CGI 
script. The actual search result could look like something like this:

Your search for "blabla" returned "X" number of hits.

1. Some Title Link
   Description of the page... - [modification date]
   http://www.blabla.com/pagethis.html
  
2. Another Title
   Description of anotehr page.. - [modification date]
   http://www.blabla.com/hello.html
 
3. Yet Another Title
   Description Blabla... - [modification date]
   http://www.blabla.com/whatever.html
 
   etc....

   page: 1 2 3

The script could return the results including the description as contained in 
the META name="description" content tag, being whatever the first 150 
characters or so (this is a usual standard length on search engines).

Where [modification date] would just be the last date the file was modified.

Possible features on the HTML form page could include:

Allow the user to decide, display "X" number of hits p/page, from within the 
search form, drop menu or whaetever. But this paramenter could also be placed 
in a hidden form tag so that the administrator can fix it.

Anotehr useful features could be to allow the user to select which index to 
search from, e.g. first_floor, or second_floor using a drop menu. Modifying 
the value of the form input fields could simply be done with client side 
javascript, the input select value could represent the index to search.

For example something like this:

<script language="javascript">
function changeInput(object){
document.chooseIndex.mySelection.value=object.options[object.selectedIndex].value;
}
</script>

<form name="chooseIndex">

<!-- this input type would be "hidden" and obviously not type="text" -->
<input type="text" value="SwishIndex1" name="mySelection">

<select name="selectName" onchange="changeInput(this.form.selectName)">
<option value="bothFloors.swish">Search both floors
<option value="secondFloor.swish">Search Second Floor
<option value="firstFloor.swish">Search forst Floor
</select>
</form>

<noscript>
Do something else in case Javascript is disabled..
</noscript>

I guess the above could be done on the CGI level too, but I just wouldn't 
know how to, and who doesn't have Javascript browsers nowadays anyway!?

Likewise, the search index choice could be fixed only to one option, by a 
hidden form field in case the administrator prefers that.

There woudl be no need to return details like the file size I think.

And for a more complex one: I don't know what ranking system Swish uses and 
how it could be converted into something else, for example, a GIF 
star/relevance system something like that used on the follwoing search engine 
would be cool!

http://www.irt.org/cgi-bin/htwrap?method=and&format=builtin-long&sort=score&config=htdig&restrict=&exclude=&words=search+forms

Thanks for listening to my ideas!

Regards,
Kati

"Help fight continental drift."

On Monday 05 February 2001 04:59, Bill Moseley wrote:
> At 04:41 PM 02/04/01 -0800, Kati Gäbler wrote:
> >On Sunday 04 February 2001 11:57, Rainer.Scherg@rexroth.de wrote:
> >Only one more detail; I would like some advise on how to include a meta
> >description of the page in the search results. Currently, this is what the
> >result page would look like when using the CGI script as it came:
>
> This isn't the help you are looking for, but this script is really not fit
> for use as a CGI script, and I would avoid using it unless you are running
> on an closed network where you trust everyone.  The pipe open is insecure,
> and looks like it's written for Perl 4.
>
> There are some better scripts on the SWISH-E web site, I believe, but I
> haven't looked at them for some time.  Take a look at lookup.cgi -- I
> haven't really looked at any of the others (but I would not recommend
> swish-cgi.pl due to the same pipe open problem).
>
> Also, take a look at SWISH and SWISH::Fork on CPAN.
> http://search.cpan.org/doc/HANK/SWISH-0.04/SWISH.pm
> http://search.cpan.org/doc/HANK/SWISH-Fork-0.08/Fork.pm
>
> If that's an interface you can work with let me know and I'll get updated
> versions put up on CPAN (that version has a bug in the timeout feature, and
> the interface to the swish headers and properties has been redesigned to
> work with features of SWISH-E 2.2.  But if you are not a perl programmer
> then that might be more work to get up and running.
>
> To answer your question, use the -d switch to specify a delimiter in your
> results, and then split on that:
>
> So (untested and not recommend!)
>
> Instead of this:
>   open(SWISH, "$command|");
> do
>
>   # WARNING insecure
>   open( SWISH, "$command -d :: -p $property_name|")
> 	or die "Failed to open swish";
>
> And then instead of all this mess:
>
>   elsif (/^[0-9]/) {
>     chop;
>     # can't simply split because spaces can exit in title
>     $firstspace = index("$_", "\ ", 0);
>     if ($firstspace == -1) {
>       next;
>     }
>     $secondspace = index("$_", "\ ", ($firstspace+1));
>     if ($secondspace == -1) {
>       next;
>     }
>     $lastspace = rindex("$_", "\ ");
>     if ($lastspace == -1) {
>       next;
>     }
>     $rank = substr($_, 0, $firstspace);
>     $url = substr($_, ($firstspace+1), ($secondspace-$firstspace-1));
>     $title = substr($_, ($secondspace+1), ($lastspace-$secondspace-1));
>     $numbytes = substr($_, ($lastspace+1));
>     print "$rank <a href=\"$url\">$title</a> ($numbytes bytes)<br>\n";
>   }
> }
>
> Do something like this:
>
>   elsif (/^[0-9]/) {
>       chomp;  # this could be right below the while()
>
>       my ( $rank, $url, $title, $numbytes, $property ) = split /::/;
>
>       print qq[$rank <a href="$url">$title</a> ($numbytes bytes)<br>],
>             "<blockquote>$property</blockquote>";
>   }
>
> Again, untested and not advised.
>
>
>
> Bill Moseley
> mailto:moseley@hank.org
Received on Mon Feb 5 22:37:57 2001