Hello,
Thanks for the advise on the last security hole on that script, fortunately I
couldn't even make the -d option, or the modified script work using the
cut-n-paste method! I think I will advise my hosting provider to offer a
safer script.
Also, I have some ideas I'd like to contribute to the Swish developers on
this list, in case some of the features doesn't already exist.
For example, if an index has been created from 100 HTML pages or so, all
of one domain, imagine its a company with two floors, or two departments.
About half of the company is on the first floor, and the other half on the
second floor. If every HTML page has a label with some unique keyword,
however the website author feels like placing them, e.g. in some META tag:
<meta name="department" content="first_floor">
Or within some script tag:
<script language="javascript">
var = 'first_floor';
</script>
Or it could just be a keyword or phrase contained somewhere in the files.
Would it be possible for Swish to index ONLY those FILES where the spider
finds the keyword? or the other way, those where keyword does NOT exist.
Forgive me if it can already be done, if not, I suppose it can easily be
done by some shell script that searches the files an adds the
URLs in some Swish configuration file.
Another thing that might be useful would be if the spider could recognize and
ignore any frameset files, or reverse, only to index framesets, as the
administrator likes it. Because framesets can be pretty useless from an
indexing point of view as they carry little content, although in some cases
they may be more useful, as the administrator could have more control over
the content META info tags, if its well specified, I guess this depends on
how a site is built in the first place and if the site is workable in a
no-frames state or not. Maybe could Swish could therefore avoid indexing any
files containing a part of a string, as always found in a frameset file, e.g.:
<frame src=
Or just:
</frameset>
Either of the two should do the job I think (of course, it wouldn't work for
a site that actually contains code examples of framesets files).
Another thing that might be useful would be to index or not index certain
files containing specific characters in the filename only, (not just the
suffix), anything after the last "/" of the URL. It could for example be any
upper cases as defined [ABCDEFG...] etc., or whatever else specified by the
administrator, in a Swish config option.
I guess also that could be done with an external shell script that traverses
directories and adds any such files in the list of files to be excluded or
something. Although I think it would be better if it could be done with Swish
indexing commands, if Swish is meant to be search ENGINE and not a manually
updated index. Maybe other people also use naming structures in groups of
page names like me.
Lastly, I'm still searching for simple front-end. The one I found earlier was
simple, but missing descriptions possibilities and it was insecure. I tried
to implement many others without much success, as mostly they seem very hard
to modify, with to many options etc., no offense in ase any of you wrote
them, they are probably great for an expert programmer, but to complex for
the averege Webmaster to jump into. I just couldn't imagine that such a
sophisticated search tool would be so simple to set up on the command line
whilst the front-end is so difficult (for the non expert). If there simpler
and more portable front-end examples available to choose from, only needing
perl 5, not requiring installation of various non-standard modules or other
libraries that doesn't exist in the regular hosting situation my guess is
that the Swish number of installations would be a hundred times more
successful!
Therefore, I would like to contribute a simple idea for a simple interface I
think might work well and without too may options designed more-so for the
average surfer. Such script could include a plain .HTML form and a .CGI
script. The actual search result could look like something like this:
Your search for "blabla" returned "X" number of hits.
1. Some Title Link
Description of the page... - [modification date]
http://www.blabla.com/pagethis.html
2. Another Title
Description of anotehr page.. - [modification date]
http://www.blabla.com/hello.html
3. Yet Another Title
Description Blabla... - [modification date]
http://www.blabla.com/whatever.html
etc....
page: 1 2 3
The script could return the results including the description as contained in
the META name="description" content tag, being whatever the first 150
characters or so (this is a usual standard length on search engines).
Where [modification date] would just be the last date the file was modified.
Possible features on the HTML form page could include:
Allow the user to decide, display "X" number of hits p/page, from within the
search form, drop menu or whaetever. But this paramenter could also be placed
in a hidden form tag so that the administrator can fix it.
Anotehr useful features could be to allow the user to select which index to
search from, e.g. first_floor, or second_floor using a drop menu. Modifying
the value of the form input fields could simply be done with client side
javascript, the input select value could represent the index to search.
For example something like this:
<script language="javascript">
function changeInput(object){
document.chooseIndex.mySelection.value=object.options[object.selectedIndex].value;
}
</script>
<form name="chooseIndex">
<!-- this input type would be "hidden" and obviously not type="text" -->
<input type="text" value="SwishIndex1" name="mySelection">
<select name="selectName" onchange="changeInput(this.form.selectName)">
<option value="bothFloors.swish">Search both floors
<option value="secondFloor.swish">Search Second Floor
<option value="firstFloor.swish">Search forst Floor
</select>
</form>
<noscript>
Do something else in case Javascript is disabled..
</noscript>
I guess the above could be done on the CGI level too, but I just wouldn't
know how to, and who doesn't have Javascript browsers nowadays anyway!?
Likewise, the search index choice could be fixed only to one option, by a
hidden form field in case the administrator prefers that.
There woudl be no need to return details like the file size I think.
And for a more complex one: I don't know what ranking system Swish uses and
how it could be converted into something else, for example, a GIF
star/relevance system something like that used on the follwoing search engine
would be cool!
http://www.irt.org/cgi-bin/htwrap?method=and&format=builtin-long&sort=score&config=htdig&restrict=&exclude=&words=search+forms
Thanks for listening to my ideas!
Regards,
Kati
"Help fight continental drift."
On Monday 05 February 2001 04:59, Bill Moseley wrote:
> At 04:41 PM 02/04/01 -0800, Kati Gäbler wrote:
> >On Sunday 04 February 2001 11:57, Rainer.Scherg@rexroth.de wrote:
> >Only one more detail; I would like some advise on how to include a meta
> >description of the page in the search results. Currently, this is what the
> >result page would look like when using the CGI script as it came:
>
> This isn't the help you are looking for, but this script is really not fit
> for use as a CGI script, and I would avoid using it unless you are running
> on an closed network where you trust everyone. The pipe open is insecure,
> and looks like it's written for Perl 4.
>
> There are some better scripts on the SWISH-E web site, I believe, but I
> haven't looked at them for some time. Take a look at lookup.cgi -- I
> haven't really looked at any of the others (but I would not recommend
> swish-cgi.pl due to the same pipe open problem).
>
> Also, take a look at SWISH and SWISH::Fork on CPAN.
> http://search.cpan.org/doc/HANK/SWISH-0.04/SWISH.pm
> http://search.cpan.org/doc/HANK/SWISH-Fork-0.08/Fork.pm
>
> If that's an interface you can work with let me know and I'll get updated
> versions put up on CPAN (that version has a bug in the timeout feature, and
> the interface to the swish headers and properties has been redesigned to
> work with features of SWISH-E 2.2. But if you are not a perl programmer
> then that might be more work to get up and running.
>
> To answer your question, use the -d switch to specify a delimiter in your
> results, and then split on that:
>
> So (untested and not recommend!)
>
> Instead of this:
> open(SWISH, "$command|");
> do
>
> # WARNING insecure
> open( SWISH, "$command -d :: -p $property_name|")
> or die "Failed to open swish";
>
> And then instead of all this mess:
>
> elsif (/^[0-9]/) {
> chop;
> # can't simply split because spaces can exit in title
> $firstspace = index("$_", "\ ", 0);
> if ($firstspace == -1) {
> next;
> }
> $secondspace = index("$_", "\ ", ($firstspace+1));
> if ($secondspace == -1) {
> next;
> }
> $lastspace = rindex("$_", "\ ");
> if ($lastspace == -1) {
> next;
> }
> $rank = substr($_, 0, $firstspace);
> $url = substr($_, ($firstspace+1), ($secondspace-$firstspace-1));
> $title = substr($_, ($secondspace+1), ($lastspace-$secondspace-1));
> $numbytes = substr($_, ($lastspace+1));
> print "$rank <a href=\"$url\">$title</a> ($numbytes bytes)<br>\n";
> }
> }
>
> Do something like this:
>
> elsif (/^[0-9]/) {
> chomp; # this could be right below the while()
>
> my ( $rank, $url, $title, $numbytes, $property ) = split /::/;
>
> print qq[$rank <a href="$url">$title</a> ($numbytes bytes)<br>],
> "<blockquote>$property</blockquote>";
> }
>
> Again, untested and not advised.
>
>
>
> Bill Moseley
> mailto:moseley@hank.org
Received on Mon Feb 5 22:37:57 2001