Skip to main content.
home | support | download

Back to List Archive

Re: Funky, unknown errors

From: Alan Ivey <ai4891(at)>
Date: Fri Jul 02 2004 - 15:16:46 GMT
>> # "DefaultContents should be used over
>> for using the internal spider"
>> # - too bad it make the results page on swish.cgi
>> terrible! It shows the HTML tags!

> I don't understand that comment.  If you index HTML
> text then it won't parse the html.  DefaultContents 
> just sets the default document type (which is used
> things like StoreDescription) but swish-e would
> use the HTML parser by default.

That was a note for me. I had that in there without
the quotes so I would know why I wrote it. After
running swish-e, it should show the HTML tags. That
was when I had it as DefaultContents. The rest of it
was the same, I just swapped the word Default for
Index. Question, just to make sure I understand you
correctly: If I was to have it as IndexContents TXT*
html, it would NOT show the HTML tags? Because, as
IndexContents HTML* .html it DOES show the HTML tags
in the search results. I used IndexContents HTML*
html because that was the example I saw in the docs.

Also DefaultContents, in my case, would be good for
specifying non-HTML files that I want displayed as
HTML files, correct? For some reason, I never made a
good distinction in my head between the two, no matter
how many times I read the docs (my fault, not

>> I wrote my own FileFilter. Like I said, I'm not a
>> expert by any means, so the only way I was able to
>> the results I wanted was to duplicate what I had
>> while testing with the command line. And since you
>> can't do anything like "FileFilter .ppt command |
>> command" (pipe), I wrote I simple bash script to do
>> it. The perl script was written to remove the HTML
>> tags from the ppthtml, since I really just want
>> ppt-to-text. This is my humble attempt at it :)

>All you should have to do is make sure it's parsed as

I did, but as I explained before, with HTML pages and
IndexContents, it was displaying the HTML tags. When I
switched to DefaultContents, it stopped doing that,
but I went ahead and kept the script because ppthtml
head the page's TITLE the doc path, and it looked
stupid in the results page to have
/var/www/html/file.ppt; which, again, is not a swish-e
problem, but a ppthtml problem.

> Didn't someone post a Powerpoint filter not too long

Well, it works just doing FileFilter ppthtml "'%p -'"
but, I wanted to remove the HTML tags to make it text.

> Regardless, you would just need to look at the
> and see what the character code is.

Forgive me, but I'm not familiar with how to do this.
I understand what the different encoding types are,
but as far as changing them, I'm not sure how to do
that. The time I saw it most occur was for a simple
hyphen. Would that be a problem on my end or on the
website's box?

Again, thank you so much for taking time to help me!!

Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage! 
Received on Fri Jul 2 08:17:04 2004