>> # "DefaultContents should be used over
>> for using the internal spider"
>> # - too bad it make the results page on swish.cgi
>> terrible! It shows the HTML tags!
> I don't understand that comment. If you index HTML
> text then it won't parse the html. DefaultContents
> just sets the default document type (which is used
> things like StoreDescription) but swish-e would
> use the HTML parser by default.
That was a note for me. I had that in there without
the quotes so I would know why I wrote it. After
running swish-e, it should show the HTML tags. That
was when I had it as DefaultContents. The rest of it
was the same, I just swapped the word Default for
Index. Question, just to make sure I understand you
correctly: If I was to have it as IndexContents TXT*
html, it would NOT show the HTML tags? Because, as
IndexContents HTML* .html it DOES show the HTML tags
in the search results. I used IndexContents HTML*
html because that was the example I saw in the docs.
Also DefaultContents, in my case, would be good for
specifying non-HTML files that I want displayed as
HTML files, correct? For some reason, I never made a
good distinction in my head between the two, no matter
how many times I read the docs (my fault, not
>> I wrote my own FileFilter. Like I said, I'm not a
>> expert by any means, so the only way I was able to
>> the results I wanted was to duplicate what I had
>> while testing with the command line. And since you
>> can't do anything like "FileFilter .ppt command |
>> command" (pipe), I wrote I simple bash script to do
>> it. The perl script was written to remove the HTML
>> tags from the ppthtml, since I really just want
>> ppt-to-text. This is my humble attempt at it :)
>All you should have to do is make sure it's parsed as
I did, but as I explained before, with HTML pages and
IndexContents, it was displaying the HTML tags. When I
switched to DefaultContents, it stopped doing that,
but I went ahead and kept the script because ppthtml
head the page's TITLE the doc path, and it looked
stupid in the results page to have
/var/www/html/file.ppt; which, again, is not a swish-e
problem, but a ppthtml problem.
> Didn't someone post a Powerpoint filter not too long
Well, it works just doing FileFilter ppthtml "'%p -'"
but, I wanted to remove the HTML tags to make it text.
> Regardless, you would just need to look at the
> and see what the character code is.
Forgive me, but I'm not familiar with how to do this.
I understand what the different encoding types are,
but as far as changing them, I'm not sure how to do
that. The time I saw it most occur was for a simple
hyphen. Would that be a problem on my end or on the
Again, thank you so much for taking time to help me!!
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
Received on Fri Jul 2 08:17:04 2004