Skip to main content.
home | support | download

Back to List Archive

Re: returning HTML code

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Tue May 31 2005 - 21:10:49 GMT
Nicholas W. Miller scribbled on 5/31/05 4:05 PM:

> Peter,
> 
> How do you "pre-convert" your tags?  Is this a command in your config  
> file?
> 

no, not a config option. You'd have to use -S prog and a filtering script to 
process your html (spider.pl is one example). NOTE that such an approach would 
likely have other, less desirable consequences (your "raw" html would get 
indexed as "filtered").

example:

instead of indexing this:

<tag>foo</tag>

as 'foo'

it would be indexed as:

<tag>foo</tag>

(assuming you had <>/ as WordCharacters).

so searching for 'foo' would fail in the example above, because swish-e would 
treat it as all one word.


> Thanks,
> 
> Nick
> 
> On May 31, 2005, at 9:53 AM, Peter Karman wrote:
> 
>> no.
>>
>> swish-e ignores tags and converts entities when saving properties.
>>
>> you could, however, pre-convert your tags and & in order to abuse that
>> feature:
>>
>> karman@topaz08 17% swish-e -i foo.html -c c -v3
>> Parsing config file 'c'
>> Indexing Data Source: "File-System"
>> Indexing "foo.html"
>>
>> Checking file "foo.html"...
>>    foo.html - Using HTML2 parser -  (4 words)
>>
>> Removing very common words...
>> no words removed.
>> Writing main index...
>> Sorting words ...
>> Sorting 3 words alphabetically
>> Writing header ...
>> Writing index entries ...
>>    Writing word text: Complete
>>    Writing word hash: Complete
>>    Writing word data: Complete
>> 3 unique words indexed.
>> 5 properties sorted.
>> 1 file indexed.  61 total bytes.  4 total words.
>> Elapsed time: 00:00:00 CPU time: 00:00:00
>> Indexing done!
>> karman@topaz08 18% swish-e -w test -p swishdescription
>> # SWISH format: 2.4.3
>> # Search words: test
>> # Removed stopwords:
>> # Number of hits: 1
>> # Search time: 0.000 seconds
>> # Run time: 0.020 seconds
>> 1000 foo.html "foo.html" 61 "<bar>my test</bar>"
>> .
>> karman@topaz08 19% cat c
>> StoreDescription HTML2 <body>
>> IndexContents HTML2 .html
>> karman@topaz08 20% cat foo.html
>> <html>
>> <body>
>> &lt;bar&gt;my test&lt;/bar&gt;
>> </body>
>> </html>
>> karman@topaz08 21%
>>
>>
>> Nicholas W. Miller wrote on 05/31/2005 10:58 AM:
>>
>>> Hello,
>>>
>>> Is it possible to configure swish-e to return a page's HTML code
>>> instead of the just the page's visible text?
>>>
>>> Thanks,
>>>
>>> Nick
>>>
>>
>> -- 
>>   Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
>>

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Tue May 31 14:10:49 2005