Hi,
Lookup (a swish-e 1.3.3 based searchengine in perl) includes a HTML
parser module (perl) for extracting the first 300 bytes from the text
of a HTML page.
Lookup extracts at search-time from the file system, There is another
approach to have a modified spider: http://www.lhsc.on.ca/swish-e/
Storing the abstracts in a GDBM file. (more efficient)
Lookup at http://bas.antraciet.nl/lookup
Bas Meijer
>Has anyone done a HTML library for outputting and parsing HTML documents?
>
>> -----Original Message-----
>> From: swish-e@sunsite.berkeley.edu
>> [mailto:swish-e@sunsite.berkeley.edu]On Behalf Of Luke Ross
>> Sent: Wednesday, 6 December 2000 04:18
>> To: Multiple recipients of list
>> Subject: [SWISH-E] RE: Formatting the output from Swish-E
>>
>>
>> Hi
>>
>> On Sun, 3 Dec 2000, Patrick Dunford wrote:
>>
>> > A third option might be to have PHP parse each returned file
>> and extract the
>> > HTML from the file... haven't looked at this in detail but
>> theoretically it
>> > might be possible.
>>
>> I looked at this, but it was nigh-on impossible for server-parsed and
>> included files :)
>>
>> Regards,
>>
>> Luke
>> --
>> Luke Ross (Fizzy Razzer) - lukeross@sys3175.co.uk
>> Visit http://lcr.sys3175.co.uk for geek code, other addresses,
>> web page etc.
>>
>>
--
-- /''' Bas Meijer, Antraciet
c-OO WEB: http://bas.antraciet.nl WAP: http://wmpp.net
\ > Kerkstraat 19 Postbus 256 1400 AG Bussum.NL
\&& tel. +31 35 7502100 fax. +31 35 7502111
Received on Wed Dec 6 10:18:24 2000