Lookup (a swish-e 1.3.3 based searchengine in perl) includes a HTML
parser module (perl) for extracting the first 300 bytes from the text
of a HTML page.
Lookup extracts at search-time from the file system, There is another
approach to have a modified spider: http://www.lhsc.on.ca/swish-e/
Storing the abstracts in a GDBM file. (more efficient)
Lookup at http://bas.antraciet.nl/lookup
>Has anyone done a HTML library for outputting and parsing HTML documents?
>> -----Original Message-----
>> From: firstname.lastname@example.org
>> [mailto:email@example.com]On Behalf Of Luke Ross
>> Sent: Wednesday, 6 December 2000 04:18
>> To: Multiple recipients of list
>> Subject: [SWISH-E] RE: Formatting the output from Swish-E
>> On Sun, 3 Dec 2000, Patrick Dunford wrote:
>> > A third option might be to have PHP parse each returned file
>> and extract the
>> > HTML from the file... haven't looked at this in detail but
>> theoretically it
>> > might be possible.
>> I looked at this, but it was nigh-on impossible for server-parsed and
>> included files :)
>> Luke Ross (Fizzy Razzer) - firstname.lastname@example.org
>> Visit http://lcr.sys3175.co.uk for geek code, other addresses,
>> web page etc.
-- /''' Bas Meijer, Antraciet
c-OO WEB: http://bas.antraciet.nl WAP: http://wmpp.net
\ > Kerkstraat 19 Postbus 256 1400 AG Bussum.NL
\&& tel. +31 35 7502100 fax. +31 35 7502111
Received on Wed Dec 6 10:18:24 2000