Skip to main content.
home | support | download

Back to List Archive

Re: Using swish-e with one structured document

From: Michael Peters <mpeters(at)not-real.plusthree.com>
Date: Thu Jul 27 2006 - 14:09:42 GMT
Richard Lewis wrote:
> On Tuesday 25 July 2006 17:44, Richard Lewis wrote:
>> Is it possible to get it to say /where/ in a document it found a result?
>> And, even better, get it to say what the id attribute of the parent element
>> of the matching word was?
>>
> So is this is just not possible? Or is it really easy and obvious and have I 
> just missed it in the docs?
> 
> One problem I'll have with splitting the large documents into fragments is the 
> amount of space having lots of small files takes up (i.e. a lot more than the 
> sum of their sizes). Potential solutions include using the XFS filesystem 
> (rather than ext3) or putting them in an ISO image and loopback mounting it.

How large are these files and how many individual files would you create from
them? I can't honestly think of a time when I've been concerned about the disk
space when I break up a document into multiple smaller ones. I'm usually
concerned about the CPU time spent processing those large files. Finding a
specific fragment with a certain id in a large document would be much slower
than simply finding the file with that certain fragment in it if they were in
their own files.

> The other thing I've just thought of is using the -S prog option when creating 
> the index and using an XPath or possibly XSLT processing tool to extract the  
> document fragments for indexing. This would allow me to index each fragment 
> with its @id attribute.

This is the approach I would take. If something that you want to index does not
fit into swish-e's model very well (in this case 1 document == 1 hit) then
filters are a good place to look. By running your files through a filter first,
you can rearrange them into what ever you want.

I would probably add a custom tag to whatever chunks you're spitting out to
swishe that indicated how to find that chunk again in the document it came from
(just the filename and xpath expression would probably work) and then use that
in your results.

-- 
Michael Peters
Developer
Plus Three, LP
Received on Thu Jul 27 07:09:42 2006