Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] command line Swish - returning search results

From: at <Peter>
Date: Mon, 03 Oct 2011 11:04:21 +0100
On 14/09/11 04:25, Klaus Gruen wrote:
> Thanks, but we do have a single denormalized file.  I want to return the
> records, or fields within the records, that pattern matches with the
> keyword/wildcards used... to that end how do you suggest we structure
> the XML file then, we have 30M records, having 1 file per record doesnt
> make sense.

If we are talking about XML, you need to use XML terminology. XML
doesn't have "records" or "fields", it has elements.

Are the files that you index all in XML and all the same document type?

>> Also, for the xml structure, swish-e is pretty flexible in that
>> you can define your own XML, I assume.... Hoping something like
>> this works (let me know if it wouldnt work):
>>
>> <record>
>>    <field1>Cisco Systems
>>    </field1>
>>    <field2>1000 Tasman Drive
>>    </field2>
>> </record>
>> <record>
>>    <field1>Microsoft
>>    </field1>
>>    <field2>200 Microsoft Ave.
>>    </field2>
>> </record>
> 
> Note that you'll want to have one <record></record> set per file in
> order for swish-e to return meaningful results.

XML requires a single top-level element, so you may want to wrap the
output in something like

<results>
  <company>
    <name>Cisco Systems</name>
    <address>1000 Tasman Drive</address>
  </company>
  <company>
    <name>Microsoft</name>
    <address>200 Microsoft Ave.</address>
  </company>
</results>

Don't let leading or trailing newlines into the output of values, and if
possible give the elements meaningful names. Once the output is
well-formed XML, you can post-process it with a number of XML tools. We
use lxprintf (part of the LTxml2 toolkit from
http://www.ltg.ed.ac.uk/software/ltxml2) and XSLT2 (in Cocoon) to
post-process Swish-e output.

The real problem is getting Swish-e to return whole elements in its
results. I haven't cracked this one yet. Meta names let you restrict the
search, but the output is still only one line per file. At the moment, a
result like:

160
/var/www/xml/profiles/docs/researchprofiles/X999/pflynn/Publications.xml
"Publications.xml" 56151

means *each hit* has to be visited, the file opened, and something like
lxprintf used to extract all the <article> elements (or whatever I
specify) containing the hits (and there may be many per file). This is
slow. I thought it might be solved by asking Swish-e to use
StoreDescription, but this appears to strip the markup, so it cannot be
used to control the output context.

> I wonder if a RDBMS is better suited to what you're trying to do, than a
> full-text indexer is. With a RDBMS is you get transactions and incremental
> additions/deletions, and the speed of something like SQLite is decent in
> comparison to a full-text search.

No, if you want full control over searching an XML document, you need to
use an XML search engine like eXist. An rDBMS is for rectangular data
like spreadsheets, and if your XML is in that format it might help, but
if your XML is normal text document[s], then an rDBMS is of very limited
value. Beware of rDBMSs claiming to be "XML-ready" and similar marketing
puff: most of them just store the XML as a blob. Even the ones that
check for well-formedness on input and output don't tend to give you
access to the inner element structure during searching unless they
provide a real XML engine -- in which case you're probably as well of
using an XML indexing and search engine to start with.

We covered a lot of this at last year's XML Summerschool (see
http://xmlsummerschool.com/curriculum-2010/xslt-and-xquery-2010/) but
it's a fast-moving field.

///Peter
_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Mon Oct 03 2011 - 10:04:30 GMT