Skip to main content.
home | support | download

Back to List Archive

Re: Indexing xml files that has another included xml file

From: Peter Karman <karman(at)not-real.cray.com>
Date: Wed Sep 08 2004 - 22:29:41 GMT
I believe (Bill will correct me) that even using libxml2 as your parser 
(XML2), entities and XIncludes are not followed in your XML. The xmllint 
tool that comes with libxml2 does resolve those (with correct 
catalog/options), but I don't think swish-e uses the library functions 
to resolve external files. Seems like a good todo issue...

Edgard Pineda wrote on 9/8/04 5:17 PM:

> Hello All!
> 	I'm trying to indexing this example of a xml file: (named
> "example.base")
> 
> <?xml version="1.0" encoding="iso-8859-1"?>
> <!DOCTYPE article [
>     <!ENTITY xmlfrag SYSTEM "other.data" >
> ]>
> <article>
>   &xmlfrag;
> </article>
> 
> In the same directory I have "other.data" file wich has a lot of xml
> content.
> I did created the following config file "swishe_conf":
> 
> # Indexable data
> IndexDir /test/swish-e_test
> IndexFile /test/swish-e_test/index
> IndexOnly .base
>  
> # Name and description
> IndexName "Testing Index"
> IndexDescription "Generated by Swish-e 2.4.0"
>  
> # XML
> MetaNames hl1 hl2
> PropertyNames hl1 hl2
> IndexReport 3
> FollowSymLinks yes
>  
> # Ranking
> IgnoreTotalWordCountWhenRanking yes
> IndexComments 0
>  
> # Spanish characters
> TranslateCharacters áéíóúüñÁÉÍÓÚÜÑ aeiouunAEIOUUN
>  
> # Stopwords
> #MinWordLimit 3
> MaxWordLimit 30
> 
> 
> Then I run:
> 
>>swish-e -v 3 -c swishe_conf -f index.tmp
> 
> Parsing config file 'swishe_conf'
> Indexing Data Source: "File-System"
> Indexing "/home/proy/devel/test/swish-e_test"
>  
> Checking dir "/home/proy/devel/test/swish-e_test"...
>   23640.base - Using DEFAULT (HTML2) parser -  (1 words)
>  
> Removing very common words...
>   Getting IgnoreLimit stopwords: Complete
> no words removed.
> Writing main index...
> Sorting words ...
> Sorting 1 words alphabetically
> Writing header ...
> Writing index entries ...
>   Writing word text: Complete
>   Writing word hash: Complete
>   Writing word data: Complete
> 1 unique word indexed.
> 6 properties sorted.
> 1 file indexed.  143 total bytes.  1 total words.
> Elapsed time: 00:00:00 CPU time: 00:00:00
> Indexing done!
> 
> I want that swish-e includes de xml file other.data in the file indexed
> and shows me several keywords... but then I run:
> 
> 
>>swish-e -f index.tmp -k '*'
> 
> # SWISH format: 2.4.0
> index.tmp: xmlfrag
> 
> :(
> 
> What should I do to make that swish-e can include the file specified in 
> 'ENTITY xxx SYSTEM "somefile"' in indexed xml files??
> 
> Thanks in advance for your help!!
> 
> Edgard Pineda.

-- 
Peter Karman  651-605-9009  karman@cray.com
Received on Wed Sep 8 15:30:03 2004