Skip to main content.
home | support | download

Back to List Archive

Question on how Swish-e is parsing words out of a

From: David Wood <dwood(at)not-real.inter.nl.net>
Date: Thu Dec 18 2003 - 05:52:06 GMT
For the HTML doc content listed at the bottom of this message, if I run:

/opt/swish-e/bin/swish-e -T PARSED_WORDS -v 3 -i blah.html -f blah.idx

Swish-e's output is:

=== START OUTPUT ===

Indexing Data Source: "File-System"
Indexing "blah.html"

Checking file "blah.html"...
   blah.html - Using DEFAULT (HTML2) parser - White-space found word 'December'
White-space found word '2003'
White-space found word 'PSG'
White-space found word 'Playbook'
White-space found word 'product_families=Monitors,Desktop'
White-space found word 'PCs,Desktop'
White-space found word 'PCs'
White-space found word 'Options'
White-space found word 'and'
White-space found word 'Accessories,Handheld'
White-space found word 'PCs,Handheld'
White-space found word 'PCs'
White-space found word 'Options'
White-space found word 'and'
White-space found word 'Accessories,Mobile'
White-space found word 'PCs,Notebook'
White-space found word 'PCs'
White-space found word 'Options'
White-space found word 'and'
White-space found word 'Accessories,Tablet'
White-space found word 'PCs,Thin'
White-space found word 'Clients,Thin'
White-space found word 'Clients'
White-space found word 'Options'
White-space found word 'and'
White-space found word 'Accessories,Windows'
White-space found word 'NT'
White-space found word 'Workstations,Windows'
White-space found word 'Workstations,Workstations'
White-space found word 'Options'
White-space found word 'and'
White-space found word 'Accessories'
White-space found word 'product_lines='
White-space found word 'marketing_programs='
  (50 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 24 words alphabetically
Writing header ...
Writing index entries ...
   Writing word text: Complete
   Writing word hash: Complete
   Writing word data: Complete
24 unique words indexed.
4 properties sorted.
1 file indexed.  457 total bytes.  50 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!

=== END OUTPUT ===


Why is Swish-e finding the "words" listed above, for example, 
'product_families=Monitors,Desktop'?  Neither '_' nor '=' is in WORDCHARS, 
so those strings should be getting broken into component words, shouldn't they?


Swish-e version is:  SWISH-E 2.4.0, on HP-UX 11.0.


Thanks for any insight...

Cheers,

David






=== START DOC CONTENT ===

<html>
<head>
<title>December 2003 PSG Playbook</title>

</head>

<body>

<pre>


product_families=Monitors,Desktop PCs,Desktop PCs Options and 
Accessories,Handheld PCs,Handheld PCs Options and Accessories,Mobile 
PCs,Notebook PCs Options and Accessories,Tablet PCs,Thin Clients,Thin 
Clients Options and Accessories,Windows NT Workstations,Windows 
Workstations,Workstations Options and Accessories
product_lines=
marketing_programs=

</pre>

</body>
</html>

=== END DOC CONTENT ===






*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Thu Dec 18 05:52:15 2003