Skip to main content.
home | support | download

Back to List Archive

Newline sometimes ignored in TXT-parser?

From: Arne Georg Gleditsch <argggh(at)>
Date: Sun Nov 25 2001 - 13:13:10 GMT
Hi, all.  During testing of Swish-E (CVS version, checked out
2001-11-25) I encountered this oddity:

  $ echo DefaultContents TXT > test.config
  $ /home/argggh/src/ping/swish-e/src/swish-e -S fs -i 2.4.15-pre6/CREDITS -v 5 -c test.config
  Indexing Data Source: "File-System"
  Indexing "2.4.15-pre6/CREDITS"
  Checking file "2.4.15-pre6/CREDITS"...
    CREDITS - Using TXT parser -  (11689 words)
  Removing very common words...
  no words removed.
  Writing main index...
  Sorting words ...
  Sorting 5199 words alphabetically
  Writing header ...
  Writing index entries ...
    Writing word text: Complete
    Writing word hash: Complete
    Writing word data: Complete
  5199 unique words indexed.
  4 properties sorted.                                              
  1 file indexed.  77693 total bytes.
  Elapsed time: 00:00:00 CPU time: 00:00:00
  Indexing done!
  $ /home/argggh/src/ping/swish-e/src/swish-search -w 'Henderson'
  # SWISH format: 2.1-dev-24
  # Search words: Henderson
  err: no results
  $ /home/argggh/src/ping/swish-e/src/swish-search n-w 'Henderson*'
  # SWISH format: 2.1-dev-24
  # Search words: Henderson*
  # Number of hits: 1
  # Search time: 0.001 seconds
  # Run time: 0.006 seconds
  1000 2.4.15-pre6/CREDITS "CREDITS" 77693
  $ /home/argggh/src/ping/swish-e/src/swish-search -w 'HendersonE'
  # SWISH format: 2.1-dev-24
  # Search words: HendersonE
  # Number of hits: 1
  # Search time: 0.000 seconds
  # Run time: 0.007 seconds
  1000 2.4.15-pre6/CREDITS "CREDITS" 77693
  $ grep -A2 Henderson 2.4.15-pre6/CREDITS
  N: Richard Henderson

The file indexed is CREDITS from the Linux 2.4.15pre6 source code
distribution.  I suppose any Linux version in the 2.4 series will be
similar enough to exhibit this as well.  I'll mail the exact file used
here to anyone who wants to test this if so is not the case.

I also have a small wish for the Swish-E developers: I'd love to be
able to feed swish-e the file contents to index on stdin.  Just like
"-S prog" really, just driven by the program gathering the file
contents, not by swish-e.  As of now I am kludging it by starting
swish-e like this from my gatherer:

  swish-e -S prog -i /bin/cat [..]

and then feeding stdin of this process with the stuff to be indexed.
This works, but it's a bit gross.

Received on Sun Nov 25 13:14:12 2001