On Wed, Feb 08, 2006 at 06:11:35AM -0800, Mike.Fountain@worldspan.com wrote:
> Couple questions on Indexing errors and indexing time:
> How fast should indexing run on a properly configured system? Its taking
> me a little over 1 minute to index about 200 files. I'm using DirTree
> piped to the index command. Off the top of my head, think this box is an
> 800MHZ CPU with 128MB RAM running Ubuntu linux.
> Is 1 minute or so for that few files ok? I've got nothing to compare it
> to, so no idea how fast an index should run.
I find if indexing takes longer than the time it takes to drink two
martinis then it doesn't really matter anyway.
> Watching the detailed output, it looks like what really slows down the
> indexing is PDF files. Some of them parse quick with errors, some of them
> seem to grind for quite awhile before spitting out an error:
> /www/pages/support/vendors/cisco/6500arch.pdf - Using HTML2 parser -
> (16364 words)
> Error (6594296): Internal: got 'EI' operator
> Error (9730251): Internal: got 'EI' operator
> /www/pages/support/vendors/cisco/Catalyst 4500 Update 2.pdf - Using HTML2
> parser - (9963 words)
> Error (2539944): Unknown operator '£'
> Error (2539944): Internal: got 'EI' operator
> Error (7208699): Unknown operator 'c¬'
> Error (7208699): Internal: got 'EI' operator
Those are errors generated by Xpdf while indexing your pdf. Maybe
you need a newer version of Xpdf? Maybe it doesn't understand
something valid in your pdf? Maybe your pdfs were generated by a
Might be worth contacting the author of Xpdf and asking and offering
up your pdf file for testing to them.
> The other question I have is - Is the searc features of the web site
> available while the site is reindexing? If my site grows to the point
> where it takes 5-10 minutes to index, or does it build a temp file and then
> do a quick swap out of the indexes once its done?
It writes the index to temporary files then renames them a the end.
Since there's more than one file that makes up the index there's a
race condition -- a short amount of time when the index could be
opened and it would generate an error. I've never seen or heard that
happen, but if you are worried about it you might instead write a
script to reindex into a new directory then rename the directory.
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Wed Feb 8 06:26:45 2006