Skip to main content.
home | support | download

Back to List Archive

Re: Index / Prop files over 2GB

From: Mike Kralec <mkralec(at)not-real.sbgnet.com>
Date: Fri Oct 29 2004 - 11:47:58 GMT
Here you go.  This is the *output*:

Indexing Data Source: "File-System"
Indexing "/archive/asheville"
Indexing "/archive/baltimore"
Indexing "/archive/birmingham"
Indexing "/archive/central"
Indexing "/archive/cincinnati"
Indexing "/archive/flint"
Indexing "/archive/greensboro"
Indexing "/archive/lasvegas"
Indexing "/archive/milwaukee"
Indexing "/archive/nashville"
Indexing "/archive/oklahoma"
Indexing "/archive/pittsburgh"
Indexing "/archive/portland"
Indexing "/archive/raleigh"
Indexing "/archive/rochester"
Indexing "/archive/tampa"
Indexing "/archive/wggb-springmass"
Indexing "/archive/buffalo"
Indexing "/archive/champaign"
Indexing "/archive/wics-springill"
Indexing "/archive/cedarrapids"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 813,347 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: ...^M  Writing word text:  10%^M  Writing word 
text:  20%^M  Writing word text:
  30%^M  Writing word text:  40%^M  Writing word text:  50%^M  Writing 
word text:  60%^M  Writing wo
rd text:  70%^M  Writing word text:  80%^M  Writing word text:  90%^M  
Writing word text: 100%^M  Wr
iting word text: Complete
  Writing word hash: ...^M  Writing word hash:  10%^M  Writing word 
hash:  20%^M  Writing word hash:
  30%^M  Writing word hash:  40%^M  Writing word hash:  50%^M  Writing 
word hash:  60%^M  Writing wo
rd hash:  70%^M  Writing word hash:  80%^M  Writing word hash:  90%^M  
Writing word hash: 100%^M  Wr
iting word hash: Complete
  Writing word data: ...^M  Writing word data:   9%^M  Writing word 
data:  19%^M  Writing word data:
  29%^M  Writing word data:  39%^M  Writing word data:  49%^M  Writing 
word data:  59%^M  Writing wo
rd data:  69%^M  Writing word data:  79%^M  Writing word data:  89%^M  
Writing word data:  99%^M  Wr
iting word data: Complete
813,347 unique words indexed.
Sorting property: swishdocpath                            ^MSorting 
property: swishtitle
                  ^MSorting property: 
swishdocsize                            ^MSorting property: sw
ishlastmodified                       ^MSorting property: 
swishdescription                        ^M
Sorting property: year_month                              ^MSorting 
property: market
                  ^M7 properties sorted.
6,049,227 files indexed.  3,155,419,777 total bytes.  490,801,524 total 
words.
Elapsed time: 09:37:42 CPU time: 06:14:06
Indexing done!

*Here's the indexing script:*
#!/bin/bash

# THIS WILL INDEX EVERYTHING, STARTING WITH NOTHING

cd /archive/setup
/bin/touch index.timestamp
/usr/local/bin/swish-e -e -c swish.conf -f avid.index.new 2>&1 > 
index.report
/bin/mv -f avid.index.new avid.index
/bin/mv -f avid.index.new.prop avid.index.prop
/bin/rm -f avid.index.new
/bin/rm -f avid.index.new.prop

*Here's the reindexing script:*
#!/bin/bash

# THIS IS THE REINDEXING SCRIPT, WILL REINDEX ANYTHING WITH A TIMESTAMP 
NEWER THAN index.timestamp

cd /archive/setup
/bin/touch index.timestamp.new
/usr/local/bin/swish-e -e -c swish.conf -N index.timestamp -f 
avid.index.new 2>&1 > reindex.report
/usr/local/bin/swish-e -M avid.index avid.index.new avid.tmp 2>&1 >> 
reindex.report
/bin/mv -f avid.tmp avid.index
/bin/mv -f avid.tmp.prop avid.index.prop
/bin/rm -f avid.index.new
/bin/rm -f avid.index.new.prop
/bin/cp -p index.timestamp.new index.timestamp
/bin/rm -f index.timestamp.new

*Here's the swish.conf*
# Swish-e Configuration File

# Let me know if there are problems
IndexAdmin mkralec@sbgnet.com

# Tell swish what to index
IndexDir /archive/asheville
IndexDir /archive/baltimore
IndexDir /archive/birmingham
IndexDir /archive/central
IndexDir /archive/cincinnati
IndexDir /archive/flint
IndexDir /archive/greensboro
IndexDir /archive/lasvegas
IndexDir /archive/milwaukee
IndexDir /archive/nashville
IndexDir /archive/oklahoma
IndexDir /archive/pittsburgh
IndexDir /archive/portland
IndexDir /archive/raleigh
IndexDir /archive/rochester
IndexDir /archive/tampa
IndexDir /archive/wggb-springmass
IndexDir /archive/buffalo
IndexDir /archive/champaign
IndexDir /archive/wics-springill
IndexDir /archive/cedarrapids

# Only index HTML and text files
IndexOnly .html

# Otherwise, use the HTML2 parser
IndexContents HTML2 .html

# Tell swish what to save the index as
IndexFile avid.index

# Don't index published.html
FileRules filename is published\.html

# Store the body as the description
StoreDescription HTML2 <body>

# Setup market meta name extraction
ExtractPath market regex !^/archive/([^/]+)/.*$!$1!

# Setup year_month meta name extraction
ExtractPath year_month regex !^/archive/[^/]+/([^/]+)/([^/]+)/.*$!$1$2!

# Let swish know about important fields
MetaNames date trt tapenumber

# Lets use the following for search sorting
PropertyNames year_month market

# Ignore Words found to be repetitive
IgnoreWords unknown tape text code archive time cues production date 
news format

# Index words longer than 1 characters
MinWordLimit 2

Mike

Bill Moseley wrote:

>On Thu, Oct 28, 2004 at 04:31:46AM -0700, Mike Kralec wrote:
>  
>
>>FYI, I just wanted to say that 2.5.2 compiled with the large file 
>>support is working great for
>>me with a little over 6 million indexed files.  I'm re-indexing nightly 
>>and merging works fine
>>also.  I'm up to around 2.5GB with the prop file now.
>>    
>>
>
>Pushing the envelope, I see.  Once again, thanks Jose!
>
>Can you post output from indexing to see number of files/words and
>indexing time.
>
>
>  
>
Received on Fri Oct 29 04:48:02 2004