Skip to main content.
home | support | download

Back to List Archive

swish-e incremental and -S prog (patch and problem)

From: Dobrica Pavlinusic <dpavlin(at)not-real.rot13.org>
Date: Fri Dec 10 2004 - 00:05:27 GMT
--WYTEVAkct0FjGQmd
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hi!

I'm writing perl module which create swish index(es). I started last summer,
with noble idea to split indexes in slices to enable re-indexing of just
one part, but lot of other things prevented me from finish it.

Lately, there has been some talk about swish-e incremental indexing and it
seemed to be just what I was looking for, so I gave it a try.

Since my module is running swish using -S prog, I needed ability to
change mode from Index to Update and/or Remove on the fly. Attached
patch is proposed addition of Update-Mode: header with values of Update,
Remove of Index (which corresponds to swish-e modes available from
command line). I also think that Update-Mode: probably isn't the best
name for new header.

Other changes in this patch include:

- report number of removed files with -T INDEX_HEADER (which probably
  should have #ifdef USE_BTREE around it, but structure data doesn't
  have it so I didn't put it either)
- index same file again after it's been removed
- support for external program (flush_stream) at several places
- fix for negative number of files in index after update (I think that
  it shouldn't increase number of removed files because it will update
  it anyway)
- verbose message in remove mode

I think that verbosity level of messages from Update-Mode: header should
be 3 and not 2, but I'm not sure.

Having said all that, I also have a problem: when indexing, removing and
again updating index with same file set, I get segmentation fault after
a while.

I'm attaching two scripts which could be used to exhibit this behavior. They
try to index, remove and than again update swish-e's own documentation ten
times. Segfault happens after third iteration. I suspect that it has
something to do with my attempt to update deleted files, but I'm in dark
alley here.

Any help would be greatly appreciated.

Why does documentation claims that incremental indexing is much slower?
In my tests, it's a bit faster:

17,933 files indexed.  127,571,399 total bytes.  17,851,935 total words.
Elapsed time: 00:01:43 CPU time: 00:01:06

vs normal swish-e:

17,933 files indexed.  127,571,399 total bytes.  17,851,935 total words.
Elapsed time: 00:01:47 CPU time: 00:01:02



-- 
Dobrica Pavlinusic               2share!2flame            dpavlin@rot13.org
Unix addict. Internet consultant.             http://www.rot13.org/~dpavlin

--WYTEVAkct0FjGQmd
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="swish-e_update_mode.diff"

Index: extprog.c
===================================================================
RCS file: /cvsroot/swishe/swish-e/src/extprog.c,v
retrieving revision 1.52
diff -u -w -r1.52 extprog.c
--- extprog.c	10 Sep 2003 18:57:39 -0000	1.52
+++ extprog.c	9 Dec 2004 23:08:17 -0000
@@ -395,6 +395,42 @@
                 continue;
             }
 
+	    /* new Update-Mode: [Update|Remove|Index] header
+	     * for this to work, swish-e has to be compiled with incremental option and
+	     * in update mode (-u) so that index is opened in read/write mode
+	     * dpavlin 2004-12-09
+	     */
+
+            if (strncasecmp(line, "Update-Mode", 11) == 0)
+            {
+                char *x = strchr(line, ':');
+                if (!x)
+                    progerr("Failed to parse Update-Mode '%s'", line);
+
+                x = str_skip_ws(++x);
+                if (!*x)
+                    progerr("Failed to parse Update-Mode header '%s'", line);
+
+		/* should we dump error here? It seem to work without update mode! - dpavlin */
+		if (sw->Index->update_mode != MODE_UPDATE && sw->Index->update_mode != MODE_REMOVE)
+			progwarn("Update-Mode header is supported only if swish-e is invoked in update (-u) mode");
+
+		if ( strncasecmp(x, "Update", 6) == 0 ) {
+			sw->Index->update_mode = MODE_UPDATE;
+			if ( sw->verbose >= 2 ) printf( "Update mode: %s (MODE_UPDATE)\n", x );
+		} else if ( strncasecmp(x, "Remove", 6) == 0 ) {
+			sw->Index->update_mode = MODE_REMOVE;
+			if ( sw->verbose >= 2 ) printf( "Update mode: %s (MODE_REMOVE)\n", x );
+		} else if ( strncasecmp(x, "Index", 5) == 0 ) {
+			sw->Index->update_mode = MODE_UPDATE;
+			if ( sw->verbose >= 2 ) printf( "Update mode: %s (MODE_UPDATE)\n", x );
+		} else {
+			progerr("Unknown Update-Mode: %s", x);
+		}
+
+                continue;
+            }
+
             progwarn("Unknown header line: '%s' from program %s", line, prog);
 
         }
Index: headers.c
===================================================================
RCS file: /cvsroot/swishe/swish-e/src/headers.c,v
retrieving revision 1.10
diff -u -w -r1.10 headers.c
--- headers.c	9 Nov 2004 23:03:53 -0000	1.10
+++ headers.c	9 Dec 2004 23:08:18 -0000
@@ -56,6 +56,7 @@
     {  "Saved as",          SWISH_STRING, 2,  offsetof( INDEXDATAHEADER, savedasheader ) },
     {  "Total Words",       SWISH_NUMBER, 2,  offsetof( INDEXDATAHEADER, totalwords ) },
     {  "Total Files",       SWISH_NUMBER, 2,  offsetof( INDEXDATAHEADER, totalfiles ) },
+    {  "Removed Files",     SWISH_NUMBER, 2,  offsetof( INDEXDATAHEADER, removedfiles ) },
     {  "Total Word Pos",    SWISH_NUMBER, 2,  offsetof( INDEXDATAHEADER, total_word_positions ) },
     {  "Indexed on",        SWISH_STRING, 2,  offsetof( INDEXDATAHEADER, indexedon ) },
     {  "Description",       SWISH_STRING, 2,  offsetof( INDEXDATAHEADER, indexd ) },
Index: index.c
===================================================================
RCS file: /cvsroot/swishe/swish-e/src/index.c,v
retrieving revision 1.232
diff -u -w -r1.232 index.c
--- index.c	6 Dec 2004 22:08:41 -0000	1.232
+++ index.c	9 Dec 2004 23:08:26 -0000
@@ -826,18 +826,23 @@
             if ( fi.prop_index )
                 efree( fi.prop_index );
 
-            /* New file is the same or older. Skip it */
-            if (ret >= 0)
+            /* New file is the same or older and not deleted, skip it */
+            if (ret >= 0 && DB_CheckFileNum(sw,old_filenum,indexf->DB))
             {
                if (sw->verbose >= 3)
                    printf(" - Update mode - File same or older - (Skipping it)\n\n");
+		/* external program must seek past data */
+		if (fprop->fp)
+		    flush_stream( fprop );
 
                 return;
             }
             else
             {  /* Remove old filenum and continue */
                 DB_RemoveFileNum(sw,old_filenum,indexf->DB);
-                cur_index->header.removedfiles++;
+		/* 2004-12-09 dpavlin -- this seem to change number of files into
+		 * negative number, so I commented it */
+		/*  cur_index->header.removedfiles++; */
             }
         }
         break;
@@ -848,13 +853,20 @@
 
         DB_ReadFileNum(sw,&old_filenum,fprop->real_path,strlen(fprop->real_path),indexf->DB);
         /* If exits a previous file with the same real_path remove it */
-        if(old_filenum)
+	/* 2004-12-09 dpavlin -- added check if file is deleted */
+        if(old_filenum && DB_CheckFileNum(sw,old_filenum,indexf->DB))
         {
+	    if (sw->verbose >= 3)
+                printf(" - Remove mode - File removed\n");
             IndexFILE   *cur_index = sw->indexlist;
             /* Remove old filenum and continue */
             DB_RemoveFileNum(sw,old_filenum,indexf->DB);
             cur_index->header.removedfiles++;
+
         }
+        /* external program must seek past data */
+        if (fprop->fp)
+            flush_stream( fprop );
         return;
         break;
     }

--WYTEVAkct0FjGQmd
Content-type: text/plain
Content-transfer-encoding: 7bit


************************************************************
Non-text elements of this multipart message
have been deleted to make the message conform
with the policies of this list
************************************************************

--WYTEVAkct0FjGQmd--
Received on Thu Dec 9 16:05:32 2004