Skip to main content.
home | support | download

Back to List Archive

(no subject)

From: <Rainer.Scherg(at)not-real.rexroth.de>
Date: Sun Nov 19 2000 - 17:47:04 GMT
Hi Jose,

in a separate mail you'll get the new swish source files as
a tar archive (the mail is too large to post to the maillist)...


I've done some redesign of the code, rearranged routines and 
implemented the FileProperty-Routine. First tests are working fine...


Files touched:
   swish.h
   file.c/h
   fs.c/h
   http.c/h
   index.c/h
   txt.c/h
   xml.c/h
   


But here are some things which are still to be discussed & to be done...


Swish-e:

ToDo and some questions for my understanding:

  - there is a "title" passed from "outside" to the index routines...
    this seems to be for historic reasons, when swish did only HTML.
    mostly the "title" contains the filepath.

    At this point, I let this still be untouched...
    but, we should get rid of this relicts.

    the title should be retrieved within the indexing routine for a doctype.
    (XML may be different to HTML or other types...)


  - "indextitleonly" (now: fprop->index_no_content) is not honoured in each
    index-routine (only the original one: countwords).
    Should be done.


  - in routine "indexafile": DOCENTRY *e only contains the filename...
    (and the misplaced "title")
    What do we need this structure for?

  - The new routine "read_stream" (file.c) has IMO to be redesigned - the
    idea is good!  (Jose; please have a look on my thoughts).

     -------------------
     char *read_stream(FILE *fp,int filelen)
     {
      int c=0,offset=0,bufferlen=0;
      unsigned char *buffer;
	
      if(filelen)
	{
		buffer=emalloc(filelen+1);
		vread(buffer,1,filelen,fp);
		buffer[filelen]='\0';
	} else {    /* if we are reading from a popen call, filelen is 0 */

		buffer=emalloc((bufferlen=MAXSTRLEN)+1);
		while((c=fgetc(fp))!=EOF)
		{
			if(offset==bufferlen)
			{
				bufferlen+=MAXSTRLEN;
				buffer=erealloc(buffer,bufferlen+1);
			}
			buffer[offset++]=(unsigned char)c;
		}
		buffer[offset]='\0';
	}
	return (char *)buffer;
     }
     -------------------------
     
     1. fgetc is mostly slow. We should fread (or vread).
     2. MAXSTRLEN is to small.
        This means on large documents often reallocs of memory and
        in worst case moving memory blocks (behavior:  exponential slow
down).
        Min. Buffersize should be larger (at least 64K, 128 K or 256 K)
     3. We could handle this on a standard way without filelen passed,
        when the buffersize is large enough, a realloc on e.g. html docs
        will hardly occur...
 
    
  - http.c: The last modification date still has to be retrieved from a
document.
    At this point the last mod date for this method is zero.

  - Saving the last modification date in the indexfile (for results...)

  - Description stuff to do...

  - Defining a new result format (what does the parameters look like?)

  - Get rid of old K&R-C stuff  (e.g.  #define _AP)
    
  - Do some commenting stuff to make swish better understandable for further
    development

  - Idea: Feed in a filelist (like filters, e.g. a find-cmd output) to
index.
          This would be a new indexing method like file_system_indexing and
          http_spidering). Sould be easy to implement - but no priority...

  - Docs... also to be done (lots of...)


cu rainer



----------------------------------------------------------------------
This Mail has been checked for Viruses
Attention: Encrypted Mails can NOT be checked !

* * *

Diese Mail wurde auf Viren ueberprueft
Hinweis: Verschluesselte Mails koennen NICHT geprueft werden !
----------------------------------------------------------------------
Received on Sun Nov 19 17:48:42 2000