Skip to main content.
home | support | download

Back to List Archive

Re: Select files for incremental indexing

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Mon Oct 09 2006 - 13:22:04 GMT
thanks for this thorough report.

Sounds like what you're saying is that ctime is not a sufficient measure of 
whether a file has changed or not. Something more like:

http://search.cpan.org/~jeremy/File-Signature-1.009/Signature.pm

Undoubtedly this approach would require more overhead in the fs method, and 
perhaps it would be easier to implement in something like DirTree.pl or similar. 
Each file's signature would have to be stored/cached somewhere, either in the 
index or in some external file/db.

My opinion is that this level of checking is a Good Idea but should not be a 
feature of Swish-e itself. Swish-e is an indexer/search tool, and really ought 
to index whatever it is asked to. The -S prog method implements this idea.

I have some more ideas on this approach which I'll post as I start to make 
Swish3 development visible on this list.

Again, that's my opinion and I welcome discussion here on the topic.

cheers,
pek

Dmitri V. Ivanov scribbled on 10/6/06 2:58 PM:
> On Thu, Oct 05, 2006 at 12:33:57PM -0700, Dmitri V. Ivanov wrote:
> 
>> *********************************************************************
>> Due to deletion of content types excluded from this list by policy,
>> this multipart message was reduced to a single part, and from there
>> to a plain text message.
>> *********************************************************************
> 
> It seems I've failed to send html version...
> 
> There is a my opinion about 'Why' and 'How' and next to it my script.
> It seems to me too slow. It sorts all files with directory whereas to
> sort directories is sufficient. But to sort all files seems to be
> usefull to get list of deleted files like uniq.
> 
> Why and How:
> =========
> 
>    First of all about file timestamps and their update [1]in common and table
>    with details:
> 
>    +-------------------------------------------------+
>    |               |  timestamps marked for update   |
>    |    syscall    |---------------------------------|
>    |               |       file        | parent dir  |
>    |---------------+-------------------+-------------|
>    | [2]chdir      |                   |             |
>    |---------------| -                 | -           |
>    | [3]fchdir     |                   |             |
>    |---------------+-------------------+-------------|
>    | [4]chmod      |                   |             |
>    |---------------| ctime             | -           |
>    | [5]fchmod     |                   |             |
>    |---------------+-------------------+-------------|
>    | [6]chown      |                   |             |
>    |---------------|                   |             |
>    | [7]fchown     | ctime             | -           |
>    |---------------|                   |             |
>    | [8]lchown     |                   |             |
>    |---------------+-------------------+-------------|
>    | [9]close      | -                 | -           |
>    |---------------+-------------------+-------------|
>    | [10]creat     | atime,ctime,mtime | ctime,mtime |
>    |---------------+-------------------+-------------|
>    | [11]execve    | atime             | -           |
>    |---------------+-------------------+-------------|
>    | [12]fcntl     | -                 | -           |
>    |---------------+-------------------+-------------|
>    | [13]ftruncate |                   |             |
>    |---------------| ctime,mtime       | -           |
>    | [14]truncate  |                   |             |
>    |---------------+-------------------+-------------|
>    | [15]fstat     |                   |             |
>    |---------------|                   |             |
>    | [16]stat      | -                 | -           |
>    |---------------|                   |             |
>    | [17]lstat     |                   |             |
>    |---------------+-------------------+-------------|
>    | [18]fsync     |                   |             |
>    |---------------| -                 | -           |
>    | [19]fdatasync |                   |             |
>    |---------------+-------------------+-------------|
>    | [20]link      | ctime             | ctime,mtime |
>    |---------------+-------------------+-------------|
>    | [21]lseek     | -                 | -           |
>    |---------------+-------------------+-------------|
>    | [22]mknod     | atime,ctime,mtime | ctime,mtime |
>    |---------------+-------------------+-------------|
>    | [23]mkdir     | atime,ctime,mtime | ctime,mtime |
>    |---------------+-------------------+-------------|
>    | [24]mmap      | *                 | -           |
>    |---------------+-------------------+-------------|
>    | [25]munmap    | -                 | -           |
>    |---------------+-------------------+-------------|
>    | [26]msync     | *                 | -           |
>    |---------------+-------------------+-------------|
>    | [27]open      | *                 | *           |
>    |---------------+-------------------+-------------|
>    | [28]pread     |                   |             |
>    |---------------|                   |             |
>    | [29]read      | atime             | -           |
>    |---------------|                   |             |
>    | [30]readv     |                   |             |
>    |---------------+-------------------+-------------|
>    | [31]pwrite    |                   |             |
>    |---------------|                   |             |
>    | [32]write     | ctime,mtime       | -           |
>    |---------------|                   |             |
>    | [33]writev    |                   |             |
>    |---------------+-------------------+-------------|
>    | [34]rename    | implementation    | ctime,mtime |
>    |---------------+-------------------+-------------|
>    | [35]rmdir     | -                 | ctime,mtime |
>    |---------------+-------------------+-------------|
>    | [36]readlink  | *                 | -           |
>    |---------------+-------------------+-------------|
>    | [37]readdir   | atime             | -           |
>    |---------------+-------------------+-------------|
>    | readahead     | ?                 | ?           |
>    |---------------+-------------------+-------------|
>    | [38]symlink   | *                 | *           |
>    |---------------+-------------------+-------------|
>    | sendfile      | ?                 | ?           |
>    |---------------+-------------------+-------------|
>    | [39]unlink    | -                 | ctime,mtime |
>    |---------------+-------------------+-------------|
>    | [40]utime     | ctime             | -           |
>    +-------------------------------------------------+
> 
>    Summary:
> 
>    Syscalls writing to file data updates mtime (creat, truncate, ftruncate,
>    mknod, mkdir, pwrite, write, writev). mtime isn't updated when no data
>    changed (notably link, unlink, rename). ctime is updated with any changes
>    with file data and metadata.
> 
>    rename syscall hasn't strong rules for updating file timestamps at
>    standart. With linux it updates ctime. I hope that most other systems do
>    so, but i'm not sure.
> 
>    Second. We look into this table and say yourself: "Well. Just select files
>    with ctime newer than. It's all". It's not.
> 
>    We know that directory is file and it can be renamed. When we rename
>    directory we don't update timestamps of files it points. Additionally we
>    can look into our table and see that ctime (of parent directory) is
>    changed often. It's not sufficient to look only to timestamps.
> 
>    Lets look into metadata: there is three timestamps, uid,gid,access mask
>    and inode number (and pointer to file data). inode number or [41]File
>    serial number that is unique (per-filesystem).
> 
>    Let assume now, that our program rememberes directory name to inode number
>    relations. If name (of parent directory we check) doesn't exist or
>    associated inode number is other as remembered it's surely other
>    directory. Otherwise there is two cases:
> 
>     1. It's the same directory (old good pepperbox).
>     2. Old directory was deleted, then new (empty) directory was created and
>        it was filled with files (but they must have newer ctime).
> 
>    For case of rename updating ctime we have criteria: ctime plus parent dir
>    with it's inode number.
> 
>    Otherwise (rename don't update ctime of file) we must have remembered
>    associations between all files names and their inode numbers.
> 
>    The fact that GNU tar uses only directory names persuades that most *nixes
>    does update ctime with rename.
> 
> References
> 
>    Visible links
>    1. http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_07
>    2. http://www.opengroup.org/onlinepubs/009695399/functions/chdir.html
>    3. http://www.opengroup.org/onlinepubs/009695399/functions/fchdir.html
>    4. http://www.opengroup.org/onlinepubs/009695399/functions/chmod.html
>    5. http://www.opengroup.org/onlinepubs/009695399/functions/fchmod.html
>    6. http://www.opengroup.org/onlinepubs/009695399/functions/chown.html
>    7. http://www.opengroup.org/onlinepubs/009695399/functions/fchown.html
>    8. http://www.opengroup.org/onlinepubs/009695399/functions/chown.html
>    9. http://www.opengroup.org/onlinepubs/009695399/functions/close.html
>   10. http://www.opengroup.org/onlinepubs/009695399/functions/creat.html
>   11. http://www.opengroup.org/onlinepubs/009695399/functions/exec.html
>   12. http://www.opengroup.org/onlinepubs/009695399/functions/fcntl.html
>   13. http://www.opengroup.org/onlinepubs/009695399/functions/ftruncate.html
>   14. http://www.opengroup.org/onlinepubs/009695399/functions/truncate.html
>   15. http://www.opengroup.org/onlinepubs/009695399/functions/fstat.html
>   16. http://www.opengroup.org/onlinepubs/009695399/functions/stat.html
>   17. http://www.opengroup.org/onlinepubs/009695399/functions/lstat.html
>   18. http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html
>   19. http://www.opengroup.org/onlinepubs/009695399/functions/fdatasync.html
>   20. http://www.opengroup.org/onlinepubs/009695399/functions/link.html
>   21. http://www.opengroup.org/onlinepubs/009695399/functions/lseek.html
>   22. http://www.opengroup.org/onlinepubs/009695399/functions/mknod.html
>   23. http://www.opengroup.org/onlinepubs/009695399/functions/mkdir.html
>   24. http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
>   25. http://www.opengroup.org/onlinepubs/009695399/functions/munmap.html
>   26. http://www.opengroup.org/onlinepubs/009695399/functions/msync.html
>   27. http://www.opengroup.org/onlinepubs/009695399/functions/open.html
>   28. http://www.opengroup.org/onlinepubs/009695399/functions/pread.html
>   29. http://www.opengroup.org/onlinepubs/009695399/functions/read.html
>   30. http://www.opengroup.org/onlinepubs/009695399/functions/readv.html
>   31. http://www.opengroup.org/onlinepubs/009695399/functions/pwrite.html
>   32. http://www.opengroup.org/onlinepubs/009695399/functions/write.html
>   33. http://www.opengroup.org/onlinepubs/009695399/functions/writev.html
>   34. http://www.opengroup.org/onlinepubs/009695399/functions/rename.html
>   35. http://www.opengroup.org/onlinepubs/009695399/functions/rmdir.html
>   36. http://www.opengroup.org/onlinepubs/009695399/functions/readlink.html
>   37. http://www.opengroup.org/onlinepubs/009695399/functions/readdir.html
>   38. http://www.opengroup.org/onlinepubs/009695399/functions/symlink.html
>   39. http://www.opengroup.org/onlinepubs/009695399/functions/unlink.html
>   40. http://www.opengroup.org/onlinepubs/009695399/functions/utime.html
>   41. http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_175
> =========
> 
> My script
> =========
> 
> #!/usr/bin/perl -w
> use strict;
> use integer;
> 
> # comparison function for comparing paths. Character weight is its code value
> # except for slash. Slash is lessiest character. There is no hope it will work
> # with multibyte chars.
> # ----------------------------------------------------------------------------
> sub cmpnoi18n($$)
> {   my ($leftstr,$rightstr) = @_;
>     my ($zerros) = ("$leftstr" ^ "$rightstr") =~ m/(^[\x00]*)[^\x00]/;
>     if (not defined($zerros)) { #strinngs are different from first char or same
> 	return (ord($leftstr) <=> ord($rightstr));}
>     else {
> 	my $leftlen = length($leftstr);
> 	my $rightlen = length($rightstr);
> 	my $minlen = ($leftlen < $rightlen) ? $leftlen : $rightlen;
> 	my $zerrolen = length($zerros);
> 	if ($zerrolen == $minlen) { # one of different trailers is empty string
> 	    return ($leftlen <=> $rightlen);}
> 	else { # both different trailers aren't empty
> 	    $leftstr = substr $leftstr,$zerrolen,1;
> 	    $rightstr = substr $rightstr,$zerrolen,1;
> 	    return -1 if ($leftstr eq '/'); # make '/' less than any other char to don't have difference
> 	    return 1 if ($rightstr eq '/'); # beetwin sort of partial filenames and compare full
> 	    return (ord($leftstr) <=> ord($rightstr));
> 	};
>     };
> };
> # ----------------------------------------------------------------------------
> # Utilitary function to get need file stat subset. 
> # Input: filepath suitable for # calling stat(), time to compare timestamps with
> # Output: array with filetype, boolean cnewer, inode number.
> # filetype is f for regular file, l for symlink, d for directory where we can 
> # stat entries, D for directory where we cant and o for other file types.
> # ----------------------------------------------------------------------------
> sub filestats($$) {
>     my ($filename,$oldtime) = @_;
>     my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$ctime,$blksize,$blocks) = lstat($filename);
>     my $cnewer = $ctime >= $oldtime;
>     my $filetype = 'o';
>     $filetype = (-r _ && -x _) ? 'd' : 'D' if -d _;
>     $filetype = 'f' if -f _;
>     $filetype = 'l' if -l _;
>     return ($filetype,$cnewer,$ino);
> };
> # -----------------------------------------------------------------------------
> # Utility function to fill level (meaning directory level) structure. Returns
> # pointer to this filled structure.
> # level: hash with elements:
> # 	name: path to directory from root of processed subtree
> # 	entries: array with directory entries sorted with cmpnoi18n weighting
> # 		function.
> # 	unknown: boolean value representing the fact that directory wasn't 
> # 		known at previous run.
> # 	entnum: number of elements for entries array
> # 	entindex: number of element from entries that must be processed next
> # 	parent: pointer to level structure representing parent directory
> # Input: boolean dirunknown, path to processed subtree, path to processed
> # directory relating to root of processed subtree and, optionally pointer to
> # level structure representing parent directory
> # ---------------------------------------------------------------------------
> sub getlevel {
>     my ($dirunknown,$path2tree,$Dirname) = @_[0..2];
>     my $parent = (scalar @_ > 3) ? $_[3] : 0;
>     opendir(Dir,"${path2tree}/$Dirname");
>     my $level = { name => $Dirname,
> 	  entries => [ sort cmpnoi18n readdir(Dir) ],
> 	  unknown => $dirunknown,
> 	  entnum => 0,
> 	  entindex => 0,
> 	  parent => $parent
> 	  };
>     closedir Dir;
>     $level->{entnum} = scalar @{$level->{entries}};
>     $level->{name} = $level->{name} . '/' if $level->{name} gt '';
>     return $level;
> };
> # -----------------------------------------------------------------------------
> # Example set file process function. Return pointer to anonymous function,
> # that will process supplied file depending of it's type and the fact, that file
> # was changed since previous run
> # Input: Mask of file types to proceed (substring of 'dflo'),
> # 	File name to write newer files names to and, optionally,
> # 	File name fo write all files names.
> # -----------------------------------------------------------------------------
> sub setfileproc ($$:$) {
>     my ($TypeMask,$NewerFilesListName) = @_[0,1];
>     my ($FullFilesList,$NewerFilesList);
>     my $FullFilesListName = '';
>     if (scalar @_ > 2) {
> 	$FullFilesListName = shift @_;
> 	open $FullFilesList,">$FullFilesListName";
>     };
>     open $NewerFilesList,">$NewerFilesListName";
>     # -------------------------------------------------------------------------
>     # Returned anonymous function that will actually process file names.
>     # Here filename is only string that will be printed, fileisnewer is
>     # boolean, and filetype is one char from 'dflo'
>     # -------------------------------------------------------------------------
>     return sub ($$$) {
> 	my ($filename,$fileisnewer,$filetype) = @_;
> 	if ((index $TypeMask,$filetype,0) >= $[ ) {
> 	    print $FullFilesList "$filename\n" if ($FullFilesListName ne '');
> 	    print $NewerFilesList "$filename\n" if $fileisnewer;
> 	};
>     };
> 
> };
> # -----------------------------------------------------------------------------
> # Main working function. 
> # input: $treepath is a path to processed subtree root
> # 	 $NewSnapshootName is filename to write new snapshoot data
> # 	 $fileproc is a pointer to function to process file names
> # 	 $WorkSnapshootName is filename of used snapshoot file.
> # Output: none.
> # Snapshoot file format:
> # 1st line - timestamp from runtime slash $treepath inode number
> # other lines - directory slash it's inode number
> # lines since second is sorted with cmpnoi18n comparison
> # Tasks:
> # 1. Generate NewSnapshoot file for next run.
> # 2. Call $fileproc function for each file with $treepath and it's
> # subdirectories setting file type and isnewer values.
> # filetype: see filestats()
> # isnewer: boolean == (<parent directory unknown at workSnapshoot> || 
> # (<parent directory inode> != <inode from workSnapshoot>) ||
> # (<file ctime> >= <time from workSnapshoot))
> # --------------------------------------------------------------------
> sub findnewer ($$$:$) {
>     my ($treepath,$NewSnapshootName,$fileproc) = @_[0..2];
>     my $WorkSnapshoot;
>     my ($oldname,$oldino,$oldtime,$WorkSnapshootName) = ('',0,0,'');
>     if (scalar @_ > 3) { 
> 	$WorkSnapshootName= $_[3]; 
> 	($oldtime,$oldino)= <$WorkSnapshoot> =~ m@^(.+)/(\d+)$@ if (open $WorkSnapshoot,"<$WorkSnapshootName");
> 	# $oldtime is a time to compare ctime's with, $oldino is always inode
> 	# number read from workSnapshoot
>     }; 
>     open my $NewSnapshoot, ">$NewSnapshootName" or die "Can't open file to write new snapshoot";
>     my ($filetype,$cnewer,$ino) = filestats($treepath,$oldtime);
>     my $newtime = time;
>     print $NewSnapshoot "${newtime}/$ino\n"; #write first line to snapshoot
>     my $dirunknown = $ino != $oldino;
>     my $curparent=&getlevel($dirunknown,$treepath,'');
>     while (($curparent->{entindex} < $curparent->{entnum}) || ref $curparent->{parent}) {
> 	while ($curparent->{entindex} < $curparent->{entnum}) {
> 	    my $filename = $curparent->{entries}[$curparent->{entindex}++];
> 	    next if ($filename eq '..' || $filename eq '.');
> 	    $filename = $curparent->{name} . $filename;
> 	    ($filetype,$cnewer,$ino) = filestats(($treepath . '/' . $filename),$oldtime);
> 	    my $fileisnewer = ($curparent->{unknown} || $cnewer);
> 	    &$fileproc($filename,$fileisnewer,($filetype =~ s/D/d/));
> 	    if ($filetype =~ m/[dD]/) {
> 		if ($filetype eq 'd') {
> 		    print $NewSnapshoot "${filename}/$ino\n";
> 		    # check directory against WorkSnapshoot
> 		    my $dircmp = &cmpnoi18n($oldname,$filename);
> 		    while ($dircmp < 0 && ($WorkSnapshootName ne '') && (not eof($WorkSnapshoot))) {
> 			    ($oldname,$oldino) = <$WorkSnapshoot> =~ m@^(.+)/(\d+)$@;
> 			    $dircmp = cmpnoi18n($oldname,$filename);
> 		    };
> 		    $dirunknown = ($dircmp != 0 || $ino != $oldino);
> 		    # ----------------------------------------------
> 		    $curparent = &getlevel($dirunknown,$treepath,$filename,$curparent);
> 		}
> 		else { warn "Directory ${treepath}/$filename isn't readable or executeable. Skipping";};
> 	    };
> 	};
> 	$curparent = $curparent->{parent} if (ref $curparent->{parent});
>     };
> };
> 
> # example usage
> my $WorkSnapshootName = 'wsnsh';
> my $NewSnapshootName = 'nsnsh';
> my $treepath = '/home/dima';
> my $types = 'dflo';
> my $fileproc = &setfileproc($types,'-');
> &findnewer($treepath,$NewSnapshootName,$fileproc,$WorkSnapshootName);
> 
> ========
> Example searches for files changed since we generated file ./wsnsh with
> subtree /home/dima printing changed files to STDOUT and writing new
> snapshoot into ./nsnsh. If wsnsh does not exist it will think that all
> files within subtree are changed.
> 
> WBR
> Dmitri Ivanov
> 

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Mon Oct 9 06:22:12 2006