Skip to main content.
home | support | download

Back to List Archive

Re: Select files for incremental indexing

From: Dmitri V. Ivanov <dima(at)not-real.intex.spb.ru>
Date: Fri Oct 06 2006 - 19:58:38 GMT
On Thu, Oct 05, 2006 at 12:33:57PM -0700, Dmitri V. Ivanov wrote:

> *********************************************************************
> Due to deletion of content types excluded from this list by policy,
> this multipart message was reduced to a single part, and from there
> to a plain text message.
> *********************************************************************

It seems I've failed to send html version...

There is a my opinion about 'Why' and 'How' and next to it my script.
It seems to me too slow. It sorts all files with directory whereas to
sort directories is sufficient. But to sort all files seems to be
usefull to get list of deleted files like uniq.

Why and How:
=========

   First of all about file timestamps and their update [1]in common and table
   with details:

   +-------------------------------------------------+
   |               |  timestamps marked for update   |
   |    syscall    |---------------------------------|
   |               |       file        | parent dir  |
   |---------------+-------------------+-------------|
   | [2]chdir      |                   |             |
   |---------------| -                 | -           |
   | [3]fchdir     |                   |             |
   |---------------+-------------------+-------------|
   | [4]chmod      |                   |             |
   |---------------| ctime             | -           |
   | [5]fchmod     |                   |             |
   |---------------+-------------------+-------------|
   | [6]chown      |                   |             |
   |---------------|                   |             |
   | [7]fchown     | ctime             | -           |
   |---------------|                   |             |
   | [8]lchown     |                   |             |
   |---------------+-------------------+-------------|
   | [9]close      | -                 | -           |
   |---------------+-------------------+-------------|
   | [10]creat     | atime,ctime,mtime | ctime,mtime |
   |---------------+-------------------+-------------|
   | [11]execve    | atime             | -           |
   |---------------+-------------------+-------------|
   | [12]fcntl     | -                 | -           |
   |---------------+-------------------+-------------|
   | [13]ftruncate |                   |             |
   |---------------| ctime,mtime       | -           |
   | [14]truncate  |                   |             |
   |---------------+-------------------+-------------|
   | [15]fstat     |                   |             |
   |---------------|                   |             |
   | [16]stat      | -                 | -           |
   |---------------|                   |             |
   | [17]lstat     |                   |             |
   |---------------+-------------------+-------------|
   | [18]fsync     |                   |             |
   |---------------| -                 | -           |
   | [19]fdatasync |                   |             |
   |---------------+-------------------+-------------|
   | [20]link      | ctime             | ctime,mtime |
   |---------------+-------------------+-------------|
   | [21]lseek     | -                 | -           |
   |---------------+-------------------+-------------|
   | [22]mknod     | atime,ctime,mtime | ctime,mtime |
   |---------------+-------------------+-------------|
   | [23]mkdir     | atime,ctime,mtime | ctime,mtime |
   |---------------+-------------------+-------------|
   | [24]mmap      | *                 | -           |
   |---------------+-------------------+-------------|
   | [25]munmap    | -                 | -           |
   |---------------+-------------------+-------------|
   | [26]msync     | *                 | -           |
   |---------------+-------------------+-------------|
   | [27]open      | *                 | *           |
   |---------------+-------------------+-------------|
   | [28]pread     |                   |             |
   |---------------|                   |             |
   | [29]read      | atime             | -           |
   |---------------|                   |             |
   | [30]readv     |                   |             |
   |---------------+-------------------+-------------|
   | [31]pwrite    |                   |             |
   |---------------|                   |             |
   | [32]write     | ctime,mtime       | -           |
   |---------------|                   |             |
   | [33]writev    |                   |             |
   |---------------+-------------------+-------------|
   | [34]rename    | implementation    | ctime,mtime |
   |---------------+-------------------+-------------|
   | [35]rmdir     | -                 | ctime,mtime |
   |---------------+-------------------+-------------|
   | [36]readlink  | *                 | -           |
   |---------------+-------------------+-------------|
   | [37]readdir   | atime             | -           |
   |---------------+-------------------+-------------|
   | readahead     | ?                 | ?           |
   |---------------+-------------------+-------------|
   | [38]symlink   | *                 | *           |
   |---------------+-------------------+-------------|
   | sendfile      | ?                 | ?           |
   |---------------+-------------------+-------------|
   | [39]unlink    | -                 | ctime,mtime |
   |---------------+-------------------+-------------|
   | [40]utime     | ctime             | -           |
   +-------------------------------------------------+

   Summary:

   Syscalls writing to file data updates mtime (creat, truncate, ftruncate,
   mknod, mkdir, pwrite, write, writev). mtime isn't updated when no data
   changed (notably link, unlink, rename). ctime is updated with any changes
   with file data and metadata.

   rename syscall hasn't strong rules for updating file timestamps at
   standart. With linux it updates ctime. I hope that most other systems do
   so, but i'm not sure.

   Second. We look into this table and say yourself: "Well. Just select files
   with ctime newer than. It's all". It's not.

   We know that directory is file and it can be renamed. When we rename
   directory we don't update timestamps of files it points. Additionally we
   can look into our table and see that ctime (of parent directory) is
   changed often. It's not sufficient to look only to timestamps.

   Lets look into metadata: there is three timestamps, uid,gid,access mask
   and inode number (and pointer to file data). inode number or [41]File
   serial number that is unique (per-filesystem).

   Let assume now, that our program rememberes directory name to inode number
   relations. If name (of parent directory we check) doesn't exist or
   associated inode number is other as remembered it's surely other
   directory. Otherwise there is two cases:

    1. It's the same directory (old good pepperbox).
    2. Old directory was deleted, then new (empty) directory was created and
       it was filled with files (but they must have newer ctime).

   For case of rename updating ctime we have criteria: ctime plus parent dir
   with it's inode number.

   Otherwise (rename don't update ctime of file) we must have remembered
   associations between all files names and their inode numbers.

   The fact that GNU tar uses only directory names persuades that most *nixes
   does update ctime with rename.

References

   Visible links
   1. http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_07
   2. http://www.opengroup.org/onlinepubs/009695399/functions/chdir.html
   3. http://www.opengroup.org/onlinepubs/009695399/functions/fchdir.html
   4. http://www.opengroup.org/onlinepubs/009695399/functions/chmod.html
   5. http://www.opengroup.org/onlinepubs/009695399/functions/fchmod.html
   6. http://www.opengroup.org/onlinepubs/009695399/functions/chown.html
   7. http://www.opengroup.org/onlinepubs/009695399/functions/fchown.html
   8. http://www.opengroup.org/onlinepubs/009695399/functions/chown.html
   9. http://www.opengroup.org/onlinepubs/009695399/functions/close.html
  10. http://www.opengroup.org/onlinepubs/009695399/functions/creat.html
  11. http://www.opengroup.org/onlinepubs/009695399/functions/exec.html
  12. http://www.opengroup.org/onlinepubs/009695399/functions/fcntl.html
  13. http://www.opengroup.org/onlinepubs/009695399/functions/ftruncate.html
  14. http://www.opengroup.org/onlinepubs/009695399/functions/truncate.html
  15. http://www.opengroup.org/onlinepubs/009695399/functions/fstat.html
  16. http://www.opengroup.org/onlinepubs/009695399/functions/stat.html
  17. http://www.opengroup.org/onlinepubs/009695399/functions/lstat.html
  18. http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html
  19. http://www.opengroup.org/onlinepubs/009695399/functions/fdatasync.html
  20. http://www.opengroup.org/onlinepubs/009695399/functions/link.html
  21. http://www.opengroup.org/onlinepubs/009695399/functions/lseek.html
  22. http://www.opengroup.org/onlinepubs/009695399/functions/mknod.html
  23. http://www.opengroup.org/onlinepubs/009695399/functions/mkdir.html
  24. http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
  25. http://www.opengroup.org/onlinepubs/009695399/functions/munmap.html
  26. http://www.opengroup.org/onlinepubs/009695399/functions/msync.html
  27. http://www.opengroup.org/onlinepubs/009695399/functions/open.html
  28. http://www.opengroup.org/onlinepubs/009695399/functions/pread.html
  29. http://www.opengroup.org/onlinepubs/009695399/functions/read.html
  30. http://www.opengroup.org/onlinepubs/009695399/functions/readv.html
  31. http://www.opengroup.org/onlinepubs/009695399/functions/pwrite.html
  32. http://www.opengroup.org/onlinepubs/009695399/functions/write.html
  33. http://www.opengroup.org/onlinepubs/009695399/functions/writev.html
  34. http://www.opengroup.org/onlinepubs/009695399/functions/rename.html
  35. http://www.opengroup.org/onlinepubs/009695399/functions/rmdir.html
  36. http://www.opengroup.org/onlinepubs/009695399/functions/readlink.html
  37. http://www.opengroup.org/onlinepubs/009695399/functions/readdir.html
  38. http://www.opengroup.org/onlinepubs/009695399/functions/symlink.html
  39. http://www.opengroup.org/onlinepubs/009695399/functions/unlink.html
  40. http://www.opengroup.org/onlinepubs/009695399/functions/utime.html
  41. http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_175
=========

My script
=========

#!/usr/bin/perl -w
use strict;
use integer;

# comparison function for comparing paths. Character weight is its code value
# except for slash. Slash is lessiest character. There is no hope it will work
# with multibyte chars.
# ----------------------------------------------------------------------------
sub cmpnoi18n($$)
{   my ($leftstr,$rightstr) = @_;
    my ($zerros) = ("$leftstr" ^ "$rightstr") =~ m/(^[\x00]*)[^\x00]/;
    if (not defined($zerros)) { #strinngs are different from first char or same
	return (ord($leftstr) <=> ord($rightstr));}
    else {
	my $leftlen = length($leftstr);
	my $rightlen = length($rightstr);
	my $minlen = ($leftlen < $rightlen) ? $leftlen : $rightlen;
	my $zerrolen = length($zerros);
	if ($zerrolen == $minlen) { # one of different trailers is empty string
	    return ($leftlen <=> $rightlen);}
	else { # both different trailers aren't empty
	    $leftstr = substr $leftstr,$zerrolen,1;
	    $rightstr = substr $rightstr,$zerrolen,1;
	    return -1 if ($leftstr eq '/'); # make '/' less than any other char to don't have difference
	    return 1 if ($rightstr eq '/'); # beetwin sort of partial filenames and compare full
	    return (ord($leftstr) <=> ord($rightstr));
	};
    };
};
# ----------------------------------------------------------------------------
# Utilitary function to get need file stat subset. 
# Input: filepath suitable for # calling stat(), time to compare timestamps with
# Output: array with filetype, boolean cnewer, inode number.
# filetype is f for regular file, l for symlink, d for directory where we can 
# stat entries, D for directory where we cant and o for other file types.
# ----------------------------------------------------------------------------
sub filestats($$) {
    my ($filename,$oldtime) = @_;
    my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$ctime,$blksize,$blocks) = lstat($filename);
    my $cnewer = $ctime >= $oldtime;
    my $filetype = 'o';
    $filetype = (-r _ && -x _) ? 'd' : 'D' if -d _;
    $filetype = 'f' if -f _;
    $filetype = 'l' if -l _;
    return ($filetype,$cnewer,$ino);
};
# -----------------------------------------------------------------------------
# Utility function to fill level (meaning directory level) structure. Returns
# pointer to this filled structure.
# level: hash with elements:
# 	name: path to directory from root of processed subtree
# 	entries: array with directory entries sorted with cmpnoi18n weighting
# 		function.
# 	unknown: boolean value representing the fact that directory wasn't 
# 		known at previous run.
# 	entnum: number of elements for entries array
# 	entindex: number of element from entries that must be processed next
# 	parent: pointer to level structure representing parent directory
# Input: boolean dirunknown, path to processed subtree, path to processed
# directory relating to root of processed subtree and, optionally pointer to
# level structure representing parent directory
# ---------------------------------------------------------------------------
sub getlevel {
    my ($dirunknown,$path2tree,$Dirname) = @_[0..2];
    my $parent = (scalar @_ > 3) ? $_[3] : 0;
    opendir(Dir,"${path2tree}/$Dirname");
    my $level = { name => $Dirname,
	  entries => [ sort cmpnoi18n readdir(Dir) ],
	  unknown => $dirunknown,
	  entnum => 0,
	  entindex => 0,
	  parent => $parent
	  };
    closedir Dir;
    $level->{entnum} = scalar @{$level->{entries}};
    $level->{name} = $level->{name} . '/' if $level->{name} gt '';
    return $level;
};
# -----------------------------------------------------------------------------
# Example set file process function. Return pointer to anonymous function,
# that will process supplied file depending of it's type and the fact, that file
# was changed since previous run
# Input: Mask of file types to proceed (substring of 'dflo'),
# 	File name to write newer files names to and, optionally,
# 	File name fo write all files names.
# -----------------------------------------------------------------------------
sub setfileproc ($$:$) {
    my ($TypeMask,$NewerFilesListName) = @_[0,1];
    my ($FullFilesList,$NewerFilesList);
    my $FullFilesListName = '';
    if (scalar @_ > 2) {
	$FullFilesListName = shift @_;
	open $FullFilesList,">$FullFilesListName";
    };
    open $NewerFilesList,">$NewerFilesListName";
    # -------------------------------------------------------------------------
    # Returned anonymous function that will actually process file names.
    # Here filename is only string that will be printed, fileisnewer is
    # boolean, and filetype is one char from 'dflo'
    # -------------------------------------------------------------------------
    return sub ($$$) {
	my ($filename,$fileisnewer,$filetype) = @_;
	if ((index $TypeMask,$filetype,0) >= $[ ) {
	    print $FullFilesList "$filename\n" if ($FullFilesListName ne '');
	    print $NewerFilesList "$filename\n" if $fileisnewer;
	};
    };

};
# -----------------------------------------------------------------------------
# Main working function. 
# input: $treepath is a path to processed subtree root
# 	 $NewSnapshootName is filename to write new snapshoot data
# 	 $fileproc is a pointer to function to process file names
# 	 $WorkSnapshootName is filename of used snapshoot file.
# Output: none.
# Snapshoot file format:
# 1st line - timestamp from runtime slash $treepath inode number
# other lines - directory slash it's inode number
# lines since second is sorted with cmpnoi18n comparison
# Tasks:
# 1. Generate NewSnapshoot file for next run.
# 2. Call $fileproc function for each file with $treepath and it's
# subdirectories setting file type and isnewer values.
# filetype: see filestats()
# isnewer: boolean == (<parent directory unknown at workSnapshoot> || 
# (<parent directory inode> != <inode from workSnapshoot>) ||
# (<file ctime> >= <time from workSnapshoot))
# --------------------------------------------------------------------
sub findnewer ($$$:$) {
    my ($treepath,$NewSnapshootName,$fileproc) = @_[0..2];
    my $WorkSnapshoot;
    my ($oldname,$oldino,$oldtime,$WorkSnapshootName) = ('',0,0,'');
    if (scalar @_ > 3) { 
	$WorkSnapshootName= $_[3]; 
	($oldtime,$oldino)= <$WorkSnapshoot> =~ m@^(.+)/(\d+)$@ if (open $WorkSnapshoot,"<$WorkSnapshootName");
	# $oldtime is a time to compare ctime's with, $oldino is always inode
	# number read from workSnapshoot
    }; 
    open my $NewSnapshoot, ">$NewSnapshootName" or die "Can't open file to write new snapshoot";
    my ($filetype,$cnewer,$ino) = filestats($treepath,$oldtime);
    my $newtime = time;
    print $NewSnapshoot "${newtime}/$ino\n"; #write first line to snapshoot
    my $dirunknown = $ino != $oldino;
    my $curparent=&getlevel($dirunknown,$treepath,'');
    while (($curparent->{entindex} < $curparent->{entnum}) || ref $curparent->{parent}) {
	while ($curparent->{entindex} < $curparent->{entnum}) {
	    my $filename = $curparent->{entries}[$curparent->{entindex}++];
	    next if ($filename eq '..' || $filename eq '.');
	    $filename = $curparent->{name} . $filename;
	    ($filetype,$cnewer,$ino) = filestats(($treepath . '/' . $filename),$oldtime);
	    my $fileisnewer = ($curparent->{unknown} || $cnewer);
	    &$fileproc($filename,$fileisnewer,($filetype =~ s/D/d/));
	    if ($filetype =~ m/[dD]/) {
		if ($filetype eq 'd') {
		    print $NewSnapshoot "${filename}/$ino\n";
		    # check directory against WorkSnapshoot
		    my $dircmp = &cmpnoi18n($oldname,$filename);
		    while ($dircmp < 0 && ($WorkSnapshootName ne '') && (not eof($WorkSnapshoot))) {
			    ($oldname,$oldino) = <$WorkSnapshoot> =~ m@^(.+)/(\d+)$@;
			    $dircmp = cmpnoi18n($oldname,$filename);
		    };
		    $dirunknown = ($dircmp != 0 || $ino != $oldino);
		    # ----------------------------------------------
		    $curparent = &getlevel($dirunknown,$treepath,$filename,$curparent);
		}
		else { warn "Directory ${treepath}/$filename isn't readable or executeable. Skipping";};
	    };
	};
	$curparent = $curparent->{parent} if (ref $curparent->{parent});
    };
};

# example usage
my $WorkSnapshootName = 'wsnsh';
my $NewSnapshootName = 'nsnsh';
my $treepath = '/home/dima';
my $types = 'dflo';
my $fileproc = &setfileproc($types,'-');
&findnewer($treepath,$NewSnapshootName,$fileproc,$WorkSnapshootName);

========
Example searches for files changed since we generated file ./wsnsh with
subtree /home/dima printing changed files to STDOUT and writing new
snapshoot into ./nsnsh. If wsnsh does not exist it will think that all
files within subtree are changed.

WBR
Dmitri Ivanov
Received on Fri Oct 6 12:58:40 2006