Skip to main content.
home | support | download

Back to List Archive

Image Indexing

From: Bill Conlon <bill(at)not-real.tothept.com>
Date: Tue Mar 22 2005 - 03:33:29 GMT
Here's a filter (at the bottom ) for indexing images, i.e., grabbing 
the IPTC info, such as caption, date, copyright, etc.  This requires 
Image::IPTCInfo.

If spidering, include the 'img' tag in the link_tags option in 
spider.pl!

The easiest thing to do was to write the IPTC data into the <body> 
using IPTCInfo's ExportXML function.  The HTML2 Parser seems to handle 
this nicely.  (I suppose anyone that wants meta tags in the <head> can 
follow the example of ExportXML, and wrap the content in meta tags).

Here's a little html file that displays the demo images provided with 
Image::IPTCInfo:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
	<meta http-equiv="content-type" content="text/html; 
charset=iso-8859-1" />
	<title>Demo Images</title>
</head>
<body>
<img src="demo_images/dog.jpg" alt="dog demo image" id="dog" 
width="200" height="228" border="0" /><br />
<img src="demo_images/burger_van.jpg" alt="burger van demo image" 
id="van" width="200" height="228" border="0" /><br />

</body>
</html>

Here's the indexing summary:

Summary for: http://***********/demo_images.html
Connection: Close:     3  (3.0/sec)
       Total Bytes: 2,363  (2363.0/sec)
        Total Docs:     3  (3.0/sec)
       Unique URLs:     3  (3.0/sec)
**Adding automatic MetaName 'image' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'date_created' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'country-primary_location_name' found in 
file 'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'object_name' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'credit' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'writer-editor' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'special_instructions' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'caption-abstract' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'by-line_title' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'by-line' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'city' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'province-state' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'category' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'headline' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'keywords' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'keyword' found in file 
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'supplemental_categories' found in file 
'http://********/demo_images/burger_van.jpg'
**Adding automatic MetaName 'supplemental_category' found in file 
'http://********/demo_images/burger_van.jpg'
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 45 words alphabetically
Writing header ...
Writing index entries ...
   Writing word text: Complete
   Writing word hash: Complete
   Writing word data: Complete
45 unique words indexed.
4 properties sorted.
3 files indexed.  2,363 total bytes.  175 total words.
Elapsed time: 00:00:01 CPU time: 00:00:00
Indexing done!

Here's a query:

# swish-e -w burger* -f iptc.swish-e
# SWISH format: 2.4.3
# Search words: burger*
# Removed stopwords:
# Number of hits: 2
# Search time: 0.000 seconds
# Run time: 0.016 seconds
1000 http://*********/demo_images/burger_van.jpg "A van full of burgers 
awaits mysterious buyers in a dark parking lot." 932
1000 http://**********/demo_images/dog.jpg "A large dog awaits burger 
delivery from the back of his Porsche 928." 864
.

Important Note:  The mimetypes list needs to be expanded!  I haven't 
checked to see which (maybe all?) graphic formats support IPTC.  
Certainly gif.

==============================IPTC2html.pm =============

package SWISH::Filters::IPTC2html;
use strict;
use vars qw/ $VERSION /;
use Image::IPTCInfo;

$VERSION = '0.01';
sub new {
    my ( $class ) = @_;
    my $self = bless {
        mimetypes   => [ qr!image/jpeg! ],# list of types this filter 
handles
    }, $class;
     return $self->use_modules( qw/ Image::IPTCInfo / );
}

sub filter {
    my ( $self, $doc ) = @_;
    my $file = $doc->fetch_filename;
    # Create new info object
    my $info = new Image::IPTCInfo($file);

    # Check if file had IPTC data
    unless (defined($info)) { return; }
    # Get specific attributes...
    my $caption = $info->Attribute('caption/abstract');
    my $headers = "<title>$caption</title>\n";

    # update the document's content type
    $doc->set_content_type( 'text/html' );
	my $xml = $info->ExportXML('image');

     my $txt = <<EOF;
<html>
<head>
$headers
</head>
<body>
$xml
</body>
</html>
EOF


    return \$txt;
}



1;
__END__

=head1 NAME

SWISH::Filters::iptc2html - Perl extension for filtering image files 
with Swish-e

=head1 DESCRIPTION

This is a plug-in module that uses the Perl Image::IPTCInfo package to 
extract meta-data into html for indexing by Swish-e.

This filter plug-in requires the Image::IPTC  package available at:

    http://search.cpan.org/~jcarter/Image-IPTCInfo-1.9/IPTCInfo.pm


=head1 AUTHOR

Bill Conlon

=head1 SEE ALSO

L<SWISH::Filter>

=head1 SUPPORT

Please contact the Swish-e discussion list.
http://swish-e.org/

=cut
Received on Mon Mar 21 19:33:32 2005