Here's a filter (at the bottom ) for indexing images, i.e., grabbing
the IPTC info, such as caption, date, copyright, etc. This requires
Image::IPTCInfo.
If spidering, include the 'img' tag in the link_tags option in
spider.pl!
The easiest thing to do was to write the IPTC data into the <body>
using IPTCInfo's ExportXML function. The HTML2 Parser seems to handle
this nicely. (I suppose anyone that wants meta tags in the <head> can
follow the example of ExportXML, and wrap the content in meta tags).
Here's a little html file that displays the demo images provided with
Image::IPTCInfo:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="content-type" content="text/html;
charset=iso-8859-1" />
<title>Demo Images</title>
</head>
<body>
<img src="demo_images/dog.jpg" alt="dog demo image" id="dog"
width="200" height="228" border="0" /><br />
<img src="demo_images/burger_van.jpg" alt="burger van demo image"
id="van" width="200" height="228" border="0" /><br />
</body>
</html>
Here's the indexing summary:
Summary for: http://***********/demo_images.html
Connection: Close: 3 (3.0/sec)
Total Bytes: 2,363 (2363.0/sec)
Total Docs: 3 (3.0/sec)
Unique URLs: 3 (3.0/sec)
**Adding automatic MetaName 'image' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'date_created' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'country-primary_location_name' found in
file 'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'object_name' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'credit' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'writer-editor' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'special_instructions' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'caption-abstract' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'by-line_title' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'by-line' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'city' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'province-state' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'category' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'headline' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'keywords' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'keyword' found in file
'http://********/demo_images/dog.jpg'
**Adding automatic MetaName 'supplemental_categories' found in file
'http://********/demo_images/burger_van.jpg'
**Adding automatic MetaName 'supplemental_category' found in file
'http://********/demo_images/burger_van.jpg'
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 45 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
45 unique words indexed.
4 properties sorted.
3 files indexed. 2,363 total bytes. 175 total words.
Elapsed time: 00:00:01 CPU time: 00:00:00
Indexing done!
Here's a query:
# swish-e -w burger* -f iptc.swish-e
# SWISH format: 2.4.3
# Search words: burger*
# Removed stopwords:
# Number of hits: 2
# Search time: 0.000 seconds
# Run time: 0.016 seconds
1000 http://*********/demo_images/burger_van.jpg "A van full of burgers
awaits mysterious buyers in a dark parking lot." 932
1000 http://**********/demo_images/dog.jpg "A large dog awaits burger
delivery from the back of his Porsche 928." 864
.
Important Note: The mimetypes list needs to be expanded! I haven't
checked to see which (maybe all?) graphic formats support IPTC.
Certainly gif.
==============================IPTC2html.pm =============
package SWISH::Filters::IPTC2html;
use strict;
use vars qw/ $VERSION /;
use Image::IPTCInfo;
$VERSION = '0.01';
sub new {
my ( $class ) = @_;
my $self = bless {
mimetypes => [ qr!image/jpeg! ],# list of types this filter
handles
}, $class;
return $self->use_modules( qw/ Image::IPTCInfo / );
}
sub filter {
my ( $self, $doc ) = @_;
my $file = $doc->fetch_filename;
# Create new info object
my $info = new Image::IPTCInfo($file);
# Check if file had IPTC data
unless (defined($info)) { return; }
# Get specific attributes...
my $caption = $info->Attribute('caption/abstract');
my $headers = "<title>$caption</title>\n";
# update the document's content type
$doc->set_content_type( 'text/html' );
my $xml = $info->ExportXML('image');
my $txt = <<EOF;
<html>
<head>
$headers
</head>
<body>
$xml
</body>
</html>
EOF
return \$txt;
}
1;
__END__
=head1 NAME
SWISH::Filters::iptc2html - Perl extension for filtering image files
with Swish-e
=head1 DESCRIPTION
This is a plug-in module that uses the Perl Image::IPTCInfo package to
extract meta-data into html for indexing by Swish-e.
This filter plug-in requires the Image::IPTC package available at:
http://search.cpan.org/~jcarter/Image-IPTCInfo-1.9/IPTCInfo.pm
=head1 AUTHOR
Bill Conlon
=head1 SEE ALSO
L<SWISH::Filter>
=head1 SUPPORT
Please contact the Swish-e discussion list.
http://swish-e.org/
=cut
Received on Mon Mar 21 19:33:32 2005