Skip to main content.
home | support | download

Back to List Archive

Re: Novice question: unknown MetaNames error

From: Julie Wetherill <julie(at)not-real.gentoo.harvard.edu>
Date: Fri Jan 16 2004 - 17:11:29 GMT
>
> > Anyway, I do have a related problem that maybe you can explain. I need to
> > retrieve on metadata imbedded in PDFs. Adobe uses Dublin Core tags
> > (dc:description, dc:title, dc:creator). I can't get swish-e to recognize
> > these as metanames (whether these are in PDFs or in HTML).
>
>$ cat c
>MetaNames dc:description
>
>$ cat 1.html
>hello
>$ swish-e -c c -i 1.html -v0 -T indexed_words
>     Adding:[1:swishdefault(1)]   'b'   Pos:2  Stuct:0x7 ( HEAD TITLE FILE )
>     Adding:[1:swishdefault(1)]   'title'   Pos:3  Stuct:0x7 ( HEAD TITLE 
> FILE )
>     Adding:[1:dc:description(10)]   'foo'   Pos:6  Stuct:0x85 ( META HEAD 
> FILE )
>     Adding:[1:swishdefault(1)]   'hello'   Pos:9  Stuct:0x9 ( BODY FILE )
>
>$ swish-e -w dc:description=foo
># SWISH format: 2.4.1
># Search words: dc:description=foo
># Removed stopwords:
># Number of hits: 1
># Search time: 0.001 seconds
># Run time: 0.043 seconds
>1000 1.html "<b>title" 125
>.
>
>
> > Warning: Substituted possible embedded null character(s) in file
> > '/home/hul/htdocs/ois/systems/aleph/docs/test/serial_claiming_in_Aleph.pdf'
>
>Looks like you are not filtering the pdf files.

Oops. When I truncated the config file for testing, I dropped the filtering 
directive. But it is there in the full config file:

FileFilter .pdf /usr/local/apache/swish/filter-bin/_pdf2html.pl

and I get same non result when trying to retrieve by dc:description. I 
copied the whole session below. So maybe there is a problem with the pdf 
filter? --julie


sylvia{julie}15: cat metadata3.conf
# DIRECTIVES COMMON to  HTTP and FILESYSTEM METHODS
###################################################

IndexDir /home/hul/htdocs/ois/systems/aleph/docs/test/
# For the FileSystem Method:
# This is a space-separated list of files and
# directories you want indexed. You can specify
# more than one of these directives.
#
# For the HTTP Method:
# Use the URL's from which you want the spidering
# to begin.
# NOTE: use  hmtl files rather than  directories
# for this method.

IndexFile /usr/local/apache/swish-indexes/metadata3.index
# This is what the generated index file will be.

IndexName "Aleph document index"
IndexDescription "Index of Aleph staff documentation"
#IndexPointer "http://sunsite/~ghill/swish/index.html"
#IndexAdmin "Giulia Hill, (ghill@library.berkeley.edu)"
# Extra information you can include in the index file.

MetaNames dc:description title creator
# List of all the meta names used in the file to index, must be on one line.
# If no metanames DO NOT deleted the line.

PropertyNames dc:description title creator

IndexReport 3
# This is how detailed you want reporting. You can specify numbers
# 0 to 3 - 0 is totally silent, 3 is the most verbose.

FollowSymLinks yes
# Put "yes" to follow symbolic links in indexing, else "no".

ReplaceRules remove "/home/hul/htdocs/"
#ReplaceRules replace "[a-z_0-9]*_m.*\.html" "index.html"
#ReplaceRules replace "/home/oisprivate/htdocs/" "/"
# ReplaceRules allow you to make changes to file pathnames
# before they're indexed. This directive uses C library
# regex.h regular expressions.
# NOTE: do not use replace <string> "" to remove a string,
# use remove <string> instead - you might get a core dump otherwise.

#MinWordLimit 5
# Set the minimum length of an indexable word. Every shorter word
# will not be indexed.
# Commenting out the line will give the defaults

#MaxWordLimit 5
# Set the maximum length of an indexable word. Every longer word
# will not be indexed.
# Commenting out the line will give the defaults

#WordCharacters abcdefghijklmnopqrstuvwxyz\&#;0123456789.@|,-'"[](~!@$%^{}_+?
# WORDCHARS is a string of characters which SWISH permits to
# be in words. Any strings which do not include these characters
# will not be indexed. You can choose from any character in
# the following string:
#
# abcdefghijklmnopqrstuvwxyz0123456789_\|/-+=?!@$%^'"`~,.[]{}()
#
# Note that if you omit "0123456789&#;" you will not be able to
# index HTML entities. DO NOT use the asterisk (*), lesser than
# and greater than signs (<), (>), or colon (:).
#
# Including any of these four characters may cause funny things to happen.
# NOTE: Do not escape \ nor " and they cannot be the first letter in the string
# Commenting out the line will give the defaults

#BeginCharacters m"
# Of the characters that you decide can go into words, this is
# a list of characters that words can begin with. It should be
# a subset of (or equal to) WordCharacters
# Same rule of syntax as for WordCharacters

#EndCharacters \"\
# Of the characters that you decide can go into words, this is
# a list of characters that words can begin with. It should be
# a subset of (or equal to) WordCharacters
# Same rule of syntax as for WordCharacters

#IgnoreLastChar
# Array that contains the char that, if considered valid in the middle of
# a word need to be disreguarded when at the end. It is important to also
# set the given char's in the ENDCHARS array, otherwise the word will not
# be indexed because considered invalid.
# Commenting out the line will give the defaults
# NOTE: if " is the first char in the string it needs to be escaped with \
# Do not escape otherwise

#IgnoreFirstChar
# Array that contains the char that, if considered valid in the middle of
# a word need to be disreguarded when at the beginning. This was to solve
# the problem of parenthesis when there is no space between ( and the
# beginning of the word.
# Remember to add the char's to the BEGINCHARS list also.
# Commenting out the line will give the defaults
# NOTE: if " is the first char in the string it needs to be escaped with \
# Do not escape otherwise

IgnoreLimit 50 1000
# This automatically omits words that appear too often in the files
# (these words are called stopwords). Specify a whole percentage
# and a number, such as "80 256". This omits words that occur in
# over 80% of the files and appear in over 256 files. Comment out
# to turn of auto-stopwording.

#IgnoreWords SwishDefault
# The IgnoreWords option allows you to specify words to ignore.
# Comment out for no stopwords; the word "SwishDefault" will
# include a list of default stopwords. Words should be separated by spaces
# and may span multiple directives.

IndexComments 0
# This option allows the user decide if to index the comments in the files
# default is 1. Set to 0 if comment indexing is not required.

##################################
# DIRECTIVES for FILESYSTEMS ONLY
# Comment out if using HTTP
###################################

IndexOnly .html .pdf
# Only files with these suffixes will be indexed.

NoContents .gif .xbm .au .mov .mpg .ps
# Files with these suffixes will not have their contents indexed -
# only their file names will be indexed.

FileFilter .pdf /usr/local/apache/swish/filter-bin/_pdf2html.pl

FileRules pathname contains BudgRep
#FileRules pathname contains .*dir1
#FileRules filename contains # % ~ .bak .orig .old old.
#FileRules title contains construction example pointers
#FileRules directory contains .htaccess
#FileRules filename is index
# Files matching the above criteria will *not* be indexed.
# The patter matching uses the C library regex.h

################################
# DIRECTIVES for HTTP METHOD ONLY
# Comment out if using FILESYSTEM
##################################

#MaxDepth 5
#(default 5)  This defines how many links the spider should
#follow before stopping.  A value of 0 configures the spider to
#traverse all links

#Delay 60
#(default 60)  The number of seconds to wait between issuing
#requests to a server.

#TmpDir /tmp
#(default /var/tmp)  The location of a writeable temp directory
#on your system.  The HTTP access method tells the Perl helper to place
#its files there.

#SpiderDirectory /home/ghill/swishRon/src/
#(default ./)  The location of the Perl helper
#script.  Remember, if you use a relative directory, it is relative to
#your directory when you run SWISH-E, not to the directory that SWISH-E
#is in.

#EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu
#EquivalentServer http://sunsite.berkeley.edu:2000 http://sunsite.berkeley.edu
#(default nothing)  This allows you to deal with
#servers that use respond to multiple DNS names.  Each line should have
#a list of all the method/names that should be considered equivalent.
#If you have multiple directives, each one defines its own set of equivalent
#servers.
sylvia{julie}16: swish-e.new -c metadata3.conf -i 
/home/hul/htdocs/ois/systems/aleph/docs/test
Indexing Data Source: "File-System"
Indexing "/home/hul/htdocs/ois/systems/aleph/docs/test"

Checking dir "/home/hul/htdocs/ois/systems/aleph/docs/test"...
   acq-approval_plan_titles.pdf - Using DEFAULT (HTML) parser -  (292 words)
   bestpractice_vendor_code_not_active.pdf - Using DEFAULT (HTML) parser 
-  (285 words)
   cat-rept-xpo-fail.html - Using DEFAULT (HTML) parser -  (322 words)
   print-setup-circacq.bk.html - Using DEFAULT (HTML) parser -  (262 words)
   serial_claiming_in_Aleph.pdf - Using DEFAULT (HTML) parser -  (1707 words)
   cat-authrec-conflicts.pdf - Using DEFAULT (HTML) parser -  (373 words)
   cres_dataentryguidelines.pdf - Using DEFAULT (HTML) parser -  (2736 words)

In dir "/home/hul/htdocs/ois/systems/aleph/docs/test/_notes":

In dir "/home/hul/htdocs/ois/systems/aleph/docs/test/_baks":

In dir "/home/hul/htdocs/ois/systems/aleph/docs/test/_baks/_notes":

Removing very common words...
   Getting IgnoreLimit stopwords: Complete
no words removed.
Writing main index...
Sorting words ...
Sorting 1043 words alphabetically
Writing header ...
Writing index entries ...
   Writing word text: Complete
   Writing word hash: Complete
   Writing word data: Complete
1043 unique words indexed.
7 properties sorted.
7 files indexed.  859549 total bytes.  5977 total words.
Elapsed time: 00:00:03 CPU time: 00:00:02
Indexing done!
sylvia{julie}17: swish-e.new -w dc:description=acquisitions -f metadata3.index
# SWISH format: 2.2.3
# Search words: dc:description=acquisitions
err: no results 
Received on Fri Jan 16 17:19:13 2004