Hi Bill!
OK, we should let write keywors in one line (no multilines) - it was no
prblem.
The problem is that the meta tag "keywords" is parsed as ">eta
name="keywords" (see below)!
swish-filter-test gives the following out:
swishe@local:~/swish-e/bin> ./swish-filter-test
-content /daten/intranet/krunet/keywords.pdf
Document /daten/intranet/krunet/keywords.pdf was filtered.
Document: /daten/intranet/krunet/keywords.pdf
(/daten/intranet/krunet/keywords.pdf)
Content-Type: text/html
Parser type: HTML*
>Filter used: SWISH::Filters::Pdf2HTML=HASH(0x845ec40)
( application/pdf -> text/html )
<html>
<head>
<meta name="author" content="Rieser Nachrichten">
<meta name="creationdate" content="Sun Mar 6 18:27:40 2005">
<meta name="encrypted" content="no">
<meta name="file_size" content="21374 bytes">
">eta name="keywords" content="Förderprogramm LEADER+
<meta name="moddate" content="Tue Mar 22 09:38:43 2005">
<meta name="optimized" content="yes">
<meta name="page_size" content="595 x 842 pts (A4)">
<meta name="pages" content="1">
<meta name="pdf_version" content="1.5">
<meta name="producer" content="Acrobat Distiller 6.0.1 (Windows)">
<meta name="subject" content="LEADER+-Projekte sollen durch Netzwerk
verbunden und für alle nutzbar gemacht werden">
<meta name="tagged" content="yes">
<meta name="title" content="Neuer Schwung für Monheimer Alb">
</head>
<body>
<pre>
</pre>
</body>
</html>
swishe@local:~/swish-e/bin>
Swish-e seems to index the key words ('Förderprogramm' and 'LEADER+'):
swishe@local:~/swish-e/bin> swish-e -T index_words -S fs
-c /home/swishe/swish-e/conf/swish.fs.kr.conf
Indexing Data Source: "File-System"
Indexing "/srv/www/htdocs/krunet"
Checking dir "/srv/www/htdocs/krunet"...
leer.pdf - Using HTML2 parser - White-space found word
'http://localhost/krunet/keywords.pdf'
White-space found word 'Path-Name:'
White-space found word '/srv/www/htdocs/krunet/keywords.pdf'
White-space found word 'Content-Length:'
White-space found word '861'
White-space found word 'Last-Mtime:'
White-space found word '1111480257'
White-space found word 'Document-Type:'
White-space found word 'HTML*'
White-space found word 'Neuer'
White-space found word 'Schwung'
White-space found word 'für'
White-space found word 'Monheimer'
White-space found word 'Alb'
White-space found word 'Förderprogramm'
White-space found word 'LEADER+'
White-space found word 'Path-Name:'
White-space found word '/srv/www/htdocs/krunet/keywords.pdf'
White-space found word 'Content-Length:'
White-space found word '861'
White-space found word 'Last-Mtime:'
White-space found word '1111480257'
White-space found word 'Document-Type:'
White-space found word 'HTML*'
(40 words)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 28 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
28 unique words indexed.
6 properties sorted.
1 file indexed. 21,374 total bytes. 52 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
swishe@local:~/swish-e/bin>
My config file is below:
#####################################################
# Swish-e config to index the Intranet BZA files #
# #
# Use swish-e for indexing the /krunet folder #
#####################################################
IndexDir /srv/www/htdocs/krunet
# Specify the program /folder to run
FollowSymLinks yes
# Follow symbolic links in indexing
IndexName "Intranet DLE Krumbach"
IndexDescription "Index der Intranet Krumbach Dateien auf dem bza Rechner."
# Name and description of the index
IndexFile /home/swishe/swish-e/index/intra_kr_fs.index
# Index file name
FileFilter .pdf /home/swishe/swish-e/lib/swish-e/DirTree.pl
# Using DirTree.pl for filtering .pdf files
IndexContents HTML2 .htm .html .pdf
IndexContents TXT2 .doc .xls
# With HTML2 use libxml2 library (recommended)
# With TXT2 for Word and Excel files
IndexOnly .htm .html .pdf .txt .doc .xls
# Index only files ending in .htm, ... .
IgnoreWords
File:
/home/swishe/swish-e/share/doc/swish-e/examples/conf/stopwords/german.txt
# words to be ignored by indexing
ReplaceRules replace "/srv/www/htdocs/" "http://localhost/"
# Allows you to make changes to file pathnames before they're indexed.
# These changed file names or URLs will be returned in search results.
PropertyNamesDate created_on
# tell Swish that you have a property called created_on, and that it's a
timestamp
#PropertyNames title author
# List of meta tags names that can be retrieved with the -p option.
# Index size increases as by the formula in the manual.
# Comment out if no PropertyNames. Case insensitive
PropertyNameAlias swishtitle title
# alias title to swishtitle
Metanames swishtitle swishdocpath swishlastmodified keywords
# Allow extra searching by title, path, date
UndefinedMetaTags ignore
# By default, undefined meta names are indexed as plain text
# This feature can change this behaviour. Here we say
# don't index text in metatags unless defined in MetaNames
MetaNames automatic
# MetaNames first author
# List of all the meta names used in the file to index, must be on
one line.
# If no metanames DO NOT deleted the line.
# New in 2.0 -> automatic option will extract metanames dynamically
StoreDescription TXT* 200000
StoreDescription HTML* <body> 200000
# Set StoreDescription for each parser
# to display context with search results
FileRules pathname contains '/0_'
# Don't index the directory with "0_"
FileRules filename contains '/0_' linker_frame
# And don't index any files with "0_" and "linker_frame.htm" # and 'Kopie
von 0_hauptseite'
IndexReport 3
# This is how detailed you want reporting. You can specify numbers
# 0 to 3 - 0 is totally silent, 3 is the most verbose.
# 4 is debugging. Can be overridden with -v on the command line
ParserWarnLevel 1
# Sets the error level when using the libxml2 parser for XML and HTML.
# libxml2 will point out structural errors in your documents.
# 0 = no report 1 = fatal errors 2 = errors 3 = warnings
But searching of the words 'LEADER' or 'Förderpogramm' gives no results!
Is my config file wrong?
Why does swish-filter-test display content?
..
">eta name="keywords" content="Förderprogramm LEADER+
..
With regards
Leonard Scheermann
>On Mon, Mar 21, 2005 at 06:30:49PM +0100, Scheermann Leonard wrote:
>> pdfinfo parses just the first line of "Keywords":
>>
>> pdfinfo keywords.pdf
>> "
>> Title: Neuer Schwung für Monheimer Alb
>> Subject: LEADER+-Projekte sollen durch Netzwerk verbunden und für
>> alle nutzbar gemacht werden
>
>Is that wrapped from your mail program or did pdfinfo wrap that?
>
>> Keywords: Förderprogramm LEADER+
>
>So, pdfinfo truncated that?
>
>A quick google turned up this:
>
> http://www.tug.org/pipermail/pdftex/2003-December/004649.html
>
>You might try asking the author of xpdf about this.
>
Received on Tue Mar 22 01:42:02 2005