Skip to main content.
home | support | download

Back to List Archive

[swish-e] Question about searching for PDF documents using Swish-e (revised)

From: at <Michael>
Date: Thu, 20 Mar 2014 14:29:39 -0500
To whom it may concern: 

I would like to use Swish-e to search for PDF documents within a directory or within a database. However I am having some difficulty doing said task. 

The version of Swish-e I am running is SWISH-E 2.4.7.
The version of Linux I am running is: 
Linux 2.6.32-431.5.1.el6.x86_64 #1 SMP Wed Feb 12 00:41:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux


Here is my swish-e configuration file: 
#Example Swish-e Configuration file
#Define *what* to index
#IndexDir can point to a directories and/or a files
#Here it is pointing to the current directory
#Swish-e will also recurse into sub-directories.
#But only index the .html files
#commented out by mike 3/19/14 IndexOnly .html

IndexDir .
IndexOnly .config .pdf
FileFilterMatch pdftotext "'%p' -" /\.pdf$/


#Define META tags
MetaNames meta1 meta2 meta3


Whenever I run the: 

swish-e -c swish.config command (swish.config is the name of my configuration file) 

to  index the configuration file I get the following output: 

Note: this is the command that I run: swish-e -c swish-e.conf
Indexing Data Source: "File-System"
Indexing "."

Warning: Substituted 2209 embedded null character(s) in file './MEX000030001.2012.2.00.L.06.30.PDF' with a newline


Warning: Substituted 2068 embedded null character(s) in file './MEX000030001.2012.1.00.L.03.31.PDF' with a newline


Warning: Substituted 2080 embedded null character(s) in file './MEX000030001.2012.3.00.L.09.30.PDF' with a newline


Warning: Substituted 225100 embedded null character(s) in file './MEX000030435.2012.A.00.L.12.31.PDF' with a newline

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 401,598 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
401,598 unique words indexed.
4 properties sorted.
6 files indexed.  9,251,632 total bytes.  2,106,651 total words.
Elapsed time: 00:00:10 CPU time: 00:00:10
Indexing done!


When I am finished doing that I run the following command to see if I can search for a PDF or a word within a PDF document. The command I run next is the following: 

[root(at)not-real.zeus tests]# swish-e -w valores
# SWISH format: 2.4.7
# Search words: valores
# Removed stopwords:
err: no results
.

The above output is not true because there is a word called valores in one of the PDFs.

Also 
here is another example of output when I try search for a PDF. 

[root(at)not-real.zeus tests]# swish-e -w MEX000030001.2012.1.00.L.03.31.PDF
# SWISH format: 2.4.7
# Search words: MEX000030001.2012.1.00.L.03.31.PDF
# Removed stopwords:
err: no results
.


The PDF docs that I am trying to search for are in the same directory as my swish configuration file that I am indexing. Does that make a difference? 
By the way my PDF docs are in the same directory as the swish.config file. 
What would be the best way to fix this error so that I may correctly Index PDF docs but also to search for and find PDFs using swish-e? 

Do I need to use the FileFilters command? Bear in mind that this is my first time using Swish-e. 

If anyone would be so kind as to assist me with my issue, I would greatly appreciate it. 

Mike 


 		 	   		   		 	   		   		 	   		  

_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Mar 20 2014 - 19:29:41 GMT