Skip to main content.
home | support | download

Back to List Archive

(no subject)

From: <adivey1(at)not-real.cox.net>
Date: Fri May 28 2004 - 16:47:49 GMT
*Note* sorry for all the "undisclosed"s. I'm an intern with a gov't contracting agency so I don't know what all is allowed to be public.

This first method doesn't go deep in the directories at all. It just does robots.txt and the root.

- Here is my configuration file. I run it by... 
# swish-e -c undisclosed.conf -v 3 -S http

IndexFile		undisclosed.index
IndexName		"Undisclosed"
IndexPointer		http://undisclosed/
IndexAdmin 		webmaster
IndexDir		http://undisclosed/
IndexContents		HTML*	.htm .html
IndexContents		TXT*	.txt .pdf
StoreDescription 	HTML* <body> 20000
StoreDescription 	TXT* <body> 20
IgnoreWords		www http a an the of and or
MetaNames		swishdocpath swishtitle
FileFilter 		.pdf pdftotext "'%p' -"
FileFilter		.html "/bin/cat"   "'%p'"

- My results are as follows...
Parsing config file 'undisclosed.conf'
Indexing Data Source: "HTTP-Crawler"
Indexing "http://undisclosed/"
Now fetching [http://undisclosed/robots.txt]...Status: 404.
retrieving http://undisclosed/ (0)...
sleeping 5 seconds before fetching http://undisclosed/
Now fetching [http://undisclosed/]...Status: 200. text/html
 - Using DEFAULT (HTML2) parser -  (3 words)
 
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 6 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
6 unique words indexed.
5 properties sorted.
1 file indexed.  1,133 total bytes.  9 total words.
Elapsed time: 00:00:08 CPU time: 00:00:00
Indexing done!

- I don't know why that isn't working. Anyway, I switched to the spider.pl method. I didn't edit spider.pl at all, and here is my config file...

IndexFile		undisclosed.index
IndexName		"Undisclosed"
IndexPointer		http://undisclosed/
IndexAdmin 		webmaster
IndexDir		spider.pl
SwishProgParameters 	default http://undisclosed/
IndexContents		HTML*	.htm .html
IndexContents		TXT*	.txt .pdf
StoreDescription 	HTML* <body> 20000
StoreDescription 	TXT* <body> 20
IgnoreWords		www http a an the of and or
MetaNames		swishdocpath swishtitle
FileFilter 		.pdf pdftotext "'%p' -"
FileFilter		.html "/bin/cat"   "'%p'"

- But this way, it doesn't get any PDFs! See the results...
# swish-e -c undisclosed.conf -v 3 -S prog
Parsing config file 'undisclosed.conf'
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /usr/local/lib/swish-e/spider.pl
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
http://undisclosed/no_flash.html - Using HTML2 parser -  (130 words)
(about 15 other html pages...)
http://undisclosed/working/index.html - Using HTML2 parser -  (315 words)
 --- *** THEN THE SYSTEM HANGS HERE FOR ABOUT 2 OR 3 MINUTES! *** ---
Error: May not be a PDF file (continuing anyway)
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
http://undisclosed/visions/ESE_Strategy2003.pdf - Using HTML2 parser -  (no words indexed)

- Why does it do that? I have specified to use FileFilter=pdftotext and IndexContents=TXT* .txt .pdf! So let's tweak the config file then!!

IndexContents		HTML*	.htm .html pdf
IndexContents		TXT*	.txt
FileFilter 		.pdf pdf2html "'%p' -"
(everything else the same)

- Then I'll run swish-e again and get the following...
# swish-e -c undisclosed.conf -v 3 -S prog
Parsing config file 'undisclosed.conf'
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /usr/local/lib/swish-e/spider.pl
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
http://undisclosed/no_flash.html - Using HTML2 parser -  (130 words)
(about 15 other html pages...)
http://undisclosed/working/index.html - Using HTML2 parser -  (315 words)
 --- *** THEN THE SYSTEM HANGS HERE FOR ABOUT 5 OR 6 MINUTES! *** ---
sh: line 1: pdf2html: command not found
http://undisclosed/visions/ESE_Strategy2003.pdf - Using HTML2 parser -  (no words indexed)

- Pdf2HTML is in SWISH-E's filters directory! 

- I'm completely lost. I wish there were some sample configurations... I've been reading Docs all day and don't know what I'm doing wrong. It can't be permissions because I'm running as root. Please help.

Thanks
Received on Fri May 28 09:47:50 2004