I finally found time to return to try using swish-e in a Windows
environment. Hopefully the information I can provide here can trigger an
idea and/or a path I can pursue to complete this project. (If you need any
more information regarding setup and usage, please let me know.)
OS: Windows 2000 full patched.
Swish-e: v 2.4.2
pdftotext: v 3.00
pdfinfo: v 3.00
perl: v5.6.1 ActiveState build 638
When I run "swish-filter-test", pdftotext is found and loaded successfully.
I've tried two different approaches to this issue:
OPTION ONE
index_port.bat
"C:\Program Files\SWISH-E\swish-e.exe"
-S prog -v 3 -c
"C:\Program Files\SWISH-E\indexes\Port\port.config"
-f "C:\Program Files\SWISH-E\indexes\Port\index.swish-e"
port.config
DefaultContents HTML2
IndexContents HTML* .asp .htm .html .shtml .pdf
StoreDescription HTML* <body> 320
IndexDir perl.exe
TmpDir "C:\\Progra~1\\SWISH-E\\indexes\\Tmp\\"
SwishProgParameters
"C:\\Progra~1\\SWISH-E\\lib\\swish-e\\spider.pl"
"C:\\Progra~1\\SWISH-E\\indexes\\Port\\port.spider"
ReplaceRules remove http://local.dev.port.com
port.spider
@servers = (
{
base_url => 'http://local.dev.port.com',
email => 'tony@2plus2.com',
delay_sec => 1,
debug => DEBUG_URL | DEBUG_INFO | DEBUG_FAILED |
DEBUG_SKIPPED,
# other spider settings described below
},
);
The output for this option is a bit strange....while it attempts to index
the site, it fails to record word count for pages after the 16th link. This
link is a PDF and the spider appears to lockup on analyzing it and while it
fetches all the other links it finds, it fails to index these pages.
>> +Fetched 1 Cnt: 15
http://local.dev.port.com/newsroom/pressrel/pressrel_163.asp
200 OK text/html 23871
parent:http://local.dev.port.com
! Found 0 links in
http://local.dev.port.com/newsroom/pressrel/pressrel_163.asp
(585 words)
http://local.dev.port.com/newsroom/pressrel/pressrel_163.asp
- Using HTML2 parser - sleeping 1 seconds
>> +Fetched 1 Cnt: 16 http://local.dev.port.com/pdf/publ_notice2.pdf
200 OK application/pdf 44239
parent:http://local.dev.port.com
(1357 words)
http://local.dev.port.com/pdf/publ_notice2.pdf
- Using HTML2 parser - (59 words) sleeping 1 seconds
>> +Fetched 1 Cnt: 17
http://local.dev.port.com/newsroom/pressrel/pressrel_162.asp
200 OK text/html 20249
parent:http://local.dev.port.com
! Found 0 links in
http://local.dev.port.com/newsroom/pressrel/pressrel_162.asp
sleeping 1 seconds
>> +Fetched 1 Cnt: 18 http://local.dev.port.com/portnyou/offi_seni.asp
200 OK text/html 30543
parent:http://local.dev.port.com
! Found 1 links in http://local.dev.port.com/portnyou/offi_seni.asp
sleeping 1 seconds
As you can see from #17 on, there are no word counts. At the end of #16
there is this weirdness: "- Using HTML2 parser - (59 words) sleeping 1
seconds". Seems like commands are stepping on eachother. Then from #17 on -
it fails to index the fetched pages. The summary of work looks like this:
Summary for: http://local.dev.port.com
Connection: Close: 991 (0.9/sec)
Duplicates: 21,123 (19.5/sec)
Off-site links: 3,501 (3.2/sec)
Skipped: 21 (0.0/sec)
Total Bytes: 138,730,121 (128098.0/sec)
Total Docs: 968 (0.9/sec)
Unique URLs: 992 (0.9/sec)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 1,459 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
1,459 unique words indexed.
5 properties sorted.
16 files indexed. 435,308 total bytes. 5,753 total words.
Elapsed time: 00:18:06 CPU time: 00:18:06
Indexing done!
Even though it found 992 pages, it only indexed 16. I can't get swish to
throw any other errors that appear to be relevant to the issue. Any
suggestions where to start looking are appreciated.
OPTION TWO
index_port.bat
"C:\Program Files\SWISH-E\swish-e.exe"
-S http -v 3 -c
"C:\Program Files\SWISH-E\indexes\Port\port.config"
-f "C:\Program Files\SWISH-E\indexes\Port\index.swish-e"
port.config
IndexDir http://local.dev.port.com
TmpDir "C:\\Progra~1\\SWISH-E\\indexes\\Tmp\\"
IndexOnly .asp .htm .html .shtml .pdf
FileFilter .pdf pdftotext "'%p' -"
IndexContents HTML* .asp .htm .html .shtml *.pdf
StoreDescription HTML* <body> 320
This is where is gets stranger. Swish will index all .asp/.htm files fine -
but fails to open the temp files for any PDFs it encounters.
sleeping 5 seconds before fetching
http://local.dev.port.com/pdf/database_ucp.pdf
Now fetching [http://local.dev.port.com/pdf/database_ucp.pdf]...Status:
200. application/pdf
- Using DEFAULT (HTML2) parser
- Error: Couldn't open file
''C:\Progra~1\SWISH-E\indexes\Tmp\swishspider@1108.contents''
(no words indexed)
retrieving http://local.dev.port.com/pdf/real_ccr.pdf (5)...
sleeping 5 seconds before fetching http://local.dev.port.com/pdf/real_ccr.pdf
Now fetching [http://local.dev.port.com/pdf/real_ccr.pdf]...Status: 200.
application/pdf
- Using DEFAULT (HTML2) parser
- Error: Couldn't open file
''C:\Progra~1\SWISH-E\indexes\Tmp\swishspider@1108.contents''
(no words indexed)
The file it is unable to open is always the same name. I've been able to
pause the processing and see that swishspider@1108.contents does exist in
the C:\Progra~1\SWISH-E\indexes\Tmp\ directory, and it's permissions are
not out of whack. I'm guessing that pdftotext is not feeding the processed
PDF back to the spider correctly and therefore it sets up a zero length
file or is actually saving to "somewhere else".
The summary of work looks like this:
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 12,770 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
12,770 unique words indexed.
5 properties sorted.
980 files indexed. 324,873,230 total bytes. 278,746 total words.
Elapsed time: 00:18:32 CPU time: 00:18:32
Indexing done!
This time it successfully indexes the "other" pages, but of course the PDFs
are not indexed.
Common to both examples, the StoreDescription does not appear to be acted
on. I have no descriptions available via <swishdescription>, I get some
Date Time String (e.g " Local Time : 1:12:01 PM PT") instead. Nor does
swish appear to accept the IndexOnly / IndexContents directive - it
attempts to index the PDF anyway. It grabs the file then errors on "invalid
mime type". Is this correct behaviour? I would think that swish would skip
the file because of the .pdf extension not being the in the approved list.
If anyone wants to hit/index an example of the site in question:
http://test.portofoakland.com
This URL is a replica of the live site and should respond exactly the same.
--
Anthony Baratta
Received on Wed Sep 22 15:22:46 2004