Skip to main content.
home | support | download

Back to List Archive

Ignoring session ids when distinguishing files as being different

From: Stefan Seiz <TalkLists(at)not-real.index-s.de>
Date: Tue Feb 22 2005 - 11:30:49 GMT
Hi,

i am new to Swish-e, coming from HtDig.

While evaluating swish-e, i discovered two show-stoppers for our enviroment.

1)
Our Site is served dynamicaly ad the app-server includes sesison-ids in urls
which i can not turn off.

These session ids change and thus the swish-e crawler.pl will recognize
pages as being different, allthough they are in fact the same pages (just
the session-id changes).

Using htdig, i could work around this problem by one simple configuration
option:
    url_rewrite_rules: (.*)&pb-id=.* \\1
    (where pb-id=XXXXX is my session id)

Is there anything similar in swish-e to make it ignore the session id when
it distinguishes between files being different.

2) Password protected PDF files.
All our PDFs are protected with the same password, so i can easily pass a
password to the command line options of pdftotext.

So i tried modifying
    /usr/local/lib/swish-e/perl/SWISH/Filters/Pdf2HTML.pm
and tried to add "-opw MyPasswd" to the call to $self->run_pdftotext but
failed miserably. I tried many different variations of adding the -opw
option to pdftotext.

Can anyone help me out as how i need to add the -opw option to the call to
pdftotext?

    
Thanks!
 
--
Stefan Seiz <http://www.StefanSeiz.com>
Spamto: <bin@imd.net>
Received on Tue Feb 22 03:30:51 2005