Hi All,
I'm still working this issue since my last post on 3/9/04 and I have
made some progress but I now need to get Swish-e to index the files,
which it is not doing.... to recap:
> I want to include data [html] from another site in my
> index. My commission sales of the other sites products go
> thru the other site but data on available products doesn't
> show up in my index.
>
> My current plan is as follows:
>
> Use wget to mirror the section of the other site over to
> mine. This will give a set of files under
> http://www.afana.com/www.othersite.com/afl/
This is done. All the files are .asp files but saved as .asp.html to
make them visible to Swish-e.
> Then run Swish-E against that. Then on display of the index
> I will need to transform the URL's, presumably with
> ReplaceRules?? e.g.:
>
> I will have an URL such as:
> http://www.afana.com/www.othersite.com/afl/video_detail.asp?vid_id=338
> and have to transform it to:
>
http://www.othersite.com/cgi-bin/at.pl?a=195711&e=/afl/video_detail.asp?
vid_id=338
I have these regex rules in place in my swconfig.conf:
ReplaceRules regex
!afana.com/www.sportsdelivered.com/(.+).html!sportsdelivered.com/cgi-bin
/at.pl?a=195711&e=/$1!
ReplaceRules regex !http: //www.sportsdelivered.com/afl/(.+)!http:
//www.sportsdelivered.com/cgi-bin/at.pl?a=195711&e=/$1!
(deliberate space inserted after http: above to avoid e-mail program
converting this to a URL)
And to the extent I can test them they do the job.
The problem now is that it does not appear that Swish-e is indexing the
necessary directory in total:
http://www.afana.com/www.othersite.com/afl/
When I do a search to look for files with sportsdelived.com in the URL
the only thing it finds is the index file:
http://www.sportsdelivered.com/cgi-bin/at.pl?a=195711&e=/afl/index.asp
(which is the correct regex transform of:
http://www.afana.com/www.sportsdelived.com/afl/index.asp.html )
i.e.: from an actual search using
http://www.afana.com/swish-e/lib/swish-e/swish.cgi it finds only this:
2 Sports Delivered -- rank: 940
Australian Football Video AFL Name a Game HOME MORE VIDEOS CONTACT
Player Profiles Compilations Ansett Cup WEG Grand Finals Seasons
Highlights Team Highlights Double Packs Triple Packs DVD Club Histories
Club Gift Packs Adelaide Brisbane Carlton Collingwood Essendon
Fremantle Geelong Hawthorn Kangaroos Melbourne Port Adelaide
Richmond St.Kilda Sydney W. Bulldogs W. Coast Search Restricted by
category
Last Modified Date:
Document Size: 72087
Document Path:
http://www.sportsdelivered.com/cgi-bin/at.pl?a=195711&e=/afl/index.asp
Apparently, the other 600 files in my directory are skipped. Because
they are extracted from the dynamically generated pages at the other
site they aren't necessarily linked in a "spiderable" chain from the
index file but all of them need to be indexed.
So, any thoughts on what the best way to go about this is? Do I run
another index job and then merge the indexes or can I do something to
get these included? Here's my index cron job at present:
$HOME/public_html/swish-e/bin/swish-e -S prog -c
$HOME/public_html/swish-e/bin/swconfig.conf
and swconfig.conf contains this:
---
IndexDir spider.pl
NoContents .gif .jpg .png .cgi .pl .log .jar .ico .js .class .log .sql
.csv .dir .idx .dat
IndexContents HTML* .htm .html .shtm .shtml .css
IndexContents TXT* .txt .text
IndexContents XML* .xml .wml .rdf .rss
DefaultContents HTML
SwishProgParameters
/home/afana/public_html/swish-e/lib/swish-e/SwishSpiderConfig.pl
http://www.afana.com
IndexReport 1
ParserWarnLevel 1
IndexFile /home/afana/public_html/swish-e/website.index
obeyRobotsNoIndex yes
---
Any ideas?
-Rob de Santos
-Columbus, Ohio USA
Chairman of the Board,
Australian Football Association of North America (AFANA)
ph: 1-888-4AFANA1 (North America) (1-888-423-2621)
ph: 1-614-338-0002 (outside NA)
e-mail: rdesantos(at)not-real.afana.com web: <http://www.afana.com>
Contents of this message may not be posted
to the web or "blogged" without prior permission.
Received on Wed Apr 14 09:16:47 2004