Skip to main content.
home | support | download

Back to List Archive

Re: Indexing files without an extension

From: dennis lastor <dennis.lastor(at)not-real.gmail.com>
Date: Tue Feb 07 2006 - 05:32:06 GMT
When run the spider I get the following error(s):

Summary for: http://Internal_Site/WebHome
Connection: Close: 1  (1.0/sec)
      Unique URLs: 1  (1.0/sec)
       robots.txt: 1  (1.0/sec)
 .
.
.
.
repeated for all internal sites.....
...
..
..

Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!

However, that page has several lines of text, links, etc.. but it doens't
appear to pick up anything
except pages with DOCs, or PDFs. (which is great...but would like the wiki
pages indexed as well)

The Debug DOCS talk allot about the CGI debug, but I cannot find anything o=
n
debugging the indexing other than setting the verbosity to '3' in the confi=
g
file.

Any help is appreciated.

Dennis

On 2/6/06, Bill Moseley <moseley@hank.org> wrote:
>
> On Mon, Feb 06, 2006 at 07:23:33PM -0800, dennis lastor wrote:
> > I am trying to index a wiki page that contains links to other wiki
> > pages without extensions.
> >
> > For example one of the pages could be
> > http://internal_site/Page_With_Text
>
> Should be no problem if you are using the spider.
>
>
> > I have read through several of the FAQs and threads but have not
> > been able to find anything on this topic.  I have no trouble
> > indexing PDFs, DOCs, TXT, HTML, etc, and everything works GREAT!  I
> > would just like to index these pages without extensions.
>
> What's the problem?
>
>
> >
> > I am using the "prog" method by running:
> >
> > swish-e -S prog -c swish.conf
> >
> > My swish.conf looks like:
> >
> > # Example for spidering
> > # Use the "spider.pl" program included with Swish-e
> > IndexDir spider.pl
> >
> >
> > #Path to filters
> > FilterDir /tool/bin/
>
> Don't need that.
>
>
> >
> >
> > # Define what sites to index.  Just add to the bottom of this
> >
> > SwishProgParameters default http://Internal_Site/WegPage1            =
=3D20
> >           \
> >                                         =3D20
> > http://Internal_Site/WebPage2
> > \
> >                                         =3D20
> > http://Internal_Site/WebPage3
>
>
> >
> > # ? DefaultContents HTML2
> > IndexContents HTML* .htm .html .shtml .pdf .doc .ppt .xls
>
> Should not need that.  The spider tells swish what parser to use.
>
> > Whenever I run swish-e it correclty indexes all of the PDFs,
> etc..etc...but
> > not the internal wiki sites (without extensions)
> > but rather says there are no unique words to index.
>
> What happens if you point the spider at one of those wiki pages?
> Turn on debbuging like is described in the docs.
>
>
> > I am also not sure if the 'CompressPositions yes' will compress the
> index
> > files or not.
>
> Ignore that setting for now.
>
> --
> Bill Moseley
> moseley@hank.org
>
> Unsubscribe from or help with the swish-e list:
>    http://swish-e.org/Discussion/
>
> Help with Swish-e:
>    http://swish-e.org/current/docs
>    swish-e@sunsite.berkeley.edu
>
>



*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Mon Feb 6 21:32:08 2006