Skip to main content.
home | support | download

Back to List Archive

Re: 2.4.3 Refuses to Index Virtual Host

From: fh oregon <linux(at)not-real.frankhunt.com>
Date: Sun Apr 10 2005 - 23:29:28 GMT
I could chase all the files in the tree, but, as suggested below I would 
most likely generate duplicates, there are also links between parts of 
the file system to improve performance and shorten the pathname (and fix 
hacks I've made over the past six years or so).  As it stands now, I am 
not seeing any duplication (at least that I have discovered so far) and 
everything is working.

Not sure why anyone would want to search my site anyway, but it was an 
interesting exercise and I learned a lot.

-f

Peter Karman wrote:

>caveat: you'd have to deal with the fact that recursion would include the 2nd 
>site's files within the first index, since the filetree includes both sites.
>
>Peter Karman scribbled on 4/10/05 6:13 PM:
>  
>
>>if you have access to the filesystem where the files are stored, is there some 
>>advantage to using the spider at all?
>>
>>otherwise you could do:
>>
>>   swish-e -i /path/to/site1 -c config1
>>
>>and
>>
>>   swish-e -i /path/to/site1/site2 -c config2
>>
>>which would be both faster and create two different indexes for searching.
>>
>>fh oregon scribbled on 4/10/05 6:02 PM:
>>
>>    
>>
>>>My goal here is to have the main site and the virtual site(s) indexed 
>>>and searchable.  The more I mull this over I came up with a way to fake 
>>>out the indexer.   As a test, I placed a (hidden) link on the main page 
>>>directly to the /SFCC directory and !!!  It looks like it is all working 
>>>now.  I need to do more testing.
>>>
>>>-fh
>>>
>>>Bill Moseley wrote:
>>>
>>>
>>>
>>>      
>>>
>>>>On Sat, Apr 09, 2005 at 11:10:12AM -0700, fh oregon wrote:
>>>>
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>>>The root of the site (frankhunt.com) is /web/httpd/htdocs  Within that 
>>>>>directory is the main index.html as well as a few other html documents 
>>>>>and directorys for other parts of the site.  One of those directorys is 
>>>>>/web/httpd/htdocs/SFCC which is the root of the 
>>>>>siliconforestcorvetteclub.com domain.
>>>>> 
>>>>>
>>>>>          
>>>>>
>>>>Again, the spider has NO knowledge of your directory structure.  If
>>>>you spider frankhunt.com and there's no pages in frankhunt.com in CFCC
>>>>then it won't spider them.
>>>>
>>>>Try it yourself.  Go to frankhunt.com and only click on links that
>>>>include frankhunt.com as the host name.  That's all that will be
>>>>indexed.  That link to CFCC is not the same host name.
>>>>
>>>>Look, you also link to http://www.fs.fed.us/gpnf/volcanocams/msh/ --
>>>>do you expect that to get indexed?  And everything it links to, also?
>>>>
>>>>Sounds like you are not clear on how web servers map directories.
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>      
>>>
>
>  
>

-- 
Frank Hunt
Confused Linux Admin
Received on Sun Apr 10 16:29:29 2005