On 30/09/10 17:19, Juan Salvador Castejón wrote:
> Hi all,
> We are thinking about using swish-e but i am not sure if it's the
> best option in our case. The truth is that I previously used swish-e
> some years before, when it wasn't possible to index and search
> simultaneously or to index incrementally and had several problems
> that led me to reject it.
I wouldn't recommend searching and indexing simultaneously anyway :-)
> In the web site I have seen that, most of the issues I found at that
> moment are solved right now, but I would like to show you my case in
> order to get an opinion from you all if swish-e is the right choice.
> There are about 1 million document (PDFs, Word, Excel,... no HTML)
> stored in a huge shared disk. Each user has his own directory where
> he stores his own files and some additional directories which can be
> shared among several users (departments, workgroups, etc.).
> We would like users be able to search just for those documents they
> have accessed to.
We do this for our campus search (about one third of the number of
files, though) by indexing each area separately (schools, depts, admin,
etc) and then merging selected ones to create indexes for each group,
and finally merge those to create a master index. Last night's run
started at 5.45am and finished at 8.57am (so I'm going to make it start
a bit earlier :-) The longest bit is the final master merge at the end:
that took 31 mins for the 400+ separate indexes it created. Indexing
includes HTML, XML, PDF, and DOC files.
> The time needed to index the whole domain should
> be less than 24h if possible.
Unless you have a lot of large non-text documents (eg PDF and DOC), that
should be OK (ie if we do 300,000 mixed docs in 3hrs on an old server, I
would expect 1M docs in under 9hrs on a faster one).
> I know it is not much information but given this quantity of
> documents (2M) and the security restrictions, would you recommend
> swish-e or I should look for anything else?
Hold on, you said 1M earlier. Is it 2M or 1M? 2M sounds like it should
still index in under 24hrs assuming a modern machine with plenty or
memory and fast disk access.
Provided you have a robust way to disaggregate the files by access
rights, you can index each owner's own files, and then merge groups of
them to create dept/group indexes, and then merge those to make the
master index (if needed).
Users mailing list
Received on Fri Oct 1 06:16:02 2010