One of the areas of research that has always fascinated me is searching and linguistics - specifically machine comprehension of human questions. The current level of technology in this field is both very advanced (i.e. Google) and yet at the same time it can be very limited (machines do not really understand the question - it's all algorithms, patterns and the like). In Clearspace we've updated our search code to try and take advantage of as much of the available search technology as we had time to incorporate.

 

As is true with all our products our core search technology is based upon the excellent Lucene search library which we updated to the latest release to gain some new features and benefits, as well as the usual set of bug fixes. New in Clearspace however is a completely redesigned API around Lucene which provides some clear benefits to what we had available previously. I'd like to highlight a few features that we've added that I think are noteworthy.

 

Combined search API This means that you can search blog posts, forum messages and wiki documents at once in the same call to the API (or any combination of those). While this may not seem all that much of an improvement it is in fact quite an improvement over how searching was accomplished in the Integrated Server product (our only other product that had multiple content types to search over). When you execute a general search in the Integrated Server you are in fact executing multiple searches at the same time over two separate Lucene indexes - one for kb content, one for forum content. This approach has consequences on performance and flexibility on how search results could be displayed. With the new approach it's faster and provides the ability  to simply execute a search and display the results irregardless of the content type of the result.

 

[Lucene|http://jivesoftware.com/blog/wp-content/uploads/2007/01/search.png]

 

Find Similar searches We've built into the new API the ability to query on any blog post, message or document to find other content in the system that is similar to the source content object. This is a feature that we've taken full advantage of in many places in the UI to display 'More Like This' type links, helping to automatically link content together.

 

 

Pluggability While we do our best to make the searching that is built into Clearspace the best that we can make it we understand that corporations often have existing search implementions that they will want use. Thus we've adopted two approaches that we feel will cover most requirements in this regard. The first is that searching is webservice enabled which allows corporations to easily search Clearspace content from external applications. The second is that the whole core implementation of search in Clearspace is completely pluggable so that if you had a Google search appliance it's quite possible (with some coding of course) to replace the built-in Lucene implementation with one which hooks into your Google appliance.

 

Distributed searching The search implementation in Clearspace has been written in such a way that we'll be able in the future to allow customers to setup the search system in such a way that they can define a seperate server (or servers) that will be delegated solely to searching. Or, if they do not want to do that they'll have the ability to have search queries be executed in the normal cluster by the server that happens to be the least busy at the moment. While I had only a hand in this work (most credit from this must go to Gaston Dombiak who is probably best known for his work on the Wildfire XMPP server) I think this feature is perhaps one of the technically interesting features we've added. Unfortunately, given time constraints distributed searching will not make it into the initial release of Clearspace - look forward to word of it in future releases.

 

We have a lot of ideas for future improvements we can make to searching in Clearspace - hopefully I'll be able to find the time to blog about some of those in the future!