During the last couple weeks of the 2.0 development cycle we pushed some really helpful search improvements (some of them bug fixes) into Clearspace. There are a numberof posts scattered around our intranet (which is called Brewspace) where the actual improvements / bug fixes are discussed but I don't believe those improvements made it into the documentation (there are a number of improvements listed in the changelog, but no description of what the improvements are). Hence this blog post.
First, in 2.0, the default search operator was changed to 'AND' from 'OR', the end result being that if you did a search like this:
Clearspace would look for all the blog posts, discussions and documents that contained the term "clearspace" AND contained the term "openid". Way back in Clearspace 1.0 the thought was that we should deviate from what Google does (they AND the terms you input) because we're not searching the entire web; our thinking was that most of our installations would only have a couple thousand documents, blog posts and threads and so we didn't ever want a search for 'clearspace openid ldap' to return nothing if there was a document that discussed two of the three. The reality is that when the search operator was 'OR' the number of results from a search query in Clearspace was almost always greater than 500 results (the maximum number of results we would return in a search): in fact, the more words you used in your query, the more likely that you'd end up with a large number of results, which in theory is great (we found a bunch of stuff that for you!) but in practice doesn't make for a great user experience (thirty four pages of search results? come on!). One of the articles (discussed below) had this to say about lots of search results:
Users sometimes measure the success of a query primarily by the number of results it returns. If they feel the number is too large, they add more terms in an effort to bring back a more manageable set.
So not only did "OR" result in more results per query, if you decided that you wanted to refine your query by adding a term, the number of results would actually grow, not shrink, which is the opposite of what you'd expect.
The funny thing about changing this default is that when we turned it on for brewspace (our intranet, no other changes had been made at that point), a number of people noticed right away and were amazed at the 'improvement'. It's really crazy how something as simple as "AND" versus "OR" could make such a big difference in user experience.
Before I move on, here are a couple interesting articles I found that talk about search as it relates to user experience:
Greg Linden, an ex-Amazon guy, wrote a great blog post a couple months agothat summarized a talk that Marissa Mayer gave about Google and their results page and the number of results per page and also added some notes about his own experience while working at Amazon. The bottom line from the presentation? Speed kills. The faster that we can return search results, the happier Clearspace users will be (although the comments on that post tell a potentially different story, don't miss'em).
A BBC News article from 2006 had this to say about search results:
At most, people will go through three pages of results before giving up, found the survey by Jupiter Research and marketing firm iProspect. It also found that a third of users linked companies in the first page of results with top brands. It also found 62% of those surveyed clicked on a result on the first page, up from 48% in 2002. Some 90% of consumers clicked on a link in these pages, up from 81% in 2002. And 41% of consumers changed engines or their search term if they did not find what they were searching for on the first page.
Takeaway? Relevant results are more important than many results.
The guys and gals at Boxes and Arrows have some really great articles on search patterns:
The essential problem of search too many irrelevant results has not gone away.
More and more, our ongoing research is telling us that Search has to be perfect. Users expect it to "just work" the first time, every time.
One thing that was easy to add which has also come up a couple times was search by author. I'm happy to report that in 2.1 we added the ability to search for content authored by a specific user. So just like you can click 'more options' on the search results page today and choose what types of content you want to search for, you'll be able to select a user whose content you want to find and filter the results using that selection. Side note: that functionality has always been in our API and if you're a URL hacker like I am, you can actually perform Clearspace searches using a pretty URL like this:
or if you want to search for any content written by the user 'aaron' that contains the word 'openID', you'd use this URL:
Another thing that I believe has worked in the past but that we haven't talked about is the idea of a simple syntax for search. Much like the operators you can use in Google (ie: 'site:jivesoftware.com lucene' will find all the references to 'lucene' on the domain 'jivesoftware.com'), we now support the following operators: 'subject:', 'body:', 'tags:' and 'attachmentstext:'. While I admit that they're not the most user-friendly things to type in, it does give advanced users a little bit more flexibility. For example: you can now ignore tags if you want by doing a search like this: 'subject:lucene OR body:lucene'. The search syntax operators are scoped to be in the search tips documentation that sits right alongside the search box. Again, this is for 2.1.
Those were the improvements. Now the bug fixes (which just so happen to also really improve your searching experience).
Search stemming doesn't seem to be working (CS-3645): Not sure how long this wasn't working, but the existence of this bug meant that if you put the word "error" in a document and then did a search for "errors", our search engine wouldn't find your document. Read more about stemming if you're curious about that sort of thing. If you're seeing this bug, make sure a) you upgrade to at least Clearspace 2.x and b) make sure that you're using a stemming indexer. The default analyzer does not stem. You can change the indexer by going to the admin console --> system --> settings --> search --> search settings tab --> indexer type.
Group search results by thread setting in admin console doesn't change search behavior (CS-3656): I'm not sure how long this feature has been around, but there is a search setting in the admin console that gives you the ability to group all messages in a thread into one result in a search results page so that the messages in a single thread don't overwhelm the search results (since messages share the subject and tags from the thread, that actually happened quite a bit). Fixed in 2.0, I highly recommend turning it on in your instance if you haven't already.
Search updates to better balance queries across content types (CS-3638): Some improvements were made in 2.0.0 toward this issue, but it's 100% fixed in 2.0.3 and 2.1. There were two really big but really really hard to see problems with the way that we were executing our search queries. First, a quick background on how a search query is performed in Clearspace against all content types. We have a single Lucene index for all the content in Clearspace (there is a separate index for user data, but that's a different story) so when a search for 'bananas' is executed, we did something like this (don't read too much into the language I'm using, I'm just trying to illustrate how it works at a 30,000 foot level):
get blog posts that match query
find all the blog posts where the subject matches 'bananas' OR the body matches 'bananas' OR the tags matches 'bananas' OR the attachments matches 'bananas' or the blogID matches 'bananas'
get discussions that match query
find all the messages where the subject matches 'bananas' OR the body matches 'bananas' OR the tags matches 'bananas' OR the attachments matches 'bananas' or the threadID matches 'bananas'
get documents that match query
find all the documents where the subject matches 'bananas' OR the body matches 'bananas' OR the summary matches 'bananas' OR the fieldsText matches 'bananas' OR the tags matches 'bananas' OR the attachments matches 'bananas' or the documentID matches 'bananas'
merge results from steps 1-3 using the relevance score from each item in the result set as the comparator
The assumption we made when writing this code was that the scores that Lucene returns for each item of all content type will be relatively similar. More concisely, if I had a document and a blog post which for some reason had identical content, I'd expect they would both have the exact same Lucene relevance score if they came up in the results of a search. But that assumption turned out to be wrong, not once but twice.
First, as you can see from glancing at the sample queries I pasted above, we searched a different number of fields per content type. Who cares right? Why would the number fields that you search on influence anything? Turns out that Lucene cares: the way scoring works in Lucene is that if you search on ten fields and only get a hit on two of them in document 'X' that the resulting relevance score should be less than a hit on blog post 'Y' where you search on five fields and get a hit on two of them. It makes perfect sense when you think about it: it's just like the tests you had in school. Getting a 4 out of 5 on a test works out to be 80%, about a B. If you got 4 out of 10 on a test, that's 40%, you failed. You probably called them grades... maybe sometimes even a score, which just so happens to be exactly how Lucene refers to the relevance that a particular document has to a given query (if you're curious about how Lucene does scoring / relevance you should check out the JavaDoc for the Similarity class and also read this document on scoring). Anyway, this behavior was actually fixed in 2.0: so now when we execute a search on mixed content we search the exact same number of fields for each content type: subject, body, tags, attachmentsText.
The second assumption that turned out to be wrong was just as nebulous. I illustrated above how we search in Clearspace: we do searches for each content type and then we merge the results of those searches into a single result set. In order to do a query for just blog posts that match the word 'token', we do a query that in Lucene looks like this:
+objectType:38 +(subject:token body:token attachmentsText:token tags:token)
It kinds of looks like a SQL subselect: get me all the things where one of subject, body tags or attachmentsText match and then, from those results, only return the results where the objectType is 38 (which is the int that JiveConstants.BLOGPOST refers too). The thing that killed us here was outer statement
a) when Lucene executes a query, it computes a query weight and field weight for each statement in your query and multiplies those two values together to get the total weight for that statement and
b) the query weight is basically a measure of how many times the key (in this case 'objectType') appears divided by the number of times the value appears (in this case '38') which means that
c) content objects that you have less of (in our case: blog posts) will tend to have a score much higher than content objects that you have a lot of (in our case: documents). Again, this makes sense: items that appears in the index a relatively small number of times are in some sense rare and so they should get a relatively higher weight. Regardless, it turns out there's a really easy fix for this problem as well: you can boost specific fields in your query like this:
and you can effectively neuter a field by boosting it to zero:
which means that Lucene will look for all the items in the index whose subject is 'token' but the weight, which usually influences the score that it assigns to the field 'subject' will not influence the score that the resulting item receives.
We're continually looking to improve the search tools in Clearspace. If you're seeing something you don't expect or if there's something cool you'd like us to add, please pipe up in our Support and Feature Discussion spaces here on Jivespace.