How does search work?

     

    Introduction

    This document describes how Jive search works to help you search more effectively.  It addresses mainly the search that is available to regular users: the spotlight search, main content search, user search, and place search. You can also use this document to learn about the scoring algorithm, which can help you tune your content to improve its standing in search results.

     

    Vocabulary

    A few terms that may not be familiar to you:

    • Field: A single piece of information within the content or user profile you're searching.  For example, in a document, you have the title/subject, the content, and the tags.  For a user, you have first name, last name, expertise, tags, and many more.
    • Place: An area of your site that contains content: spaces/communities, social groups, projects, etc.
    • Spotlight search: The search box included in the header of every page. It pops up a limited number of results from content, users and places.

     

    Basic search algorithm

    We use the Lucene search framework and we've configured the framework extensively to suit our needs.  Here's a basic overview of how Lucene's scoring algorithm works.  Some of the key points:

    • The score has a lot to do with how many words are in the field with the word you're searching for.  For example, if you write a 20,000 word essay that makes a single reference to the movie "Finding Nemo" somewhere in the document and you have another document in the system (or a status update or a blog post or a thread, etc..) that's only 50 words and includes "Finding Nemo", the thinking is that the latter is more relevant to a query for "nemo".
    • We do an "AND" search on content (like Google) which means that all terms must be present.  We do an "OR" search on users.

     

    Stemming

    "Stemming" means looking for the root of words - for example, if you type in "looked", it will also search for "look".  In 4.5 and before, we don't do stemming by default, but it is possible to turn on: the administrator can go into the Admin Console and change the search analyzer from English, Non Stemming, to English Stemming.  You will need to rebuild the search index when you do this.  (Note that rebuilding the search index can take a very long time and can be very resource intensive, so we recommend you do this during off-peak hours to avoid impacting users.)

     

    In 5.0, with the move to Solr, Solr requires that the analyzer be statically set in an XML file on the file system which means it cannot be dynamically changed.  5.0 uses the stemming analyzer by default.

     

    The spotlight search at the top of each page does a slightly different search than the main search page: the spotlight search adds a wildcard (*) to the end of your search term (since it searches as you type and expects that you may not have finished typing yet), but the main search page doesn't.  Lucene doesn't support using many of the filters that we use in a normal query on wildcard queries.  One of the main differences is that these queries are not stemmed, and no stop words are removed.  In addition, compound words aren't split up the way they are in a regular, non-wildcard search.  For example, if you do a normal search for "MyImportantDocument.txt" or "my_important_document.txt", your search term will be split up into [my] [important] [document] [txt], and will match content that contain all of those words.  In a wildcard search, though, your search term will remain intact, and will only match content that matches the full search term.

     

    These details also apply to other places in the application where you use type-ahead search and the results are automatically brought up.  For example, on the "Browse" page, when you use the "type to filter by text" box, it uses the same wildcard search that spotlight search does.

     

    @mentioning

    Similar to spotlight search, when you @mention something, Jive takes what you've typed in so far and adds a wildcard to it; like spotlight search, this means that no stemming is done with this search.  The main different from spotlight search is that @mentioning only searches the title (of content or place) and username, name and email of a user.

     

    Useful settings

    • Search blog and document comments: by default we do search blog and document comments, but this can be changed with the system properties document.searchComments.enabled and blog.searchComments.enabled.
    • Useful settings you can change in Admin Console > System > Settings > Search > Search Settings:
      • Search attachments (such as Word documents and PDFs) - turned on by default
      • Default search query date range
      • Default Indexer Type
    • Synonyms: You can define common synonyms for your particular system - for example, "docs" and "documentation".  To add synonyms, go to Admin Console > System > Settings > Search > Synonyms, enter a pair of words separated by a comma in the Synonyms box, then click Add Synonym.
    • Stop words: these are words that are ignored by the search engines (such as "the" and "of").  You can add your own stop words in Admin Console > System > Settings > Search > Stop Words. The default stop words (that can't be removed without customization) are:
        public static final String[] STOP_WORDS = {
              "a", "and", "are", "as", "at", "be", "but", "by",
              "do", "for", "i", "if", "in", "into", "is", "it",
              "me", "my", "no", "not", "of", "on", "or", "s", "such",
              "t", "that", "the", "their", "then", "there", "these",
              "they", "this", "to", "was", "will", "with", "you"
        };
    
    
    
    
    

     

    Note that changing synonyms and stop words both require you to rebuild the search index; in 5.0 you also need to restart the application after making the change and before rebuilding the search index.

     

    Non-content search

    In addition to searching for content, you can also search for users and places (such as spaces, social groups and projects).  There are some important differences to note with these types of search.

     

    This includes searching for users through the front end, as well as searching for users in the admin console People tab.

     

    As mentioned earlier, user search uses "OR" instead of "AND", so simply adding more pieces of a user's name/profile won't narrow down your results, it will expand them.

     

    In 4.5, by default the application uses phonetic ("fuzzy") search. This works as follows: the search index stores the metaphone encoding of each user's name.  This is basically breaking down the name into its basic phonetic blocks.  Then, when you search for a term, we also get the metaphone encoding of the search time, and compare it to the metaphone encoding of the names.  We give matches with only a phonetic match a  "negative boost" of  .6, which should ensure that exact matches receive a higher search score than a phonetic match.

     

    This is great for situations where you don't know the exact spelling of a name.  For example: consider a situation where you think a user's name is Nick Smith, but actually in Jive it's stored as Nik Smyth.  Without phonetic search, you wouldn't find this user.  However, both of their metaphone encodings are NKSM, so you would get a match with phonetic searches.

     

    However, it can sometimes cause confusion when the you get a lot of phonetic matches that don't make intuitive sense.  One example: a user is searching for XML, intending to look for people who had that term in their job titles.  However, most of the results are people with names like Samuel.

     

    In 4.5 you can disable phonetic search with the system property people.search.fuzzy.enabled set to false.  In 5.0, this is not enabled by default; instead, we use synonyms for common nicknames: for example, if you search for "Robert", the results should also return users named "Bob".

     

    Profile fields

    Another important thing to know about user search is that the profile fields are all lumped together into one search field by the user's privacy settings.  (For example, the contents of all profile fields that are set to "everyone" are together in one search field, and all of the contents of the profile fields that are set to "connections" are in another search field).  This is important mainly because it means that profile fields will, in most cases, receive lower relative scores than names due to the length of the search field that they're in.  It also means that you can't search for a specific field the way you can with the subject:support example.

     

    When searching places, such as spaces, social groups, etc., we don't just search the title; we also search the description and the tags.  These are all lumped together into one field.

     

    The same search algorithm applies here; a field that contains 5 words, one of which is a match, will receive a higher score than a field that contains 25 words, one of which is a match.  If you're having trouble getting your place to show up at the top of a search for a particular term, be sure to use the search term in the title, description and tag field as many times as possible, with as few other words as possible.

     

    Tweaking your search

    • You can add wildcards (*) to your search.
      • Note that wildcards can't be used as the first character of a search. This means that you can't search for all users with a particular email domain, for example - a search for '*@jivesoftware.com' will return no results (unless you have a user who has the exact string '@jivesoftware.com' as part of their profile, such as their username).
    • You can also search specific fields.  For example, subject:support will search for content with support in the subject.
    • The default search range is "all", but you can choose a different search range if you're only looking for more recent items.

     

     

    Differences between 4.5, 5.0, 6.0, 7.0 and Cloud

    There are many changes and improvements made in the 5.0 search implementation - these changes are also in the 6.0 on premise search; this section will cover the biggest differences.  A more comprehensive document is available here: Search in 4.x, 5.x, 6.x, 7.x and Cloud

     

    As of 6.0, Jive also offers cloud based search service which has better features (highlights outlined below) and will iterate more quickly. More on that can be found here: Search in Jive 6 FAQs

     

    Content search fields

    In 4.5, we lump all of the content fields into one search field: title, body and tags.  We do this so that we can match one word from the search in the title field and another in the body. In 5.0, we do this, but we also index each field individually.  This means that the shorter fields (like title) have more weight than longer fields (like the body field).  Because of this, searching for words in the title or tags is more likely to return a correct match than searching the body - assuming the body of the content is shorter than the title and tags, of course.

     

    For the cloud search service (available in 6.0 and cloud versions of Jive), additional improvements have been made, specifically:

    Status Updates

    Since status updates have no subject field, we only incorporate the relevance match of the body index for status updates. This prevents them from dominating search results.

    Exact word matches in the subject

    We've added additional weight to exact subject matches so that they are ranked higher.

    Word proximity

    When the query terms are next to each other (and in the same order) in a document, that document will be ranked higher

    Jive Find

    Additionally, cloud search introduces the concept of Jive Find - this is better detailed here Search in Jive 6 FAQs

    Search Framework

    In 4.5, we use Lucene directly.  In 5.0, we use Solr, which is a wrapper for Lucene.  In general, this change only affects the back-end, not the way that indexing or the search algorithm works - this shouldn't be something that affects users' results.

     

    In 6.0, there are two options for how to access search - on premise or cloud search. More on the 6.0 options here: https://community.jivesoftware.com/docs/DOC-67914

    The tag field

    In 5.0, we discontinued the normalization of the tag field.  This means that the search algorithm in 5.0 doesn't take into account how many words are in the tag field  - if one document has a single tag, "search", and another document has 5 tags, one of which is "search", the two documents will still receive the same score from the tags field when searching for the term "search".

     

     

    Lucene FAQ

    Great resource for answering your Lucene questions: http://wiki.apache.org/lucene-java/LuceneFAQ#Lucene_FAQ