Skip navigation

Developer

1 Post authored by: bill.zhang

Authors:bill.zhang, Senior Search Engineer at Jive Software, and matt.wheeler, Senior Software Engineer at Jive Software

 

Over the past year we have done some experimentation with different index structures to efficiently support multiple tenants (or customers) in a single search index.  We thought we'd share some of our findings.

 

Multi-tenant search with Lucene

There are many ways to achieve multi-tenancy using Lucene.  The regular way is to create a tenant ID field and apply a filter for that field to every query.  In this post, we describe another approach: prefixing each field name with a tenant ID so that every tenant has its own set of fields in the index. We believe this approach leads to a much better performance than the regular approach.

 

Implementing multi-tenancy using a filter field

Today most enterprise search engines supports "fielded search" (Apache Lucene - Query Parser Syntax): when you search for a keyword you can specify the "field" of the keyword, which roughly means which part of the document the keyword must appear in. For example, when you search for keyword "jive" the query that gets executed internally is a fielded query that might look like:

"subject:jive body:jive"

The actual query is more complex than that, but it is enough to illustrate the idea in this blog. The query string tells an internal index searcher: "retrieve the documents where "jive" appears either in its "subject" or its "body"". To support the fielded query, during the indexing process we split the documents into multiple fields: subject, body, tags, author_id, and so on (Apache Lucene - Index File Formats). Additionally search engines supports the definition of "filters", i.e., objects used to restrict which documents may be returned during searching (Filtering a Lucene Search). Therefore, we can put the tenant ID of the content's owner in a field "tenantId" in every index document.  Then for each query we specify a filter on tenantId to restrict the results to that tenant:

"FilterChain:[tenantId:12345]"

 

An alternative approach to multi-tenancy

We experimented with an alternative approach to multi-tenancy. Instead of using a filter to restrict the results we prefixed field names with the tenant ID.  Thus the query becomes:

"12345_subject:jive 12345_body:jive"

This may seem wasteful as there would potentially be tens of thousands of fields in a single index.  Below we describe the performance of such an indexing scheme. Search engines usually define a <field,text> pair as a term (Term (Lucene 3.6.2 API)), and its most basic function is to retrieve a "posting list", i.e., a list of docs that contain a term, for any given term. Since "12345_subject:jive" can be parsed into a single term, the docs containing the term can be retrieved quickly -- this is much quicker than retrieving the much bigger posting list of term "subject:jive".

 

This approach has two advantages when compared with the field filter approach:

  1. Performance. Besides a shorter posting list, it also avoids the need of injecting a tenant filter for every query; and
  2. Relevance. Mixing multi-tenant data in a single field reduces the accuracy of search relevance, because of an inaccurate document frequency (Similarity (Lucene 3.6.2 API))

To our knowledge this is a new approach that has not been published.  Here we describe our findings applying it to Lucene.


Concerns

We started by identifying a number of factors with the approach that would cause various Lucene memory and disk structures to explode as the number of fields grew:

  1. Lucene allocates an array in memory for each field, called normalization factor.
  2. Lucene allocates a bit cache for filters per term/field combo.
  3. Lucene creates a field cache for each sortable field.
  4. Lucene's on-disk index size.

 

Basic Experiments

The experiment was to setup an index that had 100 tenants (customers) each with 100,000 documents. The experiment then walked through a random dictionary of terms. In order to excercise the boolean logic the queries were <query> OR <last query>.

 

Basic Index Sizes (without any optimization)

In the table the columns represent the following Lucene data structures:

  • frq   Contains the list of docs which contain each term along with frequency
  • cfs   An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles
  • tim   Part of the term dictionary, stores term info
  • prx   Stores position information about where a term occurs in the index
Index Schemeindex disk sizefrqcfstimprx
Just simple fields (not per tenant)10 Gb3.2 Gb19 Mb15 Mb6.1 Gb
With per tenant fields: 100 tenants each with 100,000 docs16 Gb5.9 Gb1.8 Gb1.2 Gb6.1 Gb
With per tenant fields: 10,000 tenants each with 1000 docs7.0 Gb19 Gb **2.3 Gb6.8 Gb

** This norm file can be compressed. See below.

 

Basic Query Results Timings

In the table there is the latency of random queries with two different index builds, with and without per field tenant data. With per tenant data the queries are 3-4 times faster.

 

Index Scheme

Total time (ms)

of searching 10,000 times

Total time (ms)

of searching 20,000 times

Just simple fields (not per tenant)13,74326,442
With per tenant fields: 100 tenants each with 100,000 docs4,7168,209


 

 

Specific Solutions & More Experiments

Norm compression

When we use per tenant field the size of the norm file (.nrm) is linear to the number of docs times the number of tenants.  As we saw in the above experiments when there is a large number of customers in a single index then the size of the norm file becomes very large.

 

Norms are stored as a per-field array, with one entry per document.  Thus when a document does not have any terms for a field, the array will contain a zero value for that document.  This isn't a big deal when most of the documents in the index have a value for each field, but is a problem when each field has relatively few documents (a sparse index).

 

We developed a scheme to compress the norm file. The basis of the idea is that although a single index file may contain data of multiple tenants, when we consider a single index document, it really contains data from only one tenant. Therefore, for n tenants, instead of having n arrays of norms for the same logical field (e.g., title), we can combine all norms of the field into one array. In this way, the size of the index file is no longer relevant to the number of customers -- just the number of documents.

 

We built a prototype based on the idea, and experimented with it by indexing the same input file with different numbers, with norm compression:

Number of doc per tenantNumber of customers.doc.pos.tim_nrm.cfs
10001000557M659M457M2.95M
100020001.16G1.32G0.92G5.89M
100040002.44G2.64G1.87G11.78M
400,000101.95G2.71G0.28G11.87M

Notes:

  1. In rows 1, 2 and 3 the norm file size increases because of the increase in the total number of docs.
  2. Lucene changed file names in 4.1. For details see: org.apache.lucene.codecs.lucene41 (Lucene 4.1.0 API)

 

For comparison, we also tried indexing the same input file, without the norm compression:

Number of doc per tenantNumber of customers.doc.pos.tim_nrm.cfs
10001000557M659M457M2.95G

 

By combining the 1000 per-field norms into a single norms field on-disk we made the norms file 1000 times smaller.

 

Sorting by Field Value

Sorting by field value, e.g., by title, is a common function of enterprise search engines. However, given how Lucene implements sorting (based on FieldCache), it is not scalable when the number of fields used in sorting increases. To solve this problem, we experimented with a "hybrid" approach: use per-tenant fields for querying and storing content, but use simple field (not per tenant) for sorting and faceting.

 

We implemented "sort by title" function with both the hybrid approach and the per-tenant approach and ran experiments with them. Below are the details of the experiment:

 

Configuration:

  • Hardware: Mac Pro, 16G memory, SSD Hard Drive, 2.3G I7 CPU
  • Software: Mac OSX 10.8.2, Java 1.6
  • JVM: -Xmx1024m

Indexing setup:

  • 4000 tenants each with 1000 docs
  • the same input file as in previous experiments
  • add a new, non-tokenized field, to support sorting by title
  • the value of the new field is unique for each doc

Search setup:

  • 1000 tenants each with 10,000 queries
  • Same group of queries were executed for "sort by relevance" and "sort by title"
  • 32.1% queries return non zero hits
Approach.doc.pos.timQuery Latency,Sort by Relevance(MicroSecond)Query Latency,Sort by Title(MicroSecond)
Hybrid (per-tenant fields for querying & simple fields for sorting)2.45G2.64G2.19G73.478.5
Per-tenant fields, without sorting field (and therefore doesn't support sorting by title)2.44G2.64G1.87GDoesn't support
Per-tenant fields, including sorting field2.45G2.64G2.20GOUT OF MEMORY!

Filtering (by numeric range and/or by field value)

Filtering is another common function of enterprise search engines. In this group of experiments we implemented two filters: filtering by date and filtering by content type, in multiple ways, and tested their performance. The implementations differ in:

  1. Tenant ID handling: the hybrid approach (per-tenant fields for querying and simple fields for filtering) versus per-tenant fields for everything
  2. Filter strategy: how filtering is done internally, a very new Lucene feature. See http://lucene.apache.org/core/4_1_0/core/index.html?org/apache/lucene/index/IndexWriter.html
  3. Caching:
    • No caching: filter chain would be [NumericRangeFilter, TermsFilter] to apply the date and content type filter.
    • Use field cache: [FieldCacheRangeFilter, FieldCacheTermsFilter]
    • Cache individual filters: [CachingWrapperFilter(NumericRangeFilter), CachingWrapperFilter(TermsFilter)]
    • Cache filter chain: CachingWrapperFilter([NumericRangeFilter, TermsFilter])

Configuration:

  • Hardware: Mac Pro, 16G memory, SSD Hard Drive, 2.3G I7 CPU
  • Software: Mac OSX 10.8.2, Java 1.6
  • JVM: -Xmx1024m

Indexing setup:

  • 500 tenants each with 10,000 docs
  • the same input file as in previous experiments
  • added a new field to support filtering by date, and its value is a random date from 2009 to 2013 (both inclusive)
  • added a new field to support filtering by type, and its value is a random int (converted to string) from 0 to 9 (both inclusive)

Search setup:

  • 500 tenants each with 2000 queries
  • Date filter is applied on every query, and the date range is from 2011/01/01 (inclusive) to 2012/01/01 (exclusive)
  • Content type filter is applied on every query, and the content type is randomly chosen from all possible types
  • Without filtering: 69.3% queries return non zero hits
  • With filtering: about 56% queries return non zero hits
Per-tenant Fields?Filter StrategyFilter CachingQuery Latency (micro seconds)Comment
No FilteringN/AN/A94.5
HybridDefaultNo caching20-30 millisecond
HybridDefaultField cache80-100 millisecondSlower than not caching at all!
HybridDefaultCaching individual filters10-20 millisecond
HybridDefaultCaching filter chain89.2Faster than no filtering!
Hybridquery firstCaching filter chain61.5Best of all
Hybridleap frog filter firstCaching filter chain62.9
Hybridleap frog query firstCaching filter chain62.2
Per-customer fieldsDefaultNo caching774.5Not too bad!
Per-customer fieldsDefaultCaching filter chainOOM

Filter Blog