0 Replies Latest reply on Sep 10, 2015 8:30 AM by taylybb

    Crawling: Deep Paging Problem

    taylybb

      While crawling our production environment based on Jive we experienced several errors and frequent halting.  As of yet, we have not been able to complete a full crawl.  After further research, we think we may have found the possible cause and potential solution.  It is called the deep paging problem. This problem increases with the amount of documents (data types: content, people, places, announcements) that are crawled, and an environment like ours has a lot of documents several hundred thousand documents to crawl.

       

      We found out is Jive 5.0 onward moved from using Lucene search library to employing the still Lucene-based search server Solr. Solr 4.7 onward (latest is Solr 5.3 now), created a solution to the Deep paging problem, called cursorMark. Essentially, it’s a solution to overcome unnecessary overhead of working through again the prior pages of contents before the target page of contents.  If Jive is using Solr 4.7 onward, we like to know if this feature is being utilized in the Jive REST API. If not or if another optimization approaches like filter query is being used, we would like it to be implemented.

       

      Below are some links explaining the problem and potential solution. Thank you.

       

      https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

      https://dzone.com/articles/solr-47-efficient-deep

      http://solr.pl/en/2011/07/18/deep-paging-problem/

      http://yonik.com/solr/paging-and-deep-paging/

      https://community.jivesoftware.com/docs/DOC-50253