One of our 3rd party technical partners is attempting to crawl our Jive instance, specifically the discussions. They are having a difficult time getting past a few pages due to the immense amount of content and subsequent pagination. Looking for help utilizing the Jive API to crawl for information. Any information I can pass onto the 3rd party vendor would be appreciated.
Jive already indexes content and uses Lucene in the background so maybe they can somehow use the existing Lucene index. If not, maybe they can limit their search to HTML elements that have certain IDs and classes. Possibly they could also limit link following using regular expressions. These pages definitely have a lot going on so they probably want to limit what they're crawling.