Could someone in Jive has time tell me how to solve this problem.
I have an explanation for you about the mbox behavior you are seeing. When you import test2.mbox first, a dummy parent is created for F. Then, when you import test1.mbox, the "root" thread is created. However, when message E is imported it is seen as F's dummy parent replacement and gets inserted as the root message of the second thread. At that point E/F is it's own thread since the dummy message that E replaced was a thread root. The import process does not try to find a parent for E since E already seems to be a thread root.
In general, a good rule of thumb is to try and import messages in the order they were created/sent. The import process tries hard to resolve the original thread structure of the messages, but there are cases which can cause this type of behavior.
In other words, if I'm interpreting this correctly, Jive threading is not robust against messages arriving out of order. Since that is normal behavior of e-mail and nntp gateways, I would consider it a bug.
Greg, I'm not completely sure that your explanation is correct in this case. The mbox files in question are exports from legacy forum software, and the legacy software has sorted the messages into what it considers to be threaded order. It ensures that every reply appears after what it believes the parent to be. So your explanation only makes sense if Jive disagrees with the old software about parents. That is entirely possible, because in the real world, the References headers of different messages sometimes conflict or contain loops or other invalid data.
In any case, the mbox files are already sorted into a pretty good order, good enough that it's difficult for us to improve on. Any other ideas?
What Ngai has outlined with the test mbox files is a known limitation of the current gateway threading behavior. The limitation I'm talking about relates to the inability for a gateway to 'insert' a message prior to an already existing message.
In the above case, when test2.mbox is imported prior to test1.mbox the gateway sees that there is a missing parent message (<vstkc2y2ubtt@legacy> ) for which it creates a dummy parent. When test1.mbox is imported that dummy message is replaced with the real message, however that messages's parent message is not created in that thread since that would require 'inserting' a message at the beginning of a thread. Instead, it's created as a new thread.
I'm looking into solutions to have this behavior corrected in the 4.x series of Jive Forums. For now the only way to truly fix this issue is to manually update the threading information in the database after an import - a tricky task to get right and not one that really should be attempted. I'll think about this issue some more today to see if I can think of a more practical solution.
Yes, I think I understand it.
RFC850 prescribes that References: headers should be complete, but in the real world they are sometimes truncated. That means that, given a single message, it is impossible to be sure what the root message of its thread is. As I understand it, Jive attempts to create that root message, and attempts to skip over intermediate parents until and unless one of them actually arrives.
Years ago I ran some numbers on methods of grouping e-mail and newsgroup messages into threads (excluding the question of organizing messages within a thread). I compared a method using Message-ID with References and In-reply-to headers versus a method using a canonicalized subject line. I counted a threading error when I found a case where what human readers considered one topic was split into two threads, or where two or more topics were jammed into one thread.
For my data at the time, I measured that both methods made many errors, and that the subject line method was significantly more accurate overall. In large part that was because people often unhelpfully start a new topic by replying to an existing message.
Presumably some method taking both pieces of information into account would be able to do even better from a human-centered point of view.