5 Replies Latest reply: Oct 26, 2004 11:32 AM by Jay Scott RSS

    subsequently importing mbox problem

      Hi,

       

      When I importing mbox files subsequently, some messages older than the messages already in Jive can not be threaded correctly. Could you tell me how to fix this problem?

      I attache the data that I use to test for your convience.

      I has test1.mbox test2.mbox and test3.mbox. test3.mbox is the combination of test2.mbox and test1.mbox for doing the vertification only. So test3.mbox is the thread pattern that should look like in the Jive. However, I importing test1.mbox after test2.mbox being imported could not give me the right pattern like test3.mbox giving. It seems completely wrong.

       

      The correctly thread pattern should be look like this

       

      (root)

          

       

       

      -A

       

      -E

       

      -F

       

      -C

       

      -D

       

      However, it give me like this

       

      (root)

       

          

      -A

       

      -D

       

      -C

       

      (E)

       

      -F

       

      If one message has different subject. The thread becomes other behavior.

       

      My conern is that if the mbox import behavior is such unpredictable, could you tell me the right scenarios.

      We need to use your suggestion to fine tune our data migration plan.

       

      Ngai

        • Re: subsequently importing mbox problem

          Hi,

           

          Could someone in Jive has time tell me how to solve this problem.

           

          Ngai

            • Re: subsequently importing mbox problem

              Hi Ngai,

               

              I have an explanation for you about the mbox behavior you are seeing. When you import test2.mbox first, a dummy parent is created for F. Then, when you import test1.mbox, the "root" thread is created. However, when message E is imported it is seen as F's dummy parent replacement and gets inserted as the root message of the second thread. At that point E/F is it's own thread since the dummy message that E replaced was a thread root. The import process does not try to find a parent for E since E already seems to be a thread root.

               

              In general, a good rule of thumb is to try and import messages in the order they were created/sent. The import process tries hard to resolve the original thread structure of the messages, but there are cases which can cause this type of behavior.

               

              Regards,

              Greg

                • Re: subsequently importing mbox problem

                  In other words, if I'm interpreting this correctly, Jive threading is not robust against messages arriving out of order. Since that is normal behavior of e-mail and nntp gateways, I would consider it a bug.

                   

                  Greg, I'm not completely sure that your explanation is correct in this case. The mbox files in question are exports from legacy forum software, and the legacy software has sorted the messages into what it considers to be threaded order. It ensures that every reply appears after what it believes the parent to be. So your explanation only makes sense if Jive disagrees with the old software about parents. That is entirely possible, because in the real world, the References headers of different messages sometimes conflict or contain loops or other invalid data.

                   

                  In any case, the mbox files are already sorted into a pretty good order, good enough that it's difficult for us to improve on. Any other ideas?

                    • Re: subsequently importing mbox problem

                      Jay,

                       

                      What Ngai has outlined with the test mbox files is a known limitation of the current gateway threading behavior. The limitation I'm talking about relates to the inability for a gateway to 'insert' a message  prior to an already existing message.

                       

                      In the above case, when test2.mbox is imported prior to test1.mbox the gateway sees that there is a missing parent message (<vstkc2y2ubtt@legacy> ) for which it creates a dummy parent. When test1.mbox is imported that dummy message is replaced with the real message, however that messages's parent message is not created in that thread since that would require 'inserting' a message at the beginning of a thread. Instead, it's created as a new thread.

                       

                      I'm looking into solutions to have this behavior corrected in the 4.x series of Jive Forums. For now the only way to truly fix this issue is to manually update the threading information in the database after an import - a tricky task to get right and not one that really should be attempted. I'll think about this issue some more today to see if I can think of a more practical solution.

                       

                       

                      Regards,

                       

                      Bruce Ritchie

                        • Re: subsequently importing mbox problem

                          Yes, I think I understand it.

                           

                          RFC850 prescribes that References: headers should be complete, but in the real world they are sometimes truncated. That means that, given a single message, it is impossible to be sure what the root message of its thread is. As I understand it, Jive attempts to create that root message, and attempts to skip over intermediate parents until and unless one of them actually arrives.

                           

                          Years ago I ran some numbers on methods of grouping e-mail and newsgroup messages into threads (excluding the question of organizing messages within a thread). I compared a method using Message-ID with References and In-reply-to headers versus a method using a canonicalized subject line. I counted a threading error when I found a case where what human readers considered one topic was split into two threads, or where two or more topics were jammed into one thread.

                           

                          For my data at the time, I measured that both methods made many errors, and that the subject line method was significantly more accurate overall. In large part that was because people often unhelpfully start a new topic by replying to an existing message.

                           

                          Presumably some method taking both pieces of information into account would be able to do even better from a human-centered point of view.