<div dir="ltr">Matthias and Team,<div><br></div><div>Many, many thanks for your help in this matter.  We&#39;ll proceed with your recommended approach.  We will also probably stop doing upgrades on running clusters.  It seems it might be safer to do the full procedure (bringing down the whole cluster for the upgrade).</div>

<div><br></div><div>If you happen to find the cause of this bug at some point, of course I would be interested to know (especially if you fix it).</div><div><br></div><div>Thanks again,</div><div>Chris</div></div><div class="gmail_extra">

<br><br><div class="gmail_quote">On Tue, Sep 24, 2013 at 8:00 AM, Matthias Radestock <span dir="ltr">&lt;<a href="mailto:matthias@rabbitmq.com" target="_blank">matthias@rabbitmq.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Chris,<br>

<br>

(putting list back on cc)<div class="im"><br>

<br>

On 23/09/13 21:42, Chris wrote:<br>

</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">

Thank you very much for taking a look at these logs.  It is a strange<br>

bug for sure!  I guess I have two goals, really:<br>

<br></div>

  * To get the customer back up and running in the least disruptive way<br>

  * To help you guys understand what happened since I know it&#39;s no fun<div class="im"><br>

    to have mystery bugs in your product. ;-)<br>

<br>

Regarding #1, if there is not a minimally disruptive way, I am assuming<br>

I will need to reset all nodes and rebuild the cluster.<br>

</div></blockquote>

<br>

You should be able to recover the &#39;not found&#39; bindings by a) recreating the queues (as you already have) and then b) stopping the entire cluster (i.e. no node must be left running) and restarting it.<div class="im">

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Regarding #2, if you need anything else from me, please let me know!<br>

</blockquote>

<br></div>

There is no smoking gun in the logs, so the likely source of the problem is some edge case error in the mirroring and/or recovery logic. That may take us a while to track down. I don&#39;t think there&#39;s any other info we need from your running cluster. Thanks for reporting this issue.<br>


<br>

<br>

Regards,<br>

<br>

Matthias.<br>

</blockquote></div><br></div>