Hmm that's not the whole story though. If it were just the 1 router failure (in reality a hypervisor failure), we'd be in a much better position, but it's combined with 2 other modem failures. We had the ETiger->SnoDEM modem die over the winter, and it needs replacement. That link has been down for a month or more now. And most recently we're having the Tukwila->Baldi modem lose connectivity frequently. We've implemented an automatic mitigation for that, but it still produces sporadic short downtime windows of a few minutes. I'd just like to move that modem to a NetMetal 5. Our servers are also being affected by instability in the Quagga routing software. We need to replace this with a more stable alternative, like BIRD. Lastly, the Baldi emergency uplink is only configured to go to Westin and Corvallis, but not Tukwila.

We could have avoided DNS outages too, if the anycast groups were populated with more of the available servers. I believe lack of good automation for server build-outs is causing the deployment lag here.

The network is designed to withstand failures, even multiple failures, but we've got many broken things right now that need fixing. After that fixing, I would really love to see some folks get behind improving our monitoring, deployment and diagnostic automation. Networks like this won't scale unless they're nearly completely automated and simple to manage. I would not mind at all if we even rolled back some features until we can get them re-implemented in 100% automated ways.

As important as all this is, I still think the deep penetration project takes precedence, so I can't drop that work in favor of this. Aside from helping out on the simple break-fix stuff, I mean.

--Bart

On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote:

Thanks for the update, Nigel.

On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen <nigel@nigelvh.com> wrote:

Hello All,

Just wanted to send out a quick notice here. We’ve had a failure at our Seattle edge router, which we’re still investigating. In the meantime, our Tukwila edge router is still providing connectivity, but you may notice higher latencies or issues reaching things. If you find things you can’t reach, please let me know, as we’d like to make sure the redundancy is working, while we’re working to resolve the issues we’re investigating with the Seattle edge router.

Nigel
_______________________________________________
PSDR mailing list
PSDR@hamwan.org
http://mail.hamwan.net/mailman/listinfo/psdr

--

Ryan Turner
_______________________________________________
PSDR mailing list
PSDR@hamwan.org
http://mail.hamwan.net/mailman/listinfo/psdr