Hmm that's not the whole story though. If it were just the 1 router
failure (in reality a hypervisor failure), we'd be in a much better
position, but it's combined with 2 other modem failures. We had the
ETiger->SnoDEM modem die over the winter, and it needs
replacement. That link has been down for a month or more now. And
most recently we're having the Tukwila->Baldi modem lose
connectivity frequently. We've implemented an automatic mitigation
for that, but it still produces sporadic short downtime windows of a
few minutes. I'd just like to move that modem to a NetMetal 5. Our
servers are also being affected by instability in the Quagga routing
software. We need to replace this with a more stable alternative,
like BIRD. Lastly, the Baldi emergency uplink is only configured to
go to Westin and Corvallis, but not Tukwila.
We could have avoided DNS outages too, if the anycast groups were
populated with more of the available servers. I believe lack of
good automation for server build-outs is causing the deployment lag
here.
The network is designed to withstand failures, even multiple
failures, but we've got many broken things right now that need
fixing. After that fixing, I would really love to see some folks
get behind improving our monitoring, deployment and diagnostic
automation. Networks like this won't scale unless they're nearly
completely automated and simple to manage. I would not mind at all
if we even rolled back some features until we can get them
re-implemented in 100% automated ways.
As important as all this is, I still think the deep penetration
project takes precedence, so I can't drop that work in favor of
this. Aside from helping out on the simple break-fix stuff, I mean.
--Bart
On 3/9/2016 8:23 PM, Ryan Elliott
Turner wrote:
Thanks for the update, Nigel.
_______________________________________________
PSDR mailing list
PSDR@hamwan.org
http://mail.hamwan.net/mailman/listinfo/psdr