Hello All, Just wanted to send out a quick notice here. We’ve had a failure at our Seattle edge router, which we’re still investigating. In the meantime, our Tukwila edge router is still providing connectivity, but you may notice higher latencies or issues reaching things. If you find things you can’t reach, please let me know, as we’d like to make sure the redundancy is working, while we’re working to resolve the issues we’re investigating with the Seattle edge router. Nigel
Thanks for the update, Nigel. On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen <nigel@nigelvh.com> wrote:
Hello All,
Just wanted to send out a quick notice here. We’ve had a failure at our Seattle edge router, which we’re still investigating. In the meantime, our Tukwila edge router is still providing connectivity, but you may notice higher latencies or issues reaching things. If you find things you can’t reach, please let me know, as we’d like to make sure the redundancy is working, while we’re working to resolve the issues we’re investigating with the Seattle edge router.
Nigel _______________________________________________ PSDR mailing list PSDR@hamwan.org http://mail.hamwan.net/mailman/listinfo/psdr
-- Ryan Turner
Hmm that's not the whole story though. If it were just the 1 router failure (in reality a hypervisor failure), we'd be in a much better position, but it's combined with 2 other modem failures. We had the ETiger->SnoDEM modem die over the winter, and it needs replacement. That link has been down for a month or more now. And most recently we're having the Tukwila->Baldi modem lose connectivity frequently. We've implemented an automatic mitigation for that, but it still produces sporadic short downtime windows of a few minutes. I'd just like to move that modem to a NetMetal 5. Our servers are also being affected by instability in the Quagga routing software. We need to replace this with a more stable alternative, like BIRD. Lastly, the Baldi emergency uplink is only configured to go to Westin and Corvallis, but not Tukwila. We could have avoided DNS outages too, if the anycast groups were populated with more of the available servers. I believe lack of good automation for server build-outs is causing the deployment lag here. The network is designed to withstand failures, even multiple failures, but we've got many broken things right now that need fixing. After that fixing, I would really love to see some folks get behind improving our monitoring, deployment and diagnostic automation. Networks like this won't scale unless they're nearly completely automated and simple to manage. I would not mind at all if we even rolled back some features until we can get them re-implemented in 100% automated ways. As important as all this is, I still think the deep penetration project takes precedence, so I can't drop that work in favor of this. Aside from helping out on the simple break-fix stuff, I mean. --Bart On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote:
Thanks for the update, Nigel.
On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen <nigel@nigelvh.com <mailto:nigel@nigelvh.com>> wrote:
Hello All,
Just wanted to send out a quick notice here. We’ve had a failure at our Seattle edge router, which we’re still investigating. In the meantime, our Tukwila edge router is still providing connectivity, but you may notice higher latencies or issues reaching things. If you find things you can’t reach, please let me know, as we’d like to make sure the redundancy is working, while we’re working to resolve the issues we’re investigating with the Seattle edge router.
Nigel _______________________________________________ PSDR mailing list PSDR@hamwan.org <mailto:PSDR@hamwan.org> http://mail.hamwan.net/mailman/listinfo/psdr
--
Ryan Turner
_______________________________________________ PSDR mailing list PSDR@hamwan.org http://mail.hamwan.net/mailman/listinfo/psdr
Bart, You touch on a few things that have been “niggling” at the back of my mind for quite a while now – most of them come down in one way or another to overall reliability (of HamWAN) for EMCOMM, which most know has been my main driver for supporting the effort. There’s been a TON of great work done and quite frankly, I’ve been amazed that HamWAN has gone as far and fast as it has, particularly for a “ham” effort. At the same time we’ve slowly been adding and attracting the attention of various EMCOMM organizations with the promise and potential of redundant, reliable, resilient communications when “the big one” hits. Obviously not everything HamWAN is expected to survive a major quake or other event, but even pockets of reliable, high-speed communication are more than what can be accomplished via voice relays. All of which bring back to the current outage and discussion. There have been several outages in key places since we began. Last year SnoDEM was all but stranded due to a Haystack modem failure and other events at the same time. Now we have a similar situation in a different place brought on by multiple failures or weaknesses. In other instances I’ve been told we’ve had outages via misconfigured devices or other reasons. Even in a perfect world, human error happens. I believe HamWAN would benefit from somewhat of a shift in operating philosophy that would create two separate departments or divisions – operations and development. Operations responsibilities 1) Provide day to day monitoring of network resources and conditions 2) Manage (admin) of those portions of the network that are designated as “in production”. This should be the majority of the network. 3) Provide communications and coordination of network maintenance 4) Maintain an active inventory of all operational (production) sites, site hardware, and site access information. 5) Maintain and manage all production site device configurations and config change management. 6) Coordinate implementation of new functionality introduced by the Development department with appropriate monitoring, end-user communication, etc 7) Recommend topics and technologies to be explored by the Development team to enhance operational stability and delivery of new features to the network. 8) Document technologies, methods, and tools selected for use (and why) from an operational standpoint. 9) Maintain an active inventory of spare hardware to support all sites. 10) Establish a plan to correct ALL key site failures within XXXX days. 11) Coordinate with Development to actively inject and test network failures and redundancy capabilities. 12) Coordinate with Development to enhance HamWAN’s ability to operate in “pockets” when portions of the network fail in an earthquake – i.e. – each “island” stays operational with as many services as possible Development responsibilities 1) Continued exploration of new hardware, software, and network management tools (Quagga vs BIRD, Metals vs QRTs, etc) 2) Conduct experimentation with new hardware and software on separate network resources where possible, or in coordination with Operations on the larger network (more on this below). 3) Document technologies, methods, and tools explored and indicate pros/cons of each where possible. 4) Continued exploration, analysis, and documentation of available antenna and shielding designs 5) Exploration of new antenna designs and/or other hardware? 6) Exploration of new frequencies and how they are affected by terrain, vegetation, weather, etc 7) This particular list can go on FOREVER The distinction here is largely mental, but it’s important. It is entirely probable to have the same people in both groups, yet having the separation is important if HamWAN wishes to be taken seriously as a services provider to the EMCOMM community. Any benefits from that would also improve service for ALL HamWAN users. Having EMCOMM onboard is important. Not only does it provide a needed service to them, but if critical mass can be achieved it gives HamWAN access to multiple sites in every city and county. In turn though, HamWAN as a network needs to be reliable in the “customer’s” eyes. This means that infrastructure is managed with uptime as the highest priority, experimentation is managed to minimize adverse production impacts, and equipment failures are identified and corrected quickly. This is admittedly a fair amount of work. Much of it I suspect is already underway – maybe not just quite in this format. Additional help will definitely be useful. Everyone involved only has so much time available, and they should be able to focus on those items that are important to them. I believe the above framework (or something similar) begins to put some useful structure in place that continues to shape HamWAN from being the “wild west” of amateur and network “geek” exploration into the reliable, commercial grade, disaster resistant, amateur platform it envisions to be - while still allowing amateurs to push the limits of technology like they are meant to. If the above (or something similar) is of interest to the current directors and group as a whole, we can easily create a similar worklist that individuals on the sidelines can start picking things they can help with to help bring about. Just ideas. Not saying they’re perfect, but it’s a start. Any other thoughts? Cheers, Rob Salsgiver – NR3O From: PSDR [mailto:psdr-bounces@hamwan.org] On Behalf Of Bart Kus Sent: Friday, March 11, 2016 12:56 AM To: psdr@hamwan.org Subject: Re: [HamWAN PSDR] Service Impact Notice Hmm that's not the whole story though. If it were just the 1 router failure (in reality a hypervisor failure), we'd be in a much better position, but it's combined with 2 other modem failures. We had the ETiger->SnoDEM modem die over the winter, and it needs replacement. That link has been down for a month or more now. And most recently we're having the Tukwila->Baldi modem lose connectivity frequently. We've implemented an automatic mitigation for that, but it still produces sporadic short downtime windows of a few minutes. I'd just like to move that modem to a NetMetal 5. Our servers are also being affected by instability in the Quagga routing software. We need to replace this with a more stable alternative, like BIRD. Lastly, the Baldi emergency uplink is only configured to go to Westin and Corvallis, but not Tukwila. We could have avoided DNS outages too, if the anycast groups were populated with more of the available servers. I believe lack of good automation for server build-outs is causing the deployment lag here. The network is designed to withstand failures, even multiple failures, but we've got many broken things right now that need fixing. After that fixing, I would really love to see some folks get behind improving our monitoring, deployment and diagnostic automation. Networks like this won't scale unless they're nearly completely automated and simple to manage. I would not mind at all if we even rolled back some features until we can get them re-implemented in 100% automated ways. As important as all this is, I still think the deep penetration project takes precedence, so I can't drop that work in favor of this. Aside from helping out on the simple break-fix stuff, I mean. --Bart On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote: Thanks for the update, Nigel. On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen <nigel@nigelvh.com <mailto:nigel@nigelvh.com> > wrote: Hello All, Just wanted to send out a quick notice here. We’ve had a failure at our Seattle edge router, which we’re still investigating. In the meantime, our Tukwila edge router is still providing connectivity, but you may notice higher latencies or issues reaching things. If you find things you can’t reach, please let me know, as we’d like to make sure the redundancy is working, while we’re working to resolve the issues we’re investigating with the Seattle edge router. Nigel _______________________________________________ PSDR mailing list PSDR@hamwan.org <mailto:PSDR@hamwan.org> http://mail.hamwan.net/mailman/listinfo/psdr -- Ryan Turner _______________________________________________ PSDR mailing list PSDR@hamwan.org <mailto:PSDR@hamwan.org> http://mail.hamwan.net/mailman/listinfo/psdr
Bart, Rob, The biggest problem I see here is time resources. I brought this up to Bart off list, but there’s a continuing struggle to either have time to do the work yourself, or get other people to do the work. I deployed all of our monitoring and logging infrastructure, and I can say as a fact it’s been a struggle to get anyone to even do the basic work of adding new devices to the existing monitoring system, even after providing tutorials. This has gotten a bit better in very recent history, but it remains an issue. Automation is absolutely something we need to put more work into. Ryan and I have already put a bunch of work into this, which again, we have struggled to get folks to pick up, use, and contribute to. Modems breaking happens, and site access can be a significant problem. The East Tiger-SnoDEM link that Bart called out has been known down, but we can’t feasibly get that replaced in the middle of winter. Hopefully soon that can be taken care of. We can try to treat this like a production network all we want, but the reality is that we have effectively one part time staff trying to do, as Rob put it, both the Operations and Development work. The reality is that this is a network with VERY limited admin resources, which get split up to do various important things, the 900MHz work included, but that leaves even less available to do any day to day work. This isn’t our full time job, we’re not paid, we all have lives and families, we have VERY few people that actually volunteer to do any of the work, so the reality is there’s a lot we have a hard time getting to. Reality puts us much closer to “best effort” than “production”, and until we get more time/resources to do the work, it’s going to continue to be a struggle. If folks want to volunteer, I’d be happy to put them on improvements in monitoring, automation, and fixing things in the existing production network. Nigel
On Mar 11, 2016, at 09:11, Rob Salsgiver <rob@nr3o.com> wrote:
Bart,
You touch on a few things that have been “niggling” at the back of my mind for quite a while now – most of them come down in one way or another to overall reliability (of HamWAN) for EMCOMM, which most know has been my main driver for supporting the effort.
There’s been a TON of great work done and quite frankly, I’ve been amazed that HamWAN has gone as far and fast as it has, particularly for a “ham” effort.
At the same time we’ve slowly been adding and attracting the attention of various EMCOMM organizations with the promise and potential of redundant, reliable, resilient communications when “the big one” hits. Obviously not everything HamWAN is expected to survive a major quake or other event, but even pockets of reliable, high-speed communication are more than what can be accomplished via voice relays.
All of which bring back to the current outage and discussion. There have been several outages in key places since we began. Last year SnoDEM was all but stranded due to a Haystack modem failure and other events at the same time. Now we have a similar situation in a different place brought on by multiple failures or weaknesses. In other instances I’ve been told we’ve had outages via misconfigured devices or other reasons. Even in a perfect world, human error happens.
I believe HamWAN would benefit from somewhat of a shift in operating philosophy that would create two separate departments or divisions – operations and development.
Operations responsibilities 1) Provide day to day monitoring of network resources and conditions 2) Manage (admin) of those portions of the network that are designated as “in production”. This should be the majority of the network. 3) Provide communications and coordination of network maintenance 4) Maintain an active inventory of all operational (production) sites, site hardware, and site access information. 5) Maintain and manage all production site device configurations and config change management. 6) Coordinate implementation of new functionality introduced by the Development department with appropriate monitoring, end-user communication, etc 7) Recommend topics and technologies to be explored by the Development team to enhance operational stability and delivery of new features to the network. 8) Document technologies, methods, and tools selected for use (and why) from an operational standpoint. 9) Maintain an active inventory of spare hardware to support all sites. 10) Establish a plan to correct ALL key site failures within XXXX days. 11) Coordinate with Development to actively inject and test network failures and redundancy capabilities. 12) Coordinate with Development to enhance HamWAN’s ability to operate in “pockets” when portions of the network fail in an earthquake – i.e. – each “island” stays operational with as many services as possible
Development responsibilities 1) Continued exploration of new hardware, software, and network management tools (Quagga vs BIRD, Metals vs QRTs, etc) 2) Conduct experimentation with new hardware and software on separate network resources where possible, or in coordination with Operations on the larger network (more on this below). 3) Document technologies, methods, and tools explored and indicate pros/cons of each where possible. 4) Continued exploration, analysis, and documentation of available antenna and shielding designs 5) Exploration of new antenna designs and/or other hardware? 6) Exploration of new frequencies and how they are affected by terrain, vegetation, weather, etc 7) This particular list can go on FOREVER
The distinction here is largely mental, but it’s important. It is entirely probable to have the same people in both groups, yet having the separation is important if HamWAN wishes to be taken seriously as a services provider to the EMCOMM community. Any benefits from that would also improve service for ALL HamWAN users.
Having EMCOMM onboard is important. Not only does it provide a needed service to them, but if critical mass can be achieved it gives HamWAN access to multiple sites in every city and county. In turn though, HamWAN as a network needs to be reliable in the “customer’s” eyes. This means that infrastructure is managed with uptime as the highest priority, experimentation is managed to minimize adverse production impacts, and equipment failures are identified and corrected quickly.
This is admittedly a fair amount of work. Much of it I suspect is already underway – maybe not just quite in this format. Additional help will definitely be useful. Everyone involved only has so much time available, and they should be able to focus on those items that are important to them. I believe the above framework (or something similar) begins to put some useful structure in place that continues to shape HamWAN from being the “wild west” of amateur and network “geek” exploration into the reliable, commercial grade, disaster resistant, amateur platform it envisions to be - while still allowing amateurs to push the limits of technology like they are meant to.
If the above (or something similar) is of interest to the current directors and group as a whole, we can easily create a similar worklist that individuals on the sidelines can start picking things they can help with to help bring about.
Just ideas. Not saying they’re perfect, but it’s a start. Any other thoughts?
Cheers, Rob Salsgiver – NR3O
From: PSDR [mailto:psdr-bounces@hamwan.org] On Behalf Of Bart Kus Sent: Friday, March 11, 2016 12:56 AM To: psdr@hamwan.org Subject: Re: [HamWAN PSDR] Service Impact Notice
Hmm that's not the whole story though. If it were just the 1 router failure (in reality a hypervisor failure), we'd be in a much better position, but it's combined with 2 other modem failures. We had the ETiger->SnoDEM modem die over the winter, and it needs replacement. That link has been down for a month or more now. And most recently we're having the Tukwila->Baldi modem lose connectivity frequently. We've implemented an automatic mitigation for that, but it still produces sporadic short downtime windows of a few minutes. I'd just like to move that modem to a NetMetal 5. Our servers are also being affected by instability in the Quagga routing software. We need to replace this with a more stable alternative, like BIRD. Lastly, the Baldi emergency uplink is only configured to go to Westin and Corvallis, but not Tukwila.
We could have avoided DNS outages too, if the anycast groups were populated with more of the available servers. I believe lack of good automation for server build-outs is causing the deployment lag here.
The network is designed to withstand failures, even multiple failures, but we've got many broken things right now that need fixing. After that fixing, I would really love to see some folks get behind improving our monitoring, deployment and diagnostic automation. Networks like this won't scale unless they're nearly completely automated and simple to manage. I would not mind at all if we even rolled back some features until we can get them re-implemented in 100% automated ways.
As important as all this is, I still think the deep penetration project takes precedence, so I can't drop that work in favor of this. Aside from helping out on the simple break-fix stuff, I mean.
--Bart
On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote:
Thanks for the update, Nigel.
On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen <nigel@nigelvh.com <mailto:nigel@nigelvh.com>> wrote:
Hello All,
Just wanted to send out a quick notice here. We’ve had a failure at our Seattle edge router, which we’re still investigating. In the meantime, our Tukwila edge router is still providing connectivity, but you may notice higher latencies or issues reaching things. If you find things you can’t reach, please let me know, as we’d like to make sure the redundancy is working, while we’re working to resolve the issues we’re investigating with the Seattle edge router.
Nigel _______________________________________________ PSDR mailing list PSDR@hamwan.org <mailto:PSDR@hamwan.org> http://mail.hamwan.net/mailman/listinfo/psdr <http://mail.hamwan.net/mailman/listinfo/psdr>
-- Ryan Turner
_______________________________________________ PSDR mailing list PSDR@hamwan.org <mailto:PSDR@hamwan.org> http://mail.hamwan.net/mailman/listinfo/psdr <http://mail.hamwan.net/mailman/listinfo/psdr>
PSDR mailing list PSDR@hamwan.org http://mail.hamwan.net/mailman/listinfo/psdr
I'll echo the time constraints. We're looking at core infrastructure deployment for Georgia, USA and have a lot of generalized interest in the project. We're experiencing similar volunteer constraints and have yet to begin full operations. I can only picture how physical network operations are going to proceed and suffer once those deployments start. Regards, Sam Kuonen, KK4UVL On Fri, Mar 11, 2016, 12:29 PM Nigel Vander Houwen <nigel@nigelvh.com> wrote:
Bart, Rob,
The biggest problem I see here is time resources. I brought this up to Bart off list, but there’s a continuing struggle to either have time to do the work yourself, or get other people to do the work.
I deployed all of our monitoring and logging infrastructure, and I can say as a fact it’s been a struggle to get anyone to even do the basic work of adding new devices to the existing monitoring system, even after providing tutorials. This has gotten a bit better in very recent history, but it remains an issue.
Automation is absolutely something we need to put more work into. Ryan and I have already put a bunch of work into this, which again, we have struggled to get folks to pick up, use, and contribute to.
Modems breaking happens, and site access can be a significant problem. The East Tiger-SnoDEM link that Bart called out has been known down, but we can’t feasibly get that replaced in the middle of winter. Hopefully soon that can be taken care of.
We can try to treat this like a production network all we want, but the reality is that we have effectively one part time staff trying to do, as Rob put it, both the Operations and Development work.
The reality is that this is a network with VERY limited admin resources, which get split up to do various important things, the 900MHz work included, but that leaves even less available to do any day to day work. This isn’t our full time job, we’re not paid, we all have lives and families, we have VERY few people that actually volunteer to do any of the work, so the reality is there’s a lot we have a hard time getting to. Reality puts us much closer to “best effort” than “production”, and until we get more time/resources to do the work, it’s going to continue to be a struggle.
If folks want to volunteer, I’d be happy to put them on improvements in monitoring, automation, and fixing things in the existing production network.
Nigel
On Mar 11, 2016, at 09:11, Rob Salsgiver <rob@nr3o.com> wrote:
Bart,
You touch on a few things that have been “niggling” at the back of my mind for quite a while now – most of them come down in one way or another to overall reliability (of HamWAN) for EMCOMM, which most know has been my main driver for supporting the effort.
There’s been a TON of great work done and quite frankly, I’ve been amazed that HamWAN has gone as far and fast as it has, particularly for a “ham” effort.
At the same time we’ve slowly been adding and attracting the attention of various EMCOMM organizations with the promise and potential of redundant, reliable, resilient communications when “the big one” hits. Obviously not everything HamWAN is expected to survive a major quake or other event, but even pockets of reliable, high-speed communication are more than what can be accomplished via voice relays.
All of which bring back to the current outage and discussion. There have been several outages in key places since we began. Last year SnoDEM was all but stranded due to a Haystack modem failure and other events at the same time. Now we have a similar situation in a different place brought on by multiple failures or weaknesses. In other instances I’ve been told we’ve had outages via misconfigured devices or other reasons. Even in a perfect world, human error happens.
I believe HamWAN would benefit from somewhat of a shift in operating philosophy that would create two separate departments or divisions – operations and development.
Operations responsibilities 1) Provide day to day monitoring of network resources and conditions 2) Manage (admin) of those portions of the network that are designated as “in production”. This should be the majority of the network. 3) Provide communications and coordination of network maintenance 4) Maintain an active inventory of all operational (production) sites, site hardware, and site access information. 5) Maintain and manage all production site device configurations and config change management. 6) Coordinate implementation of new functionality introduced by the Development department with appropriate monitoring, end-user communication, etc 7) Recommend topics and technologies to be explored by the Development team to enhance operational stability and delivery of new features to the network. 8) Document technologies, methods, and tools selected for use (and why) from an operational standpoint. 9) Maintain an active inventory of spare hardware to support all sites. 10) Establish a plan to correct ALL key site failures within XXXX days. 11) Coordinate with Development to actively inject and test network failures and redundancy capabilities. 12) Coordinate with Development to enhance HamWAN’s ability to operate in “pockets” when portions of the network fail in an earthquake – i.e. – each “island” stays operational with as many services as possible
Development responsibilities 1) Continued exploration of new hardware, software, and network management tools (Quagga vs BIRD, Metals vs QRTs, etc) 2) Conduct experimentation with new hardware and software on separate network resources where possible, or in coordination with Operations on the larger network (more on this below). 3) Document technologies, methods, and tools explored and indicate pros/cons of each where possible. 4) Continued exploration, analysis, and documentation of available antenna and shielding designs 5) Exploration of new antenna designs and/or other hardware? 6) Exploration of new frequencies and how they are affected by terrain, vegetation, weather, etc 7) This particular list can go on FOREVER
The distinction here is largely mental, but it’s important. It is entirely probable to have the same people in both groups, yet having the separation is important if HamWAN wishes to be taken seriously as a services provider to the EMCOMM community. Any benefits from that would also improve service for ALL HamWAN users.
Having EMCOMM onboard is important. Not only does it provide a needed service to them, but if critical mass can be achieved it gives HamWAN access to multiple sites in every city and county. In turn though, HamWAN as a network needs to be reliable in the “customer’s” eyes. This means that infrastructure is managed with uptime as the highest priority, experimentation is managed to minimize adverse production impacts, and equipment failures are identified and corrected quickly.
This is admittedly a fair amount of work. Much of it I suspect is already underway – maybe not just quite in this format. Additional help will definitely be useful. Everyone involved only has so much time available, and they should be able to focus on those items that are important to them. I believe the above framework (or something similar) begins to put some useful structure in place that continues to shape HamWAN from being the “wild west” of amateur and network “geek” exploration into the reliable, commercial grade, disaster resistant, amateur platform it envisions to be - while still allowing amateurs to push the limits of technology like they are meant to.
If the above (or something similar) is of interest to the current directors and group as a whole, we can easily create a similar worklist that individuals on the sidelines can start picking things they can help with to help bring about.
Just ideas. Not saying they’re perfect, but it’s a start. Any other thoughts?
Cheers, Rob Salsgiver – NR3O
*From:* PSDR [mailto:psdr-bounces@hamwan.org <psdr-bounces@hamwan.org>] *On Behalf Of *Bart Kus *Sent:* Friday, March 11, 2016 12:56 AM *To:* psdr@hamwan.org *Subject:* Re: [HamWAN PSDR] Service Impact Notice
Hmm that's not the whole story though. If it were just the 1 router failure (in reality a hypervisor failure), we'd be in a much better position, but it's combined with 2 other modem failures. We had the ETiger->SnoDEM modem die over the winter, and it needs replacement. That link has been down for a month or more now. And most recently we're having the Tukwila->Baldi modem lose connectivity frequently. We've implemented an automatic mitigation for that, but it still produces sporadic short downtime windows of a few minutes. I'd just like to move that modem to a NetMetal 5. Our servers are also being affected by instability in the Quagga routing software. We need to replace this with a more stable alternative, like BIRD. Lastly, the Baldi emergency uplink is only configured to go to Westin and Corvallis, but not Tukwila.
We could have avoided DNS outages too, if the anycast groups were populated with more of the available servers. I believe lack of good automation for server build-outs is causing the deployment lag here.
The network is designed to withstand failures, even multiple failures, but we've got many broken things right now that need fixing. After that fixing, I would really love to see some folks get behind improving our monitoring, deployment and diagnostic automation. Networks like this won't scale unless they're nearly completely automated and simple to manage. I would not mind at all if we even rolled back some features until we can get them re-implemented in 100% automated ways.
As important as all this is, I still think the deep penetration project takes precedence, so I can't drop that work in favor of this. Aside from helping out on the simple break-fix stuff, I mean.
--Bart
On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote:
Thanks for the update, Nigel.
On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen <nigel@nigelvh.com> wrote:
Hello All,
Just wanted to send out a quick notice here. We’ve had a failure at our Seattle edge router, which we’re still investigating. In the meantime, our Tukwila edge router is still providing connectivity, but you may notice higher latencies or issues reaching things. If you find things you can’t reach, please let me know, as we’d like to make sure the redundancy is working, while we’re working to resolve the issues we’re investigating with the Seattle edge router.
Nigel _______________________________________________ PSDR mailing list PSDR@hamwan.org http://mail.hamwan.net/mailman/listinfo/psdr
--
Ryan Turner
_______________________________________________
PSDR mailing list
PSDR@hamwan.org
http://mail.hamwan.net/mailman/listinfo/psdr
_______________________________________________ PSDR mailing list PSDR@hamwan.org http://mail.hamwan.net/mailman/listinfo/psdr
_______________________________________________ PSDR mailing list PSDR@hamwan.org http://mail.hamwan.net/mailman/listinfo/psdr
After reading Nigel and Sam's responses, I can *TOTALLY* sympathize having "been there and done that" myself. At a minimum, I guess what is needed is to at least set appropriate (i.e. realistic) expectations for emcomm (and other) organizations who are contemplating hooking up to HamWAN for the expected benefit. Quite frankly, this is why from the outset that I requested a "/24" network. I perceived that HamWAN is pretty much what it is and that to deliver on the reliability hopes and expectations that the inexperienced, non-technical folks who were excited about HamWAN we would have to have non-HamWAN redundancy built into anything we implement. One of the realities I am facing is that to get the $upport we need to buy equipment and implement a broadband network -- sold on the benefits of capability *AND* reliability -- I/we need to capitalize on their excitement and do our best to fulfill the needs for which we are commissioned.
As a side note, one of the issues in my discussion about network address space allocation was whether we would really need a "/24" all to ourselves. That remains to be seen, but given the limitations of the space considering the number of municipalities in the area -- and indeed beyond for whom the "44-net" has been allocated -- a plan for the emcomm community that allows for non-HamWAN "backup" is sorely needed. I think it may be that the "/24" we were allocated may very well evolve into a "sub-regional" network with its own backup. For example, maybe our "/24" eventually encompasses the Seattle Eastside communities and has both a link to the larger HamWAN network as well as a "local" backup gateway to the Internet for the sub-region. Again, it's all about setting realistic expectations for what HamWAN is and, equally important, what it isn't (yet anyway).
I think this would be a great year to have a gathering of the HamWAN implementers at DCC. It's in St. Petersburg this fall. https://www.tapr.org/dcc.html It would be a good time to exchange ideas and best practices. On Fri, Mar 11, 2016 at 10:25 AM, Sam Kuonen <sam.kuonen@gmail.com> wrote:
I'll echo the time constraints. We're looking at core infrastructure deployment for Georgia, USA and have a lot of generalized interest in the project.
We're experiencing similar volunteer constraints and have yet to begin full operations. I can only picture how physical network operations are going to proceed and suffer once those deployments start.
Regards,
Sam Kuonen, KK4UVL
John D. Hays K7VE PO Box 1223, Edmonds, WA 98020-1223 <http://k7ve.org/blog> <http://twitter.com/#!/john_hays> <http://www.facebook.com/john.d.hays>
On 3/11/16 1:54 PM, John D. Hays wrote:
I think this would be a great year to have a gathering of the HamWAN implementers at DCC. It's in St. Petersburg this fall. https://www.tapr.org/dcc.html
We could do a BOF (break out forum) if you want, and perhaps a dinner. I'll be at Dayton this year taking about hamwan, and I'd love to present a few slides of what people in the various regions are doing. -- Bryan Fields 727-409-1194 - Voice 727-214-2508 - Fax http://bryanfields.net
Replying the to the latest fully-quoted message instead of Ed's, but Ed your observations are spot on. Rob, I think the concept of network ops is finished both in the industry and for HamWAN. In the industry, we're working at such enormous scales that you cannot possibly staff enough people to do any of the ops tasks manually. Even if you did, the unavoidable human failure rate would cripple your resulting system. In HamWAN, we have the same problems as industry (albeit at a microscopic scale), but additionally requiring staff to operate things is an adoption hurdle. We don't have the incentive of wages to staff these required job functions. Combine that with a general lack of computer/network knowledge in the ham community and you're doomed, even if you did manage to gather enough well-meaning people to support you. This problem isn't unique to the Puget Sound Data Ring. Everyone else trying to implement a HamWAN will face the same challenges, as Ed correctly points out. We need to make the leap from phase 1 to phase 2 (see Ed's email), because we've been successful enough (yay!) to grow to such a scale that we're starting to fail at phase 1. HamWAN has so far delivered interfacing standards, and a bunch of docs that educate people on suggestions (not standards) for how to configure the non-standardized parts of your network. That's a good starting point, but now that we know our standard ideas work reasonably well, it's time to take on the additional task of making them self-implementing in new HamWAN instances. This means a lot of software development. And therein lies the problem. In this project we have maybe 2 people who can help write the software required. For us to successfully make the leap from phase 1 to phase 2, we've got to become attractive to people who write software. A team of 6-10 folks would give us a good chance at making the leap. I'm not sure how to do recruiting for this, but don't let that be the seminal question of this email. I'd like to hear from people if they agree with the direction shift I've proposed here. --Bart On 3/11/2016 10:25 AM, Sam Kuonen wrote:
I'll echo the time constraints. We're looking at core infrastructure deployment for Georgia, USA and have a lot of generalized interest in the project.
We're experiencing similar volunteer constraints and have yet to begin full operations. I can only picture how physical network operations are going to proceed and suffer once those deployments start.
Regards,
Sam Kuonen, KK4UVL
On Fri, Mar 11, 2016, 12:29 PM Nigel Vander Houwen <nigel@nigelvh.com <mailto:nigel@nigelvh.com>> wrote:
Bart, Rob,
The biggest problem I see here is time resources. I brought this up to Bart off list, but there’s a continuing struggle to either have time to do the work yourself, or get other people to do the work.
I deployed all of our monitoring and logging infrastructure, and I can say as a fact it’s been a struggle to get anyone to even do the basic work of adding new devices to the existing monitoring system, even after providing tutorials. This has gotten a bit better in very recent history, but it remains an issue.
Automation is absolutely something we need to put more work into. Ryan and I have already put a bunch of work into this, which again, we have struggled to get folks to pick up, use, and contribute to.
Modems breaking happens, and site access can be a significant problem. The East Tiger-SnoDEM link that Bart called out has been known down, but we can’t feasibly get that replaced in the middle of winter. Hopefully soon that can be taken care of.
We can try to treat this like a production network all we want, but the reality is that we have effectively one part time staff trying to do, as Rob put it, both the Operations and Development work.
The reality is that this is a network with VERY limited admin resources, which get split up to do various important things, the 900MHz work included, but that leaves even less available to do any day to day work. This isn’t our full time job, we’re not paid, we all have lives and families, we have VERY few people that actually volunteer to do any of the work, so the reality is there’s a lot we have a hard time getting to. Reality puts us much closer to “best effort” than “production”, and until we get more time/resources to do the work, it’s going to continue to be a struggle.
If folks want to volunteer, I’d be happy to put them on improvements in monitoring, automation, and fixing things in the existing production network.
Nigel
On Mar 11, 2016, at 09:11, Rob Salsgiver <rob@nr3o.com <mailto:rob@nr3o.com>> wrote:
Bart, You touch on a few things that have been “niggling” at the back of my mind for quite a while now – most of them come down in one way or another to overall reliability (of HamWAN) for EMCOMM, which most know has been my main driver for supporting the effort. There’s been a TON of great work done and quite frankly, I’ve been amazed that HamWAN has gone as far and fast as it has, particularly for a “ham” effort. At the same time we’ve slowly been adding and attracting the attention of various EMCOMM organizations with the promise and potential of redundant, reliable, resilient communications when “the big one” hits. Obviously not everything HamWAN is expected to survive a major quake or other event, but even pockets of reliable, high-speed communication are more than what can be accomplished via voice relays. All of which bring back to the current outage and discussion. There have been several outages in key places since we began. Last year SnoDEM was all but stranded due to a Haystack modem failure and other events at the same time. Now we have a similar situation in a different place brought on by multiple failures or weaknesses. In other instances I’ve been told we’ve had outages via misconfigured devices or other reasons. Even in a perfect world, human error happens. I believe HamWAN would benefit from somewhat of a shift in operating philosophy that would create two separate departments or divisions – operations and development. Operations responsibilities 1)Provide day to day monitoring of network resources and conditions 2)Manage (admin) of those portions of the network that are designated as “in production”. This should be the majority of the network. 3)Provide communications and coordination of network maintenance 4)Maintain an active inventory of all operational (production) sites, site hardware, and site access information. 5)Maintain and manage all production site device configurations and config change management. 6)Coordinate implementation of new functionality introduced by the Development department with appropriate monitoring, end-user communication, etc 7)Recommend topics and technologies to be explored by the Development team to enhance operational stability and delivery of new features to the network. 8)Document technologies, methods, and tools selected for use (and why) from an operational standpoint. 9)Maintain an active inventory of spare hardware to support all sites. 10)Establish a plan to correct ALL key site failures within XXXX days. 11)Coordinate with Development to actively inject and test network failures and redundancy capabilities. 12)Coordinate with Development to enhance HamWAN’s ability to operate in “pockets” when portions of the network fail in an earthquake – i.e. – each “island” stays operational with as many services as possible Development responsibilities 1)Continued exploration of new hardware, software, and network management tools (Quagga vs BIRD, Metals vs QRTs, etc) 2)Conduct experimentation with new hardware and software on separate network resources where possible, or in coordination with Operations on the larger network (more on this below). 3)Document technologies, methods, and tools explored and indicate pros/cons of each where possible. 4)Continued exploration, analysis, and documentation of available antenna and shielding designs 5)Exploration of new antenna designs and/or other hardware? 6)Exploration of new frequencies and how they are affected by terrain, vegetation, weather, etc 7)This particular list can go on FOREVER The distinction here is largely mental, but it’s important. It is entirely probable to have the same people in both groups, yet having the separation is important if HamWAN wishes to be taken seriously as a services provider to the EMCOMM community. Any benefits from that would also improve service for ALL HamWAN users. Having EMCOMM onboard is important. Not only does it provide a needed service to them, but if critical mass can be achieved it gives HamWAN access to multiple sites in every city and county. In turn though, HamWAN as a network needs to be reliable in the “customer’s” eyes. This means that infrastructure is managed with uptime as the highest priority, experimentation is managed to minimize adverse production impacts, and equipment failures are identified and corrected quickly. This is admittedly a fair amount of work. Much of it I suspect is already underway – maybe not just quite in this format. Additional help will definitely be useful. Everyone involved only has so much time available, and they should be able to focus on those items that are important to them. I believe the above framework (or something similar) begins to put some useful structure in place that continues to shape HamWAN from being the “wild west” of amateur and network “geek” exploration into the reliable, commercial grade, disaster resistant, amateur platform it envisions to be - while still allowing amateurs to push the limits of technology like they are meant to. If the above (or something similar) is of interest to the current directors and group as a whole, we can easily create a similar worklist that individuals on the sidelines can start picking things they can help with to help bring about. Just ideas. Not saying they’re perfect, but it’s a start. Any other thoughts? Cheers, Rob Salsgiver – NR3O *From:*PSDR [mailto:psdr-bounces@hamwan.org]*On Behalf Of*Bart Kus *Sent:*Friday, March 11, 2016 12:56 AM *To:*psdr@hamwan.org <mailto:psdr@hamwan.org> *Subject:*Re: [HamWAN PSDR] Service Impact Notice
Hmm that's not the whole story though. If it were just the 1 router failure (in reality a hypervisor failure), we'd be in a much better position, but it's combined with 2 other modem failures. We had the ETiger->SnoDEM modem die over the winter, and it needs replacement. That link has been down for a month or more now. And most recently we're having the Tukwila->Baldi modem lose connectivity frequently. We've implemented an automatic mitigation for that, but it still produces sporadic short downtime windows of a few minutes. I'd just like to move that modem to a NetMetal 5. Our servers are also being affected by instability in the Quagga routing software. We need to replace this with a more stable alternative, like BIRD. Lastly, the Baldi emergency uplink is only configured to go to Westin and Corvallis, but not Tukwila.
We could have avoided DNS outages too, if the anycast groups were populated with more of the available servers. I believe lack of good automation for server build-outs is causing the deployment lag here.
The network is designed to withstand failures, even multiple failures, but we've got many broken things right now that need fixing. After that fixing, I would really love to see some folks get behind improving our monitoring, deployment and diagnostic automation. Networks like this won't scale unless they're nearly completely automated and simple to manage. I would not mind at all if we even rolled back some features until we can get them re-implemented in 100% automated ways.
As important as all this is, I still think the deep penetration project takes precedence, so I can't drop that work in favor of this. Aside from helping out on the simple break-fix stuff, I mean.
--Bart
On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote:
Thanks for the update, Nigel. On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen <nigel@nigelvh.com <mailto:nigel@nigelvh.com>> wrote:
Hello All,
Just wanted to send out a quick notice here. We’ve had a failure at our Seattle edge router, which we’re still investigating. In the meantime, our Tukwila edge router is still providing connectivity, but you may notice higher latencies or issues reaching things. If you find things you can’t reach, please let me know, as we’d like to make sure the redundancy is working, while we’re working to resolve the issues we’re investigating with the Seattle edge router.
Nigel _______________________________________________ PSDR mailing list PSDR@hamwan.org <mailto:PSDR@hamwan.org> http://mail.hamwan.net/mailman/listinfo/psdr
--
Ryan Turner
_______________________________________________ PSDR mailing list PSDR@hamwan.org <mailto:PSDR@hamwan.org> http://mail.hamwan.net/mailman/listinfo/psdr
_______________________________________________ PSDR mailing list PSDR@hamwan.org <mailto:PSDR@hamwan.org> http://mail.hamwan.net/mailman/listinfo/psdr
_______________________________________________ PSDR mailing list PSDR@hamwan.org <mailto:PSDR@hamwan.org> http://mail.hamwan.net/mailman/listinfo/psdr
_______________________________________________ PSDR mailing list PSDR@hamwan.org http://mail.hamwan.net/mailman/listinfo/psdr
Thanks Ed/Sam/Nigel/Bart/all for the thoughts. First and foremost I want to make sure that I am in NO way criticizing the huge amount of effort that has gotten us this far, OR the efforts of ANY individual who put their time and energy into the project. As a past casualty of throwing myself into “the cause” I know all too well the toll it can take, and I sincerely appreciate the time everyone has invested. It is also precisely for that reason that I brought the topic up. Like many other well intended projects, HamWAN has reached a threshold that it must pass in order to continue to grow, or that it must stay below and be content with what it has achieved. Without some form of change, HamWAN (and HamWANs throughout) face the same risks as a lot of other amateur-related efforts – namely to “die on the vine” due to lack of support and eventual “creator” exhaustion. I would also hate to see all of this effort, goodwill, and potential be wasted if HamWAN were to fold or otherwise fail to reach it’s potential. The vision is valid, and the potential in the 2nd and 3rd generations of amateur WAN connectivity and communications is truly enormous. It is only fair that we continue to mold the idea of HamWAN as necessary to see that the work that has been done is not in vain, and that the efforts of the creators are rewarded and grown in the future. /sap /sermon I submit that the separation of Operations and Development is not as much a staffing need as it is a MENTAL need – although additional people devoted to both areas would obviously be beneficial. In the beginning EMCOMM was identified as an amateur-related activity that could really benefit from what HamWAN could bring to the table. Conversely, EMCOMM also has the potential to provide HamWANs with access to high value sites to deploy and establish coverage. These needs are highly symbiotic and also carry responsibilities. There have been numerous amateur efforts to implement digital communications that have been successful in pockets yet failures overall. D-star, packet, you name it. Most of it boils down to complexity to the end-point operator. How many hams can (or care to know) how to tweak a TNC, only to have RF, radio parameters, TNC parameters, or the operator on the other end not be able to complete a message. Over the years a lot of time and expense has been put into amateur “solutions” by Served Agencies, only to have them lie unused today. HamWAN shares some of those risks, but not all. The use of current (and cutting edge) networking technology eliminates one of the pitfalls of packet – namely obsolescence. Yes, packet still has its place but it’s not high speed, high volume data communications. In order to be accepted in EOCs and other EMCOMM sites, HamWAN must “mentally” be better than the alternatives. It has to be reliable, robust, and treated like a serious disaster resource platform. While one of the attractions of HamWAN is to provide an environment for amateur experimentation, it CANNOT be done on parts of a network that are we are asking EMCOMM entities to help fund, provide sites for, and operate on behalf of. If EOCs are providing cell sites that are down 10, 20, or 50% of the time due to network “experimentation”, misconfigurations, or hardware failures that we don’t have the manpower to support, there is little incentive to spend public money on yet another amateur “solution” that will not work when it’s actually needed. Conversely, we need to “build” an environment that lends itself to as little human “babysitting” required as possible, while allowing the necessary experimentation to continue advancing the art. As mentioned in a couple of emails, expectation management is key. This applies both ways. We are asking to be invited, but also need to evolve where we (HamWAN) can be counted on. We can go the path of setting up parallel but separate networks for EMCOMM and HamWAN as an experimentation platform. Unfortunately we either double the hardware at each site to support both (and the support needs), or we lose potential sites due to lack of supporting entity(ies). I argue that having EMCOMM being supported as a key component of ALL HamWANs in a single network is the most beneficial to both communities. EMCOMM gets a disaster communications solution that doesn’t exist elsewhere at a cost that doesn’t sink taxpayer funded budgets, and for that “price” amateur-based HamWANs get access to premium sites that would not otherwise be available. If this model still makes sense now that HamWAN is in year 3 or 4, then it is merely a question of how best to set things up to continue moving forward. The points of separating Operations and Development are more mental and organizational in nature. By creating the distinction, you accomplish the following: 1) Offloading (with documentation and guidance) the daily maintenance tasks to others in manageable bites – even more so with greater automation 2) Establish that network stability and reliable operation take priority over all else. 3) Development and change management happen under a framework within a group / email list. i.e. – avoid the situations where changes are implemented partway due to the only guy working on it having 20 minutes after work before going to bed and then something elsewhere goes down via unintended consequences. It’s not a slam on anyone, it’s basic change management that we all use at work. 4) Involve more people by creating roles that are less technically encompassing in scope – i.e. – not everyone needs to be a telecom, Microsoft, or Amazon network engineer to be able to contribute. 5) And this list can go on… To Bart’s most recent points I have no disagreement that as much as possible can (and should) be automated. At the end of the day though, somebody (or a few somebodies) have to call the shots on when and how network changes, maintenance, and support happen. This is more from a standpoint of making fundamental changes to the infrastructure design. Easy examples would be changes in routing protocols, global router settings changes, manual changes to accommodate temporary network conditions or outages that may unintentionally conflict with automated features, etc. I didn’t get a copy of Ed’s email that discussed Phase 1 and Phase 2 issues, so I’ll have to pass on that. Initially I didn’t get Nigel’s response either until I got it in one of Bart’s replies. Not sure what’s going on there. Where do we (HamWAN) have the current list of topics that are under development and/or need help? Probably 50% of the time I have IRC running in the background at work, and most of the conversations I see with regard to a specific topic are people engaged in that particular task, but never a running list of what’s being worked on what could use help on. While I haven’t been through the reworked website with a fine tooth comb, I don’t remember seeing those topics there either. The flip side is I also understand that the few who have been working on them typically don’t want to take the time to keep publishing lists either, so it’s somewhat of a catch-22. In my personal case, in the dinosaur age I was a pretty skilled admin and could hold my own in most areas. I’ve been away quite a while in management roles and would have some catching up to do for sure, but probably could be of use somewhere. Once upon a time I knew my way around Cisco IOS, today I’m re-learning Mikrotik <g>. Others on the list I’m sure have other skills that might be able to help, but where do they start that they can be useful? Lastly the topic of writing software. I understand the advantages of having something that does exactly what you want and it’s also a part of experimentation and advancing the art. At the same time, anything that is custom either needs the creator to be immortal to continue support into the future, or it needs to be documented to the level of being able to train your replacement. Google works wonders for looking up support information on off the shelf products. Not as well for custom software. I’m not saying it can’t be used, only that there are definite trade-offs. Ok. I’ll stop for now. Thanks for taking the time to read & respond. Cheers, Rob From: PSDR [mailto:psdr-bounces@hamwan.org] On Behalf Of Bart Kus Sent: Friday, March 11, 2016 1:12 PM To: psdr@hamwan.org Subject: Re: [HamWAN PSDR] Service Impact Notice Replying the to the latest fully-quoted message instead of Ed's, but Ed your observations are spot on. Rob, I think the concept of network ops is finished both in the industry and for HamWAN. In the industry, we're working at such enormous scales that you cannot possibly staff enough people to do any of the ops tasks manually. Even if you did, the unavoidable human failure rate would cripple your resulting system. In HamWAN, we have the same problems as industry (albeit at a microscopic scale), but additionally requiring staff to operate things is an adoption hurdle. We don't have the incentive of wages to staff these required job functions. Combine that with a general lack of computer/network knowledge in the ham community and you're doomed, even if you did manage to gather enough well-meaning people to support you. This problem isn't unique to the Puget Sound Data Ring. Everyone else trying to implement a HamWAN will face the same challenges, as Ed correctly points out. We need to make the leap from phase 1 to phase 2 (see Ed's email), because we've been successful enough (yay!) to grow to such a scale that we're starting to fail at phase 1. HamWAN has so far delivered interfacing standards, and a bunch of docs that educate people on suggestions (not standards) for how to configure the non-standardized parts of your network. That's a good starting point, but now that we know our standard ideas work reasonably well, it's time to take on the additional task of making them self-implementing in new HamWAN instances. This means a lot of software development. And therein lies the problem. In this project we have maybe 2 people who can help write the software required. For us to successfully make the leap from phase 1 to phase 2, we've got to become attractive to people who write software. A team of 6-10 folks would give us a good chance at making the leap. I'm not sure how to do recruiting for this, but don't let that be the seminal question of this email. I'd like to hear from people if they agree with the direction shift I've proposed here. --Bart On 3/11/2016 10:25 AM, Sam Kuonen wrote: I'll echo the time constraints. We're looking at core infrastructure deployment for Georgia, USA and have a lot of generalized interest in the project. We're experiencing similar volunteer constraints and have yet to begin full operations. I can only picture how physical network operations are going to proceed and suffer once those deployments start. Regards, Sam Kuonen, KK4UVL On Fri, Mar 11, 2016, 12:29 PM Nigel Vander Houwen <nigel@nigelvh.com <mailto:nigel@nigelvh.com> > wrote: Bart, Rob, The biggest problem I see here is time resources. I brought this up to Bart off list, but there’s a continuing struggle to either have time to do the work yourself, or get other people to do the work. I deployed all of our monitoring and logging infrastructure, and I can say as a fact it’s been a struggle to get anyone to even do the basic work of adding new devices to the existing monitoring system, even after providing tutorials. This has gotten a bit better in very recent history, but it remains an issue. Automation is absolutely something we need to put more work into. Ryan and I have already put a bunch of work into this, which again, we have struggled to get folks to pick up, use, and contribute to. Modems breaking happens, and site access can be a significant problem. The East Tiger-SnoDEM link that Bart called out has been known down, but we can’t feasibly get that replaced in the middle of winter. Hopefully soon that can be taken care of. We can try to treat this like a production network all we want, but the reality is that we have effectively one part time staff trying to do, as Rob put it, both the Operations and Development work. The reality is that this is a network with VERY limited admin resources, which get split up to do various important things, the 900MHz work included, but that leaves even less available to do any day to day work. This isn’t our full time job, we’re not paid, we all have lives and families, we have VERY few people that actually volunteer to do any of the work, so the reality is there’s a lot we have a hard time getting to. Reality puts us much closer to “best effort” than “production”, and until we get more time/resources to do the work, it’s going to continue to be a struggle. If folks want to volunteer, I’d be happy to put them on improvements in monitoring, automation, and fixing things in the existing production network. Nigel On Mar 11, 2016, at 09:11, Rob Salsgiver <rob@nr3o.com <mailto:rob@nr3o.com> > wrote: Bart, You touch on a few things that have been “niggling” at the back of my mind for quite a while now – most of them come down in one way or another to overall reliability (of HamWAN) for EMCOMM, which most know has been my main driver for supporting the effort. There’s been a TON of great work done and quite frankly, I’ve been amazed that HamWAN has gone as far and fast as it has, particularly for a “ham” effort. At the same time we’ve slowly been adding and attracting the attention of various EMCOMM organizations with the promise and potential of redundant, reliable, resilient communications when “the big one” hits. Obviously not everything HamWAN is expected to survive a major quake or other event, but even pockets of reliable, high-speed communication are more than what can be accomplished via voice relays. All of which bring back to the current outage and discussion. There have been several outages in key places since we began. Last year SnoDEM was all but stranded due to a Haystack modem failure and other events at the same time. Now we have a similar situation in a different place brought on by multiple failures or weaknesses. In other instances I’ve been told we’ve had outages via misconfigured devices or other reasons. Even in a perfect world, human error happens. I believe HamWAN would benefit from somewhat of a shift in operating philosophy that would create two separate departments or divisions – operations and development. Operations responsibilities 1) Provide day to day monitoring of network resources and conditions 2) Manage (admin) of those portions of the network that are designated as “in production”. This should be the majority of the network. 3) Provide communications and coordination of network maintenance 4) Maintain an active inventory of all operational (production) sites, site hardware, and site access information. 5) Maintain and manage all production site device configurations and config change management. 6) Coordinate implementation of new functionality introduced by the Development department with appropriate monitoring, end-user communication, etc 7) Recommend topics and technologies to be explored by the Development team to enhance operational stability and delivery of new features to the network. 8) Document technologies, methods, and tools selected for use (and why) from an operational standpoint. 9) Maintain an active inventory of spare hardware to support all sites. 10) Establish a plan to correct ALL key site failures within XXXX days. 11) Coordinate with Development to actively inject and test network failures and redundancy capabilities. 12) Coordinate with Development to enhance HamWAN’s ability to operate in “pockets” when portions of the network fail in an earthquake – i.e. – each “island” stays operational with as many services as possible Development responsibilities 1) Continued exploration of new hardware, software, and network management tools (Quagga vs BIRD, Metals vs QRTs, etc) 2) Conduct experimentation with new hardware and software on separate network resources where possible, or in coordination with Operations on the larger network (more on this below). 3) Document technologies, methods, and tools explored and indicate pros/cons of each where possible. 4) Continued exploration, analysis, and documentation of available antenna and shielding designs 5) Exploration of new antenna designs and/or other hardware? 6) Exploration of new frequencies and how they are affected by terrain, vegetation, weather, etc 7) This particular list can go on FOREVER The distinction here is largely mental, but it’s important. It is entirely probable to have the same people in both groups, yet having the separation is important if HamWAN wishes to be taken seriously as a services provider to the EMCOMM community. Any benefits from that would also improve service for ALL HamWAN users. Having EMCOMM onboard is important. Not only does it provide a needed service to them, but if critical mass can be achieved it gives HamWAN access to multiple sites in every city and county. In turn though, HamWAN as a network needs to be reliable in the “customer’s” eyes. This means that infrastructure is managed with uptime as the highest priority, experimentation is managed to minimize adverse production impacts, and equipment failures are identified and corrected quickly. This is admittedly a fair amount of work. Much of it I suspect is already underway – maybe not just quite in this format. Additional help will definitely be useful. Everyone involved only has so much time available, and they should be able to focus on those items that are important to them. I believe the above framework (or something similar) begins to put some useful structure in place that continues to shape HamWAN from being the “wild west” of amateur and network “geek” exploration into the reliable, commercial grade, disaster resistant, amateur platform it envisions to be - while still allowing amateurs to push the limits of technology like they are meant to. If the above (or something similar) is of interest to the current directors and group as a whole, we can easily create a similar worklist that individuals on the sidelines can start picking things they can help with to help bring about. Just ideas. Not saying they’re perfect, but it’s a start. Any other thoughts? Cheers, Rob Salsgiver – NR3O From: PSDR [mailto:psdr-bounces@hamwan.org] On Behalf Of Bart Kus Sent: Friday, March 11, 2016 12:56 AM To: psdr@hamwan.org <mailto:psdr@hamwan.org> Subject: Re: [HamWAN PSDR] Service Impact Notice Hmm that's not the whole story though. If it were just the 1 router failure (in reality a hypervisor failure), we'd be in a much better position, but it's combined with 2 other modem failures. We had the ETiger->SnoDEM modem die over the winter, and it needs replacement. That link has been down for a month or more now. And most recently we're having the Tukwila->Baldi modem lose connectivity frequently. We've implemented an automatic mitigation for that, but it still produces sporadic short downtime windows of a few minutes. I'd just like to move that modem to a NetMetal 5. Our servers are also being affected by instability in the Quagga routing software. We need to replace this with a more stable alternative, like BIRD. Lastly, the Baldi emergency uplink is only configured to go to Westin and Corvallis, but not Tukwila. We could have avoided DNS outages too, if the anycast groups were populated with more of the available servers. I believe lack of good automation for server build-outs is causing the deployment lag here. The network is designed to withstand failures, even multiple failures, but we've got many broken things right now that need fixing. After that fixing, I would really love to see some folks get behind improving our monitoring, deployment and diagnostic automation. Networks like this won't scale unless they're nearly completely automated and simple to manage. I would not mind at all if we even rolled back some features until we can get them re-implemented in 100% automated ways. As important as all this is, I still think the deep penetration project takes precedence, so I can't drop that work in favor of this. Aside from helping out on the simple break-fix stuff, I mean. --Bart On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote: Thanks for the update, Nigel. On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen <nigel@nigelvh.com <mailto:nigel@nigelvh.com> > wrote: Hello All, Just wanted to send out a quick notice here. We’ve had a failure at our Seattle edge router, which we’re still investigating. In the meantime, our Tukwila edge router is still providing connectivity, but you may notice higher latencies or issues reaching things. If you find things you can’t reach, please let me know, as we’d like to make sure the redundancy is working, while we’re working to resolve the issues we’re investigating with the Seattle edge router. Nigel _______________________________________________ PSDR mailing list <mailto:PSDR@hamwan.org> PSDR@hamwan.org <http://mail.hamwan.net/mailman/listinfo/psdr> http://mail.hamwan.net/mailman/listinfo/psdr -- Ryan Turner _______________________________________________ PSDR mailing list <mailto:PSDR@hamwan.org> PSDR@hamwan.org <http://mail.hamwan.net/mailman/listinfo/psdr> http://mail.hamwan.net/mailman/listinfo/psdr _______________________________________________ PSDR mailing list PSDR@hamwan.org <mailto:PSDR@hamwan.org> http://mail.hamwan.net/mailman/listinfo/psdr _______________________________________________ PSDR mailing list PSDR@hamwan.org <mailto:PSDR@hamwan.org> http://mail.hamwan.net/mailman/listinfo/psdr _______________________________________________ PSDR mailing list PSDR@hamwan.org <mailto:PSDR@hamwan.org> http://mail.hamwan.net/mailman/listinfo/psdr
HamWAN isn't some magical network that never has failures. If that's the impression emcomm organizations are being sold, then that needs to stop. HamWAN is just as susceptible to component failure as any commercial network out there. The main difference is that we get to design and scale the network to have an emphasis on reliability instead of maximizing subscribers for profit. We also have the huge advantage of maintaining our own infrastructure. That way, when parts fail they can be fixed by someone from our own community instead of waiting for a corporation to prioritize our issue in relation to their other customers. Can you imagine how long it would take a commercial provider to fix your issue after the big one? On the other hand, hams are always prepared for the big one and can likely be deployed much more quickly. As the PSDR gets larger, it gets more resilient. That's why it took several things falling over simultaneously to cause an impact. While I agree with some of the ideas in this thread, I think we already meet the bar for a lot of them. We've always treated the network as "production" with the potential for customer impact, which is why this thread was started in the first place. I don't believe we've ever had an impacting event because someone was "experimenting" with something. In the end, this really is an experimental network and needs to remain so in order to recruit and train more hams into it. The emcomm organizations shouldn't be excited about HamWAN because it's more reliable than their commercial networks. They should instead be excited by the fact that it's another community that can support them in a disaster. -Cory NQ1E On Fri, Mar 11, 2016 at 1:12 PM, Bart Kus <me@bartk.us> wrote:
Replying the to the latest fully-quoted message instead of Ed's, but Ed your observations are spot on.
Rob, I think the concept of network ops is finished both in the industry and for HamWAN. In the industry, we're working at such enormous scales that you cannot possibly staff enough people to do any of the ops tasks manually. Even if you did, the unavoidable human failure rate would cripple your resulting system. In HamWAN, we have the same problems as industry (albeit at a microscopic scale), but additionally requiring staff to operate things is an adoption hurdle. We don't have the incentive of wages to staff these required job functions. Combine that with a general lack of computer/network knowledge in the ham community and you're doomed, even if you did manage to gather enough well-meaning people to support you.
This problem isn't unique to the Puget Sound Data Ring. Everyone else trying to implement a HamWAN will face the same challenges, as Ed correctly points out. We need to make the leap from phase 1 to phase 2 (see Ed's email), because we've been successful enough (yay!) to grow to such a scale that we're starting to fail at phase 1.
HamWAN has so far delivered interfacing standards, and a bunch of docs that educate people on suggestions (not standards) for how to configure the non-standardized parts of your network. That's a good starting point, but now that we know our standard ideas work reasonably well, it's time to take on the additional task of making them self-implementing in new HamWAN instances. This means a lot of software development.
And therein lies the problem. In this project we have maybe 2 people who can help write the software required. For us to successfully make the leap from phase 1 to phase 2, we've got to become attractive to people who write software. A team of 6-10 folks would give us a good chance at making the leap.
I'm not sure how to do recruiting for this, but don't let that be the seminal question of this email. I'd like to hear from people if they agree with the direction shift I've proposed here.
--Bart
On 3/11/2016 10:25 AM, Sam Kuonen wrote:
I'll echo the time constraints. We're looking at core infrastructure deployment for Georgia, USA and have a lot of generalized interest in the project.
We're experiencing similar volunteer constraints and have yet to begin full operations. I can only picture how physical network operations are going to proceed and suffer once those deployments start.
Regards,
Sam Kuonen, KK4UVL
On Fri, Mar 11, 2016, 12:29 PM Nigel Vander Houwen <nigel@nigelvh.com> wrote:
Bart, Rob,
The biggest problem I see here is time resources. I brought this up to Bart off list, but there’s a continuing struggle to either have time to do the work yourself, or get other people to do the work.
I deployed all of our monitoring and logging infrastructure, and I can say as a fact it’s been a struggle to get anyone to even do the basic work of adding new devices to the existing monitoring system, even after providing tutorials. This has gotten a bit better in very recent history, but it remains an issue.
Automation is absolutely something we need to put more work into. Ryan and I have already put a bunch of work into this, which again, we have struggled to get folks to pick up, use, and contribute to.
Modems breaking happens, and site access can be a significant problem. The East Tiger-SnoDEM link that Bart called out has been known down, but we can’t feasibly get that replaced in the middle of winter. Hopefully soon that can be taken care of.
We can try to treat this like a production network all we want, but the reality is that we have effectively one part time staff trying to do, as Rob put it, both the Operations and Development work.
The reality is that this is a network with VERY limited admin resources, which get split up to do various important things, the 900MHz work included, but that leaves even less available to do any day to day work. This isn’t our full time job, we’re not paid, we all have lives and families, we have VERY few people that actually volunteer to do any of the work, so the reality is there’s a lot we have a hard time getting to. Reality puts us much closer to “best effort” than “production”, and until we get more time/resources to do the work, it’s going to continue to be a struggle.
If folks want to volunteer, I’d be happy to put them on improvements in monitoring, automation, and fixing things in the existing production network.
Nigel
On Mar 11, 2016, at 09:11, Rob Salsgiver <rob@nr3o.com> wrote:
Bart,
You touch on a few things that have been “niggling” at the back of my mind for quite a while now – most of them come down in one way or another to overall reliability (of HamWAN) for EMCOMM, which most know has been my main driver for supporting the effort.
There’s been a TON of great work done and quite frankly, I’ve been amazed that HamWAN has gone as far and fast as it has, particularly for a “ham” effort.
At the same time we’ve slowly been adding and attracting the attention of various EMCOMM organizations with the promise and potential of redundant, reliable, resilient communications when “the big one” hits. Obviously not everything HamWAN is expected to survive a major quake or other event, but even pockets of reliable, high-speed communication are more than what can be accomplished via voice relays.
All of which bring back to the current outage and discussion. There have been several outages in key places since we began. Last year SnoDEM was all but stranded due to a Haystack modem failure and other events at the same time. Now we have a similar situation in a different place brought on by multiple failures or weaknesses. In other instances I’ve been told we’ve had outages via misconfigured devices or other reasons. Even in a perfect world, human error happens.
I believe HamWAN would benefit from somewhat of a shift in operating philosophy that would create two separate departments or divisions – operations and development.
Operations responsibilities 1) Provide day to day monitoring of network resources and conditions 2) Manage (admin) of those portions of the network that are designated as “in production”. This should be the majority of the network. 3) Provide communications and coordination of network maintenance 4) Maintain an active inventory of all operational (production) sites, site hardware, and site access information. 5) Maintain and manage all production site device configurations and config change management. 6) Coordinate implementation of new functionality introduced by the Development department with appropriate monitoring, end-user communication, etc 7) Recommend topics and technologies to be explored by the Development team to enhance operational stability and delivery of new features to the network. 8) Document technologies, methods, and tools selected for use (and why) from an operational standpoint. 9) Maintain an active inventory of spare hardware to support all sites. 10) Establish a plan to correct ALL key site failures within XXXX days. 11) Coordinate with Development to actively inject and test network failures and redundancy capabilities. 12) Coordinate with Development to enhance HamWAN’s ability to operate in “pockets” when portions of the network fail in an earthquake – i.e. – each “island” stays operational with as many services as possible
Development responsibilities 1) Continued exploration of new hardware, software, and network management tools (Quagga vs BIRD, Metals vs QRTs, etc) 2) Conduct experimentation with new hardware and software on separate network resources where possible, or in coordination with Operations on the larger network (more on this below). 3) Document technologies, methods, and tools explored and indicate pros/cons of each where possible. 4) Continued exploration, analysis, and documentation of available antenna and shielding designs 5) Exploration of new antenna designs and/or other hardware? 6) Exploration of new frequencies and how they are affected by terrain, vegetation, weather, etc 7) This particular list can go on FOREVER
The distinction here is largely mental, but it’s important. It is entirely probable to have the same people in both groups, yet having the separation is important if HamWAN wishes to be taken seriously as a services provider to the EMCOMM community. Any benefits from that would also improve service for ALL HamWAN users.
Having EMCOMM onboard is important. Not only does it provide a needed service to them, but if critical mass can be achieved it gives HamWAN access to multiple sites in every city and county. In turn though, HamWAN as a network needs to be reliable in the “customer’s” eyes. This means that infrastructure is managed with uptime as the highest priority, experimentation is managed to minimize adverse production impacts, and equipment failures are identified and corrected quickly.
This is admittedly a fair amount of work. Much of it I suspect is already underway – maybe not just quite in this format. Additional help will definitely be useful. Everyone involved only has so much time available, and they should be able to focus on those items that are important to them. I believe the above framework (or something similar) begins to put some useful structure in place that continues to shape HamWAN from being the “wild west” of amateur and network “geek” exploration into the reliable, commercial grade, disaster resistant, amateur platform it envisions to be - while still allowing amateurs to push the limits of technology like they are meant to.
If the above (or something similar) is of interest to the current directors and group as a whole, we can easily create a similar worklist that individuals on the sidelines can start picking things they can help with to help bring about.
Just ideas. Not saying they’re perfect, but it’s a start. Any other thoughts?
Cheers, Rob Salsgiver – NR3O
From: PSDR [mailto:psdr-bounces@hamwan.org] On Behalf Of Bart Kus Sent: Friday, March 11, 2016 12:56 AM To: psdr@hamwan.org Subject: Re: [HamWAN PSDR] Service Impact Notice
Hmm that's not the whole story though. If it were just the 1 router failure (in reality a hypervisor failure), we'd be in a much better position, but it's combined with 2 other modem failures. We had the ETiger->SnoDEM modem die over the winter, and it needs replacement. That link has been down for a month or more now. And most recently we're having the Tukwila->Baldi modem lose connectivity frequently. We've implemented an automatic mitigation for that, but it still produces sporadic short downtime windows of a few minutes. I'd just like to move that modem to a NetMetal 5. Our servers are also being affected by instability in the Quagga routing software. We need to replace this with a more stable alternative, like BIRD. Lastly, the Baldi emergency uplink is only configured to go to Westin and Corvallis, but not Tukwila.
We could have avoided DNS outages too, if the anycast groups were populated with more of the available servers. I believe lack of good automation for server build-outs is causing the deployment lag here.
The network is designed to withstand failures, even multiple failures, but we've got many broken things right now that need fixing. After that fixing, I would really love to see some folks get behind improving our monitoring, deployment and diagnostic automation. Networks like this won't scale unless they're nearly completely automated and simple to manage. I would not mind at all if we even rolled back some features until we can get them re-implemented in 100% automated ways.
As important as all this is, I still think the deep penetration project takes precedence, so I can't drop that work in favor of this. Aside from helping out on the simple break-fix stuff, I mean.
--Bart
On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote:
Thanks for the update, Nigel.
On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen <nigel@nigelvh.com> wrote:
Hello All,
Just wanted to send out a quick notice here. We’ve had a failure at our Seattle edge router, which we’re still investigating. In the meantime, our Tukwila edge router is still providing connectivity, but you may notice higher latencies or issues reaching things. If you find things you can’t reach, please let me know, as we’d like to make sure the redundancy is working, while we’re working to resolve the issues we’re investigating with the Seattle edge router.
Nigel _______________________________________________ PSDR mailing list PSDR@hamwan.org http://mail.hamwan.net/mailman/listinfo/psdr
--
Ryan Turner
_______________________________________________
PSDR mailing list
PSDR@hamwan.org
http://mail.hamwan.net/mailman/listinfo/psdr
_______________________________________________ PSDR mailing list PSDR@hamwan.org http://mail.hamwan.net/mailman/listinfo/psdr
_______________________________________________ PSDR mailing list PSDR@hamwan.org http://mail.hamwan.net/mailman/listinfo/psdr
_______________________________________________ PSDR mailing list PSDR@hamwan.org http://mail.hamwan.net/mailman/listinfo/psdr
_______________________________________________ PSDR mailing list PSDR@hamwan.org http://mail.hamwan.net/mailman/listinfo/psdr
In response to Nigel's points -- contributions of "time" and general participation should also be encouraged. I am happy to "do my part" as I get more up-to-speed on things.
Rob (and all), I apologize for the length of this e-mail and encourage any who are interested to grab a beverage of their preference and maybe give a listen when you can... ;-) It was funny to me that as I was reading your proposed overview of ideas on how to improve on HamWAN's reliability through a culture shifted towards (essentially) keeping "development" out of the "production" network core, I was thinking "yes, and ..." and as I continued reading, you were already ahead and literally wrote what I had been thinking. So, I would say "Hear! Hear!" because as one who is essentially representing the emcomm community here, I could not agree more. A few "high-up" folks in the City of Redmond was totally jazzed about HamWAN and said "we want that," but to really deliver on the "unsaid expectations" behind such statements it is imperative to have a culture and mindset along the lines of what you are suggesting. Years ago, I started and built the ISP Northwest Nexus and that was EXACTLY our philosophy. Indeed, we explained it to an advertising agency so they could come up with some appropriate ideas for promoting our service. The campaign we went with -- our most successful over the years we operated before being acquired -- had the tagline "When reliability just isn't negotiable." with the picture of a bungee jumper in mid-dive (I think inspired by us talking about customers relying on our "link" to the net being "solid infrastructure" -- the rope attached to a solid structure as it were -- with backup generators, monitoring, etc.). That's the mindset of folks in emcomm. My guess -- again I'm a newcomer here -- is that there are some that are striving for that and some who, in the grand tradition of amateur radio, are keen on experimenting with the technology and advancing the "art" as it were. Both are needed -- as you point out. My perception is the HamWAN organization isn't huge -- a handful of dedicated enthusiasts who have done really well implementing something of real value to both the ham community as well as to the community in general. Over the decades of my experience in the high-tech industry I have found that organizations go through (speaking very broadly) three phases of growth: 1) the "scramble" to "get something up and running or to market," 2) the "bringing order out of chaos" where some organization is accomplished to solidify the foundation upon which to build -- what I believe you are essentially proposing, and 3) the "ongoing operations" based on well-defined processes and procedures that are adhered to with discipline. It is a fact (well, my observations) that people "cut out" for one phase are much less enthusiastic about having much to do with the other phases. Folks thriving in phase 1 -- the creative innovators -- sometimes even abhor the discipline of phase 3 feeling "shackled" and "confined." One of my biggest challenges was helping folks grow with the organization in ways that supported its continued success and retained them and the value of their contribution. People necessarily had to be flexible and move into roles appropriate for "who they were" without feeling marginalized or sidelined. Indeed, both functions that you have outlined are *critical* to the growth and ongoing success of any organization. One extreme example of this was the creation of Bell Labs, but I digress... I think the biggest, most realistic and doable, "next steps" that the HamWAN organization could take would be to: 1) adopt and agree on a culture of organizational discipline to maintain a safe segregation (as much as possible anyway) between the "operational" vs. the "developmental" portions of the network; and 2) formally designate which portions of the network are considered "in production" and treat them accordingly (priority for restoration after failure, etc.). The church I attend has an espresso bar and while all the beverages are technically free, there are "suggested donations" for each that are for familiar amounts. I do not think it unreasonable to suggest that organizations that are linking to HamWAN *FOR* the reliability it promises strongly consider making an on-going (e.g. annual) donation to help with the on-going development of the network both in coverage *and* robustness. Even an up-front contribution would, of course, be totally reasonable. I'm not saying that HamWAN should be some sort of "business" per se, but to consider "taking a page or two" out of the playbooks of successful organizations -- both for-profit as well as non-profit. I think HamWAN's success would blosome and be a model for many other areas. The foundation has been started with good standards of equipment and design -- phase 1 if you will -- being open to an intentional move towards a phase 2 would be a Good Thing. Just my $0.02 adjusted for inflation -- YMMV... ;-) -Ed (WB7UBD)
participants (9)
-
Bart Kus -
Bryan Fields -
Cory (NQ1E) -
Ed Morin -
John D. Hays -
Nigel Vander Houwen -
Rob Salsgiver -
Ryan Elliott Turner -
Sam Kuonen