SD-WAN

  • 1.  Routing to Servers during a Disaster Recovery Scenario

    Posted 09-18-2020 14:04

    A couple years ago we had an outage that brought down both ISPs at Site A where we have our production VM Servers hosted and none of our other locations could access the servers for over a day. We really want to shrink that downtime if something like that was to happen again. Right now we have a cluster of Host Servers for all of our Production VMs in Site A, and we have a DR server node that receives periodic snapshots of those VMs hosted at Site B. If Site A was to go down, we could log into the DR Node at Site B and Spin up the VM Servers we need access to. Here are a couple Network Diagrams showing the current network and fail over scenario:

    Current DR Network Diagram 128T.pdf

    Current DR Failover 128T.pdf

    Some problems with this current setup:

    IP Addressing and routing: The Static IPs and gateways of our servers are configured for the Site A network, when they would spin up at Site B they would not be able to network properly. The applications used on client computers at other locations are configured to connect to these servers using the IP addresses they currently have, either statically or via DNS. So updating the servers for the Site B Network would be a huge pain point.

    Port Forwards: Are configured through Router A and it's external IP Addresses, these would need to be configured on Router B. I'm sure there are other issues that we would need to deal with as well, but these two were the first big ones I've been thinking about.

    To prepare for a DR scenario I'm purposing setting up a 128T Router VM hosted in our VM Server Cluster that we would use for the routing and everything for the Production VM servers. This router would also have snapshots syncing to the server node at Site B, so in the event of a DR fail over we would boot up the snapshot of the VM Router on the Site B DR Node, and it would provide the same network for the VM servers as they had at Site A. I assume it would work much like having a 128T router within a home network and the routing would pretty much fix itself for Site C and D. Here are some network diagrams for this purposed scenario:

    Purposed DR Network Diagram 128T.pdf
    Purposed DR Failover 128T.pdf


    Do you think this would be the best way to handle this? Can you think of any issues with this? Or do you have a better way to handle this DR Scenario?
    I'm ready to start setting up a test environment for this if it would be the way to go, or else I'd love to start talking about any better ways we could handle this. Please, let me know what you think or if you'd like any clarifications on my network diagrams.



    ------------------------------
    Austin Stoffel
    Systems Administrator
    ------------------------------


  • 2.  RE: Routing to Servers during a Disaster Recovery Scenario

    Posted 09-23-2020 10:32
    I would really have to white board it out and ask questions. One gotcha will be IP routing / weighting. If you overlap IP space on both locations - then a route must exists. Im not positive, but I think you can setup 128T routers as fail over scenario so the route only goes live (And the broadcast IP ranges) should site A fail. The other gotcha is working for X days on site B - what are you doing to fail that information back over?

    If you are running VMWare - and bandwidth is sufficient between locations - You can activate HA mode - in which each packet from site A is sent to site B. If Site A goes off line - B becomes primary - and there might be a small latency spike or a few lost packets during the switch over. Once site A comes back online - they sync the data in the background and switch A back to primary. 

    You may be able to leverage site b as a router --> MPLS or some other direct connection that may be offered, and add that as a path back to site A.

    ------------------------------
    Jeff Bragdon
    ------------------------------



  • 3.  RE: Routing to Servers during a Disaster Recovery Scenario

    Posted 09-29-2020 12:27
    @Jeff overlapping IP space isn't really a factor in my purposed design. In the purposed design, if Site A is going to be down long enough to want to initiate a failover, then I would be spinning up a snapshot of the 128T router VM that routes for the VM servers VLAN. Since it would be a router within a router(Double NAT) it would then connect out to the conductor and all the other routers how to connect back. Similar to how we have some 128T Routers connected from within users home networks.

    Failing the data back is not a concern, only routing within 128T.

    We are not running VMWare, we use Scale Computing and are not using any kind of HA Mode. We are just Scheduling Snapshots of VMs to go to our Site B node. If we decide that an outage is going to last long enough to justify a fail over, then we will spin up the servers at Site B and hopefully have a solution figured out to route to those servers at Site B.

    We do not have any MPLS connections, we utilize 128T as our SD-WAN.

    This is for our DR plan, it doesn't matter why Site A goes down, if it's network related, a fire in the server room or an asteroid impact. We just want to configure our routing to work whether the router and servers are up at Site A or up at Site B.

    ------------------------------
    Austin Stoffel
    Systems Administrator
    ------------------------------