Active-Active Data Center Design
Defining an active-active data-center strategy is not an easy task when you talk to network, server & compute teams who usually do not collaborate when it comes to planning their infrastructure. Most importantly, active-active data Center design requires end-to-end technology stack working together cohesively. It usually needs an enterprise-level architecture drive to establish the idea. Moreover, it really means to provide availability and traffic load sharing of applications across DC’s with the following key use cases
- Business Continuity
- Mobility and load sharing
- Consistent policy and fast provisioning capability across
As part of the active-active strategy, application load-sharing scenarios must be defined. Example, some applications may be active in DC site-1 with their failover instances on standby in DC site-2, while others may be active in DC site-2 with their failover instances on standby in DC site-1.
Active-Active Data-center Technical Requirement:
Below are the generic technical requirements to be considered when formulating the active-active datacenter design.
In Addition to the above, the followings are the major building blocks and associated considerations to make during active-active data center design
Active-active Transport Technologies
Clearly, transport technologies are the interconnectors of the datacenters. Links and device-level redundancies are part of the transport domain which provides HA & resiliency across the site, This could include redundancy for multiplexers, GPONs, DCI network devices, dark fibers, and diversity POPs for surviving POP failure and 1+1 protection schemes for devices, card, and link, etc.
Furthermore, below list contains, the major considerations to make when designing a transport solution to interconnect data centers
- Recover from various type of failure scenarios: Link failure, module failures, and node failure, etc
- Link latency and applications round trip requirements for the traffic between DC’s
- Bandwidth requirements and associated scalability factors
Active-Active Network Services
Indeed, network services interconnect all the devices in the data centers by performing required traffic switching and routing functions. The network should facilitate the forwarding of application traffic & load sharing without any disruption. And also application mobility across the data-centers by providing the pervasive gateway, L2 extension, and ingress and egress path optimization. Furthermore, it is good to note, most of the major network vendor’s SDN solution currently provides integrated VxLAN overlay solution to achieve L2 extension, path optimization, and gateway mobility
Also, the following are the major considerations to make when designing active-active network services
- Recover from various type of failure scenarios: Link, module and network device failure, etc
- Pervasive Gateway across the infrastructure: Gateway availability local to the DC and across the DC
- Stretching L2 domain: Able to extend the L2 domain ( VLAN or VxLAN) between the DC’s
- Consistent Policy: Network policies are consistent across the on-premises and also to the various cloud infrastructure – these policies could include the naming, segmentation rules for integrating various L4/L7 services and hypervisor integration, etc.
- Path Optimizations: Ingress and egress
- Centralized Management: Centralized provisioning of the network policies and management (e.g.: Inventory, troubleshooting, AAA capabilities, backup and restore, traffic flow analysis and capacity dashboards, etc.)
Active-Active L4-L7 Services
Undoubtedly, building active-active L4-L7 services across DC’s is always an expensive task as it requires placing security and ADC devices in both DC’s. Importantly global traffic managers, application policy controllers, load-balancers, and firewalls are the major solutions to consider in this space. furthermore, these will need to be deployed at a different tier for the protection of perimeter, extranet, WAN, core server farm, UAT segment, etc. Also to note, currently most of the leading L4-L7 services vendors offering clustering solutions of their products across the DC’s. Clustering allows its members to share the l4/l7 policies, traffic load, and at the same time providing seamless fail-over in case of issues.
It should be noted, major considerations related to L4-L7 services design are below
- Recover from various types of failure scenarios: Link, module, and l4-l7 device failure, etc.
- Consistent Policy: L4-L7 policies are consistent across the on-premises infrastructure and also on to the various clouds – this could include the naming of the policies, L4-L7 rules for various traffic types, etc.
- Centralized Management: Centralized provisioning of the network policies and management (e.g.: Inventory, troubleshooting, AAA capabilities, backup and restore, traffic flow analysis, capacity dashboards, etc.)
Active-Active Storage Services
Definitely, storage and related networking solutions are one of the main pillars of active-active data center design. Moreover, it means storages in both DC’s serving applications. similarly, the design should cater to the ability to accept read and write requests without any interruption. Therefore it is also important to have real-time data mirroring and seamless fail-over capability across DC’s. Some of the major considerations related to storage design are below
- Recover from various type of storage failure scenarios such as Single disk, storage array and storage controller failure & split-brain scenarios
- Synchronous vs. asynchronous replication: With Synchronous replication data write to primary storage and replica simultaneously. Because of that, it consumes more bandwidth and furthermore typically requires using dedicated FC links
- Storage high availability & redundancy: Storage replication factors & number of disks available for redundancy etc
- Storage Network failure scenarios: Link, module and network device failure, etc
Active-Active Server Virtualization
The server virtualization evolved over the years. Most importantly organizations are even moving to microservices and containers. The main consideration here is to extended hypervisor/container clusters across the DC’s to achieve seamless virtual machine/ container instances movement and fail-over. The dominant players in this space are VMware Docker and Microsoft. And there are others well – such as KVM, Kubernetes( Container Management), etc
Below are some of the key considerations when it comes to server virtualization
- Virtualization platform to form a cross-DC virtual host cluster
- HA Function to protect the VM, create affinity rules to prefer local hosts in normal operational conditions.
- Deploy the same service on VMs in two DCs so that when host machine unavailable, VMs in the other DC can take over the loads in real-time
- The compute node devices across the DC’s are provisioned with symmetric configuration with required resources for failover
- Centralized management of computing resources and hypervisor’s
Active-Active Applications Deployment
The infrastructure is built for application to function. Furthermore, it is important to make sure the high availability of the applications across DCs. And it can do fail-over and can get location proximity access. The key is to have the Web, App and DB tiers available at both data-centers, and in case of the application fails in any of the DC it should allow fail-over and continuity
Follow are the some of the major considerations
- Deploy the Web services on a virtual machine (VM) or a physical machine, with multiple servers forming independent clusters per DC
- Deploy the App services on a virtual machine (VM) or a physical machine. With multiple servers in the DC forming a cluster, or multiple cross-DC servers forming a cluster (Preferably different IP based access – If the application supports distributed deployment).
- Deploy databases preferably on physical machines to form a cross-DC cluster (Active- standby or active-active). E.g. : Oracle RAC, DB2, SQL with Windows server failover cluster (WSFC)
Summary
The below diagram shows the summary of the active-active data center design components
Active-active data-center design requires architecture components of the network, storage, l4-l7 services, compute, and virtualization and application components working together. Seamless availability and operation of the business applications in case of the infrastructure failure in any one of the data-center is a key factor. And when it comes to cost, operating active-active data centers are expensive as compared to disaster recovery, but only by about 20% while delivering 35% more capacity and enabling non-stop operations. This improves uptime, enhanced performance, and optimum asset utilization
For futher read, I would recommend following Cisco live presentation: https://www.ciscolive.com/c/dam/r/ciscolive/apjc/docs/2016/pdf/BRKDCT-2615.pdf
Finally, please don’t miss out to read Nutanix Solutions from an architectural perspective blog
This is one of the comprehensive article on the topic i came across – thanks for putting al pieces togethet
Hi James, thanks for the comments. Yes, the active-active DC discussion is aways cross-domain/architectural topic and should have the end to end objective in mind when designing it.
a very well explained topic much appreciate the time and effort in putting it together,
Thank you Eddie
Very Good Stuff Muhammad, great job.
Thank you, glad to know you like the post
How to handle latency for interconnect communication between each pair of active-standby or active-active databases running different DCs?
Hi Chaiyasit S. the latency requirement varies based on the scenario. for the active-standby, it is ok to have higher latency but during the failover, it should be able to cater to the data transfer requirements. However, when it comes to an active-active scenario the application requirement needs to be considered properly. For example, I know if you are stretching oracle across they mandate not more than 10 ms RTT.
https://docs.oracle.com/middleware/12213/wls/WLCAG/weblogic_ca_best_stch.htm#WLCAG-GUID-E5687E48-B57A-49CB-AF2E-E7BF55078D93
-MM-
Thank you very much
Hi Muhammad,
I must say you have beautifully documented your articles.
Thank you, Rajendra. we will make sure to continue the same and your comments are really important for us.
This is a great resource Muhammad. Just when I was looking out for something similar. Thank you!
Excelent article
Salam Mohammed, I have some doubts about active-active from storage streched cluster side. In many incidnetes it takes both DCs down togather. In which it makes it difcult to recover from either data centers. The limittation of 10MS RT with highr cost of DWDM links adds up. I would have active from the software layer but faster recovery from the DR site. DO you recoomend a solution? or I just move to Active- Hot standby
One of the thing we all should keep in mind about active-active design is, it is NOT about same application session or same application getting served from both sites at the same time!!. The idea is how to do traffic load-sharing across the sites. such as application 1 is served from site one and second application is served from site -2. Same case applicable to storage as well, meaning one of the site storage could be primary for certain applications where as the 2nd site is primary for remaining – this way none of the sites are idle and with proper capacity planning if one of the site fail, it gets fail over to the other and vice versa. The same logic should be used for storage as well. In all cases we cannot write on both site at the same time for the same application, write would happen one of the site and synchronised to the 2nd site for availability. Hope this answered your questions. And thank for asking
This is wrong on so many levels. Stretched DC architecture with multi-layer DCIs is a direct abundance of fundamental network design principles.
thanks for the comment. this is an existing production design implemented for one of the enterprise and in principle never had an issue – thats said you should have end to end domain experience. Again active-active design is all about making the infrastructure ready on both side so that it helps in failover – not about loading same appliation same time from both sites.
It depends how you do it – we have customers running without any issues in fact this design gives you the best failover options.
can I run active/active network on data centers that are 500 miles apart?
Salam Muhammad
Hope you are doing good. With the latest in vxlan and AcI . Etc. Any updates on this writeup?
Hi Charlie, I dont recommend running Active-active DC 500 miles apart if your design has to consider lots of east west traffic. However if you use active active from application hosting perspective why not.
Hi Nadeem, The VxLAN helps to tunnel same subnet across the DCs. However as you may noted the consideration not limited to just reachability or the protocols. If all the elements need to have failover and extension between the DCs ( it is up to the design) it is always better to have a close distance where you can extend 40G pipe between the DCs.
Hi Issa,
If you look at overall. the active active design is bandwidth intensive and latency is key consideration – especially if we talk from infrastructure failover, cross sync etc. If your business requirements meet with DC and DR – no issues.