May 28, 2019

Active-Active Data Center Design

By Muhammad Marakkoottathil Enterprise, DC networking and Storage 25 Comments

Defining an active-active data-center strategy is not an easy task when you talk to network, server & compute teams who usually do not collaborate when it comes to planning their infrastructure. Most importantly, active-active data Center design requires end-to-end technology stack working together cohesively. It usually needs an enterprise-level architecture drive to establish the idea. Moreover, it really means to provide availability and traffic load sharing of applications across DC’s with the following key use cases

Business Continuity
Mobility and load sharing
Consistent policy and fast provisioning capability across

As part of the active-active strategy, application load-sharing scenarios must be defined. Example, some applications may be active in DC site-1 with their failover instances on standby in DC site-2, while others may be active in DC site-2 with their failover instances on standby in DC site-1.

Active-Active Data-center Technical Requirement:

Below are the generic technical requirements to be considered when formulating the active-active datacenter design.

Active-Active Data center design – Technical Requirement Summary

In Addition to the above, the followings are the major building blocks and associated considerations to make during active-active data center design

Active-active Transport Technologies

Clearly, transport technologies are the interconnectors of the datacenters. Links and device-level redundancies are part of the transport domain which provides HA & resiliency across the site, This could include redundancy for multiplexers, GPONs, DCI network devices, dark fibers, and diversity POPs for surviving POP failure and 1+1 protection schemes for devices, card, and link, etc.

Furthermore, below list contains, the major considerations to make when designing a transport solution to interconnect data centers

Recover from various type of failure scenarios: Link failure, module failures, and node failure, etc
Link latency and applications round trip requirements for the traffic between DC’s
Bandwidth requirements and associated scalability factors

Active-Active Network Services

Indeed, network services interconnect all the devices in the data centers by performing required traffic switching and routing functions. The network should facilitate the forwarding of application traffic & load sharing without any disruption. And also application mobility across the data-centers by providing the pervasive gateway, L2 extension, and ingress and egress path optimization. Furthermore, it is good to note, most of the major network vendor’s SDN solution currently provides integrated VxLAN overlay solution to achieve L2 extension, path optimization, and gateway mobility

Also, the following are the major considerations to make when designing active-active network services

Recover from various type of failure scenarios: Link, module and network device failure, etc
Pervasive Gateway across the infrastructure: Gateway availability local to the DC and across the DC
Stretching L2 domain: Able to extend the L2 domain ( VLAN or VxLAN) between the DC’s
Consistent Policy: Network policies are consistent across the on-premises and also to the various cloud infrastructure – these policies could include the naming, segmentation rules for integrating various L4/L7 services and hypervisor integration, etc.
Path Optimizations: Ingress and egress
Centralized Management: Centralized provisioning of the network policies and management (e.g.: Inventory, troubleshooting, AAA capabilities, backup and restore, traffic flow analysis and capacity dashboards, etc.)

Active-Active L4-L7 Services

Undoubtedly, building active-active L4-L7 services across DC’s is always an expensive task as it requires placing security and ADC devices in both DC’s. Importantly global traffic managers, application policy controllers, load-balancers, and firewalls are the major solutions to consider in this space. furthermore, these will need to be deployed at a different tier for the protection of perimeter, extranet, WAN, core server farm, UAT segment, etc. Also to note, currently most of the leading L4-L7 services vendors offering clustering solutions of their products across the DC’s. Clustering allows its members to share the l4/l7 policies, traffic load, and at the same time providing seamless fail-over in case of issues.

It should be noted, major considerations related to L4-L7 services design are below

Recover from various types of failure scenarios: Link, module, and l4-l7 device failure, etc.
Consistent Policy: L4-L7 policies are consistent across the on-premises infrastructure and also on to the various clouds – this could include the naming of the policies, L4-L7 rules for various traffic types, etc.
Centralized Management: Centralized provisioning of the network policies and management (e.g.: Inventory, troubleshooting, AAA capabilities, backup and restore, traffic flow analysis, capacity dashboards, etc.)

Active-Active Storage Services

Definitely, storage and related networking solutions are one of the main pillars of active-active data center design. Moreover, it means storages in both DC’s serving applications. similarly, the design should cater to the ability to accept read and write requests without any interruption. Therefore it is also important to have real-time data mirroring and seamless fail-over capability across DC’s. Some of the major considerations related to storage design are below

Recover from various type of storage failure scenarios such as Single disk, storage array and storage controller failure & split-brain scenarios
Synchronous vs. asynchronous replication: With Synchronous replication data write to primary storage and replica simultaneously. Because of that, it consumes more bandwidth and furthermore typically requires using dedicated FC links
Storage high availability & redundancy: Storage replication factors & number of disks available for redundancy etc
Storage Network failure scenarios: Link, module and network device failure, etc

Active-Active Server Virtualization

The server virtualization evolved over the years. Most importantly organizations are even moving to microservices and containers. The main consideration here is to extended hypervisor/container clusters across the DC’s to achieve seamless virtual machine/ container instances movement and fail-over. The dominant players in this space are VMware Docker and Microsoft. And there are others well – such as KVM, Kubernetes( Container Management), etc

Below are some of the key considerations when it comes to server virtualization

Virtualization platform to form a cross-DC virtual host cluster
HA Function to protect the VM, create affinity rules to prefer local hosts in normal operational conditions.
Deploy the same service on VMs in two DCs so that when host machine unavailable, VMs in the other DC can take over the loads in real-time
The compute node devices across the DC’s are provisioned with symmetric configuration with required resources for failover
Centralized management of computing resources and hypervisor’s

Active-Active Applications Deployment

The infrastructure is built for application to function. Furthermore, it is important to make sure the high availability of the applications across DCs. And it can do fail-over and can get location proximity access. The key is to have the Web, App and DB tiers available at both data-centers, and in case of the application fails in any of the DC it should allow fail-over and continuity

Follow are the some of the major considerations

Deploy the Web services on a virtual machine (VM) or a physical machine, with multiple servers forming independent clusters per DC
Deploy the App services on a virtual machine (VM) or a physical machine. With multiple servers in the DC forming a cluster, or multiple cross-DC servers forming a cluster (Preferably different IP based access – If the application supports distributed deployment).
Deploy databases preferably on physical machines to form a cross-DC cluster (Active- standby or active-active). E.g. : Oracle RAC, DB2, SQL with Windows server failover cluster (WSFC)

Summary

The below diagram shows the summary of the active-active data center design components

Active-active data center design full stack network components

Active-active data-center design requires architecture components of the network, storage, l4-l7 services, compute, and virtualization and application components working together. Seamless availability and operation of the business applications in case of the infrastructure failure in any one of the data-center is a key factor. And when it comes to cost, operating active-active data centers are expensive as compared to disaster recovery, but only by about 20% while delivering 35% more capacity and enabling non-stop operations. This improves uptime, enhanced performance, and optimum asset utilization

For futher read, I would recommend following Cisco live presentation: https://www.ciscolive.com/c/dam/r/ciscolive/apjc/docs/2016/pdf/BRKDCT-2615.pdf

Finally, please don’t miss out to read Nutanix Solutions from an architectural perspective blog

About Author

Muhammad Marakkoottathil(MM)

Expert in the field of SDN, cloud computing, virtualization, active-active data center design & migration. Passionate about helping organizations to achieve their digital transformation objectives with strong 15+ years of experience in design, deployment, and managing heterogeneous network solutions across the industry verticals. Major Industry Certifications: Cisco CCIE, CCDP, VMware VCAP-NV_DESIGN, TOGAF, ITIL, NUTANIX NCSE, Google Cloud Architect, Azure Fundamentals More info please visit my page @ LinkedIn: https://www.linkedin.com/in/contactmm/

25 Comments

James

This is one of the comprehensive article on the topic i came across – thanks for putting al pieces togethet
June 8, 2019 Reply

Muhammad Marakkoottathil

Hi James, thanks for the comments. Yes, the active-active DC discussion is aways cross-domain/architectural topic and should have the end to end objective in mind when designing it.
June 9, 2019 Reply

eddie nugent

a very well explained topic much appreciate the time and effort in putting it together,
October 17, 2019 Reply

Muhammad Marakkoottathil

Thank you Eddie
May 26, 2020 Reply

J George

Very Good Stuff Muhammad, great job.
November 7, 2019 Reply

Muhammad Marakkoottathil

Thank you, glad to know you like the post
May 26, 2020 Reply

Chaiyasit S.

How to handle latency for interconnect communication between each pair of active-standby or active-active databases running different DCs?
June 21, 2020 Reply

Muhammad Marakkoottathil

Hi Chaiyasit S. the latency requirement varies based on the scenario. for the active-standby, it is ok to have higher latency but during the failover, it should be able to cater to the data transfer requirements. However, when it comes to an active-active scenario the application requirement needs to be considered properly. For example, I know if you are stretching oracle across they mandate not more than 10 ms RTT.

https://docs.oracle.com/middleware/12213/wls/WLCAG/weblogic_ca_best_stch.htm#WLCAG-GUID-E5687E48-B57A-49CB-AF2E-E7BF55078D93

-MM-
June 22, 2020 Reply

Chaiyasit

Thank you very much
June 24, 2020 Reply

Rajendra Prasad

Hi Muhammad,
I must say you have beautifully documented your articles.
August 2, 2020 Reply

Muhammad Marakkoottathil

Thank you, Rajendra. we will make sure to continue the same and your comments are really important for us.
August 7, 2020 Reply

Charanjit Singh

This is a great resource Muhammad. Just when I was looking out for something similar. Thank you!
August 7, 2020 Reply
Issa

Excelent article
February 11, 2022 Reply
Issa

Salam Mohammed, I have some doubts about active-active from storage streched cluster side. In many incidnetes it takes both DCs down togather. In which it makes it difcult to recover from either data centers. The limittation of 10MS RT with highr cost of DWDM links adds up. I would have active from the software layer but faster recovery from the DR site. DO you recoomend a solution? or I just move to Active- Hot standby
February 11, 2022 Reply

Muhammad Marakkoottathil

One of the thing we all should keep in mind about active-active design is, it is NOT about same application session or same application getting served from both sites at the same time!!. The idea is how to do traffic load-sharing across the sites. such as application 1 is served from site one and second application is served from site -2. Same case applicable to storage as well, meaning one of the site storage could be primary for certain applications where as the 2nd site is primary for remaining – this way none of the sites are idle and with proper capacity planning if one of the site fail, it gets fail over to the other and vice versa. The same logic should be used for storage as well. In all cases we cannot write on both site at the same time for the same application, write would happen one of the site and synchronised to the 2nd site for availability. Hope this answered your questions. And thank for asking
October 22, 2024 Reply

Rostislav Rusev

This is wrong on so many levels. Stretched DC architecture with multi-layer DCIs is a direct abundance of fundamental network design principles.
May 10, 2023 Reply

Muhammad Marakkoottathil

thanks for the comment. this is an existing production design implemented for one of the enterprise and in principle never had an issue – thats said you should have end to end domain experience. Again active-active design is all about making the infrastructure ready on both side so that it helps in failover – not about loading same appliation same time from both sites.
October 22, 2024 Reply

Riachad Dickinson

It depends how you do it – we have customers running without any issues in fact this design gives you the best failover options.
November 3, 2023 Reply
Charlie

can I run active/active network on data centers that are 500 miles apart?
March 13, 2024 Reply
Nadeem

Salam Muhammad

Hope you are doing good. With the latest in vxlan and AcI . Etc. Any updates on this writeup?
March 30, 2024 Reply
Muhammad Marakkoottathil

Hi Charlie, I dont recommend running Active-active DC 500 miles apart if your design has to consider lots of east west traffic. However if you use active active from application hosting perspective why not.
June 22, 2024 Reply
Muhammad Marakkoottathil

Hi Nadeem, The VxLAN helps to tunnel same subnet across the DCs. However as you may noted the consideration not limited to just reachability or the protocols. If all the elements need to have failover and extension between the DCs ( it is up to the design) it is always better to have a close distance where you can extend 40G pipe between the DCs.
June 22, 2024 Reply
Muhammad Marakkoottathil

Hi Issa,

If you look at overall. the active active design is bandwidth intensive and latency is key consideration – especially if we talk from infrastructure failover, cross sync etc. If your business requirements meet with DC and DR – no issues.
June 22, 2024 Reply

Network Bachelor

Active-Active Data Center Design