What is EVPN / VXLAN

In this article I will tell you what EVPN / VXLAN is and why the features of this technology seem attractive to me for use in the data center. I will not immerse you deeply into technical details, but will dwell on them only to the extent that it is necessary to become familiar with the technology. Almost everything I will deal with in this article is somehow related to the transfer of OSI layer II traffic between devices in the same broadcast domain. There are many tasks of an applied nature that can be comfortably solved with such an opportunity; one of the most familiar examples of such a task is the migration of virtual machines within one or more data centers. And if some time ago the conversation about this inevitably turned into a plane of discussion of the problems and inconveniences of the common broadcast domain, now, on the contrary, we can reflect on the solution of this problem from the point of view of new opportunities, prospects and convenience.

The data center industry is characterized by a particularly high density of applications, multiplied by consumer expectations, so the choice of the network subsystem architecture is particularly important. In conversations with customers, I usually propose to transform the abstract values of the characteristics of the functioning of the network subsystem into the field of assessing the impact of these values on the client’s business tasks and the risks of a particular event in the life cycle of the network subsystem. This makes it possible to consider familiar values from the point of view of the customer's subject area, for example, abstract 2 seconds of packet dropping during a topology relocation, switchover, or routine actions can turn into quite tangible 2.5 Gigabytes of lost data on each 10G port. The current trends in consolidating services and increasing port speeds raise the problem of reliability and stability, since the price of a second of idle time in terms of business metrics is multiplied by the speed steps 1G-10G-25G-40G-100G.

In a traditional OSI second-level network with multiple paths between switches, we are required to use STP, MLAG, or stacking. I will briefly describe the individual characteristics of these technologies so that the preconditions for the emergence of a new architecture of the data center network subsystem become clear. STP reduces effective throughput by simply disabling ports. There are also large claims to this technology in terms of convergence time. STP has other disadvantages, many people know about them as well as I do, so any of the implementations of STP does not have a place in the data center under any pretext.

From the point of view of considering development prospects for MLAG, as the core of the data center, many questions arise. It is necessary to clearly understand the fact that an MLAG topology, by definition, cannot contain more than two aggregation switches, and, in a good way, requires a horizontal connection of doubled capacity between them. Due to the lack of segmentation and traffic control inside and between the chassis, it may be difficult to use this technology in a geographically distributed environment. Determining the resources needed to prevent the “Dual Master” event, using a substantive analysis of failure scenarios, can extremely overload the solution in terms of the hardware component. Although the MLAG aggregation switches continue to function separately, they are extremely closely related in terms of physical topology, so a close look at the details of the operation sometimes makes the technology not so attractive.
')
Operating the stack as the core of the data center is fraught with the risks of finding all the eggs in one basket, since, despite the attractiveness and seeming simplicity of the overall Control Plane per Stack, software errors occur and you are not insured against their occurrence . Worse, the error usually cannot be located in a Fault Domain of an acceptable size, the stack is not only one large switch, but one large Fault Domain. The fact that stack capabilities are often limited by the weakest component, since by definition many switching chip tables must be identical on all devices, is also not to be overlooked. A stack running in a distributed environment has the same set of fault tolerance issues, which are briefly described above for the MLAG application case. The ring topology, to which some manufacturers of stackable switches are limited, is characterized by potentially more overloaded stack ports closer to the border. Depending on the profile of your traffic, this can be a problem that you also need to pay attention to in terms of calculating the actual oversubscription.

The term VXLAN stands for a virtual extensible local area network, and you can find many similarities and think of VXLAN as well as VLAN. But the difference is that VXLAN is a tunnel (one-way virtual connection between two switches), to be more precise, this is Ethernet over UDP encapsulation.

The usual transmission of second-level traffic between hosts is based on the information on the location of MAC addresses, the table of which is stored by intermediate switches, they update their knowledge of the network using the blind training mechanism after traffic has passed. In the world of VXLAN, a lot of things are changing, for example, to transfer traffic of the second level, intermediate switches do not have to work at the second level within a single broadcast domain. Tunneling of Ethernet traffic allows transit switches to abstract from the concept of MAC addresses, and to perform their functions within the framework of an IP factory free topology according to the routing rules. Each switch in the data center works completely independently of its neighbors, this is reminiscent of the distributed work of the Internet, since the maintenance work on each device is strictly local in nature. Flexible physical topology allows you to clearly adjust to the traffic profile in a particular data center, and the free choice of types and number of connecting ports allows you to build networks with a given level of oversubscription, up to completely non-blocking.

When traffic from host A to host B gets to the border switch as a second-level frame, this frame is packed inside the IP packet and sent over the IP network to the other side's border switch so that the unpacking process takes place there. The beginning of an epoch of the separation of network components into transport and service components is considered to be the implementation of BGP Free Core based on MPLS technologies in telecom operators networks. With regard to the data center, this trend has materialized relatively recently as a method for solving problems of scalability and simplifying operation, components of such an architecture are usually called Underlay, a transport network, and Overlay, a service or virtual network. The Network Virtualization Overlays architecture is described in more detail in documents [1, 2]. Separating the functions of the data center network subsystem allows you to configure services separately from the configuration of the transport component. You no longer have to chain the Vlan on each transit switch, instead you describe the Overlay network by setting up the service only where it is actually provided, while the exchange of Overlay network traffic within the service is carried out through tunnels based on the Underlay network. On the other hand, it is quite expected that such activities on the Underlay network, such as changing the physical topology, adding / removing switches or communication channels, will have zero effect on the configuration of the service component. The core or transport component of the data center network is free from processing and storing the services states, the table space of the switching chips of the core switches is used exclusively for building connectivity, transferring traffic and ensuring fast convergence, and information about the states of the service component is present only at the edge of the data center network.

At about the same time, large telecom operators and leading manufacturers of network equipment were finishing the development of a new standard for transmitting second-level traffic based on MPLS networks. The VPLS technology has accumulated a critical mass of unmet requirements in such tasks as:

ensuring balancing of flows when connecting a client to two PEs.
ensuring balancing flows between PEs.
providing redundancy in geographically distributed configurations.
ensuring fast convergence.
optimization of traffic delivery.
lower bum traffic

These requirements have been implemented in the new standard called Ethernet VPN, with the full text of the proposals can be found in [3].

A little later, it was proposed to adapt the operator model of EVPN based on MPLS for use in IP data center networks. To do this, it was necessary to align some of the standard procedures with the peculiarities of VXLAN encapsulation and the hop-by-hop paradigm of routing, as well as somewhat complementing the taxonomy of PE device types - MPLS implementation implies supporting virtual network routing functions by all PEs, and in VXLAN implementation a simplified PE type appears only for switching inside a virtual network [4].

VXLAN and EVPN are standards [5, 6] and you can hope for consistent work in a multi-vendor network [7]. The first standard is interesting because it describes the details of the encapsulation of Ethernet traffic, it refers to those things that we see "on the wire." The second standard is much more voluminous; it describes the rules for exchanging information about MAC addresses as IP prefixes and suggests a model for supporting multiple tenants or services based on a common network. For all this, they did not invent a separate protocol, the authors decided to supplement BGP with new entities. So, if to speak simply - VXLAN is encapsulation on a wire or data-plane, EVPN is a set of rules, guided by which switches transfer traffic of the second level over the IP network, i.e. control-plane.

Having only VLAN tags at your disposal, you can operate with the number of services or segments of a 12-bit number bounded above, i.e. 4096. Already 24 bits are allocated in VXLAN to identify the segments, i.e. the maximum number of VNI instances VLAN tag) is equal to 16 million. Do not think of the first number as something unattainable, and of the second as uselessly large, in a data center network with multiple services, an estimate of the size of the required segments may well approach the upper limit of the number of VLANs. The comparison in terms of this quantitative characteristic resembles IPv4 versus IPv6.

EVPN / VXLAN supports Layer 2 traffic over multiple paths (ECMP transmission). If we recall that VXLAN is a tunnel, it becomes clear that this property is inherited in a completely natural way. It is enough to specify the destination point of the tunnel in the IP header of the external packet, and transit switches have the opportunity to utilize all possible paths by the traffic flows. VXLAN encapsulation is specifically designed so that streaming with a sequence of packets does not require serious transit packet inspection. For this purpose, the transport header is used, the traffic entropy field and the UDP port number serve as a flow identifier. This field is filled at the edge switches, so that the transit switches do not have to look deeply into the contents of the packet, which reduces the requirements for their switching chips. This means that you can build high-performance IP factories using general-purpose switches.

I already wrote how the use of Overlay networks affects the scalability of the core of the data center network subsystem, let's look at another aspect of this concept - reducing the level of broadcast traffic. This question in the EVPN functional gives rise to the most misunderstandings, but in reality everything is simple. EVPN switches use the software method of learning MAC addresses based on BGP messaging, after the appearance of a new MAC address on the access port, the switching tables are synchronized across the entire network. By the time the server starts sending traffic, the switching tables already contain up-to-date information, so the remote side does not need to replicate a frame with an unknown unicast destination address to all data center access ports. This is a significant difference from the classical method of “blind” filling of tables upon the arrival of traffic, not only optimizes bandwidth on data communication channels and data center access ports, but also reduces the overhead of processing second-level frames by servers that should not be their recipients.

Software learning of MAC addresses makes them more mobile, the standard provides for devoid of the disadvantages of broadcast replication, procedures for preparing switching tables for the case of moving a MAC address from one access port to another, as usually happens at the time of hot migration of a virtual machine.

The issues of convergence time and reaction to topology changes should be considered in the Overlay and Underlay components separately. Regarding the Underlay network, try to adhere to the generally accepted practice of building routed networks; in relation to the IP factory, they are expressed by a few simple recommendations:

use BFD so that transit switches can detect an emergency in less than a second;
try to adhere to symmetric topologies with parallel paths, this will allow the switch at the border of the accident to perform the role of a traffic recovery point even before the accident information reaches the source of traffic;
Choose topologies that allow alternative routes with paths pre-installed on the switching chip without the risk of micro loops.

Of greater interest to the topic of this article is the description of the application of the program model of learning MAC addresses in the Overlay network for the scenario of enabling or disabling the access port. Access switches to which the servers are connected, in addition to the MAC address information, exchange topological. A set of access ports on two or more switches that are connected to the same server is referred to as the Ethernet Segment Number (Ethernet Segment Identifier- ESI), which is unique within the broadcast domain. This allows remote switches to match the destination MAC address with the Ethernet segment number and select switches that declared themselves connected to this segment to send traffic, even if they did not explicitly transmit MAC address information on the access port.

When a port fails or disconnects, the switch signals an alarm using a disconnect message from the Ethernet segment, so remote switches can update the switching table without waiting for the generation and processing of a set of disconnect messages for all MAC addresses separately.

The previous text mostly concerned the transfer of OSI second-level traffic; this is far from the only area of application for EVPN, now I will explain why this technology is referred to as a method of integrating routing and switching. Let's think what prevents the presence of two or more routers in one segment of a second-level network. There may be a lot of answers, but one of them sounds like this - because at the second level of OSI there is no way to send traffic along multiple paths. The switch needs to know exactly which port to send traffic to the MAC address of the default router, since this address in a classic Ethernet network cannot be active on different ports at the same time. But, as we saw earlier, the EVPN developers implemented Active-Active traffic transmission based on the MAC information propagation method, so two or more two switches can announce to their EVPN neighbors the local presence of the default router's MAC address within the VXLAN network. Routing between networks can be carried out in Active-Active mode on multiple switches that have declared themselves to be the default routers. These devices behave like a distributed router, each part of which has the same IP and MAC address or Anycast Gateway, i.e. the problem turns into an advantage. EVPN provides mechanisms for exchanging not only the MAC, but also IP information, so the table data and the behavior of the distributed router will be identical on all its components.

With regard to the symmetric topology of the data center, routing is usually implemented in one of four ways:

on all Spine switches;
on all Leaf switches;
on dedicated Leaf switches;
on the router outside the Overlay domain;

The first option is characterized by a rather simple application inspection (Security Insertion) and Service Chaining mechanisms [8], since the service unit of the data center network subsystem interacts with a relatively small number of Spine switches, where traffic delivery policies are implemented.

, Leaf , , . , Leaf Overlay . Spine , IP , BGP-RR.

, VXLAN Overlay , IP EVPN/VXLAN . Active-Active Anycast-Gateway , , Ethernet EVPN/VXLAN.

[1] RFC7364 «Problem Statement: Overlays for Network Virtualization»

[2] RFC7365 «Framework for Data Center (DC) Network Virtualization»

[3] RFC7209 «Requirements for Ethernet VPN (EVPN)»

[4] draft-ietf-bess-evpn-overlay «A Network Virtualization Overlay Solution using EVPN»

[5] RFC7348 «Virtual eXtensible Local Area Network (VXLAN): A Frameworkfor Overlaying Virtualized Layer 2 Networks over Layer 3 Networks»

[6] RFC7432 «BGP MPLS-Based Ethernet VPN»

[7] EANTC «Multi-Vendor Interoperability Test», 2017

[8] draft-ietf-bess-service-chaining-03 «Service Chaining using Virtual Networks with BGP VPNs»

Source: https://habr.com/ru/post/352564/

All Articles

What is EVPN / VXLAN

More articles: