NETAPP vs VMWARE FLOW CONTROL DILEMMA

vQuicky

> Performance improves were seen on environments with ESXi 5.1/NetApp/10G switches having

Flow control disabled

> VMware recommends leaving flow control enabled while NetApp best practice recommends disabling it if using 10G switches.

> VMware recommends investigating pause frames and if too many were found – indicates at an underlying problem.

flowcontrol

 

inDepth

We had recently seen some random datastore drops and issues in our virtualized environment which had a backend Netapp storage. Upon investigation and some deep-dive it was found that flow control was enabled on the entire stack.

This is another article
worth looking – talks about NetApp sending too many pause frames.

Below is what Netapp has to say in its best practice guide about flow control.

Flow control is a low-level process for managing the rate of data transmission between two nodes to
prevent a fast sender from overrunning a slow receiver. Flow control can be configured on ESX/ESXi
servers, FAS storage arrays, and network switches. For modern network equipment, especially 10GbE
equipment, NetApp recommends turning off flow control and allowing congestion management to be
performed higher in the network stack. For older equipment, typically GbE with smaller buffers and
weaker buffer management, NetApp recommends configuring the endpoints, ESX servers, and NetApp
arrays with the flow control set to “send.”

So Netapp says for modern 10GbE equipment –  the flow control should be turned off and allow the congestion to be managed higher up in the stack. This does make sense now that applications are now smart enough to manage data throughput and flow. Hardware flow control is really not application aware unless you write you application to look out for that and that may not be good application technique.

On the flip side VMware actually asks to leave flow control enabled by default in the hypervisor. Below is what VMware says about flow control.

Note: By default, flow control is enabled on all network interfaces in VMware ESX and ESXi. This is the preferred configuration. If there are large numbers of pause frames in an environment, it is usually indicative of an underlying issue that should be investigated.

VMware advises on investigating pause frames on the hypervisor and if you were to see “too many” then there may be a issue lower in the stack. But there is no definite number of what “too many” really means. Below is what VMware says.

Pause Frames are related to Ethernet flow control and are used to manage the pacing of data transmission on a network segment. Sometimes, a sending node (ESX/ESXi host, switch, etc) may transmit data faster than another node can accept it. In this case, the overwhelmed network node can send pause frames back to the sender, pausing the transmission of traffic for a brief period of time.

If flow control is undesired in an environment, the support for flow control can be disabled for a given network interface or driver on the ESX/ESXi hosts. The method differs for different drivers.

So all in all – my recommendation is to have flow control disabled on your hosts, switches and even on the storage end. This can really help remove a lot of problems including dropping luns and NFS mounts.

To disable flow control on the hypervisor –

0 – No flow control.
1 – Flow control on received (RX) traffic only.
2 – Flow control on transmitted (TX) traffic only.
3 – Flow control on both received and transmitted traffic.

ESX/ESXi 3.0 to 4.1:
  • To disable flow control for all ports of a quad-port Intel e1000 network interface, set the option FlowControl=0,0,0,0 on the e1000 module:# esxcfg-module -s FlowControl=0,0,0,0 e1000
  • To disable flowcontrol on Intel adapter 1, but enable Transmit only on adapter 2:# esxcfg-module -s FlowControl=0,2 e1000
  • To enable Transmit  FlowControl on Intel adapter 1 and adapter 4:

    # esxcfg-module -s FlowControl=2,,,2  e1000
  • To revert the changes back to the original setting without Flow Control:# esxcfg-module -s ModuleName “”You can confirm if the above command clears up the option for the particular module by running this command:# esxcfg-module -g ModuleName

    You should see an output similar to:

    The option is an empty string ModuleName options = ”

ESXi 5.0:
  • To disable Flow Control for all ports of a quad-port Intel e1000 network interface:# esxcli system module parameters set –module e1000 –parameter-string “FlowControl=0”

Feel free to comment or correct me 🙂

The VMware KB article

The NETAPP best practices guide.

9 Thoughts on “NETAPP vs VMWARE FLOW CONTROL DILEMMA

  1. hey, good info here… I am currently in the design phase of a FAS3250, Cisco nexus 5k with 10gbe and Dell blades for ESX 5.1.
    I have my 10gbe VIFS created, they are shared. Jumbo is turned on on the fas and switches and vmware.
    now this flow control is confusing. I read netapp’s TR’s best practice to keep it Flowcontrol SEND on Filer and Receive on Switch side… is this what you have or you turned flow control off on VIF and switch port side? Let me know, thanks dude

    • Sorry for the late reply – yeah we have flow control turned off. Have you done that and found anything different? or better? I would turn off flow control because it has constantly been the issue for us.

  2. Hey there! Awesome article. Thanks a lot for sharing your experience. I’ve one question for you, before turning off Flow Control on all devices (NetApp, VMware Nodes, Network Switch): does any of the devices require a reboot for changes to take effect? Thanks!

  3. Hi, just curious… what protocols are you using in your VMware environment and what types of data stores were dropping? iSCSI? NFS? Both? Thanks

    • Hey Carlos, We only ran NFS for that specific customer and I have seen this happen. However I wont be surprised to see this happen on ISCSI as well. Think about it – just like the NFS mount, I present iscsi over the same network as well. Here the issue is with flow control on the switches and I believe it will affect both iscsi and NFS at the same time.

  4. Jeff on June 24, 2015 at 3:54 pm said:

    Great article. What can cause an ESXI host to send pause frames. I see a lot of rx pause frames on the switch side overnight when one vm is being backed up ?

    • admin on July 9, 2015 at 11:46 am said:

      I think if esxi is sending pause frames that could mean a lot of things, my gut says it could be that the host is too busy to process frames coming in causing it to send pause frames back. check the cpu and resource utilization of the host.

  5. You’re not going to use both links between an ESX host and the NetApp with Teaming. You’re only going to use 1 link per socrue or destination address. NetApp doesn’t support exporting the same NFS share to multiple IP’s, using IP aliasing, so this is a real limitation for teaming. You may be able to do it, but it’s not supported. So you really are limited to 1 Gbps, even if you have aggregated multiple links.

Leave a Reply

Your email address will not be published. Required fields are marked *

Post Navigation