Cisco HyperFlex Stretched Cluster Performance Testing – IOMeter

I built my first HyperFlex hybrid stretched cluster last week and was interested in getting some IO benchmarks. Just how much IOPs can I get out of the system? I had previously did similar testing on a VSAN ready-node and could easily obtain the maximum IOPs of the system so was interested to see if I could do the same here.

I had the latest HyperFlex HX240c-M5SX nodes with 40GB UCS Fabric Interconnects 6332. Each HyperFlex node has 40Gb links to each UCS Fabric Interconnect so network bandwidth is definitely not going to be the bottleneck.

Hyperflex M5 40Gb Stretched Cluster Configuration

  • 2 sites – Site1 and Site2 stretched via VXLAN. Latency between sites is <1ms.
  • Each site has a pair of 40Gb UCS Fabric Interconnects (total 4 x UCS FI6332 – 2 per site).
  • Each site has 4 x Hyperflex HX240c-M5SX converged nodes (total 8 x HX240M5 – 4 per site)
  • Each server has 1 x 1.6TB cache SSD and 8 x 1.8TB capacity SAS HDDs.
  • Available raw storage (not accounting for deduplication and compression) is 24TB.
  • Compression and deduplication is enabled by default on Hyperflex.
  • Hyperflex stretched cluster has concept for datastore locality where one of the sites is nominated as the master.
  • 2 datastores are created with locality set in each site (HX-site1-DS01 and HX-site2-DS01).

Test Scenario 1 – 10 VMs (10 x 100GB disk usage) / Disk locality

  • 10 x VMs (2 x vCPU / 4Gb / 100Gb disk) running Windows 2016 Standard with IOMeter.
  • VM placement on the 8-node ESXi cluster is as follows:
    • iometer01 and iometer09 – Host site1-esx01 – Disk locality adhered
    • iomoter02 – Host site1-esx02 – Disk locality adhered
    • iometer03 – Host site1-esx03 – Disk locality adhered
    • iometer04 – Host site1-esx04 – Disk locality adhered
    • iometer05 and iometer10 – Host site2-esx01 – Disk locality adhered
    • iometer06 – Host site2-esx02 – Disk locality adhered
    • iometer07 – Host site2-esx03 – Disk locality adhered
    • iometer08 – Host site2-esx04 – Disk locality adhered
      *Disk locality adhered – VM disks are placed in the datastore with the location of the master as that of the ESXi host
  • IOMeter setup on each VM with 5 worker threads:
  • Testing is based on 8K block size with 33% Write / 67% Read ratio and 50% Sequential/Random distribution.
  • As each VM is 100Gb, the total size of all 8 VMs total 1TB which can provide an estimation of IOPs/TB.

Test Results

  • Test results shows a total of approximately 20K IOPs (13K read + 6.5K write) for the 1TB workload run.
    However, this is not the maximum IOPs of the system as maximum IOPs increases whilst performing cloning tasks.
  • Performance stats per VM shown below:
    VM iometer07 resides on site2-esx03 alone and is displaying approx. 2K total IOPs (1.4K read + 0.6K write).
  • VMs iometer05 and iometer10 both resides on site2-esx01 and is displaying approx. 2K total IOPs (1.4K read + 0.7K write) for each VM similar to VM iometer07. This demonstrates the distributed caching ability of Hyperflex whereby each VM consumes the cache of the entire cluster instead of the cache located on the host where the VM resides.
  • Compression and deduplication stats below.
    As each iometer VM is deployed by a VM template, maximum deduplication and compression savings can be achieved as shown in the 96% storage optimization.
    The used capacity displayed is only 200 Gb whereas the deployed VMs equate to 1TB in total.

Test Scenario 2 – 10 VMs (10 x 100GB disk usage) + storage vmotion

  • The test from previous scenario continues to run and storage vmotion is initiated for all 10 VMs at once.
    NOTE: This should never be done in production environment as simultaneous storage vmotion tasks places significant storage latencies.
  • The aim of this test is to see if the total IOPs can be increased further by using storage vmotion.

Test Results

  • Test results shows a total of 67K IOPs for the 1TB workload run + storage vmotion which is almost triple the total IOPs from previous test.
  • There is a massive latency spike at the start when storage vmotion tasks are subsequently initiated for all 10 VMs.
  • During the latency spike the IOPs for 2 of the VMs were reduced to 300 IOPs whilst other VMs were able to maintain approximately 1.5K IOPs.

Test Scenario 3 – 10 VMs (10 x 100GB disk usage) / No disk locality

  • The test from scenario 1 is run again but this time the disk master copy is located at the opposite site to the ESXi host.

Test Results

  • Test results shows the total IOPs is approximately 10K IOPs which is half of what was achieved in test scenario 1.
  • Performance stats per VM below:
    VM iometer07 still residing on the same ESXi host as test scenario 1 but with the master located in the opposite site now only shows approximately 1K IOPs.
  • VMs iometer05 and iometer10 shows similar results.

Conclusion

The design of HyperFlex which uses a distributed cache across the cluster allows for better performance when the cluster size is large enough compared to other hyper-converged products which uses cache on the local host where the VM resides. The distributed cache design allows for more VMs per host and better scalability. 10 VMs running iometer continuously were insufficient to ascertain the maximum IOPs of the system.

It was found that disk locality is important in achieving the maximum IOPs. If disk locality is not adhered, total IOPs is halved for that VM.

10 VMs with deployed storage equivalent to 1TB with disk locality was able to achieve a total IOPs of approximately 20K IOPs. Assuming that the same scale can be achieved as more VMs and storage is provisioned, this equates to 20K IOPs/TB.

However, this assumption may not hold true due to the high level deduplication and compression and also when additional data disks are added to the converged node whilst the cache disk remains unchanged.

Performance stats will need to be collected when this cluster is used for production workloads which will be a better representation of customer production environments.

The results from this test looks promising and I personally cannot wait to see the performance in a real production scenario.

NOTE: vSphere stretched cluster requires the vSphere Enterprise Plus feature.

Caveats

The tests conducted were using iometer application on VMs which were cloned. This allows for maximum compression and deduplication which would not be achieved in standard production environments. Therefore, the compression and deduplication results should NOT be used as a benchmark but rather a maximum deduplication ratio if all VMs in the cluster were to be cloned from a single template.

Iometer tests used were based on 8K block size with 33% Write / 67% Read ratio and 50% Sequential/Random distribution which may not be representative of actual production workloads.

NOTE: All VMs are thick provisioned as per vSphere defaults. Thin provisioning VMs are not used.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.