Quad Xeon 7500, the Best Virtualized Datacenter Building Block?

Name: Quad Xeon 7500, the Best Virtualized Datacenter Building Block?
Item: Quad Xeon 7500, the Best Virtualized Datacenter Building Block?
Author: Johan De Gelas

by Johan De Gelas on August 10, 2010 5:10 PM EST

Posted in
IT Computing

51 Comments | Add A Comment

51 Comments

VMmark

Before we take a look at our own virtualization benchmarking, let us look at the currently (beginning of August 2010) available VMmark scores.

VMWare VMmark

It is interesting to note that most AMD “Istanbul” Opteron servers benchmarked were using DDR2-667. That somewhat limited their VMmark scores as consolidated virtualized servers have higher bandwidth demands than most “native running” servers. The dual Opteron 6176 has the same amount of cores as the Quad Opteron 8439. At the same time, those cores are identical, only the uncore part has changed. So from a pure processing power point of view, the dual Opteron 6176 performance should be about 15% slower. The reality is that the dual socket is 3% faster than the older quad socket server. This shows that VMmark really benefits from the improved memory subsystem, as the support for DDR3-1333 memory essentially doubles the bandwidth and lowers latency. That still is not enough to beat the Intel armada, as the fastest “Westmere” Xeon is about 16% faster than the best Opteron “Magny-Cours”.

The Quad Xeon X7560 leaves everything behind in VMmark, by offering more than twice the performance of all dual configurations. Virtualization favors high core counts: you are running many different applications which do not have to exchange data most of the time. This reduces the thread synchronization overhead. Nonetheless, the scores that the Xeon X7560 gets are impressive. But of course, this is VMmark, an industry benchmark. The results also depend on how much time and effort is spent on tuning the benchmark. Since the introduction of the Xeon X7500 series, the VMmark scores have already improved by 7% (from 70.78 to 75.77). Let us check out vApus Mark II where each platform is treated the same.

vApus Mark II

vApus Mark II—VMware ESX 4.0

The overall picture remains the same, although there are some clear differences. First of all, the “Magny-Cours Opteron” and “Westmere Xeon” are closer to each other. The difference between the two best server CPUs with a “decent” TDP is only 4%. But the surprise is the landslide victory of the X7560. Let us analyze the results in more detail.

For the OLAP test, we took a dual Xeon X5570 without Hyper-Threading as reference. The reason for this is that the VM got eight vCPUs, and we compare this with a native server that has eight cores. For the web test, we used two Xeon X5570 cores as reference, or a Xeon X5570 cut in two. The OLTP scores, obtained in a VM with four virtual CPUs, uses the Swingbench scores of one Xeon X5570 as reference. The reason why we chose the Xeon “Nehalem” as reference is that this server CPU is the natural yardstick for all new server CPUs: it outperformed all contemporary server CPUs by a large margin at its launch (March 2009).

Let us take a look at the more detailed results per VM. The vApus Mark II score is a geometric mean of the different VMs.

CPU config	Tiles	OLAP (1 VM)	Web (3 VMs)	OLTP (1 VM)	vApus Mark II score
Dual 6174	2	57%	30%	22%	67.5
Dual 6136	2	45%	23%	14%	48.6
Dual 7560	2	58%	51%	32%	91.8
Dual X5670	2	53%	43%	19%	70.0
Dual L5640	2	48%	33%	15%	57.6
Quad 7560	2	73%	73%	39%	118.6
Quad 7560	4	47%	50%	29%	162.7

The ESX scheduler works with Hardware Execution Contexts, which map to one logical (Hyper-Threading) or physical core. In our current test, more HECs are demanded than available, so this test is quite hard on the ESX scheduler. We have still to investigate why the OLTP scores are quite a bit lower than the other VMs. This VM is the most disk intensive and as such requires more VMkernel time than the others. This might explain why there is less processing power left for the application running inside the VM. Another reason is that this application requires more “co-scheduling”. In OLTP applications, threads rarely run independently, but have to synchronize frequently. In that case it is important that each virtual CPUs gets equal processing power. If one vCPU gets ahead of the others, this may result in a thread waiting longer than necessary for the other to release a spinlock.

Although ESX 3.5 and 4.0 feature “relaxed co-scheduling”, the best performance for these kind of applications is achieved when the scheduler can “co-schedule” the syncing threads. The fact that the system with the highest logical core count gets the best percentages in the OLTP VM is another indication that the co-scheduling issue may play an important role. Notice how the dual Xeon X7560 with 32 threads does significantly better than the higher clocked Xeon X5670 (24 threads) when running the OLTP VM. While the overall performance of the dual Xeon X7560 is 31% better than the Xeon X5670 (91.8 vs 70), the OLTP performance is almost 70% (!) better. Another indication is consistency: the differences between the VMs are much smaller on the Dual Xeon X7560.

The AMD systems show a similar picture. The 16-core 6136, despite the decent 2.4GHz clock speed, offers the lowest OLTP performance to its users as it has the fewest threads to offer the scheduler. The dual 6174 runs at a 9% lower clock speed but has 24 cores to offer. The result is that the OLTP VM performs a lot better (more “perfect” co-scheduling possible): we noticed 57% better OLTP performance. The OLTP VM was even faster on the Dual 6174 with its 24 “real” cores than on the Xeon X5670. Although this is only circumstantial evidence, we have strong indications that transactional workloads favor high core and thread counts.

Our measurements show that the quad Xeon X7560 is about 2.3 times faster than the best dual platforms. That makes one quad Xeon X7560 a very interesting alternative for each two dual CPU servers you wish to buy for virtualization consolidation.

vApus Mark II Conclusion

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

51 Comments

View All Comments

haplo602 - Wednesday, August 11, 2010 - link
This is one of the bottlenecks of your virtualised environemnt. A storage solution is only the limit if you do not use it as it was designed to be used.

the more IO demanding application you have, the less virtualisation is going to offer any benefits. usualy CPU power is the last issue after netwrok, disk and memory.

I had a good laugh at the opening page. High end servers are High end not because of the increased performance but because of the better management and disaster tolerance/recovery they offer. After all, they use the same CPUs and memory as the low end servers, just everything else is different (OLRAD, hot swap/plug of almost anything except memory and CPU).
webdev511 - Thursday, August 12, 2010 - link
Well, if you're willing to spend some more money on Solid State (if you go with two twelve core cpus you'll save on licences) you could stuff four of the new Fusion IO 1.28 TB Duo Drives into the box and map them as System Drives and then use attached storage for big files.
SomeITguy - Wednesday, August 11, 2010 - link
No offense intended, and I know this will put you on the defensive, but it sounds to me like the "development environment" was ill conceived in the design phase. You obviously overbought on processor power. The first step in designing an environment, is knowing what your apps need. You can't just buy servers, then whine about how poorly the performance matches the overall system capability...

Last job I had Citrix Xen on HP blades with 53xx and 54xx CPU's, running about 150 production VM's. On the order of >300 total, with R&D and QA. The company had no money, and because of that we only ran local storage for the OS and most functions. The shared data we did have were on Netapps, and that alone constantly spiked up to +25k IOPS. I can't remember were each blade sat on IOPS, but it was high. I was able to balance resources utilized most of the day to about the ~60% level, with spikes hitting the high 80's. No resources being overly wasted. To do this effectively takes time and patience. You need to economize. 12 VM's on a blade with 16GB of memory was not unheard of...

Then there is the whole ESX thing, eh, won't get into that. Again, you need to know what is going to run on the servers before you spend (waste) money.

In my experience, It's typical that managers just override the lowly sysadmin advice, take a vendors word over the sysadmin who manages the app, or a business unit buys you the equipment without consulting, then says "here, make it work".

Overall, I thought the article good. It is just a guide, not a bible.
davegraham - Tuesday, August 10, 2010 - link
So, i'm sitting here with a spanking new Dell R815 which is a quad socket G34 system and is shipping today w/ AMD Opteron 6176SE parts...so, this article is outdated even before it begins. (oh, did i mention it's only 2RU?)

I'm also very curious as to what the underlying storage is for all these tests as it definitely can have an impact on the servicability of the testing.

I'm curious as to the details per VM was well...IOMMU choices, HT sharing, NUMA settings, as well as the version of ESX being used?

dave
JohanAnandtech - Wednesday, August 11, 2010 - link
"So, i'm sitting here with a spanking new Dell R815 which is a quad socket G34 system and is shipping today w/ AMD Opteron 6176SE parts...so, this article is outdated even before it begins. (oh, did i mention it's only 2RU?)"

Testing servers is not like testing videocards. I can not plug the R815 in a ready installed windows pc and push the button of "Servermark". It does not work that way as you indicate yourself. A complete storage system must be set up, and in many cases ESX fails to install the first time on a brand new server. We perform a whole battery of monitoring tests for example that confirm that the DQL is low enough.

The storage system we use for the 4 tile test is a 8 disk SSD system for the OLTP tests (described in this article). The VMs themselves sit on a separate RAID controller connect to a promise JBOD. The JBOD has 8 15000 rpm SAS disks. The only really disk intensive app is Swingbench in this test, and by making sure both data and logs get their separate SSD , we achieve DQLs under 0.1. There is lot more to the Oracle config, but if you are interested, we can share the parameter file.

Anyway, the low DQL and the fact that we scale well from 2 tot 4 tiles shows that we are not limited by the disks.
davegraham - Wednesday, August 11, 2010 - link
johan,

I work with VMware for a living doing platform testing for the product i support. ;) consequently, I'm very well aware of the requirements for testing VMware and the various and sundry components within the server. Hence, my slightly critical view of what you're doing here.

appreciate the response on the storage....again, all well and good with that explanation.

I'll put my quad socket 6176SE system against your 7500 system anyday and i'll enjoy lower rack footprint, lower power consumption, and a positively brilliant VMware experience. ;)

keep up the good work.

dave
blue_falcon - Wednesday, August 11, 2010 - link
If you wan to do a similar 2U config, try the R810, only has 32 dimm sockets but nearly identical to the R910.
mapesdhs - Tuesday, August 10, 2010 - link

Johan, how would this system compare to a low-end quad-socket Altix UV 10? (max
RAM = 512GB).

Ian.
JohanAnandtech - Wednesday, August 11, 2010 - link
I never tested an SGI server, so I can not say for sure. But the hardware looks (and probably is) identical to what we have tested here.
Casper42 - Wednesday, August 11, 2010 - link
Due to the way Dell implemented the memory on their latest Quad socket machines, if you run 2 CPUs with the FlexMem bridge, you get full memory bandwidth but half of the memory sockets are further away from the CPU due to the extra trace length of going to the empty CPU socket and through the FlexMem bridge.

When you put in 4 CPUs you only get half the memory bandwidth of an Intel reference design. This is because the traces that would normally go to the empty CPU socket and through the FlexMem now go essentially nowhere because the CPU in that socket needs the access instead.

I would say try IBM or HP. Just beware that IBM does some weird stuff when it comes to their Max5 memory expansion module that can also cause additional memory latency for some of the DIMM sockets and not the others.

Quad Xeon 7500, the Best Virtualized Datacenter Building Block?

VMmark

vApus Mark II

Post Your Comment

51 Comments

View All Comments

haplo602 - Wednesday, August 11, 2010 - link

webdev511 - Thursday, August 12, 2010 - link

SomeITguy - Wednesday, August 11, 2010 - link

davegraham - Tuesday, August 10, 2010 - link

JohanAnandtech - Wednesday, August 11, 2010 - link

davegraham - Wednesday, August 11, 2010 - link

blue_falcon - Wednesday, August 11, 2010 - link

mapesdhs - Tuesday, August 10, 2010 - link

JohanAnandtech - Wednesday, August 11, 2010 - link

Casper42 - Wednesday, August 11, 2010 - link

Log in

Don't have an account? Sign up now