Nutanix: No More SAN

It is not a secret that even though a SAN comes with all the virtues of centralized data, network storage comes with a network bandwidth and latency penalty. By simply attaching a flash array directly to a system (DAS), we can measure the extra latency compared to SAN (Storage Area Network). It amounts to between 0.3 and 0.8 ms depending on whether you use Fibre Channel or iSCSI over copper wires.

Average Latency of an SSD, DAS vs Networked storage

So the minimal latency was 50% to 100% higher in a lightly loaded SAN than when the same SSD was running inside the server. However, this was the minimal latency. This can quickly grow to several milliseconds when the network load goes up.

Nutanix believes that virtualized servers should use local storage, clustered together in a virtual storage pool. Each of the virtual machines connects to a storage VM. That storage VM is typically an iSCSI target inside a VM, also called a VSA or Virtual Storage Appliance. The VSA on each server node are clustered together by the Nutanix Distributed File System (NDFS). NDFS makes sure that if one node dies, the other nodes are still able to access the necessary files to run.

The VSA also leverages the latest flash technology. The most accessed data is on a Fusion-IO or Intel S3700 SSDs, depending on the Nutanix node model. The “colder” (not frequently accessed) data is automatically transferred to the SATA terabyte disks. It's basically another level of caching, only with larger data caches than we see in the desktop world.

Using what seems to be a Supermicro Twin or Twin² chassis, even an entry level four node Nutanxi NX-3050 should support up to 400 virtual desktops with a power consumption of about 1.1 KW. Compare this with your typical SAN array that typically needs 700W just for one midrange array, and you probably need several expansion modules before you can even think about supporting 400 virtual desktops.

Unfortunately, we cannot verify the claims of Nutanix right now, but our experience tells us that from a power and performance point of view it will be very hard for the typical "server plus SAN infrastructure" to beat the much simpler “integrate everything inside a dense server” platform. The only disadvantage is that the number of DIMM slots inside such server nodes is limited. That is why even the largest Nutanix hosts do not support more than 256GB per node, which might be a limitation in some virtualization environments.

Starting at $22000 per node, the Nutanix nodes are hardly cheap, but since you don’t need a SAN the total investment is a lot lower than the traditional approach, especially for virtual desktops. Nutanix seems to have convinced quite a few people as it claims it is the fastest growing IT infrastructure startup ever, with an $80 million annual run rate. Now they just need to prove they have the reliability and support infrastructure to win over additional customers.

The Winds of Change CloudFounders: No More RAID
Comments Locked


View All Comments

  • Brutalizer - Sunday, August 11, 2013 - link

    "That's because ZFS has had a minimal impact on the professional storage market."

    That is ignorant. If you had followed the professional storage market, you would have known that ZFS is the most widely deployed storage system in the Enterprise. ZFS systems manages 3-5x more data than NetApp, and manages more data than NetApp and EMC Isilon combined. ZFS is the future and eating other's cake:
  • blak0137 - Monday, August 5, 2013 - link

    The Amplidata Bitspread data protection scheme sounds alot like the OneFS filesystem on Isilon.

    A note on the NetApp section, the NVRAM does not store the hottest blocks, rather it is only used for correlating writes to allow destaging entire raid group wide stripes onto disk at once. This utilization of NVRAM in NetApp, along with the write characteristics of the WAFL filesystem, allows RAID-DP (NetApp's slightly customized version of RAID-6) to have similar write performance as RAID-10 with a much smaller usable space penalty up to approximately 85-90% space utilization. Read cache is always held in RAM on the controller and the FlashCache (formerly PAM) cards supplement that RAM-based cache. A thing to remember about the size of the FlashCache cards is that the space still benefits from the data efficiency features of Data OnTap, such as deduplication and compression, and as such applications such as VDI get a massive boost in performance.
  • enealDC - Monday, August 5, 2013 - link

    I think you also need to discuss the effect of OSS or very low cost solutions that can be built on white box hardware. Those cause far greater disruptions than anything I can think of!
    SCST and COMSTAR to name a few.
  • Ammohunt - Monday, August 5, 2013 - link

    One thing i didn't see mention is that in the good old days you spread the I/O out across many spindles which was a huge advantage SCSI which was geared towards such a configuration. As drive sizes have increased the spindles have reduced adding more latency. The fact is that expensive SSD type storage systems are not needed in most medium sized businesses. Their data needs can in most cases be served by spectacularly by using a well architected tiered storage model.
  • mryom - Monday, August 5, 2013 - link

    There's some thing missing - take a look at Pernix Data - That's disruptive and also vSphere 5.5 gonna be a game changer. Software Defined Storage is the way forward - We just need space for more disks in blade servers
  • davegraham - Tuesday, August 6, 2013 - link

    SDS is an EMC-marchitecture discussion (a la ViPR). I'd suggest that you avoid conflating what a marketing talking head discusses with technology can actually do. :)
  • Kevin G - Monday, August 5, 2013 - link

    My understanding withenterprise storage isn't necessarily the hardware but rather the software interface and support that comes with it. NetApp for example will dial home and order replacements for failed hard drives for you. Various interfaces I've used allow for the logical creation multiple arrays across multiple controllers each using a different RAID type. I have no sane reason why some one would want to do that but the option is there and supported for the crazies.

    As far as performance goes, NVMe and SATA Express are clearly the future. I'm surprised that we haven't see any servers with hot swap mini-PCIe slots. With two lanes going to each slot, a single socket Sandy Bridge-e chip could support twenty of those small form factor cards in the front of a 1U server. At 500 GB a piece, that is 10 TB of preformatted storage, not far off of the 16 TB preformatted possible today using hard drives. Cost of course will be more expensive than disk but speeds are ludicrous.

    Going with standard PCIe form factors for storage only makes sense if there are tons of channels connected to the controlller and are PCIe native. So far the majority of offers stick a hardware RAID chip with several SATA SSD controllers onto a PCIe card and call it a day.

    Also for the enterprice market, it would be nice to a PCIe SSD have an out of band management port that communicates via Ethernet and can fully function if the switch on the other end supports power over ethernet. The entire host could be fried but data could still potentially be recovered. Also works great for hardware configuration like on some Areca cards.
  • youshotwhointhewhatnow - Monday, August 5, 2013 - link

    The first link on "Cloudfounders: No More RAID" appears to be broken ( Catastrophe.pdf).

    I read through the second link on that page (the Intel paper). I wouldn't consider that paper as unbiased considering Intel is clearly trying to use it to sell more Xeon chips. Regardless, I don't think your statement "mathematically proven that the Reed-Solomon based erasure codes of RAID 6 are a dead end road for large storage systems" is justified. Sure RAID6 will eventually give way to RAID7 (or RAIDZ2 in ZFS terms), but this still uses Reed-Solomon codes. The Intel paper just shows that RAID6+1 has much worse efficiency with slightly worse durability compared to Bitspread. The same could be said for RAID7 (instead of Bitspread), which really should have been part of the comparison.

    Another strange statement in the Intel paper is "Traditional erasure coding schemes implemented by competitive storage solutions have limited device-level BER protection (e.g., 4 four bit errors per device)". Umm, with non-degraded RAID6 you could have as many UREs as you like provided less than three occur on the same stripe (or less than two for a degraded array). Again RAID7 allows even more UREs in the same stripe.

    This is not to say that the Bitspread technique isn't interesting, but you seem to be a little to quick to drink the kool-aid.
  • name99 - Tuesday, August 6, 2013 - link

    I imagine the reason people are quick to drink the koolaid is that convolutional FEC codes have proved how well they work through much wireless experience. Loss of some Amplidata data is no different from puncturing, and puncturing just works --- we experience it every time we use WiFi or cell data.

    I also wouldn't read too much into Intel's support here. Obviously running a Viterbi algorithm to cope with a punctured convolutional code is more work than traditional parity-type recovery --- a LOT more work. And obviously, the first round of software you write to prove to yourself that this all works, you're going to write for a standard CPU. Intel is the obvious choice, and they're going to make a big deal about how they were the obvious choice.

    BUT the obvious next step is to go to Qualcomm or Broadcom and ask them to sell you a Viterbi cell, which you put on a SOC along with an ARM front-end, and hey presto --- you have a $20 chip you can stick in your box that's doing all the hard work of that $1500 Xeon.

    The point is, convolutional FEC is operating on a totally different dimension from block parity --- it is just so much more sophisticated, flexible, and powerful. The obvious thing that is being trumpeted here is destruction of one of more blocks in the storage device, but that's not the end of the story. FEC can also handle point bit errors. Recall that a traditional drive (HD or SSD) has its own FEC protecting each block, but if enough point errors occur in the block, that FEC is overwhelmed and the device reports a read error. NOW there is an alternative --- the device can report the raw bad data up to a higher level which can combine it with data from other devices to run the second layer of FEC --- something like a form of Chase combining.

    Convolutional codes are a good start for this, of course, but the state of the art in WiFi and telco is LDPCs, and so the actual logical next step is to create the next device based not on a dedicated convolutional SOC but on a dedicated LDPC SOC. Depending on how big a company grows, and how much clout they eventually have with SSD or HD vendors, there's scope for a whole lot more here --- things like using weaker FEC at the device level precisely because you have a higher level of FEC distributed over multiple devices --- and this may allow you a 10% or more boost in capacity.
  • meorah - Monday, August 5, 2013 - link

    you forgot another implication of scale-out software design. namely, the ability to bypass flash completely and store your most performance intensive workloads that use your most expensive software licensing directly in-memory. 16 gigs to run the host, the other 368 gigs as a nice RAM drive.

Log in

Don't have an account? Sign up now