Tintri, Hyper-V, Veeam and the Reddit Thread

Update – Here are the links to the youtube videos showing Hyper-V backup using both Commvault and Veeam.

Commvault

Veeam

Original Post

Few weeks back there was a comment posted on Reddit by a Tintri customer regarding Tintri, Hyper-V and Veeam.

https://www.reddit.com/r/sysadmin/comments/41uje6/tintri_veeam_hyperv_scvmm_and_woe/

The comment went something like this –

For all the love they show each other about compatability, Hyper-V is not supported by Veeam on Tintri VM Stores. This is due to Tintri not having a hardware VSS provider. Lots of finger pointing to find the answer.

Also, Tintri’s SMB3 implementation has a few major gaps. Folder inheritence isn’t a thing right now (slated for bugfix in 4.2), which means if you add hosts to a cluster, you have to fix permissions on all existing subfolders.

On top of all that, SCVMM can create shares using SMI-S, but cannot delete them. You have to delete the share on the VM Store and then rescan the storage provider.

Edit: I forgot to mention their powershell modules have no consistency between them. One uses Verb-TTNoun, the other uses Verb-TintriNoun.

There were a lot of comments and sub-comments on the thread. Some were accurate and others were not. We tried to post a response to the thread but because of Reddit’s strict policies, none of the comments ever showed up. As a result, the thread has caused some confusion amongst some of our customers and prospects.

So I wanted to take this opportunity to respond to the thread, clarify a few things and update everyone on the current status.

  • Backing up using a backup application through Remote VSS

The Tintri VMstore’s Microsoft VSS integration did lack some functionality needed for remote VSS based backups for Hyper-V (for using backup apps such as Veeam, Commvault). It will be fully supported in an upcoming release that is in QA and will be available to customers soon.

  • Folder inheritance over SMB

Folder inheritance over SMB has been supported since Tintri OS 3.1.1 in the context of storage for virtual machines. There is a specific case where this wasn’t handled correctly and as the customer pointed out, is rectified in the same upcoming Tintri OS update as above.

  • Removing SMB share using SCVMM

There is an issue removing SMB shares via SCVMM (SMI-S) and that has also been fixed in the update mentioned above. As the customer pointed out, it is still possible to remove the SMB share through the VMstore UI.

  • Inconsistency in the naming of Powershell cmdlets

To clarify, there is no inconsistency here and this is as per design. The Verb-TintriNoun PowerShell cmdlets represent the Tintri Automation Toolkit (free download from our support site) and is used for customer automation around Tintri products.

The Verb-TTNoun cmdlets are a collection of cmdlets for validating certain Microsoft environmental factors, not specific to Tintri, that can impact Hyper-V functionality. The latter is primarily used by Tintri field and support technicians and some automation. The separate ‘TT’ namespace is to avoid confusion or overlap with other modules.

As always, Tintri is committed to its multi-hypervisor story including Hyper-V and we have several large customers who have successfully deployed Tintri with Hyper-V and are enjoying the benefits of Tintri’s VM-Aware functionality in their environments. We apologize for the inconvenience caused to our customers because of this. We have ensured that all of these issues are ironed out as part of the upcoming Tintri OS release.

PS: Although the customer didn’t mention anything about his company, we believe the customer contacted support and received an update directly from the support team

@storarch

 

Some thoughts on NetApp’s acquisition of Solidfire

Yesterday, NetApp announced that they have entered into a definitive agreement to acquire Solidfire for $875M in an all cash transition. Having spent more than 7 years at NetApp, I thought, I would provide my perspective on the deal .

As we all know, NetApp had a 3-way strategy around Flash. First, All-Flash FAS for customers looking to get the benefits of data-management feature rich ONTAP but with better performance and at a lower latency. Second, E-Series for customers looking to use applications side features with a platform that delivered raw performance and third FlashRay for customers looking for something designed from the grounds up for flash that can utilize the denser, cheaper flash media to deliver lower cost alternative with inline space efficiency and data management features.

The Solidfire acquisition is the replacement for the FlashRay portion of the strategy. The FlashRay team took forever to get a product out of the door and then surprisingly couldn’t even deliver on HA. The failure to deliver on FlashRay is definitely alarming as NetApp had some quality engineers working on it. Solidfire gives NetApp faster time (?) to market (relatively speaking). Here is why I think Solidfire made the most sense for NetApp –

  • Solidfire gives NetApp arguably a highly scalable block based product (at least on paper). Solidfire’s Fiber Channel approach is a little funky but let’s ignore it for now.
  • Solidfire is one of the vendors out there that has native integration with cloud which plays well with NetApp’s Data Fabric vision.
  • Solidfire is only the second flash product out there designed from the grounds-up that can do QoS. I am not a fan as you can read here but they are better than the pack. (You know which is the other one – Tintri hybrid and all-flash VMstores with a more granular per-VM QoS of course)
  • Altavault gives NetApp a unified strategy to backup all NetApp products. So the All-Flash no longer has to work with SnapVault or ONTAP functionalities. Although the field teams would like to see tighter integration with SnapManager etc. Since most of the modern products make good use of APIs, it should not be difficult. (One of the key reasons why NetApp wanted to develop an all-flash product internally was that they wanted it to work with ONTAP – You are not surprised. Are you?)
  • Solidfire has a better story than some of the other traditional all-flash vendors out there around Service Providers which is a big focus within NetApp.
  • Solidfire’s openness around using Element OS with any HW and not just Dell and Cisco (that they can use today). I want to add here that from what I have gathered, Solidfire has more control over what type of HW one can use and its not as open as some of the other solutions out there.
  • And yes, Solidfire would have been much cheaper than other more established alternatives out there making the deal sweeter.

I would not go into where Solidfire as a product misses the mark. You can find those details around the internet. Look here and here.

Keeping technology aside, one of the big challenges for NetApp would be execution at the field level. The NetApp field sales team always leads with ONTAP and optimization of ONTAP for all-flash would make it difficult for the Solidfire product to gain mindshare unless there is a specific strategy put in place by leadership to change this behavior. Solidfire would be going from having sales team that woke up everyday to sell and create opportunity for the product to a team that historically hasn’t sold anything other than ONTAP. Hopefully, NetApp can get around this and execute on the field. At least that’s what Solidfire employees would be hoping for.

What’s next for NetApp? I can’t remember but I think someone on twitter or a blog or podcast mentioned that NetApp may go private in the coming year(s). Although it sounds crazy but I think its the only way for companies like NetApp/EMC to restructure and remove the pressure of delivering on the top line growth especially with falling storage costs, improvement in compute hardware, move towards more software centric sales, utility based pricing model and cloud.

From a Tintri standpoint, the acquisition doesn’t change anything. We believe that flash is just a medium and products like Solidfire, Pure Storage, XtremeIO or any product that uses LUNs and Volumes as the abstraction layer have failed to grab an opportunity to bring a change of approach for handling modern workloads in the datacenter. LUNs and Volumes were designed specifically for physical workloads and we have made them to work with virtual workloads through overprovisioning and constant baby-sitting. Flash just throws a lot of performance at the problem and contributes to overprovisioning. Whether customers deploy a Solidfire or a Pure Storage or a XtremeIO, there will be no change. It would just delay the inevitable. So pick your widget based on the incumbent in your datacenter or based on price.

If you want to fix the problem, remove the pain of constantly managing & reshuffling storage resources and make storage invisible then talk to Tintri.

Contact us and we will prove that we will drive down CAPEX (up to 5x), OPEX (up to 52x) and save you time with the VM-aware storage.

Screen Shot 2015-12-22 at 11.41.09 AM

While you are at it don’t forget to check out our Simplest Storage Guarantee here .

Screen Shot 2015-12-22 at 11.41.22 AM

Cheers..

@storarch

 

 

Choosing analytics: built-in, on-premises, or cloud-based

With the announcement of Tintri Analytics, we delivered on our vision: providing comprehensive, application-centric real-time analytics (using fully integrated on-prem and cloud-based solutions) that provide predictive, actionable insights based on historical system data (of up to 3 years).

Customers can now automatically (or manually) group VMs based on applications to analyze application profiles, run what-if scenarios, and model workload growth in terms of performance, capacity and flash working sets.Analytics

When you consider a storage solution refresh, analytics probably tops your list of needed features. It simplifies IT’s job, makes IT more productive and helps organizations save time and money.

The question is, what type of analytics should an organization look to have—built-in, on-premises or cloud-based? If you are just getting started, any sort of analytics would be great! Most storage vendors have an on-premises and/or a cloud-based solution. But an ideal storage product should have all three, as each of them has its own irreplaceable use case. Let’s take a look at each one.

Built-in analytics for auto-tuning

Built-in analytics that the system uses for self-tuning are uncommon in the industry. Tintri’s unique auto-QoS capability is a great example that uses built-in analytics, available at a vDisk level, to logically divide up all the storage resources and allocate the right amount of shares of the right type of resource (flash, CPU, network buffers etc.) to each vDisk. By doing this, a Tintri VMstore ensures that each vDisk is isolated from the other, without noisy neighbors.

Operationally, this simplifies architecture, as the IT team doesn’t have to figure out the number of its LUNs/volumes, the size of its LUNs/volumes, which workloads would work well together and so on. It can focus on just adding VMs to a datastore/storage repository as long as it has capacity and performance headroom available (as shown by the Tintri dashboard).

On-premises real-time analytics

On-prem analytics are extracted from a storage system by a built-in or external application deployed within the environment. Admins can consult these real-time analytics to help troubleshoot a live problem or store them for historical information. Admins can further use these analytics to help their storage solution deliver a prescriptive approach to placing workloads, and provide short-term historical data for trending, reporting and chargeback.

Tintri VMstore takes advantage of its built-in analytics to deliver an on-prem solution for analytics through both the VMstore GUI and Tintri Global Center. Up to a month of history can be imported into software like vRealize Operations, Nagios, Solarwinds and more.

Of course, customers don’t have to wait before they can see these analytics—unlike with cloud-based analytics, they can monitor systems in real-time.

Cloud-based predictive analytics

Cloud-based analytics help customers with long-term trending, what-if scenarios, predictive and comparative analytics. But not all cloud-based analytics are created equal. Some just show the metrics, while others let you trend storage capacity and performance. But the majority of them can’t go application-granular across multiple hypervisors, especially in a virtual environment. They’re just statistical guesswork based on LUN/volume data.

And that’s where Tintri Analytics separate themselves from the pack. With a VM-Aware approach, we understand applications, group them automatically and provide great insights across customers data.

Your IT team wants to be proactive, working on solving business problems instead of doing day-to-day mundane tasks. That’s why each of these three categories of analytics are must-haves. With Tintri Analytics, Tintri’s committed to reducing the pressure on storage and system admins, and helping to grow, not stall, your organization.

Cheers..

@storarch

What’s New in All-Flash?

Today, Tintri announced the Tintri VMstore T5000 All-Flash series—the world’s first all-flash storage system that lets you work at the VM level—leading a launch that includes Tintri OS 4.0, Tintri Global Center 2.1 and VMstack, Tintri’s partner-led converged stack. Since its inception in 2008, Tintri has delivered differentiated and innovative features and products for next-generation virtualized datacenters. And we’re continuing the trend with the game-changing All-Flash VM-Aware Storage (VAS).

Other all-flash vendors claim all-flash can be a solution for all workloads—a case of “if all you have is a hammer then everything looks like a nail.” Or, they’ll argue that all-flash can augment hybrid deployments, with the ability to pin or move entire LUNs and volumes.

launchtimeline_tintri

But not all workloads in a LUN or volume may have the same needs for flash, performance and latency. So just as we’ve reinvented storage over the past four years, Tintri’s ready to reinvent all-flash. Here’s how:

  • No LUNs. Continuing the Tintri tradition, the T5000 series eliminates LUNs and volumes, letting you focus on applications. We’re welcoming VMs to the all-flash space across multiple hypervisors.
  • Unified management. Aside from standalone installations, the T5000 series can also augment the T800, and vice-versa. Admins can now manage VMs across hybrid-flash and all-flash platforms in a single pool through Tintri Global Center (TGC), with full integration.
  • Fully automated policy-based infrastructure through TGC, with support from vDisk-granular analytics and VM-granular self-managed service groups.

With access to vDisk-granular historical performance data, SLAs and detailed latency information, customers can decide which workloads can benefit from all-flash vs hybrid-flash—especially when our hybrid-flash delivers 99-100% from flash.

But we hear you, storage admins: you want to go into the weeds. Surprise—we’re happy to help. Here’s what else the T5000 series can offer you:

  • Space savings from inline dedupe, compression, cloning and thin provisioning.
  • NVDIMMs, NTB, 10G and more of the latest hardware advancements.
  • Enterprise reliability exceeding 99.999% uptime.
  • Scale of up to 112,000 VMs, 2.3PB and up to 5.4M IOPs (random 60:40 R:W, 8K) in a single TGC implementation.  (These are real-life numbers, not 100% read numbers.)
  • VM-granular snapshots, cloning and replication.
  • vDisk-level Dynamic QoS to eliminate noisy neighbors and ensure peak performance.
  • VM-level Manual QoS to setup performance SLAs through Min and Max IOPs.
  • vDisk (VMDKs, VHDs)-level data synchronization across VMs for test and dev or any operations requiring periodic copying of data.
  • VM-level replication, backup and transfer between Hybrid Flash and All-Flash systems.
  • VM-granular performance analytics with end-to-end latency visualization that includes host, network, storage, contention and throttle latency.

Today, Tintri continues our solid roadmap of business-relevant innovations in storage for modern workloads. We changed the game for hybrid-flash—and we’re doing it again for all-flash.

Cheers,
@storarch

The need for a Game Changer in All-Flash Storage

The all-flash space has been abuzz lately with a slew of vendors announcing new developments:

  • Solidfire announced new nodes, a software-only implementation (which oddly comes without complete hardware freedom) and a new program around its Flash Forward guarantee.
  • Pure Storage announced an update to its Flash Array lineup and a program around Evergreen Storage.
  • HP announced its 20K 3PAR line up, basically a hardware refresh.
  • EMC announced software updates to XtremeIO and a lot of other flashy stuff in ScaleIO and DSSD (typical of EMC to think ahead and have multiple bets).
  • NetApp re-launched All-Flash FAS with new pricing to complement the rich data services that ONTAP brings to the table, and has been pounding its chest about how ONTAP is the best thing to happen to all-flash arrays.

(Time will tell what happens to Flashray which is apparently being positioned in a different category (cheaper, simpler to use). Going by my experience, it’ll be a tough sell internally to move sales teams away from selling ONTAP, especially now that they have an optimized All-Flash FAS. (They should thank Gartner for that.) Against popular belief, NetApp has had different products for different workloads in its portfolio (FAS, StorageGrid, E-Series, AltaVault, Flashray) but where it has suffered in my opinion is educating and convincing the NetApp field sales teams to sell anything other than ONTAP. The problem is made worse by loyal NetApp customers who want everything to work with or within ONTAP.)

The Theme

If we look at most of the announcements, we see a unifying theme: Bigger, Faster, Cheaper and Better. This mostly results from new HW technologies (compute), increasing flash capacities and reduction in the price of flash. From a software standpoint, the newer products are catching up to add all the functionalities that traditional products have had. Traditional products (like HP 3PAR, NetApp FAS) are optimizing the code for flash and taking advantage of their already existing data services and application integrations. From a hardware standpoint, eventually every vendor will catch up to each other as they adopt the newer hardware.

Where is the Differentiator?

If we compare the All-Flash offerings from various vendors, most of them have similar features: dedupe, compression, snapshot, clones, replication, LUN/volume-based QoS and some sort of application and cloud integration. Every vendor only does one feature or another better and they struggle to find a big differentiator. When that happens, it’s marketing that starts to innovate more than engineering, and we start seeing messages like this:

  • We provide better space savings (6x vs. 5x) (yes, that’s around 10% better)
  • Our space savings technology never goes post-process (okay, but the other vendor is 10% better for savings)
  • We provide Evergreen Flash (marketing spin on a creative sales rep doing something at the time of a refresh – made even easier by flash)
  • Our Flash Forward program is unique in the industry (another marketing spin)
  • We are the only vendor that provides cloud integration (not true)
  • Designed from the ground up for flash (flash is a medium and traditional products can be optimized for flash—but faster performance/response times or longevity of flash doesn’t necessarily need a ground-up design in all cases. I am saying this even though the Flash layer and spinning drives have a completely different block layout on the Tintri VMstore with the flash layer designed specifically with Flash in mind)
  • We have the cheapest flash solution (when nothing works, talk price)

Running out of ideas?

It’s like everyone is running out of ideas. None of these vendors have taken a “completely different” approach—and their product can be better than others’ only for a limited time. Eventually, everyone will catch up to each other. If you take the same road your competitors do, your results can’t be much different.

But we can’t expect traditional vendors to take a different approach, unless they’re developing a new product without any baggage. But younger product companies definitely have a chance to be different. Still, most of these younger companies have taken a safe approach based on 30-year-old constructs and abstractions that are not required in the modern datacenter—mainly LUNs, volumes that have challenges associated with them. These constructs worked great for some of the traditional workloads but they require a lot of assumptions to be made for architecting storage in a modern datacenter (RAID Group size, Block Size, Queue Depths, LUN/Volume sizes, Number of LUNS/Volumes, number of workloads per LUN/Volume, grouping based on data protection needs etc.). Modern workloads are no longer tied to LUNs/Volumes which also poses a huge problem especially for architectures that are designed with these constructs in mind.

Now, because the traditional vendors and the younger vendors used the same approach, it has become a contest between the two – Traditional vendors are trying to optimize their product for flash, and newer vendors are trying to add functionality to match up to that of traditional vendors. As I see it, the scale is heavier on the traditional vendor side as far as storage with a traditional approach is concerned—because instead of changing the game, the younger vendors decided to play the game of traditional vendors.

 

Innovation focussed on Storage MediaNeed to be Different, not Better

Historically, the startups that make a difference are the ones that take a different approach. Data Domain, for example, defined a new model for backups. Even NetApp took a filesystem approach to storage (for file and block), enabling a completely different implementation of technologies like snapshot, clones, dedupe (primary storage) etc. Now everyone has started to have some sort of filesystem layer, and have caught up to the extent where the lines are all blurred out. NetApp is feeling the pressure now, but it took a long time for vendors to get there. There are many other examples, including ones even outside the storage industry (think Uber, Airbnb, Facebook).

While starting out different is great, it is important for any vendor to stay different and keep reinventing itself (through acquisition or innovation) based on changing needs. They should not get bogged down by a “things are working well, why change anything?” mentality.

Being different changes the possibilities and gives a chance for products to stand out. It allows companies to change the game and the table stakes. It allows companies to ‘change the experience’ which is what we use to evaluate any product.

As far as the all-flash market is concerned, there is a need for a product with a different approach. A product that can change the game and bring new possibilities. The need is for something designed from the ground up for the modern datacenter (and modern applications), rather than something that is just designed from the ground up for flash. Flash is just a medium, and mediums change. It’s Flash today, it may be something else tomorrow.

Cheers,

@storarch

Improving Storage QoS Accuracy and Performance based Chargeback/Showback

A few weeks back, I wrote a series of blog posts (Part 1, Part 2, Part 3) on how Tintri simplifies chargeback/showback for service providers (SPs). With the release of manual quality of service (QoS) per virtual machine (VM) and the introduction of normalized IOPS, Tintri has made that value proposition for SPs even better.

Tintri Storage QoS

As we all know, Tintri is the only storage platform that has an always-on dynamic QoS service that ensures QoS at a vDisk level. As part of this new functionality, Tintri customers can manually configure QoS at a VM level.

QoS on Tintri systems is implemented on normalized IOPS (more on this below) and customers can configure min and/or max settings for individual VMs. The minimum setting guarantees performance when the system is under contention (when you are more than 100% on the Performance Reserves bar) and the maximum setting allows an upper limit on performance for the VM. The new latency visualization gets an enhancement as well with the support for contention and throttle latency visualizations that ensure that QoS doesn’t become a liability.

Screen Shot 2015-04-23 at 9.15.01 PM

If you want to read more about QoS, head over to the blog post here. There is also a great video posted here.

Normalized IOPS

Normalized IOPS are measured at a granularity of 8K by a reporting mechanism that translates standard IOPS into 8K IOPS. This helps create a single scale to measure the performance of various VMs/applications. So, in addition to reporting the standard IOPS per VM/vDisk, the VMstore also reports normalized IOPS for the VMs. So how does the Tintri VMstore report Normalized IOPS? Here are a few examples –

If an application/VM is doing 1000 IOPS @ 8K block size, the VMstore would report it as 1000 Normalized IOPS. Similarly the Normalized IOPs for an application doing 1000 IOPs @ 16K block size would be reported as 2000 Normalized IOPs. Taking a few more examples –

1000 IOPS @ 12K would be equal to 2000 Normalized IOPS as well (rounding off to nearest 8K)

and 1000 IOPS @ 32K would be reported as 4000 Normalized IOPS

Why use normalized IOPS?

  • As we all know, different applications have different block sizes. Normalized IOPS allows us to understand the real workload generated by various applications and help create an apples-to-apples comparison between applications.
  • It also makes QoS predictable. When we set up QoS using normalized IOPS, we know exactly what the result will be, instead of getting a skewed result because of the block size of the application.
  • It gives one single parameter for SPs to implement performance-based chargeback/showback. So, instead of considering IOPS, block size, and throughput, and then trying to do some sort of manual reporting and inconsistent chargeback/showback, the SPs get the measurements out of the box.

Let’s use an example to see how SPs can take advantage of the new functionality.

Screen Shot 2015-04-23 at 9.13.41 PM

In the above screenshot, we have three VMs and we can see the IOPS and Normalized IOPS for each of these VMs. If we look at just the IOPS, we would be inclined to think that the VM SatSha_tingle is putting the highest load on the system, and that it is 2.7x the VM SatSha_tingle-02. But if we look at the Normalized IOPS, we know the real story. The VM SatSha_tingle-02 is almost 1.5x of SatSha_tingle. This is also reflected in the reserves allocated by the system to the VMs under Reserve%.

In a SP environment, without the normalized IOPS, the SP would either end up charging less for SatSha_tingle-02, or would have to look at block size and do some manual calculation to understand the real cost of running the VM. But with Normalized IOPS, the SP can standardize on one parameter for charging based on performance and get more accurate and more predictable with its chargeback/showback.

Since Normalized IOPS are also used for setting up QoS, SPs can now guarantee predictable performance to its customers through implementation of min and max IOPS-based QoS. With Normalized IOPS, the SPs now have four different ways to chargeback/showback on Tintri systems: Provisioned space, Used space, Reserves and min/max Normalized IOPS. Each of these ways bring more accuracy and predictability to any SPs chargeback/showback model that directly affects their Profitability.

Cheers,

@storarch

Why all the modern Storage QoS Implementations are not good enough?

Storage QoS is starting to become a key functionality today. It is mainly driven by a drive towards increased resource utilization and therefore resource sharing. In the past, Storage would have dedicated RAID group for LUNs and the applications would have LUNs dedicated to it. This worked really well in terms of guaranteed IOPs and Isolation. As disk sizes increased and storage technology matured, we started sharing the drives amongst the LUNs and these LUNs were no longer Isolated from each other. Virtualization made the situation even worse because not only the disks got shared amongst LUNs, the LUNs themselves got shared by multiple workloads (VMs). This resulted in Noisy Neighbor problems that impacted these LUNs and Volumes based storage systems.

A few Storage Vendors have some form of manual QoS functionality built-in the storage OS. Tintri for example, from the beginning, built an architecture that enabled an Always ON, fully automatic, dynamic QoS at an individual vDisk level (think VMDKs, VHDs etc.) that ensured automatic storage resource reservation at the vDisk level (based on our built-in IO analytics engine) so that every vDisk gets the performance it needs at sub-ms response time. The architecture is designed such that a new vDisk gets its performance from free reserves available in the system. So, at no point another vDisk that needs more performance impacts an existing vDisk. The approach is different from traditional approach where one manually sets up QoS at a LUN/Volume level but is highly effective for IT organizations that don’t want to hand hold the storage system. Tintri is the only storage product out there that has an Always ON, dynamic QoS enabled within all its storage appliances.

Screen Shot 2015-04-06 at 4.42.02 PM

Having said that, setting up QoS manually does have its play in Service Provider (SP) Space as well as some Private Cloud implementations where the SP (Public or Private) doesn’t want to give everyone unlimited performance. These SPs want to be able to sell let’s say Platinum Service to their customers and do it dynamically on the fly without even moving the workload. So, coming back to my original point about QoS implemented by Storage Vendors today. Here are the reasons why I say they are not good enough –

Granularity Challenge

In today’s datacenters, workloads are Virtual and Clouds are not implemented without Virtualization. In these Virtualization enabled datacenters, dealing with LUNs/Volumes is a pain. LUNs were brought into the industry 30-40 years back, when the workloads were physical and we started using these LUNs/Volumes even with Virtualized workloads because thats what the storage systems knew. In a virtual environment, a LUN has multiple workloads running in it. Implementing QoS on LUNs has no advantage for virtual workloads whether it is being implemented for isolation or chargeback.  VVols would change it (only for vSphere) but there is still a long way to go there as VVols don’t support all the vSphere features and not all vendors have a practically deployable implementation.

The result is that VMs in a LUN/Volume end up sharing the IOPs limits set up at the LUN/Volume level and therefore end up interfering with each other.

The IOPs Dilemma

Storage QoS is implemented by IOPs. One can combine it with MB/s but only a few vendors allow you to do that. Usually, it is just one or the other.

Now here is the problem, IOPs can have different meaning based on the block size. If I am limiting a LUN/Volume to a 1000 IOPs here is what it could mean –

4K Block size means 4MB/s

8K Block size means 8MB/s

64K Block size means 64MB/s

The same 1000 IOPs can mean 16x more load on a system when looking at 64K block size Vs 4K. That is a lot of difference for a Service Provider to take into account when deciding the pricing for a service. In some cases even large number of small block size IOs may impact storage more than large block IOs. Now some vendors can combine the IOPs limit with throughput to get around this to some extent but ideally service providers want one unit to bill against and want a single scale to measure everyone. Microsoft’s implementation of Normalized IOPs is a great example of such a metric.

The Throttling Effect

Some Storage systems using QoS on LUNs have this problem to deal with specifically when it comes to hosts that have more than one LUN coming from a storage system mapped through a HBA. When one implements QoS limits on a specific LUN and that LUN tries to go above that limit, it gets throttled by the storage system. The IOs get queued up at the HBA level and at that point the host starts to throttle the IO to the LUN and it does that for not just the LUN in question but to all the LUNs coming from that storage, thinking that storage system is not able to take the load that it is trying to send. This makes it practically impossible to implement QoS at an individual LUN level without impacting other LUNs.

QoS_Throttle_fairshare

The Visibility and Analytics Challenge

Most of the storage vendors have QoS more as a check box with a very few real world deployments. The reason is that QoS is really complex to implement and there are more chances of getting it wrong than right. QoS has to be implemented like a strategy and across all workloads. The challenge is that once someone gets it wrong it is not easy to fix and requires involving Vendor Support teams to determine the cause. Some vendors sell Professional Services around this, which makes it a really expensive feature to implement.

The other point being that QoS itself can become the cause of latency either because of the Max Limits setup on a workload or because of the contention resulting from the cumulative Minimum guaranteed IOPs set up on various workloads exceeding the overall performance capability of the storage system. Ideally the storage systems should give more insight into QoS and its impact on various workloads so that if someone complains of latency or drop in performance, the IT team is quickly able to pin point the reason. None of the storage vendors provide advanced user friendly analytics for QoS today and that is one of the biggest inhibitors in terms of real world adoption of QoS.

To summarize, QoS offered by storage vendors today is not granular enough, it doesn’t have a single scale to measure or apply QoS guarantees/ limits, it doesn’t ensure performance fair-share, it doesn’t guarantee isolation and storage providers don’t have the necessary analytics associated with it to make it easy to implement and then troubleshoot QoS related issues. I think its time to address these challenges so that QoS can be widely accepted and implemented in the datacenter.

Cheers..

@storarch