How does Tintri deliver 99% IO from Flash?

This comes up a lot in my conversations with Customers, Prospects and Partners. Delivering 99% of the IO from Flash is one of the biggest differentiators for Tintri that allows us to maximize the value of Flash, delivering sub millisecond latencies to workloads.

One of the toughest jobs for any product that has both spinning disks and Flash is to keep the Flash filled with meaningful data in order to maximize the IO from Flash. This is something that the engineering teams really struggle with, resulting in products that deliver less than 50% IO from Flash in real life situations.

Tintri delivers this with the help of four key technologies –

  • FlashFirst Design
  • Inline Compression and Deduplication
  • VM Granular IO Tracking and Working Set Management
  • Block Demotion based on least Frequently accessed data Vs FIFO or Recently accessed data.

Flash First Design…

Tintri with its Flash First design works differently as compared to some of the Tiering solutions. The data is written and retained on Flash and then moved to SATA based on a very granular working set management, which we will discuss later in the post. It is unlike some of the other implementations which use spinning disks for initial placement or just copy the contents to Flash on initial write to avoid read miss on initial access.

Traditional Storage Tiering

Traditional Storage

Tintri-Flash
Tintri Storage

 

 

 

 

 

 

 

 

Inline Deduplication and Compression …

Tintri utilizes inline deduplication and compression for multiple advantages. It not only allows Tintri to avoid write amplification but also help it maximize the amount of data that is kept Flash. The typical space savings that we have observed on the systems reporting back to Tintri ranges from 5x-10x resulting in a Flash to spinning disk ratio that ranges form 1:1 to 1:2. Vs 1:20 that you see on most of the platforms out there. This helps the system in a big way as the data can now be retained on Flash for a very long time.

VM Granular IO Tracking and Working Set Management…

What we discussed above was the easy part. The more complicated and difficult part of getting 99% IO from Flash is the granular working set management at a vDisk level. Tinri assigns Flash at a vDisk level. Every vDisk has a portion Flash assigned to it. Something similar to server side flash products where you manually assign Flash capacity to VMs. The only difference being that this is done at the storage level and the allocation changes dynamically. What helps Tintri here is the ability to track every IO at a VM level. As the IOs come in, they are stored in Flash. The system works on calculating the working set size for the VM in order to allocate the required Flash capacity to it. Below you see two examples.

In the first example, you see a 175Gb+ VM and the graph shows us that it needs around 60GB of Flash Allocated in order to hit Flash 100%+ of the time.

Example VM1

Example VM1

In the second example, we have another VM of 140GB+ in size and the graph shows us that the VM needs only 5GB of Flash allocated in order to hit Flash 100%+ of the time.

Example VM2

Example VM2

These graphs are from real Tintri systems running VM workloads.

Now, this Flash capacity is the reserve per VM but definitely not what it is limited to. The VM can and does go above its reserved Flash Capacity based on how much Flash is available in the system. The Flash capacity reserved is one of the things that the Performance Gauge on the Tintri GUI is based upon. Remember that the system is utilizing inline compression and deduplication to maximize the Flash capacity, reduce the number of IOs and Flash Wear.

Block Eviction…

When it is time to evict data from Flash, unlike some of the other storage vendors, Tintri evicts the blocks (at 8K granularity) based on least frequently accessed blocks  and not based on recently accessed blocks or an algorithm as simple as FIFO. This allows system to ensure that the most important blocks are on Flash and also that any batch process or a wild VM doesn’t spoil the Flash Map.

So as you see, delivering 99% IO from Flash is a fairly complicated task and Tintri simplifies it for the user by auto-tuning the system. The VM awareness helps a lot. And as I have mentioned before, VM awareness is a lot more than delivering the snapshot or cloning functionality at a VM level. It is technologies like these where Tintri differentiates. This needs a storage that is laser focussed on VIrtualized workloads. A General Fileserver has to store .txt/.doc/mpegs/ppt files natively in the same storage as Virtualized workloads and therefore has to compromise. The same is true for any other General Purpose SAN Storage as it is not aware of anything above the storage layer and has to cater to all type of workloads. Tintri uses the VM awareness to reduce the total cost of ownership by delivering sub millisecond latencies without utilizing all-Flash Storage systems.

Cheers..

@storarch

6 thoughts on “How does Tintri deliver 99% IO from Flash?

  1. Nice article! I have few questions though.
    1) What are the typical access latencies observed ? As per my understanding, Tintri presents itself as NFS based storage which is a very heavy protocol in terms of number of RPC calls made.In addition to that every guest has to go through VMware storage stack for each storage access. Does this increase latency ?
    2) If all writes are made to flash first, wouldn’t it wear out the flash faster ?
    3) The graphs shown here about working set are nice but how long does it take to learn/realize the true working set of the VM ?

  2. Hi Chandra

    Thanks for the comment.

    1) Tintri uses a NFS stack that is optimized to run VMs and vDisks. We typically see sub-ms response times and show a nice breakup of latencies from Host, Storage and Network on our Interface. Because of NFS we are also able to bypass VMFS and any layer it involves.
    2) Tintri uses a Filesystem that is designed for Flash. The flash filesystem understands flash media and is optimized to run on just flash with complete understanding of flash boundaries, garbage collection and flash write cycles. In addition to that Inline deduplication and compress helps us in reducing the amount of IO that hits flash by 4-8x. So writing to flash doesn’t have any adverse effect on the life of flash.
    3) Working set doesn’t stay constant. It changes dynamically and so do our reserves. The reserves are calculated based on a ‘clock’ algorithm and the hand of the clock moves with every IO. Having said that, one doesn’t have to worry about the IO latency as the data is stored on Flash and kept there with inline dedup and compression helping in significantly reducing the data footprint . Remember that we are doing a reverse of what other vendors do – moving stale data to disk Vs moving active data to flash.

  3. Thanks Satinder for your detail reply.
    1) I was referring to the overheads of NFS protocol in general. The overhead increases if there are lot of small I/O’s that are very diverse. Has Tintri made any changes to NFS itself ?
    2) If we assume that flash cache is 10% of the total VM storage , it means 90% of the data is stored on the disks. In order to get 99% hit from flash cache, the apps running on VM need to access that 10% of total data that is hot and change in working set can only be up to 1% of total data size per second otherwise you can’t have 99% hit rate from cache. What applications or environments follow this pattern ?
    3) I am bit puzzled by the sub-ms response time that you get. Here is why.
    If we assume that we are getting 99% hit from flash, that still leaves 1% of data coming from disk. The disk latencies are a lot higher than millisecond sometime 5 to 10 ms. This will distort the latency numbers. So I think the latency you mentioned is probably average latency not the best case latency.
    4) As far as inline dedup and compression are concerned, they will surely help in more effective use of flash cache for Tier 2 or Tier 3 but are you getting such a big dedup savings in Tier 1 ?. I was thinking Tintri type of storage is used more in Tier 1 scenario. May be I got it wrong.
    5) As far as working set goes, Let me explain my point more clearly. If you get the working set in a reverse order as you are saying then it means you are starting with almost entire data in cache and then slowly removing unwanted or not-referenced data from flash cache. This sounds good but in reality this means a large portion of cache is going to have useless data for long time. That is why I was asking how long does it take to get to the steady state that you mentioned in the graphs. Otherwise your flash cache cannot function effectively for long time. Let me give an example. Lets assume you have total storage size of 25 TB and you have flash cache of 2.5 TB which is 10%. Lets also assume that you have 1000 VMs using this total storage. You can only fit 10% of the VMs in the cache but what will happen to other 90% ? They can’t use flash cache until those 10% VMs figure out their working set and evict unwanted data from cache.
    5) Dedup also adds overhead such as processing and penalty incurred when dedup’d data gets modified etc. This overhead in latency and CPU cost will increase as the total size of data increases. In addition you need to store dedup tables in flash which reduces your flash. So are you doing dedup within the VM or across all VMs ?

    • 1) When you say small IOs, what type of IOs are you referring to? Do you mean metadata IOs? Tintri’s NFS stack is optimized for VMs. 99% of VM IOs are Data IOs. You can read more here http://www.tintri.com/blog/2011/07/how-nfs-behavior-changes-in-a-virtualized-environment

      2) First off – we don’t use Flash as Cache. Flash is permanent repository for a lot of VMs either completely or for the majority of blocks. So thinking of Flash with Tintri as Cache and then assuming everything around that would be wrong. As far as working set of 10% is concerned, you will be surprised how many applications have working set of less than 5%. Having said that, with Tintri the amount of Flash is not 10%. It is much more than that because of deduplication and compression on Flash which is anywhere between 4x – 10x, making Flash anywhere from 50% to 100%.

      3)Most of the VMs get 100% of IO from flash and the VMs that do get 99% is because either someone is trying to access snapshot data or very-very inactive data. We measure and display how much IO is coming from disk and in those cases the latencies may go higher than 1 ms but neither it is in the range which is unacceptable to the application (Typically within 5ms) nor it is for extended time period. The system is smart in terms of managing data on disks. If one does need the extra 1% to come from flash as well then we allow customers to pin the workload to flash. It is about making a choice – If that extra 1% of IO (for inactive data and for snapshot) from flash is so important to the business that one can justify 4x the cost, then it makes sense to go for it.

      4) No, you absolutely didn’t get it wrong there. We are not only helping customer with their tier-1 applications but also helping them virtualize their Tier 1 applications more and more. This is because of good performance, predictability, application isolation and insight across the infrastructure. Now with us the customers are not limited to virtualizing Tier-1 applications. Because of the low capex, the customers can get all the above mentioned features across all of their virtualized applications.

      I am not sure what you think of as Tier-1. My definition of Tier 1 apps is any application where the downtime can cause significant loss to either business or the services. One can have any type of application in that category. Since you said Tier 1 application don’t get good Dedup results, I am assuming that you are talking about DBs. You are right that DBs don’t get good dedup results but they good very good compression. As I said before, our Flash is not cache and one can have complete VMs or a majority of it living in flash for the life. So I would not assume anything based on that understanding.

      5) Again our Flash is not cache and the effective size of our flash is not 10%. So most of what you mentioned in pt # 5 here is not applicable. Also, we don’t mind keeping cold data in flash as long as the system thinks it can keep that data in flash for a VM without impacting other VMs. Funny that you mention unwanted/useless data because the all-flash vendors want everyone to put all of that in flash all the time. You won’t like my answer to – How much time the system takes to learn? As my answer is ‘It depends’. We are never in a hurry to calculate especially because of Flash as we are really efficient with Flash usage.

      6) Our dedup is across all VMs.
      Overheads & penalties are over a baseline. What is the baseline you are assuming here? Our dedup and compression is always on. It is in the system’s DNA. Our baseline is with dedup and compression turned on. It is not a system where dedup and compression were retrofitted. So there is no point of any overhead or penalty.

      The other things you mentioned in your comment was CPU cost and storing metadata in Flash. Those are the things that are known to us so we incorporate it in the design of the hardware. It is all about ensuring the system has enough CPU cores and enough flash (not that the metadata needs TBs of Flash). Having metadata in Flash has a lot of merits that far outweigh the cost of very small amount of additional flash. The same is true for dedup and compression. Dedup and compression not only help in increasing the effective size of Flash but also reducing the number of IOs that hit flash. Remember that the system was designed from grounds up for multi-core operation with dedup and compression always turned on along with a layout designed for flash.

      -Satinder

  4. Pingback: Simplifying Storage Chargeback/Showback with Tintri – Part 3 (Performance) | Virtual Data Blocks

  5. Pingback: The industry is validating Tintri – Another one comes through | Virtual Data Blocks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s