Advertisement

Blog

Data Care for Industrial Apps: Firmware Addresses Insufficient Lifespan of Flash

Flash memory degrades with normal usage. How quickly this occurs depends upon a range of different factors. If a memory card is to reliably perform its task over a period of years without needing to be replaced – as with an industrial embedded memory solution, or perhaps in an outside location such as a telecommunications system – then it must be of a higher quality than a standard product.

(Photo courtesy: Swissbit AG)

(Photo courtesy: Swissbit AG)

In this case, the use of electronic components with an extended range of temperature tolerance and robust workmanship are typical requirements. A more important aspect is that the storage media are able to preserve data reliably for a long time. If the issue is durability, then the active assistance of the controller is required: the internal operation of a flash memory is crucial to how long it will remain writable and not lose any data. In order to understand this, one needs to know how flash memory devices age.

Gradual deterioration

The cells of a NAND flash can only withstand a limited number of programming  and erase cycles. Why? The programming voltage generates a tunnel effect by which electrons are pushed onto the floating gate or charge trap layer in which they are stored. The problem is this: over a large number of programming cycles, high-energy electrons also accumulate in the oxide layer. As a result, over time the threshold voltage changes to the point where ultimately the cell is no longer readable.

An aging cell: Electrons accumulate in the tunnel oxide layer causing the threshold voltage to gradually change. Cracks in the tunnel oxide layer induce leakage current paths permitting the charge to flow off. Read errors increase to the point where the block becomes a “bad block” that needs to be retired.  (Photo courtesy: Swissbit AG)

An aging cell: Electrons accumulate in the tunnel oxide layer causing the threshold voltage to gradually change. Cracks in the tunnel oxide layer induce leakage current paths permitting the charge to flow off. Read errors increase to the point where the block becomes a “bad block” that needs to be retired.  (Photo courtesy: Swissbit AG)

There is also a further aging effect: conductive paths form in the oxide layer, causing the cell to gradually lose its amount of charge – and with it the stored bit. This effect is exacerbated by high temperatures. Experiments on a 25 nm Multi-Level Cell (MLC) NAND have shown that after 50% of the allowed erase cycles have been consumed the retention falls to about 25% of the initial values at 55 °C. At 85 °C it falls to below 10%.

The effect also increases with time the closer the cell approaches its maximum program/erase cycles (P/E cycles). The effect on retention is huge: hence, while both new single-level and multi-level cell NANDs reliably provide 10 years of retention, this figure declines to only one year by the end of their lifetimes. But with MLC NANDs, this point is reached after 3,000 P/E cycles and with SLCs after 100,000. That is another reason why SLC is so popular for especially challenging applications.

Stress during reading

In terms of retention it can be said that that data is secure over a longer period if as little deleting and re-writing takes place as possible. However, it would be a mistake to presume that a data medium that is primarily read does not age.

In this case, a second aspect should be taken into account that leads to read errors and indirectly to wear of the NAND cells. Those cells that are close to the cell that is to be programmed, are stressed during writing, i.e. they reveal slightly increased voltage (Program Disturb). The same leads to stress during reading (Read Disturb). Here, it is the neighbouring pages that accumulate the charge.

Over time, the stored potential in these cells increases, resulting in read errors, which disappear after the block has been erased. Due to the lower charge the effect is less during reading than writing, however, bit errors also occur here but they are compensated for by the Error Correcting Code (ECC) and have to be fixed by deleting the relevant block. The following should be taken into consideration: this effect is particularly pronounced in applications that read the same data repeatedly. This means that even within a read-only memory, blocks need to be deleted, and pages written again and again in order for errors to be corrected.

Data maintenance

The bits in the form of load differences in the NAND cells are permanently in danger of degrading. The manufacturers of flash media that are supposed to be suitable for use in machines and industrial installations or in vehicles, rely on processes to ensure the integrity of the stored data. A combination of various mechanisms such as ECC Monitoring, Read Disturb Management and Background Data Refresh ensure that all of the stored data is monitored and, if necessary, refreshed. In that way, system failures can be avoided from the outset. Data integrity should be guaranteed without participation of the host application. For this reason, these processes on the memory card run autonomously.

In the case of frequent bit read errors, the ECC serves initially to trigger off the rewriting of the block concerned and the deletion of the defective blocks. This mechanism, however, only functions with a read request of the host application. Creeping corruption of data that has not been read for some time remains untreated. For this reason, advanced Data Care Management searches for potential errors independent of requests by the applications.

Data Care Management counteracts gradual data loss. In the background, all the written blocks are read and, in cases of too many bit errors, they are copied, repaired and written again. (Photo courtesy: Swissbit AG)

Data Care Management counteracts gradual data loss. In the background, all the written blocks are read and, in cases of too many bit errors, they are copied, repaired and written again. (Photo courtesy: Swissbit AG)

To do this, all of the written pages including firmware and the allocation table of the Flash Translation Layer (FTL) are read in the background and refreshed. There are various triggers for this process of precautionary error correction. The process can be initiated by a previously determined number of repeated starts whereby the process start should be delayed as much as necessary in order not to disturb any boot processes. Another trigger is dependent upon the number of P/E cycles that have been carried out, which means that at the beginning of the lifespan, refresh is rarely started but with the increase in the number of P/E cycles the interval between the refresh starts reduces. In order to counteract the consequences of the Read Disturb Effect, the amount of data read is also a reference for the controller to copy the data to fresh blocks.

The number of repeated reads is a very important trigger. Bits that have not been recognized at the first attempt can, however, be read by the Read Retry Mechanisms by a step-by-step increase in the threshold voltage. This compensates for errors that occur due to differences in read and write temperatures. However, this is also viewed as a warning because read errors can be caused by aging as well as the Read Disturb Effect. Finally, an example of an application-specific trigger for refreshing, that takes into consideration the effect of higher temperatures on retention: was the data media not used for a number of days at high temperatures? Start, refresh!

The storage life of flash memory devices decreases significantly at high temperatures. Specialist manufacturers, such as Swissbit, provide devices in which a row of mechanisms counteract decay. (Photo courtesy: Swissbit AG)

The storage life of flash memory devices decreases significantly at high temperatures. Specialist manufacturers, such as Swissbit, provide devices in which a row of mechanisms counteract decay. (Photo courtesy: Swissbit AG)

Flash memory for the industry

Aging effects and how to address “data fade” have already been explained. It was clear that a series of internal processes was needed to solve these problems and that these processes alone would not completely solve the problem. Garbage collection, the consolidation of dispersed data, is used to free up entire blocks, and wear levelling, which distributes data evenly throughout the cells, are just two examples of processes which are involved in reading, writing and deletion activities.

Another interesting challenge for producers who want to sell long lasting flash storage devices, is Write Amplification Factor (WAF). This describes the relationship between the user data, which comes from the host computer and the actual amount of data written onto the flash memory.

It is a measure of efficiency for how well a flash controller functions. Reducing the WAF is a key to increasing the flash memory’s lifespan.The difference between sequential and random access or the size of the data blocks in relation to pages and block sizes are factors that influence WAF. The reason for these relationships is the way flash memory devices function: pages within a block have to be written one after the other but blocks have to be erased completely. In the standard procedure, the mapping between logical and physical address relates to blocks. This is highly efficient for sequential data because the pages of a given block can be written sequentially. An example of this mechanism is continuously accumulated video data. However, pages are written in numerous different blocks for random-data whereby every internal overwrite in certain circumstances may require an entire block to be deleted per page.

This results in higher WAF and a reduced lifetime. Thus, page based mapping is preferable for non-sequential data. In other words, the firmware ensures that data derived from disparate sources can be written sequentially to the pages of a single block. This reduces the number of deletions and thus prolongs lifetime and enhances write performance. The drawback of this method is that it results in a larger assignment table of the Flash Translation Layer (FTL) – though this can be compensated for by an integrated DRAM.

Increasing efficiency

What many do not know is that the degree of utilization of the data medium significantly increases WAF. Why? Because the more data that is stored on a flash device, the more bits the firmware needs to move from one place to another. Page-based mapping is advantageous here as well.

Some manufacturers have yet another adjustment mechanism at their disposal, known as over provisioning, i.e. the space of a flash memory device that is reserved solely for background activities. Seven percent of an SSD is normally reserved for this purpose, i.e. the difference, for gigabyte figures, between binary and decimal values. If 12% of the data medium is reserved for over provisioning instead of 7% then this has an astonishing effect. An SSD with 12% over provisioning that is otherwise identical with another, has up to 80% higher endurance. The difference is even more pronounced in combination with a DRAM. An endurance comparison of two identical SSDs derived from MLC NAND chips has shown that the 60 GB Swissbit F-60 durabit with an integrated DRAM achieved a 6.6 higher value than the 64 GB F-50 device without additional DRAM. And in fact, the value is ten times higher for the 240 and 265 GB versions. (The TBW (terabytes written) were compared by applying the enterprise workload. This means, the total quantity of data written during the lifespan in TB under the most demanding conditions that JEDEC Standards Organisation have defined.)

What the controller does

High-quality components and good workmanship should go hand-in-hand in flash storage devices for industrial applications. To a large extent, life-span and reliable data storage are dependent on what the controller does. Using advanced firmware, Swissbit has been able to extend the lifespan of memory cards and SSDs considerably and to counteract unavoidable aging. Developers and manufacturers should look for this if durability is of importance in their own applications. 

0 comments on “Data Care for Industrial Apps: Firmware Addresses Insufficient Lifespan of Flash

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.