

Recovery Point and Time Objectives are key components of any company’s IT tactical plan. The ability to quickly and accurately recover critical data is necessary whether it be due to an outage, a legal discovery or disaster. Virtual tape solutions like that delivered from COPAN Systems has greatly increased a company’s ability to meet these objectives with lighting fast backup and restore, more reliable access to data and a seamless deployment with existing infrastructure. However, the amount of data one could afford to store on one of these systems has been limited……until now. With Single Instance Repository for MAID (SIR-M), the virtual tape cost paradigm has been shattered, enabling IT organizations to store up to 20 times the data or more in a single, low environmental impact footprint.
Data deduplication (AKA single instance, common factoring or capacity optimized storage) technologies strive to reduce the amount of duplicate data being backed up and then stored. The technologies identify and eliminate common data in and across backup streams. By eliminating the common objects, the resulting storage requirement will be reduced. COPAN believes that data deduplication can be a valuable technology and can provide significant value to customers if the correct approach is taken.
Customer Benefits
The two major benefits of data deduplication are:

However, if not implemented correctly, data deduplication can create serious customer issues including:
There are three other approaches to architecting data deduplication into your backup solution - see the other approaches to data deduplication tab above.
There are two primary ways for a storage device or appliance to perform deduplication, offline or inline. There are pros and cons to each, but when it comes to enterprise class operations, COPAN Systems' offline approach is the only way to meet IT objectives. With SIR-M, the deduplication (data reduction) process can run if and when the IT organizations desires, without impacting critical backup windows. COPAN Systems' SIRM + Enterprise MAID offers the following benefits:
The client based approach has a number of merits. First, if you’re interested in a D2D approach vs. VTL, Symantec and Tivoli offer SIR or SIR-like capabilities within the BU application. By implementing SIR at the BU client the customer will dramatically decrease the amount of BU data sent across the network and then stored. It does not make sense to add another level of SIR in the solution. Implementing the client based approach also has merits in that the customer has one throat to choke as it relates to data recovery. By holding the BU application responsible for SIR there is no chance of data loss downstream which could cause catastrophic recovery issues.
Appliance based pre-data write to storage solution. In this option the customer will implement a SAN or a series of SAN based SIR appliances. These appliances will De-duplicate the BU data stream after it is sent by the client and before it is written to the disk. The appliance will use a VTL image to spoof the BU application and will compute the common data hash prior to writing to the disk. The positive of this implementation is simply moving the processing from the client to a single purpose appliance. This may be justified if the client compute platform is architecturally set and it is easier to move the compute function to an appliance. The draw backs of this solution are more significant. First there is no network savings between the BU client and the appliance. Second the appliance has a performance limitation tied to its ability to compute the hash and it is proven that overall BU write performance will be significantly reduced. The third option is to separating deduplicating from the BU application, if data is lost or corrupted within the deduplication engine the BU application will have no way to recover. 
Storage Platform Based Approach
Deduplication as a post BU write process within a storage platform. The draw back to this solution is the separation of the deduplication from the BU application and the resulting risk of data loss or corruption which is unknown to the BU application. The benefits of this solution are powerful. First you avoid the BU performance barrier by moving the SIR actions after the BU application has streamed its write to the storage platform. This allows the full performance benefits of storage bandwidth centric system to increase BU performance. Second the compute engine requirements and thus cost of the system is reduced as this method performs the deduplication/ SIR function as a background task outside of the BU window. Once the SIR is complete, the system has minimized the amount of data that would be targeted to be sent to a DR site.
All deduplication products do essentially the same thing: look at data in "chunks" and store only a single copy of each unique chunk. A key attribute of our deduplication technology is that the process runs offline, after the backup completes. SIR reads virtual tape cartridges from the library, analyzes the contents, and establishes a repository of unique blocks of data.
The original virtual tape cartridge is then replaced in the VTL with something we call a Virtual Index Tape that is only a fraction of the size of the original.The space previously occupied by duplicate data in the library is then freed to keep much more data online for longer periods of time.
The deduplication process itself is fairly straightforward. It begins with a module we call the virtual tape scanner reading data from a virtual tape cartridge.
The scanner analyzes the data in variable-sized blocks and uses the industry standard SHA-1 hashing algorithm to calculate an index value based on the contents of the data. The value is then looked up in an index table to determine if the data is already stored in the repository. The index is pre-allocated and structured for fast lookup. If not, the data is placed into the repository and the index table updated.
In either event, the index value for the data is returned so that the virtual index tape can be constructed. The virtual index tape will occupy only a fraction of the space required for the original virtual tape cartridge since it only contains metadata and repository index pointers.SIR is backup tape format-aware for maximum data deduplication efficiency. SIR is not confused by extra information the backup program puts on the virtual tape cartridge. This format-awareness also allows SIR to examine data using different size blocks for different file-types to ensure maximum detection of duplicate data.

Single Instance Scan:
The file data is extracted and added to the repository. File data is replaced with links to the extracted data.

After Single Instance Scan:
The shadow virtual tapes contain only the backup metadata with the links to the repository file data entries. The links are the keys to retrieve the data when needed.

Summary
Deduplication technology is still new to the market but awareness has been raised by the so called "hype factor" that surrounds bleeding edge technology. Business looking to reap the benefits of Single Instance Repository type technology should considers all the options, the pros and con's prior to any technology purchase.