Data Duplication And Deduplication
1.Duplication
The definition of what constitutes a duplicate has somewhat different interpretations. For instance, some define a duplicate as having the exact syntactic terms and sequence, whether having formatting differences or not. In effect, there are either no difference or only formatting differences and the contents of the data are exactly the same.
In any case, data duplication happens all the time. In large data warehouses, data duplication is an inevitable phenomenon as millions of data are gathered at very short intervals.
Several approaches have been implemented to counter the problem of data duplication. One approach is manually coding rules so that data can be filtered to avoid duplication. Other approaches include having applications of the latest machine learning techniques or more advance business intelligence applications. The accuracy of the different methods for countering data duplication varies. For very large data collection implementing some of the methods may be too complex and also expensive to be deployed in their full capacity.
Watch this video to understand full deduplication of Data
1.Duplication
The definition of what constitutes a duplicate has somewhat different interpretations. For instance, some define a duplicate as having the exact syntactic terms and sequence, whether having formatting differences or not. In effect, there are either no difference or only formatting differences and the contents of the data are exactly the same.
In any case, data duplication happens all the time. In large data warehouses, data duplication is an inevitable phenomenon as millions of data are gathered at very short intervals.
Several approaches have been implemented to counter the problem of data duplication. One approach is manually coding rules so that data can be filtered to avoid duplication. Other approaches include having applications of the latest machine learning techniques or more advance business intelligence applications. The accuracy of the different methods for countering data duplication varies. For very large data collection implementing some of the methods may be too complex and also expensive to be deployed in their full capacity.
Data warehouse involves a process called ETL which stands for extract, transform and load. During the extraction phase, multitudes of data come to the data warehouse from several sources and the system behind the warehouse consolidates the data so each separate system format will be read consistently by the data consumers of the warehouse.
A data warehouse is basically a database and having unintentional duplication of records created from the millions of data from other sources can hardly be avoided. In the data warehousing community, the task of finding duplicated records within large databases has long been a persistent problem and has become an area of active research. There have been many research undertakings to address the problems of data duplication caused by duplicate contamination of data.
Despite all these counter measures against data duplication and despite the best efforts in trying to clean data, the reality still remains that that data duplication will never be totally eliminated. So it is extremely important to understand its impact on the quality of a data warehouse implementation. In particular, the presence of data duplication may potentially skew content distribution.
There are some application systems that have duplication detection functions. These functions are developed by calculating a unique hash value for a certain data or group of data such as a document. Each document, for instance, is being examined for cases of duplication by comparing it against some hash value in either an in-memory hash or persistent lookup system. Some of the most commonly used hash functions include MD2, MD5, or SHA. These three are the most preferred due to their desirable properties. They are also easily calculated based on arbitrary data or document lengths and they have lower collision probability.
Data duplication can also be similar to problems like plagiarism and clustering. But the case of plagiarism could either be exact data duplication or just plain similarity to a certain documents. Documents which are considered to be plagiarized may refer to the abstract idea and not the word for word content. Clustering on the other hand is a method which is used to make clusters of data that have somehow similar characteristics. Clustering is used for fast retrieval of relevant information from a database
Deduplication
Watch this video to understand full deduplication of Data
Data deduplication (often called "intelligent compression" or "single-instance storage") is a method of reducing storage needs by eliminating redundant data. Only one unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy. For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only one MB.
Data deduplication offers other benefits. Lower storage space requirements will save money on disk expenditures. The more efficient use of disk space also allows for longer disk retention periods, which provides better recovery time objectives (RTO) for a longer time and reduces the need for tape backups. Data deduplication also reduces the data that must be sent across aWAN for remote backups, replication, and disaster recovery.
Data deduplication can generally operate at the file or block level. File deduplication eliminates duplicate files (as in the example above), but this is not a very efficient means of deduplication. Block deduplication looks within a file and saves unique iterations of each block. Each chunk of data is processed using a hash algorithm such as MD5 or SHA-1. This process generates a unique number for each piece which is then stored in an index. If a file is updated, only the changed data is saved. That is, if only a few bytes of a document or presentation are changed, only the changed blocks are saved; the changes don't constitute an entirely new file. This behavior makes block deduplication far more efficient. However, block deduplication takes more processing power and uses a much larger index to track the individual pieces.
Hash collisions are a potential problem with deduplication. When a piece of data receives a hash number, that number is then compared with the index of other existing hash numbers. If that hash number is already in the index, the piece of data is considered a duplicate and does not need to be stored again. Otherwise the new hash number is added to the index and the new data is stored. In rare cases, the hash algorithm may produce the same hash number for two different chunks of data. When a hash collision occurs, the system won't store the new data because it sees that its hash number already exists in the index.. This is called a false positive, and can result in data loss. Some vendors combine hash algorithms to reduce the possibility of a hash collision. Some vendors are also examining metadata to identify data and prevent collisions.
In actual practice, data deduplication is often used in conjunction with other forms of data reduction such as conventional compression and delta differencing. Taken together, these three techniques can be very effective at optimizing the use of storage space.
Data deduplication offers other benefits. Lower storage space requirements will save money on disk expenditures. The more efficient use of disk space also allows for longer disk retention periods, which provides better recovery time objectives (RTO) for a longer time and reduces the need for tape backups. Data deduplication also reduces the data that must be sent across a
Data deduplication can generally operate at the file or block level. File deduplication eliminates duplicate files (as in the example above), but this is not a very efficient means of deduplication. Block deduplication looks within a file and saves unique iterations of each block. Each chunk of data is processed using a hash algorithm such as MD5 or SHA-1. This process generates a unique number for each piece which is then stored in an index. If a file is updated, only the changed data is saved. That is, if only a few bytes of a document or presentation are changed, only the changed blocks are saved; the changes don't constitute an entirely new file. This behavior makes block deduplication far more efficient. However, block deduplication takes more processing power and uses a much larger index to track the individual pieces.
Hash collisions are a potential problem with deduplication. When a piece of data receives a hash number, that number is then compared with the index of other existing hash numbers. If that hash number is already in the index, the piece of data is considered a duplicate and does not need to be stored again. Otherwise the new hash number is added to the index and the new data is stored. In rare cases, the hash algorithm may produce the same hash number for two different chunks of data. When a hash collision occurs, the system won't store the new data because it sees that its hash number already exists in the index.. This is called a false positive, and can result in data loss. Some vendors combine hash algorithms to reduce the possibility of a hash collision. Some vendors are also examining metadata to identify data and prevent collisions.
In actual practice, data deduplication is often used in conjunction with other forms of data reduction such as conventional compression and delta differencing. Taken together, these three techniques can be very effective at optimizing the use of storage space.
No comments:
Post a Comment