Data cleaning involves the detection and removal (or correction) of errors and inconsistencies in a data set or database due to the corruption or inaccurate entry of the data. Incomplete, inaccurate or irrelevant data is identified and then either replaced, modified or deleted.
Incorrect or inconsistent data can create a number of problems which lead to the drawing of false conclusions. Therefore data cleaning can be an important element in some data analysis situations. However, data cleaning is not without risks and problems including the loss of important information or valid data.
There are a large variety of tools available that can be used to support data cleaning. Additionally, many statistical programs have data validation built in, which can pick up some errors automatically, for example, non-valid variable codes.
Even when the data is in the standard form it cannot be assumed that it is error free. In real-world dataset’s erroneous values can be recorded for a variety of reasons, including measurement errors, subjective judgements and malfunctioning or misuse of automatic recording equipment. Erroneous values can be divided into those which are possible values of the attribute and those which are not. Although usage of the term noise varies, in this book we will take a noisy value to mean one that is valid for the data set, but is incorrectly recorded. For example the number 69.72 may accidentally be entered as 6.972, or a categorical attribute value such as brown may accidentally be recorded as another of the possible values, such as blue. Noise of this kind is a perpetual problem with real-world data. A far smaller problem arises with noisy values that are invalid for the data set, such as 69.7X for 6.972 or bbrown for brown. We will consider these to be invalid values , not noise. An invalid value can easily be detected and either corrected or rejected. It is hard to see even very ‘obvious’ errors in the values of a variable when they are ‘buried’ amongst say 100,000 other values. In attempting to ‘clean up’ data it is helpful to have a range of software tools available, especially to give an overall visual impression of the data, when some anomalous values or unexpected concentrations of values may stand out. However, in the absence of special software, even some very basic analysis of the values of variables may be helpful. Simply sorting the values into ascending order (which for fairly small datasets can be accomplished using just a standard spreadsheet) may reveal unexpected results.
– A numerical variable may only take six different values, all widely separated. It would probably be best to treat this as a categorical variable rather than a continuous one.
– All the values of a variable may be identical. The variable should be treated as an ‘ignore’ attribute. – All the values of a variable except one may be identical. It is then necessary to decide whether the one different value is an error or a significantly different value. In the latter case the variable should be treated as a categorical attribute with just two values.
– There may be some values that are outside the normal range of the variable. For example, the values of a continuous attribute may all be in the range 200 to 5000 except for the highest three values which are 22654.8, 38597 and 44625.7. If the data values were entered by hand a reasonable guess is that the first and third of these abnormal values resulted from pressing the initial key twice by accident and the second one is the result of leaving out the decimal point. If the data were recorded automatically it may be that the equipment malfunctioned. This may not be the case but the values should certainly be investigated.
– We may observe that some values occur an abnormally large number of times. For example if we were analysing data about users who registered for a web based service by filling in an online form we might notice that the ‘country’ part of their addresses took the value ‘Albania’ in 10% of cases. It may be that we have found a service that is particularly attractive to inhabitants of that country. Another possibility is that users who registered either failed to choose from the choices in the country field, causing a (not very sensible) default value to be taken, or did not wish to supply their country details and simply selected the first value in a list of options. In either case it seems likely that the rest of the address data provided for those users may be suspect too.
know more at – http://www.betterevaluation.org/en/evaluation-options/