Why Data Preprocessing is needed and which are the techniques used for data Preprocessing?

Why Data Preprocessing needed?

  • Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogeneous
  • Low-quality data will lead to low-quality mining How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results? How can the data be preprocessed so as to improve the efficiency and ease of the mining process?
  • Data have quality if they satisfy the requirements of the intended use. There are many factors comprising data quality, including accuracy, completeness, consistency, timeliness, believability, and
  • Example
    • Imagine that you are a manager at AllElectronics and have been charged with analyzing the company’s data with respect to your branch’s sales.
  • You immediately set out to perform this You carefully inspect the company’s database and data warehouse, identifying and selecting the attributes or dimensions (e.g., item, price, and units sold) to be included in your analysis.
  • Alas! You notice that several of the attributes for various tuples have no recorded value. For your analysis, you would like to include information as to whether each item purchased was advertised as on sale, yet you discover that this information has not been
  • Furthermore, users of your database system have reported errors, unusual values, and inconsistencies in the data recorded for some
  • In other words, the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest, or containing only aggregate data); inaccurate or noisy (containing errors, or values that deviate from the expected); and inconsistent (e.g., containing discrepancies in the department codes used to categorize items).
  • Above example illustrates three of the elements defining data quality: accuracy, completeness, and consistency.
  • Inaccurate, incomplete, and inconsistent data are commonplace properties of large real-world databases and data
  • There are many possible reasons for inaccurate data (i.e., having incorrect attribute values). The data collection instruments used may be faulty.
  • There may have been human or computer errors occurring at data entry. Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information (e.g., by choosing the default value “January 1” displayed for birthday). This is known as disguised missing Errors in data transmission can also occur.
  • There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and Incorrect data may also result from inconsistencies in naming conventions or data codes, or inconsistent formats for input fields (e.g., date).
  • Incomplete data can occur for a number of reasons. Attributes of interest may not always be available, such as customer information for sales transaction
  • Other data may not be included simply because they were not considered important at the time of Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions. Data that were inconsistent with other recorded data may have been deleted.
  • Furthermore, the recording of the data history or modifications may have been Missing data, particularly for tuples with missing values for some attributes, may need to be inferred.

Data Preprocessing Methods/Techniques

  • Data Cleaning routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving
  • Data Integration which combines data from multiple sources into a coherent data store, as in data
  • Data Transformation, the data are transformed or consolidated into forms appropriate for mining
  • Data Reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results.
Raju Singhaniya
Oct 14, 2021
More related questions

Questions Bank

View all Questions