Data Pre-processing

Data Pre-processing

Data pre-processing is an often neglected but important step in the data mining process. The phrase "Garbage In, Garbage Out" is particularly applicable to data mining and machine learning projects. Data gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis.[1]

If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Data pre-processing includes cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. Kotsiantis et al. (2006) present a well-known algorithm for each step of data pre-processing.[2]

References

  1. ^ Pyle, D., 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers, Los Altos, California.
  2. ^ S. Kotsiantis, D. Kanellopoulos, P. Pintelas, "Data Preprocessing for Supervised Leaning", International Journal of Computer Science, 2006, Vol 1 N. 2, pp 111-117.

Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Data binning — is a data pre processing technique used to reduce the effects of minor observation errors. The original data values which fall in a given small interval, a bin, are replaced by a value representative of that interval, often the central value. It… …   Wikipedia

  • Data mining — Not to be confused with analytics, information extraction, or data analysis. Data mining (the analysis step of the knowledge discovery in databases process,[1] or KDD), a relatively young and interdisciplinary field of computer science[2][3] is… …   Wikipedia

  • Data Protection Act 1998 — The Data Protection Act 1998 is a United Kingdom Act of Parliament which defines UK law on the processing of data on identifiable living people. It is the main piece of legislation that governs the protection of personal data in the UK. Although… …   Wikipedia

  • Data General — Industry Computer Fate Acquired Successor EMC Corporation Founded 1968 …   Wikipedia

  • Pre-flight (printing) — Pre flighting is a term used in the printing industry to describe the process of confirming that the digital files required for the printing process are all present, valid, correctly formatted, and of the desired type. The term originates from… …   Wikipedia

  • Processing — Apparu en 2001 Auteurs Ben Fry et Casey Reas …   Wikipédia en Français

  • Data quality — Data are of high quality if they are fit for their intended uses in operations, decision making and planning (J. M. Juran). Alternatively, the data are deemed of high quality if they correctly represent the real world construct to which they… …   Wikipedia

  • Data, context and interaction — (DCI) is a paradigm used in computer software to program systems of communicating objects. Its goals are: To improve the readability of object oriented code by giving system behavior first class status; To cleanly separate code for rapidly… …   Wikipedia

  • Data warehouse appliance — In computing, a data warehouse appliance consists of an integrated set of servers, storage, operating system(s), DBMS and software specifically pre installed and pre optimized for data warehousing (DW). Alternatively, the term can also apply to… …   Wikipedia

  • pre-Columbian civilizations — Introduction       the aboriginal American Indian (Mesoamerican Indian) cultures that evolved in Meso America (part of Mexico and Central America) and the Andean region (western South America) prior to Spanish exploration and conquest in the 16th …   Universalium

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”