Data profiling

Data profiling

Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to:

  1. Find out whether existing data can easily be used for other purposes
  2. Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to a category
  3. Give metrics on data quality, including whether the data conforms to particular standards or patterns
  4. Assess the risk involved in integrating data for new applications, including the challenges of joins
  5. Assess whether metadata accurately describes the actual values in the source database
  6. Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
  7. Have an enterprise view of all data, for uses such as master data management where key data is needed, or data governance for improving data quality.

Contents

Data Profiling in Relation to Data Warehouse/Business Intelligence Development

Introduction

Data profiling is an analysis of the candidate data sources for a data warehouse to clarify the structure, content, relationships and derivation rules of the data.[1] Profiling helps to understand anomalies and to assess data quality, but also to discover, register, and assess enterprise metadata.[2] Thus the purpose of data profiling is both to validate metadata when it is available and to discover metadata when it is not.[3] The result of the analysis is used both strategically, to determine suitability of the candidate source systems and give the basis for an early go/no-go decision, and tactically, to identify problems for later solution design, and to level sponsors’ expectations.[1]

How to do Data Profiling

Data profiling utilizes different kinds of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, and variation as well as other aggregates such as count and sum. Additional metadata information obtained during data profiling could be data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and abstract type recognition.[2][4][5] The metadata can then be used to discover problems such as illegal values, misspelling, missing values, varying value representation, and duplicates. Different analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an understanding of frequency distribution of different values, type, and use of each column. Embedded value dependencies can be exposed in cross-columns analysis. Finally, overlapping value sets possibly representing foreign key relationships between entities can be explored in an inter-table analysis.[2] Normally purpose build tools are used for data profiling to ease the process.[1][2][4][5][6][7] The computation complexity increases when going from single column, to single table, to cross-table structural profiling. Therefore, performance is an evaluation criterion for profiling tools.[3]

When to Conduct Data Profiling

According to Kimball,[1] data profiling is performed several times and with varying intensity throughout the data warehouse developing process. A light profiling assessment should be undertaken as soon as candidate source systems have been identified right after the acquisition of the business requirements for the DW/BI. The purpose is to clarify at an early stage if the right data is available at the right detail level and that anomalies can be handled subsequently. If this is not the case the project might have to be canceled.[1] More detailed profiling is done prior to the dimensional modeling process in order to see what it will require to convert data into the dimensional model, and extends into the ETL system design process to establish what data to extract and which filters to apply.[1]

Benefits of Data Profiling

The benefits of data profiling is to improve data quality, shorten the implementation cycle of major projects, and improve understanding of data for the users.[7] Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling.[3] Data profiling is one of the most effective technologies for improving data accuracy in corporate databases.[7] Although data profiling is effective, then do remember to find a suitable balance and do not slip in to “analysis paralysis”.[3][7]

See also

References

  1. ^ a b c d e f [Ralph Kimball et al. (2008), “The Data Warehouse Lifecycle Toolkit”, Second Edition, Wiley Publishing, Inc.], (p. 297) (p. 376)
  2. ^ a b c d [David Loshin (2009), “Master Data Management”, Morgan Kaufmann Publishers], (p. 94-96)
  3. ^ a b c d [David Loshin (2003), “Business Intelligence: The Savvy Manager’s Guide, Getting Onboard with Emerging IT”, Morgan Kaufmann Publishers], (p. 110-111)]
  4. ^ a b [Erhard Rahm and Hong Hai Do (2000), “Data Cleaning: Problems and Current Approaches” in “Bulletin of the Technical Committee on Data Engineering”, IEEE Computer Society, Vol. 23, No. 4, December 2000]
  5. ^ a b [Ranjit Singh, Dr Kawaljeet Singh et al. (2010), “A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing”, IJCSI International Journal of Computer Science Issue, Vol. 7, Issue 3, No. 2, May 2010]
  6. ^ "[Ralph Kimball (2004), “Kimball Design Tip #59: Surprising Value of Data Profiling”, Kimball Group, Number 59, September 14, 2004, (www.rkimball.com/html/designtipsPDF/ KimballDT59 SurprisingValue.pdf)]
  7. ^ a b c d [Jack E. Olson (2003), “Data Quality: The Accuracy dimension”, Morgan Kaufmann Publishers], (p.140-142)

Wikimedia Foundation. 2010.

Нужно сделать НИР?

Look at other dictionaries:

  • Data-Profiling — bezeichnet den weitgehend automatisierten Prozess zur Analyse vorhandener Datenbestände (z. B. in einer Datenbank) durch unterschiedliche Analysetechniken. Durch das Data Profiling werden die existierenden Metadaten zu den Echtdaten validiert und …   Deutsch Wikipedia

  • Data profiling — Le data profiling est le processus qui consiste à examiner les données dans les différentes sources de données existantes (bases de données, fichiers,...) et à collecter des statistiques et des informations sur ces données. C est ainsi très… …   Wikipédia en Français

  • Data quality — Data are of high quality if they are fit for their intended uses in operations, decision making and planning (J. M. Juran). Alternatively, the data are deemed of high quality if they correctly represent the real world construct to which they… …   Wikipedia

  • Data presentation architecture — (DPA) is a skill set that seeks to identify, locate, manipulate, format and present data in such a way as to optimally communicate meaning and proffer knowledge. Contents 1 Origin and context 2 Objectives 3 Scope 4 …   Wikipedia

  • Data governance — is an emerging discipline with an evolving definition. The discipline embodies a convergence of data quality, data management, data policies, business process management, and risk management surrounding the handling of data in an organization.… …   Wikipedia

  • Data quality assurance — is the process of profiling the data to discover inconsistencies, and other anomalies in the data and performing data cleansing activities (e.g. removing outliers, missing data interpolation) to improve the data quality . These activities can be… …   Wikipedia

  • Data auditing — is the process of conducting a data audit to assess how company s data is fit for given purpose. This involves profiling the data and assessing the impact of poor quality data on the organization s performance and profits. Categories: Data… …   Wikipedia

  • Data Analysis Techniques for Fraud Detection — Fraud is a million dollar business and it is increasing every year. The PwC global economic crime survey of 2009 suggests that close to 30% of companies worldwide reported fallen victim to fraud in the past year[1] Fraud involves one or more… …   Wikipedia

  • Profiling (computer programming) — In software engineering, profiling ( program profiling , software profiling ) is a form of dynamic program analysis that measures, for example, the usage of memory, the usage of particular instructions, or frequency and duration of function calls …   Wikipedia

  • Data mining — Not to be confused with analytics, information extraction, or data analysis. Data mining (the analysis step of the knowledge discovery in databases process,[1] or KDD), a relatively young and interdisciplinary field of computer science[2][3] is… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”