Metadata discovery

Metadata discovery

In metadata, metadata discovery is the process of using automated tools to discover the semantics of a data element in data sets. This process usually ends with a set of mappings between the data source elements and a centralized metadata registry.

Metadata discovery is also known as metadata scanning.

Contents

Data source formats for metadata discovery

Data sets may be in a variety of different forms including:

  1. Relational databases
  2. Spreadsheets
  3. XML files
  4. Web services
  5. Software source code such as Fortran, Jovial, COBOL, Assembler, RPG, PL/1, EasyTrieve, Java, C# or C++ classes, and hundreds of other software languages
  6. Unstructured text documents such as Microsoft Word or PDF files

A taxonomy of metadata matching algorithms

There are distinct categories of automated metadata discovery:

Lexical Matching

  1. Exact match - where data element linkages are made based on the exact name of a column in a database, the name of an XML element or a label on a screen. For example if a database column has the name "PersonBirthDate" and a data element in a metadata registry also has the name "PersonBirthDate", automated tools can infer that the column of a database has the same semantics (meaning) as the data element in the metadata registry.
  2. Synonym match - where the discovery tool in not just given a single name but a set of synonym.
  3. Pattern match - in this case the tools is given a set of lexical patterns that it can match. For example the tools may search for "*gender*" or "*sex*"

Semantic Matching

Semantic matching attempts to use semantics to associate target data with registered data elements.

  1. Semantic Similarity - In this algorithm that relies on a database of word conceptual nearness is used. For example the WordNet system can rank how close words are conceptually to each other. For example the terms "Person", "Individual" and "Human" may be highly similar concepts.

Statistical Matching

Statistical matching uses statistics about data sources data itself to derive similarities with registered data elements.

  1. Distinct Value Analysis - By analyzing all the distinct values in a column the similarity to a registered data element may be made. For example if a column only has two distinct values of 'male' and 'female' this could be mapped to 'PersonGenderCode'.
  2. Data distribution analysis - By analyzing the distribution of values within a single column and comparing this distribution with known data elements a semantic linkage could be inferred.

Vendors

The following vendors (listed in alphabetical order) provide metadata discovery and metadata mapping software and solutions

  • Esquire Innovations (see [7)
  • IBM
  • InfoLibrarian Corporation (see [1])
  • Masai Technologies (see [2])
  • Revelytix (see [3])
  • Sliver Creek Systems (see [4])
  • Sypherlink: Harvester (see [5])
  • Unicorn Systems (see [6])

Research

See also

References

  1. ^ Devarakonda, R., Palanisamy, G., Wilson, B., and Green, J., "Mercury: reusable metadata management, data discovery and access system", Earth Science Informatics (Springer Berlin / Heidelberg) 3 (1): 87–94, doi:10.1007/s12145-010-0050-7 

Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Metadata — For the page on metadata about Wikipedia, see Wikipedia:Metadata. The term metadata is an ambiguous term which is used for two fundamentally different concepts (types). Although the expression data about data is often used, it does not apply to… …   Wikipedia

  • Metadata Object Description Schema — (MODS) MODS Logo XML based bibliographic description schema developed by the United States Library of Congress Network Development and Standards Office. MODS was designed as a compromise between the complexity of the …   Wikipedia

  • Discovery Net — is one of the earliest examples of a scientific workflow system allowing users to coordinate the execution of remote services based on Web service and Grid Services (OGSA and Open Grid Services Architecture) standards. The system was designed and …   Wikipedia

  • Discovery Project — The Discovery Project ( Digital Semantic Corpora for Virtual Research in Philosophy ) is an international consortium of content providers and software developers funded under the aegis of the European Commission s e Contentplus program to develop …   Wikipedia

  • Department of Defense Discovery Metadata Specification — The Department of Defense Discovery Metadata Specification (DoD Discovery Metadata Specification or DDMS) is a Net Centric Enterprise Services (NCES) metadata initiative. DDMS is loosely based on the Dublin Core vocabulary. DDMS defines discovery …   Wikipedia

  • Defense Discovery Metadata Specification — (DDMS or DoD Discovery Metadata Specification) is a Net Centric Enterprise Services (NCES) metadata initiative. DDMS defines discovery metadata elements for resources posted to community and organizational shared spaces. Sometimes (incorrectly)… …   Wikipedia

  • Electronic discovery — Electronic discovery, or e discovery , refers to discovery in civil litigation which deals with information in electronic format also referred to as Electronically Stored Information ESI . In this context, electronic form is the representation of …   Wikipedia

  • Mercury: Metadata Search System — Mercury is a Distributed Metadata Management, Data Discovery and Access System [1]. It is a scientific data search system to capture and manage biogeochemical and ecological data in support of the National Aeronautics and Space Administration s… …   Wikipedia

  • Application Discovery and Understanding — (ADU) is the process of automatically analyzing artifacts of a software application and determining metadata structures associated with the application in the form of lists of data elements and business rules. The relationships discovered between …   Wikipedia

  • Knowledge discovery — is a concept of the field of computer science that describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data. It is often described as deriving knowledge from the input… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”