Unstructured data

Unstructured data

Unstructured data (or unstructured information) refers to (usually) computerized information that either does not have a data structure or has one that is not easily usable by a a computer program. The term distinguishes such information from data stored in fielded form in databases or annotated (semantically tagged) in documents.

The term is imprecise: software that creates machine-processable structure exploits the linguistic, auditory, and visual structure that is inherent in all forms of human communication.ref|IntelligentEnterprise This inherent structure can be inferred from text, for instance, by examining word morphology, sentence syntax, and other small- and large-scale patterns. Examples of "unstructured data" may include audio, video, and unstructured text such as the body of an e-mail message, Web page, or word processor document.

Merrill Lynch in 1998 cited estimates that as much as 80% of all potentially usable business information originates in unstructured form.ref|ML Such estimates may not be based on primary research, but they are nonetheless widely accepted.ref|Clarabridge

Data with some form of structure may be characterized as unstructured if its structure is not helpful for the desired processing task. For example, an HTML Web page is tagged, but HTML mark-up is typically designed solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page. XHTML tagging does allow machine processing of elements although it typically does not capture or convey the semantic meaning of tagged terms.

A lot of the unstructured data is noisy text. Spontaneous communication (such as e-mail, SMS, blogs, and web pages) contains noisy text and processing noise for example from automatic speech recognition produce noisy text. Noise in text is defined as any kind of difference between the surface form of a coded representation of the text and the intended, correct, or original text.

Dealing with unstructured data

Data mining and text analytics and noisy text analytics techniques are different methods used to find patterns in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual tagging with metadata or Part-of-speech tagging for further text mining-based structuring. UIMA provides a common framework for processing this information to extract meaning and create structured data about the information.

Notes

# [http://www.intelligententerprise.com/showArticle.jhtml?articleID=59301538 Structure, Models and Meaning: Is "unstructured" data merely unmodeled?] , Intelligent Enterprise, March 1, 2005.
# [http://emarkets.grm.hia.no/gem/Topic7/eip_ind.pdf Christopher C. Shilakes and Julie Tylman, "Enterprise Information Portals"] , Merrill Lynch, 16 November, 1998.
# [http://clarabridge.com/default.aspx?tabid=137&ModuleID=635&ArticleID=551 Unstructured Data and the 80 Percent Rule] , Clarabridge Bridgepoints, 2008 Q3.

ee also

*UIMA
*Data mining
*Metadata
*Noisy text

External links

* [http://blogs.ittoolbox.com/database/soup/archives/005588.asp Database soup: Unstructured data as an oxymoron]
* [http://www.dmreview.com/article_sub.cfm?articleId=1009161 Two Worlds of Data – Unstructured and Structured]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

  • Data extraction — is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration). The import into the intermediate extracting system is thus usually followed by… …   Wikipedia

  • Data-centric programming language — defines a category of programming languages where the primary function is the management and manipulation of data. A data centric programming language includes built in processing primitives for accessing data stored in sets, tables, lists, and… …   Wikipedia

  • Data presentation architecture — (DPA) is a skill set that seeks to identify, locate, manipulate, format and present data in such a way as to optimally communicate meaning and proffer knowledge. Contents 1 Origin and context 2 Objectives 3 Scope 4 …   Wikipedia

  • Data feed — is a mechanism for users to receive updated data from data sources. It is commonly used by real time applications in point to point settings as well as on the world wide web. The latter is also called Web feed. News feed is a popular form of Web… …   Wikipedia

  • Data classification (data management) — In the field of data management, data classification as a part of Information Lifecycle Management (ILM) process can be defined as tool for categorization of data to enable/help organization to effectively answer following questions: What data… …   Wikipedia

  • Data model — Overview of data modeling context: A data model provides the details of information to be stored, and is of primary use when the final product is the generation of computer software code for an application or the preparation of a functional… …   Wikipedia

  • Data analysis — Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches,… …   Wikipedia

  • Unstructured interview — Unstructured Interviews are a method of interviews where questions can be changed or adapted to meet the respondent s intelligence, understanding or belief. Unlike a structured interview they do not offer a limited, pre set range of answers for a …   Wikipedia

  • Data Intensive Computing — is a class of parallel computing applications which use a data parallel approach to processing large volumes of data typically terabytes or petabytes in size and typically referred to as Big Data. Computing applications which devote most of their …   Wikipedia

  • Unstructured Supplementary Service Data — is a capability of all GSM phones. It is generally associated with real time or instant messaging type phone services. There is no store and forward capability, such as is typical of other short message protocols (in other words, an SMSC is not… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”