Correlation database

Correlation database: A Correlation database is a database management system (DBMS) that is data model independent and designed to efficiently handle unplanned, ad hoc queries in an analytical system environment. It was developed in 2005 by database architect Joseph Foley, whose background includes more than 30 years in data warehousing and business intelligence research and development work across a variety of industries.

Unlike relational database management systems, which use a records-based storage approach, or column-oriented databases which use a column-based storage method, a correlation database uses a value-based storage (VBS) architecture in which each unique data value is stored only once and an auto-generated indexing system maintains the context for all values.^[1]

Contents

1 Structure of the Correlation DBMS

2 Comparison of DBMS Storage Structures

2.1 Storage in RDBMS

2.2 Storage in column-oriented databases

2.3 Storage in CDBMS

3 Advantages and Disadvantages of the CDBMS

4 References

Structure of the Correlation DBMS

Because a correlation DBMS stores each unique data value only once, the physical database size is significantly smaller than relational or column-oriented databases, without the use of data compression techniques. Above approximately 30GB, a correlation DBMS may become smaller than the raw data set.^{[citation needed]}

The VBS model used by a CDBMS consists of three primary physical sets of objects that are stored and managed:

a data dictionary (metadata);

an indexing and linking data set (additional metadata); and

the actual data values that comprise the stored information.

In the VBS model, each unique value in the raw data is stored only once; therefore, the data is always normalized at the level of unique values.^[2] This eliminates the need to normalize data sets in the logical schema.

Data values are stored together in ordered sets based on data types: all integers in one set, characters in another, etc. This optimizes the data handling processes that access the values.

In addition to typical data values, the data value store contains a special type of data for storing relationships between tables. This functions similarly to foreign keys in RDBMS structures, but with a CDBMS, the relationship is known by the dictionary and stored as a data value, making navigation between tables completely automatic.

The data dictionary contains typical metadata plus additional statistical data about the tables, columns and occurrences of values in the logical schema. It also maintains information about the relationships between the logical tables. The index and linking storage includes all of the data used to locate the contents of a record from the ordered values in the data store.

While not a RAM-based storage system, a CBMDS is designed to use as much RAM as the operating system can provide. For large databases, additional RAM improves performance. Generally, 4GB of RAM will provide optimized access times up to about 100 million records. 8GB of RAM is adequate for databases up to 10 times that size.^[3] Because the incremental RAM consumed decreases as the database grows, 16GB of RAM will generally support databases containing up to approximately 20 billion records.

Comparison of DBMS Storage Structures

The sample records shown below illustrate the physical differences in the storage structures used in relational, column-oriented and correlation databases.

Cust ID Name City State

12222 ABC Corp Minneapolis MN

19434 A1 Mfg Duluth MN

20523 J&J Inc St. Paul MN

Storage in RDBMS

The record-based structure used in an RDBMS stores data in with elements of the row most near each other. Variations like clustered indexing may change the sequence of the rows, but all rows, columns and values will be stored as in the table. The above table might be stored as:

12222,ABC Corp,Minneapolis,MN;19434,A1 Mfg,Duluth,MN;20523,J&J Inc,St. Paul,MN

Storage in column-oriented databases

In the column-based structure, elements of the same column are stored adjacent to each other. Consecutive duplicates within a single column may be automatically removed or compressed efficiently.

12222,19434,20523;ABC Corp,A1 Mfg,J&J Inc;Minneapolis,Duluth,St.Paul;MN,MN,MN

Storage in CDBMS

In the VBS structure used in a CDBMS, each unique value is stored once and given an abstract (numeric) identifier, regardless of the number of occurrences or locations in the original data set. The original dataset is then constructed by referencing those logical identifiers. The correlation index may resemble the storage below. Note that the value "MN" which occurs multiple times in the data above is only included once. As the amount of repeat data grows, this benefit multiplies.

1:12222,2:19434,3:20523,4:ABC Corp,5:A1 Mfg,6:J&J Inc,7:Minneapolis,8:Duluth,9:St.Paul,10:MN

The records in our example table above can then be expressed as:

11:[1,4,7,10],12:[2,5,8,10],13:[3,6,9,10]

It's worth noting that this correlation process is a form of Database Normalization. Just as one can achieve some benefits of column-oriented storage within an RDBMS, so too can one achieve some benefits of the correlation database through database normalization. However, in a traditional RDBMS this normalization process requires work in the form of table configuration, stored procedures, and SQL statements. We say that a database is a correlation database when it naturally expresses a fully normalized schema without this extra configuration. As a result, a correlation database may have more focused optimizations for this fully normalized structure.

It's also worth noting that this correlation process is similar to what occurs in a text-search oriented Inverted index.

Advantages and Disadvantages of the CDBMS

For analytical data warehouse applications, a CDBMS has several advantages over alternative database structures. First, because the database engine itself indexes all data and auto-generates its own schema on the fly while loading, it can be implemented quickly and is easy to update. There is no need for physical pre-design and no need to ever restructure the database. Second, a CDBMS enables creation and execution of complex queries such as associative queries ("show everything that is related to x") that are difficult if not impossible to model in SQL. The primary advantage of the CDBMS is that it is optimized for executing ad hoc queries - queries not anticipated during the data warehouse design phase.^[4]

A CDBMS has two drawbacks in comparison to database alternatives. Unlike relational databases, which can be used in a wide variety of applications, a correlation database is designed specifically for analytical applications and does not provide transaction management features; it cannot be used for transactional processing. Second, because it indexes all data during the load process, the physical load speed of a CDBMS is slower than relational or column-oriented structures. However, because it eliminates the need for logical or physical pre-design, the overall "time to use" of a CDBMS is generally similar to or somewhat faster than alternative structures.

References

^ Raab, David M."Analytical Database Options". Information Management Magazine 1 July 2008.

^ Raden, Neil."Databases ALIVE". Intelligent Enterprise 18 April 2008.

^ Powell, James E."Illuminate's Correlation Database Accelerates, Expands BI Queries". Enterprise Systems Journal 9 April 2008.

^ Swoyer, Steven."In Depth: Closing the Ad Hoc Query Performance Gap for Good". Enterprise Systems Journal 9 July 2008.

Categories:
Database management systems
Types of databases

Cust ID	Name	City	State
12222	ABC Corp	Minneapolis	MN
19434	A1 Mfg	Duluth	MN
20523	J&J Inc	St. Paul	MN

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

Database activity monitoring — (DAM) is a database security technology for monitoring and analyzing database activity that operates independently of the database management system (DBMS) and does not rely on any form of native (DBMS resident) auditing or native logs such as… … Wikipedia
Correlation attack — In cryptography, correlation attacks are a class of known plaintext attacks for breaking stream ciphers whose keystream is generated by combining the output of several linear feedback shift registers (called LFSRs for the rest of this article)… … Wikipedia
Database management system — A database management system (DBMS) is a software package with computer programs that control the creation, maintenance, and the use of a database. It allows organizations to conveniently develop databases for various applications by database… … Wikipedia
Database — A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality (for example, the availability of rooms in hotels), in a way that supports… … Wikipedia
Event Correlation — is a technique for making sense of a large number of events and pinpointing the few events that are really important in that mass of information. It has been notably used in Telecommunications and Industrial Process Control since the 1970s, in… … Wikipedia
Biological database — Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high throughput experiment technology, and computational analyses. They contain information from research areas… … Wikipedia
Human cognitive reliability correlation — (HCR) is a technique used in the field of Human reliability Assessment (HRA), for the purposes of evaluating the probability of a human error occurring throughout the completion of a specific task. From such analyses measures can then be taken to … Wikipedia
Portable Database Image — The Portable Database Image, also know as [http://www.panoratio.com/204.0.html .pdi] file, is a proprietary loss less format designed for analytics, publishing and syndication of complex data. The .pdi footprint is typically 100 to 1000 times… … Wikipedia
Biovista — Infobox Company company name = Biovista, Inc. company type = Private company slogan = To Seek, To know, To Act foundation = Charlottesville, VA, USA (2005) location = flagicon|USA Charlottesville, VA, USA industry = Biotechnology services = Drug… … Wikipedia
ERROL — (an acronym for Entity Relationship Role Oriented Language) is a declarative database query and manipulation language for the Entity relationship model (ERM). It is applicable to any data model on which ERM can be mapped, virtually any general… … Wikipedia

Academic Dictionaries and Encyclopedias

Correlation database

Contents

Structure of the Correlation DBMS

Comparison of DBMS Storage Structures

Storage in RDBMS

Storage in column-oriented databases

Storage in CDBMS

Advantages and Disadvantages of the CDBMS

References

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Correlation database

Contents

Structure of the Correlation DBMS

Comparison of DBMS Storage Structures

Storage in RDBMS

Storage in column-oriented databases

Storage in CDBMS

Advantages and Disadvantages of the CDBMS

References

Look at other dictionaries:

Share the article and excerpts

Direct link