Canonicalization

Canonicalization

:"Not to be confused with Canonization."

In computer science, canonicalization (abbreviated c14n, where 14 represents the number of letters between the C and the N) is a process for converting data that has more than one possible representation into a "standard" canonical representation. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

Examples

Links in Wikipedia

As an example, Wikipedia uses canonicalization in its processing of links between articles (see ). The first letter in the article name is capitalized, leading and trailing spaces are removed, and embedded whitespace is replaced by underscores. For example: Egg_salad egg salad egg_salad all refer to the same article.

Web servers

Canonicalization of filenames is important for computer security. For example, a web server may have a security rule stating "only execute files under the cgi directory (C:inetpubwwwrootcgi-bin)". The rule is enforced by checking that the path starts with "C:inetpubwwwrootcgi-bin", and if it does, the file is executed.

Should "C:inetpubwwwrootcgi-bin......WindowsSystem32cmd.exe" be executed? No, because this trick path goes back up the directory hierarchy, not staying within cgi-bin. Accepting it at face value would be an error due to failure to canonicalize the filename to a unique (simplest) representation, namely: C:WindowsSystem32cmd.exe, before doing the path check. This type of fault is called a directory traversal vulnerability.

Unicode

Variable-length encodings in the Unicode standard, in particular UTF-8, have more than one possible encoding for most common characters [http://www.ietf.org/rfc/rfc2279.txt] . This makes string validation more complicated, since every possible encoding of each string character must be considered. A software implementation which does not consider all character encodings runs the risk of accepting strings considered invalid in the application design, which could cause bugs or allow attacks. The solution is to allow a single encoding for each character. Canonicalization is then the process of translating every string character to its single allowed encoding. An alternative is for software to determine whether a string is canonicalized, and then reject it if it is not. In this case, in a client/server context, the canonicalization would be the responsibility of the client.

Canonicalization in mathematics

In mathematics, objects are sometimes converted to canonical forms. One application is in combinatorics, where the number of canonical forms can be counted. The technique of general position in geometry is similar: many proofs begin by showing that an arbitrary object under consideration can be rearranged so that its points are arranged in a convenient manner.

Canonical forms are also used in mathematical logic. A first-order formula can be put into many standards forms, including prenex normal form, conjunctive normal form, disjunctive normal form, and algebraic normal form.

References

See also

* Normal form


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

  • canonicalization — noun standardization, normalization …   Wiktionary

  • c14n — canonicalization …   Glossary of chat acronyms & text shorthand

  • XML Signature — (also called XMLDsig , XML DSig , XML Sig ) is a W3C recommendation that defines an XML syntax for digital signatures. Functionally, it has much in common with PKCS#7 but is more extensible and geared towards signing XML documents. It is used by… …   Wikipedia

  • Directory traversal — A directory traversal (or path traversal) is to exploit insufficient security validation / sanitization of user supplied input file names, so that characters representing traverse to parent directory are passed through to the file APIs.The goal… …   Wikipedia

  • Directory traversal attack — A directory traversal (or path traversal) consists in exploiting insufficient security validation / sanitization of user supplied input file names, so that characters representing traverse to parent directory are passed through to the file APIs.… …   Wikipedia

  • Defensive programming — is a form of defensive design intended to ensure the continuing function of a piece of software in spite of unforeseeable usage of said software. The idea can be viewed as reducing or eliminating the prospect of Murphy s Law having effect.… …   Wikipedia

  • DomainKeys Identified Mail — (DKIM) is a method for associating a domain name to an email message, thereby allowing a person, role, or organization to claim some responsibility for the message. The association is set up by means of a digital signature which can be validated… …   Wikipedia

  • Canonical — is an adjective derived from . Canon comes from the Greek word kanon , rule (perhaps originally from kanna reed , cognate to cane ), and is used in various meanings. Basic, canonic, canonical : reduced to the simplest and most significant form… …   Wikipedia

  • International Chemical Identifier — The IUPAC International Chemical Identifier (InChI, pronounced INchee ) is a textual identifier for chemical substances, designed to provide a standard and human readable way to encode molecular information and to facilitate the search for such… …   Wikipedia

  • URL normalization — (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”