Apriori algorithm

Apriori algorithm

In computer science and data mining, Apriori is a classic algorithm for learning association rules. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation). Other algorithms are designed for finding association rules in data having no transactions (Winepi and Minepi), or having no timestamps (DNA sequencing).

As is common in association rule mining, given a set of "itemsets" (for instance, sets of retail transactions, each listing individual items purchased), the algorithm attempts to find subsets which are common to at least a minimum number C (the cutoff, or confidence threshold) of the itemsets. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as "candidate generation"), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.

Apriori uses breadth-first search and a tree structure to count candidate item sets efficiently. It generates candidate item sets of length k from item sets of length k-1. Then it prunes the candidates which have an infrequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent k-length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates.

Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs, which have spawned other algorithms. Candidate generation generates large numbers of subsets (the algorithm attempts to load up the candidate set with as many as possible before each scan). Bottom-up subset exploration (essentially a breadth-first traversal of the subset lattice) finds any maximal subset S only after all 2^-1 of its proper subsets.

Example

This example suggests the process of selecting or generating a list of likely ordered serial candidate item sets.The technique's goal is to construct a set of k node ordered serial item sets from k-1 length item sets.For example, with k = 4, suppose there are two such sets of length k-1...:A ightarrow B ightarrow C, and :A ightarrow B ightarrow D, two candidate item sets are generated, namely :A ightarrow B ightarrow C ightarrow Dand :A ightarrow B ightarrow D ightarrow C.

Algorithm

Association rule mining is to find out association rules that satisfy the predefinedminimum support and confidence from a given database. The problem is usuallydecomposed into two subproblems. One is to find those itemsets whose occurrencesexceed a predefined threshold in the database; those itemsets are called frequent orlarge itemsets. The second problem is to generate association rules from those largeitemsets with the constraints of minimal confidence. Suppose one of the large itemsetsis Lk, Lk = {I1, I2, … , Ik}, association rules with this itemsets are generated in thefollowing way: the first rule is {I1, I2, … , Ik-1}⇒ {Ik}, by checking the confidencethis rule can be determined as interesting or not. Then other rule are generated bydeleting the last items in the antecedent and inserting it to the consequent, further theconfidences of the new rules are checked to determine the interestingness of them.Those processes iterated until the antecedent becomes empty. Since the second subproblemis quite straight forward, most of the researches focus on the first subproblem.The Apriori algorithm finds the frequent sets L In Database D.

* Find frequent set L_{k-1}.
* Join Step.
** C_k is generated by joining L_{k-1}with itself
* Prune Step.
** Any (k-1) -itemset that is not frequent cannot be a subset of a frequent k -itemset, hence should be removed.

where
* (C_k: Candidate itemset of size k)
* (L_k: frequent itemset of size k)

Apriori Pseudocode

"Apriori" (T,varepsilon): L_1 gets { "large 1-itemsets that appear in more than varepsilon transactions" }: k gets 2:: "while" L_{k-1} eq varnothing ::: C_k gets "Generate"(L_{k-1})::: "for transactions" t in T:::: C_t gets Subset(C_k,t):::: "for candidates" c in C_t::::: mathrm{count} [c] gets mathrm{count} [c] +1::: L_k gets { c in C_k | ~ mathrm{count} [c] geq varepsilon }::: k gets k+1 : "return" igcup_k L_k

References

* Agrawal R, Imielinski T, Swami AN. "Mining Association Rules between Sets of Items in Large Databases." "SIGMOD". June 1993, 22(2):207-16, [http://portal.acm.org/ft_gateway.cfm?id=170072&type=pdf&coll=GUIDE&dl=portal,ACM&CFID=11111111&CFTOKEN=2222222 pdf] .
* Agrawal R, Srikant R. "Fast Algorithms for Mining Association Rules", "VLDB". Sep 12-15 1994, Chile, 487-99, [http://www.acm.org/sigmod/vldb/conf/1994/P487.PDF pdf] , ISBN 1-55860-153-8.
* Mannila H, Toivonen H, Verkamo AI. "Efficient algorithms for discovering association rules." "AAAI Workshop on Knowledge Discovery in Databases (SIGKDD)". July 1994, Seattle, 181-92, [http://www.softlab.ntua.gr/facilities/public/AD/DM/Efficient_Algorithms_for_Discovering_Association_Rules.ps ps] .
* S. Kotsiantis, D. Kanellopoulos, Association Rules Mining: A Recent Overview, GESTS International Transactions on Computer Science and Engineering, Vol.32 (1), 2006, pp. 71-82 (http://www.math.upatras.gr/~esdlab/en/members/kotsiantis/association%20rules%20kotsiantis.pdf)


Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • GSP Algorithm — ( Generalized Sequential Pattern algorithm) is an algorithm used for sequence mining. The algorithms for solving sequence mining problems are mostly based on the a priori (level wise) algorithm. One way to use the level wise paradigm is to first… …   Wikipedia

  • Association rule learning — In data mining, association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. Piatetsky Shapiro[1] describes analyzing and presenting strong rules discovered in… …   Wikipedia

  • A priori — may refer to: * A priori (languages), a type of constructed language * A priori (statistics), a knowledge of the actual population * A priori and a posteriori (philosophy), used to distinguish two types of propositional knowledge *Apriori… …   Wikipedia

  • Sequence mining — is concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus Time series mining is closely related, but usually… …   Wikipedia

  • Entity-attribute-value model — (EAV), also known as object attribute value model and open schema is a data model that is used in circumstances where the number of attributes (properties, parameters) that can be used to describe a thing (an entity or object ) is potentially… …   Wikipedia

  • Entity–attribute–value model — (EAV) is a data model to describe entities where the number of attributes (properties, parameters) that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest. In… …   Wikipedia

  • DIC — may refer to: Contents 1 Science 2 Medicine 3 Entertainment 4 Other uses Science Differential interference cont …   Wikipedia

  • Dynamic itemset counting — An extension to Apriori algorithm used to reduce number of scans on the dataset. S. Brin R. Motwani, J.Ullman and S. Tsur. Dynamic itemset counting and implication rules for market basket data. Sigmod 1997 …   Wikipedia

  • Règle d'association — Dans le domaine du data mining la recherche des Règles d Association est une méthode populaire étudiée d une manière approfondie dont le but est de découvrir des relations ayant un intérêt pour le statisticien entre deux ou plusieurs variables… …   Wikipédia en Français

  • K-medoids — The K medoids algorithm is a clustering algorithm related to the K means algorithm and the medoidshift algorithm. Both the K means and K medoids algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize squared …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”