Co-occurrence networks

Co-occurrence networks

Co-occurrence networks are generally used to provide a graphic visualization of potential relationships between people, organizations, concepts or other entities represented within written material. The generation and visualization of co-occurrence networks has become practical with the advent of electronically stored text amenable to text mining.

By way of definition, co-occurrence networks are the collective interconnection of terms based on their paired presence within a specified unit of text. Networks are generated by connecting pairs of terms using a set of criteria defining co-occurrence. For example, terms A and B may be said to “co-occur” if they both appear in a particular article. Another article may contain terms B and C. Linking A to B and B to C creates a co-occurrence network of these three terms. Rules to define co-occurrence within a text corpus can be set according to desired criteria. For example, a more stringent criteria for co-occurrence may require a pair of terms to appear in the same sentence.

Co-occurrence networks can be created for any given list of terms (any dictionary) in relation to any collection of texts (any text corpus). Co-occurring pairs of terms can be called “neighbors” and these often group into “neighborhoods” based on their interconnections. Individual terms may have several neighbors. Neighborhoods may connect to one another through at least one individual term or may remain unconnected.

Individual terms are, within the context of text mining, symbolically represented as text strings. In the real world, the entity identified by a term normally has several symbolic representations. It is therefore useful to consider terms as being represented by one primary symbol and up to several synonymous alternative symbols. Occurrence of an individual term is established by searching for each known symbolic representations of the term. The process can be augmented through NLP (natural language processing) algorithms that interrogate segments of text for possible alternatives such as word order, spacing and hyphenation. NLP can also be used to identify sentence structure and categorize text strings according to grammar (for example, categorizing a string of text as a noun based on a preceding string of text known to be an article).

Graphic representation of co-occurrence networks allow them to be visualized and inferences drawn regarding relationships between entities in the domain represented by the dictionary of terms applied to the text corpus. Meaningful visualization normally requires simplifications of the network. For example, networks may be drawn such that the number of neighbors connecting to each term is limited. The criteria for limiting neighbors might be based on the absolute number of co-occurrences or more subtle criteria such as “probability” of co-occurrence or the presence of an intervening descriptive term.

Quantitative aspects of the underlying structure of a co-occurrence network might also be informative, such as the overall number of connections between entities, clustering of entities representing sub-domains, detecting synonyms[1], etc.

Some working applications of the co-occurrence approach are available to the public through the internet. PubGene is an example of an application that addresses the interests of biomedical community by presenting networks based on the co-occurrence of genetics related terms as these appear in MEDLINE records[2][3]. The web site Name Base is an example of how human relationships can be inferred by examining networks constructed from the co-occurrence of personal names in newspapers and other texts (as in Ozgur et al.[4]).

Networks of information are also used to facilitate efforts to organize and focus publicly available information for law enforcement and intelligence purposes (so called "open source intelligence" or OSINT). Related techniques include co-citation networks as well as the analysis of hyperlink and content structure on the internet (such as in the analysis of web sites connected to terrorism[5]).

See also Takada et al.[6] and Liu[7]


  1. ^ Cohen AM, Hersh WR, Dubay C, Spackman, K: “Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts” BMC Bioinformatics 2005, 6:103
  2. ^ Jenssen TK, Laegreid A, Komorowski J, Hovig E: "A literature network of human genes for high-throughput analysis of gene expression. " Nature Genetics, 2001 May; 28(1):21-8. PMID 11326270
  3. ^ Grivell L: “Mining the bibliome: searching for a needle in a haystack? New computing tools are needed to effectively scan the growing amount of scientific literature for useful information.” EMBO reports 2001 Mar;3(3):200-3: doi:10.1093/embo-reports/kvf059 PMID 11882534
  4. ^ Ozgur A, Cetin B, Bingol H: “Co-occurrence Network of Reuters News” (15 Dec 2007)
  5. ^ Zhou Y, Reid E, Qin J, Chen H, Lai G: "US Domestic Extremist Groups on the Web: Link and Content Analysis"
  6. ^ Takada H, Saito K, Yamada T, Kimura M: “Analysis of Growing Co-occurrence Networks” SIG-KBS (Journal Code:X0831A) 2006, VOL.73rd;NO.;PAGE.117-122 Language;Japanese
  7. ^ Liu, Chua T-S; “Building semantic perceptron net for topic spotting.” Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, 2001; 378 - 385

Wikimedia Foundation. 2010.

Игры ⚽ Нужен реферат?

Look at other dictionaries:

  • Co-occurrence — or cooccurrence can either mean concurrence / coincidence or, in a more specific sense, the above chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co occurrence in this linguistic sense can be… …   Wikipedia

  • PubGene — Infobox Company name = PubGene Inc. type = Privately held genre = foundation = 2001 founder = location city = Boston location country = USA location = locations = area served = Global key people = Eirik Næss Ulseth (CEO) industry = Bio… …   Wikipedia

  • Biomedical text mining — (also known as BioNLP) refers to text mining applied to texts and literature of the biomedical and molecular biology domain. Itis a rather recent research field on the edge of natural language processing, bioinformatics, medical informatics and… …   Wikipedia

  • river — river1 riverless, adj. riverlike, adj. /riv euhr/, n. 1. a natural stream of water of fairly large size flowing in a definite course or channel or series of diverging and converging channels. 2. a similar stream of something other than water: a… …   Universalium

  • earthquake — /errth kwayk /, n. 1. a series of vibrations induced in the earth s crust by the abrupt rupture and rebound of rocks in which elastic strain has been slowly accumulating. 2. something that is severely disruptive; upheaval. [1300 50; ME erthequake …   Universalium

  • Africa — /af ri keuh/, n. 1. a continent S of Europe and between the Atlantic and Indian oceans. 551,000,000; ab. 11,700,000 sq. mi. (30,303,000 sq. km). adj. 2. African. * * * I Second largest continent on Earth. It is bounded by the Mediterranean Sea,… …   Universalium

  • Europe, history of — Introduction       history of European peoples and cultures from prehistoric times to the present. Europe is a more ambiguous term than most geographic expressions. Its etymology is doubtful, as is the physical extent of the area it designates.… …   Universalium

  • Digital video recorder — Foxtel iQ, a combined digital video recorder and satellite receiver. V+, a combined digital vid …   Wikipedia

  • weather forecasting — Prediction of the weather through application of the principles of physics and meteorology. Weather forecasting predicts atmospheric phenomena and changes on the Earth s surface caused by atmospheric conditions (snow and ice cover, storm tides,… …   Universalium

  • India — /in dee euh/, n. 1. Hindi, Bharat. a republic in S Asia: a union comprising 25 states and 7 union territories; formerly a British colony; gained independence Aug. 15, 1947; became a republic within the Commonwealth of Nations Jan. 26, 1950.… …   Universalium

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”