Hamshahri Corpus

Hamshahri Corpus

The Hamshahri Corpus is a sizable Persian (Farsi) corpus based on the Iranian newspaper Hamshahri, one of the first online Persian newspapers in Iran. It was in initially collected and compiled by Ehsan Darrudi () at DBRG Group [http://ece.ut.ac.ir/dbrg/] of the University of Tehran.

This corpus was created by crawling the online news articles from the Hamshahri's website and processing the HTML pages to create a standard text corpus for modern Information Retrieval experiments.

The collection contains more that 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, Society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average of 1.8 KB.

The corpus is available in several formats for download [http://ece.ut.ac.ir/dbrg/Hamshahri/] :
* Tagged Text: 560 MB
* In SQL Server 2000 Tables: 712 MB

ee also

* Bijankhan Corpus
* Persian Today Corpus
* Text corpus
* Information Retrieval

External links

* [http://ece.ut.ac.ir/dbrg/ DBRG Group Website]
* [http://ece.ut.ac.ir/dbrg/Hamshahri/ The Homepage of Hamshahri Corpus] (In English)


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Hamshahri — (Persian: همشهری) is a major national Iranian Persian language newspaper published by the Municipality of Tehran, and founded by Gholamhossein Karbaschi. It is the first coloured daily newspaper in Iran and has over 60 pages of classified… …   Wikipedia

  • Text corpus — In linguistics, a corpus (plural corpora ) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or… …   Wikipedia

  • Bijankhan Corpus — The Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian language. This collection is gathered from daily news and common texts. In this collection all documents are categorized into… …   Wikipedia

  • February 2006 — NOTOC February 2006 : ← January February March April May June July August September October November December →{| class= infobox width= 250 style= font size: 133%; background color: #DDDDDD; padding top: 5px; padding bottom: 5px !Trials | *Chile …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”