- Hamshahri Corpus
The Hamshahri Corpus is a sizable Persian (Farsi) corpus based on the Iranian newspaper
Hamshahri , one of the first online Persian newspapers in Iran. It was in initially collected and compiled by Ehsan Darrudi () at DBRG Group [http://ece.ut.ac.ir/dbrg/] of theUniversity of Tehran .This corpus was created by crawling the online news articles from the
Hamshahri 's website and processing the HTML pages to create a standard text corpus for modernInformation Retrieval experiments.The collection contains more that 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, Society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average of 1.8 KB.
The corpus is available in several formats for download [http://ece.ut.ac.ir/dbrg/Hamshahri/] :
* Tagged Text: 560 MB
* In SQL Server 2000 Tables: 712 MBee also
*
Bijankhan Corpus
*Persian Today Corpus
*Text corpus
*Information Retrieval External links
* [http://ece.ut.ac.ir/dbrg/ DBRG Group Website]
* [http://ece.ut.ac.ir/dbrg/Hamshahri/ The Homepage of Hamshahri Corpus] (In English)
Wikimedia Foundation. 2010.