OCRopus

OCRopus: OCRopus
Developer(s) Thomas Breuel, DFKI

Initial release 9 April 2007^[1]

Preview release 0.4.4 (alpha) / May 1, 2010; 18 months ago (2010-05-01)

Written in C++ and Lua

Operating system Linux, Mac OS X

Type Optical character recognition

License Apache License v2.0

Website http://code.google.com/p/ocropus/

OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License, Version 2.0 with a very modular design through the use of plugins. These plugins allow OCRopus to swap out components easily.

OCRopus is currently developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence in Kaiserslautern, Germany and is sponsored by Google.

OCRopus is developed for Linux; however, users have reported success with OCRopus on Mac OS X and an application called TakOCR^[2] has been developed that installs OCRopus on Mac OS X and provides a simple droplet interface.

Contents

1 How it works

2 History

3 Usage

4 See also

5 References

6 External links

How it works

OCRopus is an OCR system that combines pluggable layout analysis, pluggable character recognition, and pluggable language modeling. It aims primarily for high-volume document conversion, namely for Google Book Search, but also for desktop and office use or for vision impaired people.

OCRopus used Tesseract as its only character recognition plugin, but it uses its own engine in the 0.4 release.^[3] This is especially useful in expanding functionality to include additional languages and writing systems. OCRopus also contains disabled code for a handwriting recognition engine which may be repaired in the future.

OCRopus's layout analysis plugin does image preprocessing and layout analysis: it chops up the scanned document and passes the sections to a character recognition plugin for line-by-line or character-by-character recognition.

As of the alpha release, OCRopus uses the language modeling code from another Google-supported project, OpenFST^[4], optional as of version pre-0.4.

History

Release history:^[5]

Initial announcement - 9 April 2007^[6]

0.1.0 (alpha) - 22 Oct 2007

0.1.1 (alpha) 14 Dec 2007 - Improved build system

0.2 (alpha 2) - 31 May 2008

0.3 (alpha 3)- 16 Oct 2008.^[5]

pre-0.4 (alpha 4) available for download May 2009^[7]

0.4.3 July 2009

0.4.4 March 2010

Updated Roadmap

Usage

OCRopus can be used from the command line or inside gscan2pdf. Once installed, it can be invoked by specifying the input images. It will output hOCR (HTML-based) code to standard out. If more precise control is needed, options can be specified on the command line to perform specific operations (e.g. recognizing a single line).

See also

Free software portal

Document Layout Analysis

References

^ Announcing the OCRopus Open Source OCR System (Thomas Breuel, OCRopus Project Leader)

^ TakOCR website

^ OCRopus doesn't even link with Tesseract by default

^ Official OpenFST website

^ ^a ^b release notes

^ Announcing the OCRopus Open Source OCR System (Thomas Breuel, OCRopus Project Leader)

^ Announcements - new repositories available

External links

OCRopus (project page on Google Code)

OCRopus Wiki

IUPR Publication Server (papers behind many of the algorithms used in OCRopus)

OCRopus course (outline of OCRopus code and how to contribute)

v · d · eOptical character recognition software

Free software
CuneiForm · GOCR · Ocrad · OCRFeeder · OCRopus · Tesseract

Proprietary software
ExperVision · FineReader · Microsoft Office Document Imaging · OmniPage · Readiris · ReadSoft · SimpleOCR · SmartScore · VueScan

See also
List of optical character recognition software

Categories:
Optical character recognition
Free software programmed in C++
Free software programmed in Lua
Google
Beta software

OCRopus
Developer(s)	Thomas Breuel, DFKI
Initial release	9 April 2007^[1]
Preview release	0.4.4 (alpha) / May 1, 2010; 18 months ago (2010-05-01)
Written in	C++ and Lua
Operating system	Linux, Mac OS X
Type	Optical character recognition
License	Apache License v2.0
Website	http://code.google.com/p/ocropus/

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

OCROpus — Entwickler: Thomas Breuel, DFKI Aktuelle Version: 0.3.1 (16. Oktober 2008) Betriebssystem: Linux Kateg … Deutsch Wikipedia
Ocropus — Entwickler: Thomas Breuel, DFKI Aktuelle Version: 0.3.1 (16. Oktober 2008) Betriebssystem: Linux Kateg … Deutsch Wikipedia
OCRopus — Entwickler Thomas Breuel, DFKI Aktuelle Version 0.4.4 (März 2010) Betriebssystem Linux Programmiersprache C++, Python … Deutsch Wikipedia
OCRopus — est un logiciel libre de reconnaissance optique de caractères avec analyse de mise en page développé avec l aide de Google pour leur projet Google Books. Pour le moment le seul module de reconnaissance est Tesseract, une des ROC les plus exactes… … Wikipédia en Français
OCRopus — Эта статья или её секция содержит информацию о программном обеспечении, которое в данный момент находится в разработке. Содержимое статьи может значительно измениться в ходе разработки ПО … Википедия
Tesseract (software) — Infobox Software name = Tesseract caption = author = Ray Smith, Hewlett Packard cite web|url = http://code.google.com/p/tesseract ocr/|title = tesseract ocr|accessdate = 2008 07 12|last = Google|authorlink = |year = 2008] developer = Google… … Wikipedia
Tesseract (Software) — Tesseract Maintainer Ray Smith u.a. Aktuelle Version 3.00.1 (5. Nov. 2010) Betriebssystem Windows, Linux, Mac OS X Programmiersprache … Deutsch Wikipedia
Document Layout Analysis — is a part of Computer Vision indicating the process of identifying and categorizing the regions of interest in a document image, e.g. a scanned page. A reading system requires the segmentation of text zones from non textual ones and the… … Wikipedia
Handschrifterkennung — Texterkennung oder auch Optische Zeichenerkennung (Abkürzung OCR von englisch Optical Character Recognition, selten auch: OZE) ist ein Begriff aus dem IT Bereich und beschreibt die automatische Texterkennung von einer gedruckten Vorlage.… … Deutsch Wikipedia
OCR-Schriften — Texterkennung oder auch Optische Zeichenerkennung (Abkürzung OCR von englisch Optical Character Recognition, selten auch: OZE) ist ein Begriff aus dem IT Bereich und beschreibt die automatische Texterkennung von einer gedruckten Vorlage.… … Deutsch Wikipedia

Academic Dictionaries and Encyclopedias

OCRopus

Contents

How it works

History

Usage

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

OCRopus

Contents

How it works

History

Usage

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Direct link