Monday, March 3, 2014

Optical Character Recognition (OCR)

Optical Character Recognition (OCR).

1 - When a scanner reads a document image, it converts the dark elements - text and graphical parts - the page in a bitmap (bitmap), an array of square pixels that can be active (black) or inactive (white). As
pixels are larger than most of the details of the text, this process degrades the thinner ends of the characters,
as occurs in the fax machine. The degradation creates most of the problems for optical recognition systems
Character Recognition (OCR).


2 - The OCR program reads the bitmap generated by the scanner and examines the areas of active and inactive pixels of the page in fact
it maps the blank page. This allows the program blocks separate paragraphs, columns, securities
and graphical parts. The white space between lines of text contained in a block defines the base of each line, a detail
essential for the recognition of characters in the text.

3 - In the first step of converting images into text, the program attempts to recognize each character through a comparison
pixel by pixel with the model of character that the program stored in memory. The models are composed of complete sets
- Numbers, punctuation and characters extended - common sources such as Courier 12 points and the set of the IBM Selectric.
Since this technique requires a very close match, character attributes such as bold and italic, should
to be recognized to be identical. A scan of poor quality not good results in this respect.

4 - The unrecognized characters undergo a more thorough and time-consuming process known as resource extraction. The
program calculates the x-height of the text - on the height of the lowercase letter x - and examines each combination of straight lines,
curves and filled areas of each character, as in the case of letter or b. The OCR programs know, for example, the
character with a descending below the base line and above a filled area is most likely to be a
tiny g. As the program prepares an alphabet of work each new character found, the recognition speed
increases.

5 - How these two processes ultimately not decipher all the characters, OCR programs use two methods to recognize hieroglyphics remaining. Some OCR programs mark the unrecognized characters with a special character - as ~, #, Or @ - and quit. It is necessary then to use a word processor to locate such special characters, correcting them manually. Some OCR programs are able to show a bitmap zoom in on the screen and ask that it be.
Press the key corresponding to the character in question, which should be replaced by the bitmap.

6 - Other OCR programs also request a special spell checker to look for obvious errors and locate the
possible alternatives for words that contain special characters not recognized. For example, programs for
OCR letter number 1 and l are very similar, in the same way as the fifth and S, or cl and d. A word like
acclimate could become adimatar. The spell checker recognizes these typical OCR errors and fixes them.

7 - Most programs dc OCR allows the converted document is written in ASCII or in a format as possible
be recognized by word processors and spreadsheets more known.

Source: Evolution of Computers

No comments:

Post a Comment