Optical Character Recognition

23
Optical Character Recognition Introduction Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artefacts, and apply techniques such as machine translation , text-to-speech and text mining to it. OCR is a field of research in pattern recognition , artificial intelligence and computer vision . OCR systems require calibration to read a specific font ; early versions needed to be programmed with images of each character, and worked on one font at a time. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components. Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl 1

Transcript of Optical Character Recognition

Optical Character Recognition

IntroductionOptical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artefacts, and apply techniques such as machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. OCR systems require calibration to read a specific font; early versions needed to be programmed with images of each character, and worked on one font at a time. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components.

Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl

1

Optical Character RecognitionWhat is OCR? OCR is optical character recognition, a software tool that allows you to convert scanned document into text searchable files. It is now increasingly common for documents to be scanned so that they can be conveniently viewed and shared via electronic means. However a scan is merely an image capture of the original document, so it cannot be edited or searched through in any way. This results in a decrease in efficiency since employees now have to manually correct or search through multiple pages. OCR solves this problem by making the document text searchable.

OCR Solutions Any office can benefit greatly from the advantages that come with OCR. One of the most experienced and capable providers of OCR software is CVISION Technologies. Their products can perform multiple functions in addition to OCR, such as PDF conversion and compression. CVISION's OCR products have an accuracy rate above 99% and can process up to 20 pages per second, which is well ahead of existing competitors. Combined with their free 30-day trial, CVISION is both proficient and user friendly

Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl

2

Optical Character Recognition

History:

In 1929 Gustav Tauscher obtained a patent on OCR in Germany, followed by Paul W. Handel who obtained a US patent on OCR in USA in 1933 In 1955, the first commercial system was installed at the Reader's Digest. In 1974 Ray Kurzweil started the company Kurzweil Computer Products, Inc. and led development of the first Omni-font optical character recognition system a computer program capable of recognizing text printed in any normal font. He decided that the best application of this technology would be to create a reading machine for the blind, which would allow blind people to have a computer read text to them out loud. This device required the invention of two enabling technologies the CCD flatbed scanner and the text-to-speech synthesizer. In 1978 Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload paper legal and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox, which had an interest in further commercializing paper-to-computer text conversion. Kurzweil Computer Products became a subsidiary of Xerox known as Scan soft, now Nuance Communications. 1992-1996 Commissioned by the U.S. Department of Energy (DOE), Information Science Research Institute (ISRI) conducted the most authoritative of the Annual Test of OCR Accuracy for 5 consecutive years in the mid-90s. Information Science Research Institute (ISRI) is a research and development unit of University of Nevada, Las Vegas. ISRI was established in 1990 with funding from the U.S. Department of Energy. Its mission is to foster the improvement of automated technologies for understanding machine printed documents

Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl

3

Optical Character Recognition

Optical Character RecognitionThe goal of Optical Character Recognition (OCR) is to classify optical patterns (often contained in a digital image) corresponding to alphanumeric or other characters. The process of OCR involves several steps including segmentation, feature extraction, and classification. Each of these steps is a field unto itself, and is described briefly here in the context of a Matlab implementation of OCR. Text capture is a process to convert analogue text based resources into digitally recognisable text resources. These digital text resources can be represented in many ways such as searchable text in indexes to identify documents or page images, or as full text resources. An essential first stage in any text capture process from analogue to digital will be to create a scanned image of the page side. This will provide the base for all other processes. The next stage may then be to use a technology known as Optical Character Recognition to convert the text content into a machine readable format.

Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl

4

Optical Character Recognition

Optical Character Recognition (OCR) is a type of document image analysis where a scanned digital image that contains either machine printed or handwritten script is input into an OCR software engine and translating it into an editable machine readable digitaltextformat(likeASCIItext). OCR works by first pre-processing the digital page image into its smallest component parts with layout analysis to find text blocks, sentence/line blocks, word blocks and character blocks. Other features such as lines, graphics, photographs etc are recognised and discarded. The character blocks are then further broken down into components parts, pattern recognized and compared to the OCR engines large dictionary of characters from various fonts and languages. Once a likely match is made then this is recorded and a set of characters in the word block are recognized until all likely characters have been found for th word block. The word is then compared to the OCR engines large dictionary of complete words that exist for that language.

Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl

5

Optical Character Recognition

Building Blocks of OCR:

High contrast image data is acquired from metal or other hard surface or in some advanced systems from paper with varying roughness and reflectivity on which character are impressed or printed. The optical scanner applies normal illumination and liner photodiode array detects light reflected normal to the surface within a narrow acceptance angle so the character appears dark and the background light. The detector signal is pre-processed to remove non -uniform background variations and yield image data which can be fed to conventional character recognition equipment. This is the main software part of the system. In this signals are processed and given

Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl

6

Optical Character Recognition

Process Involved in Optical Character Recognition Text capture is a process to convert analogue text based resources into digitally recognisable text resources. These digital text resources can be represented in many ways such as searchable text in indexes to identify documents or page images, or as full text resources. An essential first stage in any text capture process from analogue to digital will be to create a scanned image of the page side. This will provide the base for all other processes. The next stage may then be to use a technology known as Optical Character Recognition to convert the text content into a machine readable format. Optical Character Recognition (OCR) is a type of document image analysis where a scanned digital image as mentioned above that contains either machine printed or handwritten script is input into an OCR software engine and translating it into an editable machine readable digital textformat(likeASCIItext). OCR works by first pre-processing the digital page image into its smallest component parts with layout analysis to find text blocks, sentence/line blocks, word blocks and character blocks. Other features such as lines, graphics, photographs etc are recognised and discarded. The character blocks are then further broken down into components parts, pattern recognized and compared to the OCR engines large dictionary of characters from various fonts and languages. Once a likely match is made then this is recorded and a set of characters in the word block are recognized until all likely characters have been found for the word block. The word is then compared to the OCR engines large dictionary of complete words that exist for that language.

Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl

7

Optical Character Recognition

Types of OCR SystemDesktop & Server OCR Software OCR software and ICR software technology are analytical artificial intelligence systems that consider sequences of characters rather than whole words or phrases. Based on the analysis of sequential lines and curves, OCR and ICR make 'best guesses' at characters using database look-up tables to closely associate or match the strings of characters that form words. Web OCR & Online OCR With IT technology development, the platform for people to use software has been changed from single PC platform to multi-platforms such as PC +Web-based+ Cloud Computing + Mobile devices. After 30 years development, OCR software started to adapt to new application requirements. Web OCR also known as Online OCR or Web-based OCR service, has been a new trend to meet larger volume and larger group of users after 30 years development of the desktop OCR. Internet and broadband technologies have made Web OCR & Online OCR practically available to both individual users and enterprise customers. Since 2000, some major OCR vendors began offering Web OCR & Online software, a number of new entrants companies to seize the opportunity to develop innovative Web-based OCR service, some of which are free of charge services. Application-Oriented OCR Since OCR technology has been more and more widely applied to paper-intensive industry, it is facing more complex images environment in the real world. For example: complicated backgrounds, degraded-images, heavy-noise, paper skew, picture distortion, low-resolution, disturbed by grid & lines, text image consisting of special fonts, symbols, glossary words and etc. All the factors affect OCR products stability in recognition accuracy. In recent years, the major OCR technology providers began to develop dedicated OCR systems, each for special types of images. They combine various optimization methods related to the special image, such as business rules, standard expression, glossary or dictionary and rich information contained in color images, to improve the recognition accuracy. Such strategy to customize OCR technology is called Application-Oriented OCR or "Customized OCR", widely used in the fields of Business-card OCR, Invoice OCR, Screenshot OCR, ID card OCR, Driver-license OCR or Auto plant OCR, and so on.

Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl

8

Optical Character RecognitionCurrent state of OCR technology

Recognition of Latin-script, typewritten text is still not 100% accurate even where clear imaging is available. One study based on recognition of 19th- and early 20thcentury newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 71% to 98%. total accuracy can be achieved only by human review. Other areasincluding recognition of hand printing, cursive handwriting, and printed text in other scripts (especially those East Asian language characters which have many strokes for a single character)are still the subject of active research. On-line character recognition is sometimes confused with Optical Character Recognition (see Handwriting recognition). OCR is an instance of off-line character recognition, where the system recognizes the fixed static shape of the character, while on-line character recognition instead recognizes the dynamic motion during handwriting. For example, on-line recognition, such as that used for gestures in the Penpoint OS or the Tablet PC can tell whether a horizontal mark was drawn right-to-left, or left-to-right. On-line character recognition is also referred to by other terms such as dynamic character recognition, real-time character recognition, and Intelligent Character Recognition or ICR. On-line systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years (see Tablet PC history). Among these are the input devices for personal digital assistants such as those running Palm OS. The Apple Newton pioneered this product. The algorithms used in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known. Also, the user can be retrained to use only specific letter shapes. These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem. Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications. Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a cheque (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognise all handwritten cursive script.

It is necessary to understand that OCR technology is a basic technology also used in advanced scanning applications. Due to this, an advanced scanning solution can be unique and patented and not easily copied despite being based on this basic OCR technology.Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl 9

Optical Character RecognitionFor more complex recognition problems, intelligent character recognition systems are generally used, as artificial neural networks can be made indifferent to both affine and non-linear transformations. A technique which is having considerable success in recognising difficult words and character groups within documents generally amenable to computer OCR is to submit them automatically to humans in the CAPTCHA system.

Use of OCR in different sectors

Practical Applications In recent years, OCR (Optical Character Recognition) technology has been applied throughout the entire spectrum of industries, revolutionizing the document management process. OCR has enabled scanned documents to become more than just image files, turning into fully searchable documents with text content that is recognized by computers. With the help of OCR, people no longer need to manually retype important documents when entering them into electronic databases. Instead, OCR extracts relevant information and enters it automatically. The result is accurate, efficient information processing in less time.Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl 10

Optical Character Recognition

Banking The uses of OCR vary across different fields. One widely known application is in banking, where OCR is used to process checks without human involvement. A check can be inserted into a machine, the writing on it is scanned instantly, and the correct amount of money is transferred. This technology has nearly been perfected for printed checks, and is fairly accurate for handwritten checks as well, though it occasionally requires manual confirmation. Overall, this reduces wait times in many banks. Legal In the legal industry, there has also been a significant movement to digitize paper documents. In order to save space and eliminate the need to sift through boxes of paper files, documents are being scanned and entered into computer databases. OCR further simplifies the process by making documents text-searchable, so that they are easier to locate and work with once in the database. Legal professionals now have fast, easy access to a huge library of documents in electronic format, which they can find simply by typing in a few keywords.

Healthcare Healthcare has also seen an increase in the use of OCR technology to process paperwork. Healthcare professionals always have to deal with large volumes of forms for each patient, including insurance forms as well as general health forms. To keep up with all of this information, it is useful to input relevant data into an electronic database that can be accessed as necessary. Form processing tools, powered by OCR, are able to extract information from forms and put it into databases, so that every patient's data is promptly recorded. As a result, healthcare providers can focus on delivering the best possible service to every patient. OCR in Other Industries OCR is widely used in many other fields, including education, finance, and government agencies. OCR has made countless texts available online, saving money for students andDr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl 11

Optical Character Recognitionallowing knowledge to be shared. Invoice imaging applications are used in many businesses to keep track of financial records and prevent a backlog of payments from piling up. In government agencies and independent organizations, OCR simplifies data collection and analysis, among other processes. As the technology continues to develop, more and more applications are found for OCR technology, including increased use of handwriting recognition. Furthermore, other technologies related to OCR, such as barcode recognition, are used daily in retail and other industries.

Optical character recognition software Applications:

Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl

12

Optical Character RecognitionAutomatic number plate recognition Computational linguistics Computer vision Digital library Digital pen Handwriting Music OCR Optical mark recognition

Advantages

Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl

13

Optical Character RecognitionOptical Character recognition are numerous. It increases the efficiency and effectiveness of office work. In OCR the ability to instantly search through content is immensely useful, especially in an office setting that has to deal with high volume scanning or high document inflow. In OCR quick and accurate, ensuring the document's content remains intact while saving time as well. In OCR scanning and file compression done easily. In OCR Workflow is increased since employees no longer have to waste time on manual labour. In OCR work quicker and more efficiently.

Disadvantages

In OCR it cannot possible to convert text or data accurately.Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl 14

Optical Character RecognitionIn OCR, if there are any stray marks or voids of any kind within the scan area that are not supposed to be there. In OCR Price of software & hardware are very high. In OCR accuracy and capability are not possible within low prize of software and hardware. OCR software is not efficient in recognizing the handwriting and the fonts, which are quite similar to handwriting. High accuracy OCR software can read more than 400 characters/second, approximately, and generates less number of OCR errors compared to any ordinary OCR software. Therefore, if you are looking for OCR process you have to maintain a separate workstation for correcting OCR errors.

Conclusion

We have implemented an OCR system using only a mobile phone for all tasks.

Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl

15

Optical Character Recognition

The system can convert to text images of documents with font Bookman Old Style of any size. The system has an accuracy of around 75%. The recognition happens quickly within 23 minutes. The system is invariant to font size and the height at which the camera is placed above the document. The accuracy and recognition is not sufficient for practical use. It may require significant improvement.

References

Books reffred:Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl 16

Optical Character RecognitionUnited state patents on OCR. Written by - William D. Barber,Thomas M Cipolla, Josef L.Mundy Online Information: Wikipedia, the free encyclopedia http://www.webopedia.com/TERM/O/oneiric_ocelot.html http://www.ocr.org.uk/pastpapermaterials/

Dr. B.N.C.P.E. , Dept. Of Comp. Sci. & Technology,Ytl

17