Solving CAPTCHA with OCR

This article appears in the Third Party Product Reviews section. Articles in this section are for the members only and must not be used by tool vendors to promote or advertise products in any way, shape or form. Please report any spam or advertising.

Some websites require passing a CAPTCHA to access their content. As I have written before these can be parsed using the deathbycaptcha API, however for large websites with many CAPTCHA’s this becomes prohibitively expensive. For example solving 1 million CAPTCHA’s with this API would cost $1390.

Fortunately many CAPTCHA’s are weak and can be solved by cleaning the image and using simple OCR. Here are some example CAPTCHA images from a recent website I worked with:

Helpfully the distracting marks are lighter so the image can be thresholded to isolate the text:

Now the resulting images can be passed to an OCR program to extract the text. Here are results from 3 popular open source OCR tools:

Captcha 1 Captcha 2 Captcha 3 Result
7rrg5 hirbZ izi3b
Tesseract 7rrq5 hirbZ izi3b 2 / 3
Gocr 7rr95 _i_bz izi3b 1 / 3
Ocrad 7rrgS hi_bL iLi3b 0 / 3

Excellent results. Getting 100% accuracy is not necessary when solving CAPTCHA’s, because real people make mistakes too so websites will just respond with another CAPTCHA to solve.

Tesseract only confused ‘g’ with ‘q’ and Gorc thought that ‘g’ was a ‘9’, which is understandable. Even though Ocrad did not get any correct on this small sample set, it was close every time. And this was without training on the font or fixing the text orientation.

If you are interested the Python code used is available for download here. It depends on the PIL for image processing and each of the OCR tools.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

What is OCR

Optical Character Recognition (OCR) is a process of converting printed materials into text or word processing files that can be easily edited and stored. The technology has enabled such materials to be stored using much less storage space than the hard copy materials. OCR technology has made a huge impact on the way information is stored, shared and edited. Prior to optical character recognition, if someone wanted to turn a book into a word processing file, each page would have to be typed word for word.

OCR technology requires both hardware and software. In addition, sophisticated OCR systems require an additional circuit board in the computer itself to complete the process. An optical scanner scans the text on a page, then breaks the fonts down into a series of dots called a bitmap. The software can read most common fonts and distinguish where lines start and stop. This bitmap is then translated into computer text.

While optical character recognition has made huge advances in recent years, it still does not always perform well in recognizing handwriting or fonts that look similar to handwriting. There are systems within the banking industry that use OCR technology to try to read the amounts on hand-written checks, to go along with the computer’s ability to read the routing and account numbers.

To give an idea of the power of OCR, it can help to take a look at a real-world example. Imagine a police department that has all its criminal records stored in vast file cabinets. Although scanning millions of pages would be an expensive and time-consuming undertaking, the benefits are huge.

Once the OCR system has converted the pages into computer-readable text, a detective, for example, could search through the entire history in a few seconds. Manually finding a particular record might not be too difficult, but imagine a detective trying to search for all the crimes committed on a certain intersection between 8:00 and 8:30. This example only scratches the surface of the power of searchable text, and it is only one reason that many companies and institutions are spending millions of dollars to OCR their legacy data.