How advanced OCR found new life in big data systems

Register now

Optical character recognition is an old technology that’s been widely used since the 1970s to enable visually impaired individuals to access text, creating the first screen reader. Since its introduction, however, OCR has come a long way, and it can now be used to interpret documents of varying quality and formats, making it a valuable tool for data management applications.

Today, OCR, in combination with natural language processing (NLP), allows businesses to perform complex data extraction tasks. OCR scans the documents, and NLP software’s taxonomic capacity makes meaning from the data contained therein. And while most modern documents are already accessible – no OCR needed – this combined approach makes it possible to introduce archival business materials retained only in print into the data stream.

Old Content Makes A Comeback

Today, when a company prints a catalog, they first design a digital document, convert it to a PDF, and then print and bind it. Many older companies, however, have copies of catalogs, budgets, and documents that were never digitized. Instead, they were produced on typewriters or never converted to newer formats, and while these documents are potentially useful, they’re also unsearchable. For years, that’s been a roadblock, but now it doesn’t have to be.

So what exactly can businesses learn from their old documents? Sales strategy is at the top of the list. Using OCR to scan old catalog listings and make them searchable and NLP to identify key terms, businesses can assess the key terms used to advertise products over time, whether that language has changed, and how historic changes reflect the products’ evolution or brand position.

Other documents that businesses can analyze with OCR are old budgets and sales charts. This kind of financial data can help companies assess how political and economic changes impact these numbers. Yes, we had a major recession in 2008, but what about longer-term economic variation? When businesses only work with recent financial data, they risk missing possible hazards or opportunities.

Enhanced Accuracy

One reason that OCR was rarely used until recently is that it wasn’t especially reliable. Even when, in the early 2000s, the programs reached about 95% accuracy, businesses ran the risk that software would produce documents containing major mistakes – and particularly with numerals, such errors can be labor intensive to identify and correct. Analysts would do just as well entering the data by hand. However, now that the scan accuracy is significantly improved, the resultant data is more valuable, and analysts need only cross-reference the scans with original documents if something in the content doesn’t make sense.

NLP has also helped increase the accuracy of OCR scans. For example, older OCR programs might read chart lines as the letter ‘L’ or number ‘1.’ NLP is context dependent, however, so it can identify if something is a chart or graph, whether it’s reading a bill or an invoice, and other types of nuanced content.

Next Gen Text Solutions

Report scraping – pulling content from documents – is easy when all of your files are already digital, but when it comes to old reports and catalogs that aren’t digital reader ready, businesses need a comprehensive text analytics platform. Such programs are essentially a form of basic AI, trained to read documents, but they rely on much less expensive technology. Still, such OCR+NLP solutions will go the distance.

Rather than letting valuable documents sit unused in filing cabinets, get them back into circulation using affordable OCR technology. After all, with the rise of big data, companies have been extracting low-value content from every interaction – but they’ve got high-value data archived in a back closet. That information is much more likely to be actionable than many of the minor interactions that modern website analytics programs can glean, and it’s part of a big picture that most companies haven’t tapped into yet.

For reprint and licensing requests for this article, click here.