Antarctic Extremes to Prove More Products. Three Got the Trademark Already
This year's expedition will test a varied mix of products
It scans documents, marks them and sorts them according to given parameters. When asked, it searches the stored data and invoices – simply tell the machine to do so and it will find the required information. Welcome to the office of the future. A team of IT experts from Masaryk University is also tackling the task of extracting information from scanned documents which would push today’s copying and printing machine closer towards so-called document management systems. To verify one of the methods they used the Proof of Concept support.
“For quite some time now, we’ve been collaborating with Konica Minolta who develop and manufacture large scanning and printing machines. They want to make their scanners more intelligent so that they not only turn documents into digital images but also recognize their contents,” says Assoc. Prof. Aleš Horák about the project OCR Miner, the aim of which is to mine data from scanned documents.
Under his direction, the team of experts from the Faculty of Informatics focused on a specific task within the Proof of Concept project: to extract data from scanned invoices. The type of document they chose posed quite a challenge. “As far as extracting specific data is concerned, invoices are relatively complicated. Let’s have a look at our test data which consist of 1000 invoices from around 50 companies all over the world. One can see how extremely diverse the layouts of the invoices are. They still contain 10 to 20 basic types of information; however, these pieces of information are arranged in unexpected combinations, shapes, and formats. Some invoices are almost a work of art,” says Mr. Horák jokingly.
Thanks to the financial support from the project, the researchers managed to come up with a prototype of a tool which recognizes the specific types of data such as addresses or sums of money. “Within the Proof of Concept, a set of tools was created each of which specializes in one language technology. The prototype extracts information from invoices with a success rate of around 80 %. After changing over to a better OCR base, the success rate could be even better,” says Mr. Horák.
The prototype currently works with Czech and English. Konica Minolta who collaborated on the project would like to cover more world languages. “Naturally, their primary focus is on other languages than Czech; however, the company has a relatively large R&D centre in Brno and from this point of view, the tool makes good sense to them. Moreover, our mother tongue is one of the more complex languages which shows in extracting information from text as well. This is clear especially in comparison with English which is much more schematic and therefore easier to search in,” explains Mr. Horák.
The success rate of the solution developed by the scientists from Brno is comparable with other tools developed abroad. “We’re aware of several attempts at a similar task. Usually, they have a success rate between 70 and 90 % in the case of English. We took a slightly different approach, especially by integrating certain methods of language analysis. In a relatively short time we achieved, according to specialized literature, one of the best results there are,” adds Mr. Horák.
The scientists would like to continue working on his project that will make scanners and printers smarter. Another goal is to broaden the portfolio of documents the tool is able to recognize. “Now we’re going to work on extracting information from contracts. We assume that the basic structure of the tool we have created within the PoC will be applicable to other types of documents as well,” says Mr. Horák.
Implementing this technology in commercial products is the company’s job. That’s why Mr. Horák cannot say when it will happen but it could be in just a couple of years. It is not only about improving the devices similar to those that are already on the market – the entire technological field is developing rapidly. “Even nowadays copying machines are fitted with computers and therefore capable of processing data in a much more complex way than regular scanners or copying machines. The vision is to turn regular copying machines into socalled document management systems offering complex processing of documents including data analysis, categorization and intelligent searching. We’ve also started working on query systems for man-machine communication. One day, these may also be integrated into such devices. Instead of using forms interface, one could communicate with the machine directly,” describes Mr. Horák.
In the last five years, the development in the field of machine learning and data processing has sped up considerably. Who is currently more accurate, man or machine? “In many fields, it is the machine. For example in processing large volumes of data which people are simply not capable of. A while ago we saw a computer beat one of the best Go players in the world. However, sometimes it is not clear what “more accurate” means. For instance, there is an application for recognizing contents of pictures with the accuracy of 99 % whereas the human accuracy measured on a testing sample was only 96 %. Even though it might seem that people are 3 % less accurate than the machine, it is not that simple. How can we say that the machine is right if people can’t agree on what it is that these 3 % of images actually show?” concludes Mr. Horák.