Cleaning up OCR'd documents

Note from the developer:

CafeTran does a pretty good job when handling OCRed Word files. The only problem I've encountered so far are accented characters. The solution is simple. Just select all text in Word and re-apply the used font to the selection. Next, save it and create a CT project with the OCRed Word document.

Word documents created with OCR software can contain many formatting commands, just to make sure that the Word document resembles the original on paper very well. However, when you want to translate these documents, the formatting commands can get in your way.

For Word for Windows excellent solutions are available: Translators Tools and CodeZapper. However, these advanced macro suites cannot be used in Word:mac.

For Word:mac you can use the solution described in this article (for Word for Windows too, as a matter of fact).

Creating a simple solution for Word:mac

Download this Word document containing the code of a macro to clean up OCR'd documents. Install it and run it on all OCR'd documents that you want to import in CafeTran.

NOTE: The macro will preserve the following types of inline formatting:

  • bold
  • italics
  • underlined

The macro will ask you for the TMX language codes of your target language and set it in all parts (stories) of your document.

You may have to reassign the font size for the document manually. Perhaps in a later version the macro will do this for you too.

Examples

Simulated OCR'd document in Word:

1.png

Cleaned OCR'd document in Word:

2.png

Simulated OCR'd document in CafeTran:

a.png
b.png

Cleaned OCR'd document in CafeTran:

c.png
d.png

More examples

Before running the macro:

1a.png

and:

2a.png

After running the macro:

3a.png

and:

4a.png
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License