Manga OCR And Language Analysis

Our purpose is to OCR manga and perform a language analysis on the used dialects within those manga. This will allow us to see the significance of dialects in day to day lives of the manga, but the different types and understandability of the dialects.

In this project we tried to use OCR to identify and compare dialects of Japan, focusing on a few manga. Our search isolated two manga: one in kansai-ben (Love Com) and one in gotō-ben (Barakamon). We decided to limit ourselves to those two dialects firstly because finding manga that contain enough of dialect is tedious and secondly because not all dialects are evenly used in the manga-world, Kansai-ben is the most recurring one. After scanning several chapters from both manga, we picked out the most recurring inflections, words and expressions comparing them to standard Japanese. As will be shown, OCR-ing Japanese is a hazardous task; if the scans do not depict an excellent quality the character recognition will not be as effective. To avoid these complications we combined different methods, mainly Tesseract, Capture2text and as a last resort manual input. Once the scanning process was completed, we tried to identify the specificity of each dialect.

Even though we are Japanology students, our understanding of Japanese dialects is limited, and we had to back up our findings with sources, which helped us to pick out those specific inflections of every one of those dialects.

The project was technically challenging, as stated before OCR and kanji do not pair well together. There is certainly a lot of potential regarding OCR and Asian languages and this project was a nice foot into the door.

