This topic is semi-related to my other topic about converting lines, but this time it's about text. I've got some text that is actually a gigantic amount of paths. I would like to convert that into real text, to make it editable and maintainable. It should also decrease the filesize dramatically.
So basically it's a kind of OCR, but from vectors rather than pixels... But here's the catch: there are Japanese Kanji characters in it as well that I also need converted. My personal unstanding of their characterset is very limited (I can read&write the Kanas quite easily, but Kanji is a whole new beast to conquer), which makes "re-typing" everything into text objects a pain in the rear end. Doable, but very very lenghty.
I wonder if this can be done more-or-less automatically. Even a rough conversion would be most helpful, something that takes away the bulk of the work... Any advice?
Converting paths to real text?
Re: Converting paths to real text?
Hmm, OCR for vector images... That might actually give you some advantages over OCR for raster images. I'm not quite sure how OCR works (I've only read some stuff on hidden Markov models, or the application of wavelets for character recognition), so I can't say if there are any advantages and if so, which ones.
Anyway, it's going to be quite difficult to accomplish your purpose. There are just so many Kanji! Perhaps you could limit the Kanji you check to about 2000, the most used ones? I'm talking about these ones: Jōyō kanji.
By the way, I just started learning Japanese myself . Indeed, the Kana are quite easy, but Kanji take a lot of time. Perhaps you can use Jack Halpern's approach to look up Kanji? Link for Amazon (notice there will be a new version in 2013). It's my main source for looking up unknown Kanji! It even has a page on Wikpedia. Another way is by selecting the present radicals in a Kanji, like on this interactive website.
Finally, a quick Google search gave me this topic on Stack Overflow: click. Of the mentioned OCR engines, I've only used Tesseract a few times (for Dutch and English). You would have to train it in order to recognize Kanji (and that would take quite some time because of their number).
I would however advise to implement something yourself, that uses one of the above two methods (SKIP or by radical). Then it would boil down to recognizing components/parts/elements of a single Kanji, and subsequently looking it up in a database. How much programming experience do you have?
Anyway, I'll regularly check this thread for updates, since the topic certainly has my interest - reading about OCR has been on my to-do list for quite some time now. There are just so many applications (e.g. recognizing handwritten equations and converting them into LaTeX code, recognizing sheet music and converting it to MIDI (check out the Audiveris project)...)! Oh, and lets also include Inkscape in the process
[Edit]: Could you post a small example (perhaps a part of your main file)? Thanks!
Anyway, it's going to be quite difficult to accomplish your purpose. There are just so many Kanji! Perhaps you could limit the Kanji you check to about 2000, the most used ones? I'm talking about these ones: Jōyō kanji.
By the way, I just started learning Japanese myself . Indeed, the Kana are quite easy, but Kanji take a lot of time. Perhaps you can use Jack Halpern's approach to look up Kanji? Link for Amazon (notice there will be a new version in 2013). It's my main source for looking up unknown Kanji! It even has a page on Wikpedia. Another way is by selecting the present radicals in a Kanji, like on this interactive website.
Finally, a quick Google search gave me this topic on Stack Overflow: click. Of the mentioned OCR engines, I've only used Tesseract a few times (for Dutch and English). You would have to train it in order to recognize Kanji (and that would take quite some time because of their number).
I would however advise to implement something yourself, that uses one of the above two methods (SKIP or by radical). Then it would boil down to recognizing components/parts/elements of a single Kanji, and subsequently looking it up in a database. How much programming experience do you have?
Anyway, I'll regularly check this thread for updates, since the topic certainly has my interest - reading about OCR has been on my to-do list for quite some time now. There are just so many applications (e.g. recognizing handwritten equations and converting them into LaTeX code, recognizing sheet music and converting it to MIDI (check out the Audiveris project)...)! Oh, and lets also include Inkscape in the process
[Edit]: Could you post a small example (perhaps a part of your main file)? Thanks!
Re: Converting paths to real text?
Sorry to not reply for a while, I am (and have been) working on some other things.
Thank you very much for your extensive post, Ailurus. I have read through your post, and there are certainly some promising things that I can look at. I will post back the results when I make them.
I'm going on a 4 weeks holiday in two weeks or so (to Japan, no less!), which means I might not come around to this until I'm back.
Thank you very much for your extensive post, Ailurus. I have read through your post, and there are certainly some promising things that I can look at. I will post back the results when I make them.
I'm going on a 4 weeks holiday in two weeks or so (to Japan, no less!), which means I might not come around to this until I'm back.