Unicode -- but non UTF-8

Dave Garbutt's Avatar

Dave Garbutt

27 Feb, 2014 06:33 PM

Hi,

I have an OCR program creating 'Unicode' text it claims. But these files look empty when pre-viewed in MMC (and Marked, as it happens). Very confusing till I opened with TextWrangler and saw all the nulls, suggesting it is UTF-16 of some kind.
I can convert with TW to UTF-8 now I know, but is it possible UTF16 could be allowed in MMC? Or flagged as invalid or at least shown as corrupted text?
The screen shot attached shows how the file appears the letters UN are two I have re-typed by hand.

thanks, for a great app by the way.

(the oCR also swallows paragraph marks and turns LF into three blank lines. Some way to fix issues lie that would be convenient :-)

  1. Support Staff 1 Posted by Fletcher on 27 Feb, 2014 07:25 PM

    Fletcher's Avatar

    Dave,

    Thanks for writing in.

    Encoding is tricky. There's apparently not a fool-proof way to
    determine encoding of an unknown source. The BOM that is (sometimes)
    included helps serve as a label, but not every app includes that.

    I still don't understand why UTF-16 exists (or why it's used), but
    that's another discussion, and likely more philosophical than factual.

    Be sure that the app you are using is using a BOM if it's UTF-16.
    Otherwise, you'll likely always have trouble.

    That said, I did just revisit that code in Composer, and slightly
    changed the order of the file tests I run. This should allow it to be
    more accurate when trying to determine whether a file is UTF-16. But
    the BOM has to be there to get things working correctly. The next
    release will include this.

    Otherwise, you'll have to use programs like TextWrangler and experiment
    to determine which encoding was used so you can convert it to a proper
    encoding.

    And from what I understand, you'll always be better off using UTF-8. It
    sounds like there may be rare circumstances where UTF-16 is more
    efficient for a programmer, but as a user UTF-8 will seemingly be much
    more reliably handled across applications.

    Fletcher

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac