How to detect if a text file is ISO8859-1,ISO8859-15,UTF-8 or UniCode encoded

**Karl Mondale**

Assume I have a text file. How can I detect if the text inside is encoded in

ISO8859-1
ISO8859-15
UTF-8
UniCode

Karl

**Pegasus [MVP]** · January 22nd 10, 10:51 AM posted to microsoft.public.windowsxp.help_and_support

"Karl Mondale" said this in news item
...
Assume I have a text file. How can I detect if the text inside is encoded
in

ISO8859-1
ISO8859-15
UTF-8
UniCode

Karl

I would check Google or Wikipedia, e.g. he
http://en.wikipedia.org/wiki/ISO/IEC_8859-1. It explains the whole code in
detail. To find out programmatically you need to read the first few bytes.
The exact method depends on the tool you wish to use.

**Tim Slattery** · January 22nd 10, 01:56 PM posted to microsoft.public.windowsxp.help_and_support

(Karl Mondale) wrote:

Assume I have a text file. How can I detect if the text inside is encoded in

ISO8859-1
ISO8859-15
UTF-8
UniCode

You can't really, except by minutely examining the contents and seeing
whether there's something makes sense in one system but not another.
Even then you may not be sure what the creator intended, and it might
not matter anyway.

In UTF8, for example, characters in the 7-bit ASCII set are given in a
single byte. (The Unicode codes for those characters are the same as
the 7-bit ASCII encoding, just a bunch of zeroes in front). Other
Unicode characters are expressed in two or three bytes. So if the
entire file consists of 7-bit ASCII characters, the file will be
exactly the same whether UTF8 or ASCII was intended.

--
Tim Slattery

http://members.cox.net/slatteryt

**Paul Randall** · January 22nd 10, 05:03 PM posted to microsoft.public.windowsxp.help_and_support

The short answer is that you can't alway determine the encoding from the
content of a file.

To see why, you can use Notepad to experiment with creating and saving text
as ANSI, Unicode, Unicode Big Endian, and UTF-8. Try pasting in some some
text from foreign web pages, as well as plain English text. Looking at the
files in a hex editor, like XVI32, you will see that for all but Ansi,
Notepad prepends a few bytes (called a Byte Order Mark) to indicate the type
of text file. For Unicode, it is the two byte sequence (hex) FFFE or FEFF,
to indicate either big endian or little endian unicode. Not all
applications prepend a BOM. Ansi and your two ISO encodings always use one
byte per character. Unicode always uses two bytes per character, except the
new Unicode-32 uses 4 bytes per character. UTF-8 uses a variable number of
bytes per character (one to five, I think), and can encode all two-byte
Unicode characters. For saving as Ansi, Notepad complains if all characters
can't be saved as one-byte characters.

-Paul Randall

"Karl Mondale" wrote in message
...
Assume I have a text file. How can I detect if the text inside is encoded
in

ISO8859-1
ISO8859-15
UTF-8
UniCode

Karl

**Carmel[_2_]** · January 22nd 10, 08:15 PM posted to microsoft.public.windowsxp.help_and_support

"Karl Mondale" wrote:

Assume I have a text file. How can I detect if the text inside is encoded in

ISO8859-1
ISO8859-15
UTF-8
UniCode

Karl

Microsoft doesn't distribute a utility that can accomplish that feat easily.
If you can get your file transfered to a FreeBSD or Linux system, you could
use either 'file' or 'enca' to determine its property's.

MAN pages:

http://unixhelp.ed.ac.uk/CGI/man-cgi?file
http://linux.die.net/man/1/enca

--
Carmel

Never forget: 2 + 2 = 5 for extremely large values of 2.

**John Wunderlich** · January 22nd 10, 08:56 PM posted to microsoft.public.windowsxp.help_and_support

(Karl Mondale) wrote in
:

Assume I have a text file. How can I detect if the text inside is
encoded in

ISO8859-1
ISO8859-15
UTF-8
UniCode

Karl

That's rather difficult.

ISO8859-1 is almost identical to -15 where -15 replaces one encoding
with the Euro symbol and includes a few more french symbols. The only
way to tell them apart would be to look at the symbols in context.

UTF-8 is identical to ISO8859 for the first 128 ASCII characters which
include all the standard keyboard characters. After that, characters
are encoded as a multi-byte sequence.

Unicode is usually encoded in UTF-16. If you're lucky, there might be
a BOM (Byte Order Mark) of 0xFFFE or 0xFEFF as the first two characters
in the file. Otherwise, look for a 0x00 (Null character) as every
other character if the text file contains basic 7-bit ASCII characters.

HTH,
John

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode