If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
How to detect if a text file is ISO8859-1,ISO8859-15,UTF-8 or UniCode encoded
Assume I have a text file. How can I detect if the text inside is encoded in
ISO8859-1 ISO8859-15 UTF-8 UniCode Karl |
Ads |
#2
|
|||
|
|||
How to detect if a text file is ISO8859-1,ISO8859-15,UTF-8 or UniCode encoded
"Karl Mondale" said this in news item ... Assume I have a text file. How can I detect if the text inside is encoded in ISO8859-1 ISO8859-15 UTF-8 UniCode Karl I would check Google or Wikipedia, e.g. he http://en.wikipedia.org/wiki/ISO/IEC_8859-1. It explains the whole code in detail. To find out programmatically you need to read the first few bytes. The exact method depends on the tool you wish to use. |
#4
|
|||
|
|||
How to detect if a text file is ISO8859-1,ISO8859-15,UTF-8 or UniCode encoded
The short answer is that you can't alway determine the encoding from the
content of a file. To see why, you can use Notepad to experiment with creating and saving text as ANSI, Unicode, Unicode Big Endian, and UTF-8. Try pasting in some some text from foreign web pages, as well as plain English text. Looking at the files in a hex editor, like XVI32, you will see that for all but Ansi, Notepad prepends a few bytes (called a Byte Order Mark) to indicate the type of text file. For Unicode, it is the two byte sequence (hex) FFFE or FEFF, to indicate either big endian or little endian unicode. Not all applications prepend a BOM. Ansi and your two ISO encodings always use one byte per character. Unicode always uses two bytes per character, except the new Unicode-32 uses 4 bytes per character. UTF-8 uses a variable number of bytes per character (one to five, I think), and can encode all two-byte Unicode characters. For saving as Ansi, Notepad complains if all characters can't be saved as one-byte characters. -Paul Randall "Karl Mondale" wrote in message ... Assume I have a text file. How can I detect if the text inside is encoded in ISO8859-1 ISO8859-15 UTF-8 UniCode Karl |
#5
|
|||
|
|||
How to detect if a text file is ISO8859-1,ISO8859-15,UTF-8 or UniC
"Karl Mondale" wrote:
Assume I have a text file. How can I detect if the text inside is encoded in ISO8859-1 ISO8859-15 UTF-8 UniCode Karl Microsoft doesn't distribute a utility that can accomplish that feat easily. If you can get your file transfered to a FreeBSD or Linux system, you could use either 'file' or 'enca' to determine its property's. MAN pages: http://unixhelp.ed.ac.uk/CGI/man-cgi?file http://linux.die.net/man/1/enca -- Carmel Never forget: 2 + 2 = 5 for extremely large values of 2. |
#6
|
|||
|
|||
How to detect if a text file is ISO8859-1,ISO8859-15,UTF-8 or UniCode encoded
|
Thread Tools | |
Display Modes | |
|
|