If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Rate Thread | Display Modes |
#1
|
|||
|
|||
Strange characters
Look at this
neat� nine lives minus 1. I know the guy who replied to the email didn't do this and it appears on all his replies regardless of the subject. What he wrote was "neat nine lives minus 1" in a reply to a link to a headline I sent about a cat going over a dam and being rescued. What's up? |
Ads |
#2
|
|||
|
|||
Strange characters
In message , swalker
writes: Look at this neat� nine lives minus 1. I know the guy who replied to the email didn't do this and it appears on all his replies regardless of the subject. What he wrote was "neat nine lives minus 1" in a reply to a link to a headline I sent about a cat going over a dam and being rescued. What's up? Note the double space after "neat". I've noticed this a lot recently - presumably the text comes from some modern software. The first space is not really a space, but some non-ASCII character, which some part of the 'net is converting into two bytes. (It may even be starting out as two bytes.) There's NO need for it: even if he's convinced he should put two spaces after a full stop, there's no reason one of them has to be a funny-space (they look the same to him after all). You'll have to live with it. At best, you _might_ convince the guy not to use the offending software, but he's unlikely to be bothered: your first hurdle will be to convince him there's any problem, as he won't be seeing it at his sending end, followed by - in most cases - strong reluctance on his part to do anything about it, for the same reason (and he'll add "nobody else is complaining"). -- J. P. Gilliver. UMRA: 1960/1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf A true-born Englishman does not know any language. He does not speak English too well either but, at least, he is not proud of this. He is, however, immensely proud of not knowing any foreign languages. (George Mikes, "How to be Inimitable" [1960].) |
#3
|
|||
|
|||
Strange characters
On 1/23/2019 8:04 AM, swalker wrote:
Look at this neat� nine lives minus 1. I know the guy who replied to the email didn't do this and it appears on all his replies regardless of the subject. What he wrote was "neat nine lives minus 1" in a reply to a link to a headline I sent about a cat going over a dam and being rescued. What's up? Either your friend used a Microsoft E-mail application to reply to you, or else he composed his reply with Word and then copied the reply into his E-mail application. The strange characters represent what happens to so-called smart quotes -- favored by Microsoft -- when used in plain-text compositions. This also happens with other non-plain-text characters. See my http://www.rossde.com/malaprops/writing_internet.html. The box on the right gives several examples. -- David E. Ross Trump again proves he is a major source of fake news. He wants to cut off disaster funds to repair the damage caused by the Woolsey Fire in southern California because he claims the state fails to manage its forests properly. The Woolsey Fire was NOT a forest fire. Starting in an industrial tract, it did not burn through any forests. See http://www.rossde.com/fire.html. |
#4
|
|||
|
|||
Strange characters
On Wed, 23 Jan 2019 10:04:42 -0600, swalker wrote:
Look at this neat� nine lives minus 1. Which part? The "�" or the whole phrase? I know the guy who replied to the email didn't do this and it appears on all his replies regardless of the subject. What he wrote was "neat nine lives minus 1" in a reply to a link to a headline I sent about a cat going over a dam and being rescued. What's up? Likely (a) The guy who replied is using a program which does not support Unicode UTF-8 character encoding. The guy also sees the "�" before he sends the message. or (b) The guy's message is missing the "Content-Type:" header to say that the message is using UTF-8. The guy does not see the "�", but you do. -- Kind regards Ralph |
#5
|
|||
|
|||
Strange characters
In message , Wolf K
writes: [] In short: Email clients still do not all adhere top the same standards. Or, alternatively: people are using strange characters when there's no need for them. (In this case, a "special" space [or possibly punctuation] character, where its use serves absolutely no purpose.) Why am I not surprised? Because you have your hobby-horse, same as I do. (They happen to be galloping in opposite directions.) -- J. P. Gilliver. UMRA: 1960/1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf A true-born Englishman does not know any language. He does not speak English too well either but, at least, he is not proud of this. He is, however, immensely proud of not knowing any foreign languages. (George Mikes, "How to be Inimitable" [1960].) |
#6
|
|||
|
|||
Strange characters
Wolf K wrote:
On 2019-01-23 13:06, Ralph Fox wrote: On Wed, 23 Jan 2019 10:04:42 -0600, swalker wrote: Look at this neat� nine lives minus 1. Which part? The "�" or the whole phrase? I know the guy who replied to the email didn't do this and it appears on all his replies regardless of the subject. What he wrote was "neat nine lives minus 1" in a reply to a link to a headline I sent about a cat going over a dam and being rescued. What's up? Likely (a) The guy who replied is using a program which does not support Unicode UTF-8 character encoding. The guy also sees the "�" before he sends the message. or (b) The guy's message is missing the "Content-Type:" header to say that the message is using UTF-8. The guy does not see the "�", but you do. In short: Email clients still do not all adhere top the same standards. Why am I not surprised? You mean an email client with some pretense at HTML support ? Or the usage of an email client that simply doesn't have modern character support (better than ASCII support). 6E 65 61 74 EF BF BD 20 6E 69 6E 65 n e a t sp n i n e I'm not good at these alphabet puzzles, and this is what I could find. https://stackoverflow.com/questions/...his-byte-array "The original byte array is not encoded as UTF-8. The StreamReader therefore replaces each invalid byte with the replacement character U+FFFD. When that character gets encoded back to UTF-8, this results in the byte sequence EF BF BD. You cannot construct the original byte value from the string because the information is completely lost. " So it's some sort of transformation, somewhere along the way. Paul |
#7
|
|||
|
|||
Strange characters
On Wed, 23 Jan 2019 10:04:42 -0600, swalker wrote:
Look at this neat� nine lives minus 1. I know the guy who replied to the email didn't do this and it appears on all his replies regardless of the subject. What he wrote was "neat nine lives minus 1" in a reply to a link to a headline I sent about a cat going over a dam and being rescued. What's up? Thanks for the replies. |
#8
|
|||
|
|||
Strange characters
"swalker" wrote
| neat� nine lives minus 1. | | What's up? UTF-8 encoding. You can either switch to reading in UTF-8, if your email program has that option, ask them not to write in UTF-8, or ignore it. Unfortunately, UTF-8 is a big fad these days. It's a handy way to render different languages in webpages, but for English speakers it's irrelevant. Yet it's used a lot for things like non-breaking spaces and curly quotes -- neither of which is necessary in email or webpages. The way it works: Basic ASCII encoding represents each character as a byte, up to 127. That allows for the English aplhabet and punctuation. When more languages were needed, ANSI came along. ANSI is the same as ASCII up to 127. Bytes 128 to 155 are represented depending on the local language settings. (Known as a codepage.) Where you see an N with a tilde, a Russian will see a Russian character. That worked OK, but it's not good enough for international. Japanese, Chinese, Korean require far more characters. UTF-8 is one solution, which has now become default for webpages. UTF-8 is the same as ASCII, up to 127, but after that the upper bytes are used in a system of multi-byte designations, up to 4 bytes per character. That allows for extensive character representation while still using plain text to do it and not using null bytes, which would complicate things greatly. That's the story in a nutshell. It's complicated. But the long and the short of it is that whenever you see text that's normal except for the odd capital A with a caret, upside down question mark, weird double brackets, etc, what you're seeing is UTF-8 text being rendered as ANSI using the English codepage. And most of what's messed up is probably just supposed to be quotes and spaces. What you see is the characters in English for bytes 239-191-189 or hexadecimal EF BF BD. Ironically, they represent a single "replacement character" in UTF-8. In other words, if you rendered it properly you'd see a question mark in a diamond, or maybe just a square, indicating that the character can't be rendered. |
#9
|
|||
|
|||
Strange characters
"Paul" wrote
| | https://stackoverflow.com/questions/...his-byte-array | | "The original byte array is not encoded as UTF-8. | | The StreamReader therefore replaces each invalid byte | with the replacement character U+FFFD. | In that example, though, the original stream can't be rendered as UTF-8. It's got multiple null byte groupings, which can only be valid as non-text binary data. The function he was using did the best it could. But that's the first I've heard of using replacement characters. That probably only happened because the function was asked to render it. Usually UTF-8 will simply render with the local codepage where it's not supported, which is what happened. The mystery is why/what replaced whatever was there originally. It looks like the original might have been character 160, a non-breaking space. But that's a continuation byte in UTF-8. Being alone, following an ASCII-range character, would be invalid. But why would the sender's software say, "This is messed up so to fix it we'll mess it up some more."? Strange. Maybe the sender pasted valid HTML into a fascist email program? |
#10
|
|||
|
|||
Strange characters
In message , Mayayana
writes: "Paul" wrote | | https://stackoverflow.com/questions/...his-byte-array | | "The original byte array is not encoded as UTF-8. | | The StreamReader therefore replaces each invalid byte | with the replacement character U+FFFD. | In that example, though, the original stream can't be rendered as UTF-8. It's got multiple null byte groupings, which can only be valid as non-text binary data. The function he was using did the best it could. But that's the first I've heard of using replacement characters. That probably only happened because the function was asked to render it. Usually UTF-8 will simply render with the local codepage where it's not supported, which is what happened. The mystery is why/what replaced whatever was there originally. It looks like the original might have been character 160, a non-breaking space. But that's a continuation byte in UTF-8. Being alone, following an ASCII-range character, would be invalid. But why would the sender's software say, "This is messed up so to fix it we'll mess it up some more."? Strange. Maybe the sender pasted valid HTML into a fascist email program? Leaving aside the coding complexities: a non-breaking space followed by a normal space, as this sort of thing often is, is rather pointless ... (-: -- J. P. Gilliver. UMRA: 1960/1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf If you help someone when they're in trouble, they will remember you when they're in trouble again. |
#11
|
|||
|
|||
Strange characters
"J. P. Gilliver (John)" wrote
| Leaving aside the coding complexities: a non-breaking space followed by | a normal space, as this sort of thing often is, is rather pointless ... Not in HTML. A browser will ignore multiple spaces. a b will display the same way as a b. Only one space is rendered because in general spaces are ignored as part of how HTML works. So if you want to show more space you use non-breaking space characters. The standard is   also does it if the page is written for the English codepage. In UTF-8 it's encoded with the bytes C2 A0, or 194-160. So if you open a UTF-8 webpage as non-UTF-8 you often see lots of Â. Microsoft pages are like that. To fix them you can just delete every  because the next character is the non-breaking space character in English ANSI, 160. (I like to do that so that I can store files as ANSI, which is still the standard with most computer files. And some editors don't recognize UTF-8. Also, there's no absolute ID for UTF-8. So for most people it's just a hassle.) You can also see a lot of non-breaking spaces in webpage code, usually in the form of . (There are a number of standards like that, which browsers know to render. Similarly, • will render a bullet. As will [ANSI] or • [UTF-8]) In the early days #nbsp; was used for layout. Programs like Front Page did it. That and many othr programs were "WYSIWYG". They allowed creating a webpage visually, by typing, dragging images, etc. But to do that they had to pull off a lot of hacks behind the scenes, in the actual webpage code. If you typed "left" and then hit the space bar until you were on the other side of the page, then typed "right", Front Page would create code like so: left right This all gets slightly more complicated because, believe it or not, the whole issue is politicized. UTF-8 is politically correct, to begin with, because it supports diversity. And everyone should love diversity, even if it means your toaster oven is labeled in Spanish. Then there are Linux nuts who rabidly assert that there's not such thing as ANSI because that was basically a Windows thing. (I'm not sure if Apple ever used codepages.) Then there's unicode, which can be in several forms and is designed to support more characters. But that gets confusing because unicode is several things. UTF-8 is a type of unicode. The original Windows unicode, which is how Windows operates under the surface and what unicode means in a Windows context, is a system of 2 bytes per character. So a in ascii is byte 97. a in ANSI is byte 97. a in UTF-8 is byte 97. But a in 2-byte unicode is byte 97 followed by a null: 61 00. You can see it if you open a Word doc in a hex editor. But a null is traditionally used to indicate the end of a character string. So suddenly switching to using 2-byte or 4-byte unicode online would have been a disaster. By inventing UTF-8 it allowed most webpages to stay just as they are, while accomodating non-English characters, including Chinese and Japanese. |
#12
|
|||
|
|||
Strange characters
On Wed, 23 Jan 2019 18:24:27 +0000, J. P. Gilliver (John) wrote:
Or, alternatively: people are using strange characters when there's no need for them. (In this case, a "special" space [or possibly punctuation] character, where its use serves absolutely no purpose.) I would doubt someone has _deliberately_ used a strange character here. More likely it is an accident. The character is the Unicode "replacement character. It is usually seen when the data was invalid and did not match any character. https://en.wikipedia.org/wiki/Specials_%28Unicode_block%29#Replacement_character https://www.fileformat.info/info/unicode/char/fffd/index.htm The email has likely been forwarded one or more times since the accident, to make the character show up as three ISO-8859-1 characters. -- Kind regards Ralph 🦊 |
#13
|
|||
|
|||
Strange characters
"Ralph Fox" wrote
| | Or, alternatively: people are using strange characters when there's no | need for them. (In this case, a "special" space [or possibly | punctuation] character, where its use serves absolutely no purpose.) | | | I would doubt someone has _deliberately_ used a strange character here. One possibility -- the only realistic scenario I can think of: The sender copied text from a non-UTF-8 webpage. The email software is HTML-based and was able to paste the actual HTML rather than just the visible text. The HTML had a non-breaking space in it: neat nine lives minus 1 (Using script in a webpage it's possible to access either selected text or selected HTMLtext.) But the email software was not properly designed to deal with encoding and just used an arbitrary sledgehammer approach to convert the pasting to UTF-8. Presumably the sender had set it for UTF-8 composition. Or maybe it was something like gmail, composing in a webpage and designed for UTF-8 so that Google won't have to bother with languages. Since character 160 alone is invalid UTF-8 the software function substituted official nonsense bytes! ("You're not allowed to read this because it wasn't done right. My name is Edith Ann, and I'm 5 years old, and I don't have to respect ANSI text if I don't want to!") In other words, they didn't actually convert the text paste to UTF-8. They just copied over ASCII characters and substituted offficial nonsense for ANSI characters. So it might be rendered unreadable, but at least it was official UTF-8. But once the programmer decides to convert they need to do the job right and be prepared for ANSI characters, rather than just designing their software to insert official nonsense for any character over 127. In that scenario it's likely the email programmer didn't know what they were doing and just thought it was a nifty touch to offer translation to UTF-8 of pasted strings. |
#14
|
|||
|
|||
Strange characters
In message , Ralph Fox
writes: On Wed, 23 Jan 2019 18:24:27 +0000, J. P. Gilliver (John) wrote: Or, alternatively: people are using strange characters when there's no need for them. (In this case, a "special" space [or possibly punctuation] character, where its use serves absolutely no purpose.) I would doubt someone has _deliberately_ used a strange character here. More likely it is an accident. The person taking the deliberate action is not the person sending the email, but the programmer of the software that generated the offending "text", and/or the software that encouraged - made possible without any warning - the moving of said "text" from one software to another. (If the offending character was a non-breaking space, then even its use is pointless, as it was immediately followed by an ordinary space - and I have frequently encountered that pairing; however, it's not clear from the context whether it might have been punctuation [though I see no reason why ASCII punctuation would not have sufficed - a colon in this case, for example].) The character is the Unicode "replacement character. It is usually seen when the data was invalid and did not match any character. https://en.wikipedia.org/wiki/Specia...9#Replacement_ character https://www.fileformat.info/info/unicode/char/fffd/index.htm The email has likely been forwarded one or more times since the accident, to make the character show up as three ISO-8859-1 characters. Indeed. -- J. P. Gilliver. UMRA: 1960/1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf "If god doesn't like the way I live, Let him tell me, not you." - unknown |
#15
|
|||
|
|||
Strange characters
In message , Mayayana
writes: "J. P. Gilliver (John)" wrote | Leaving aside the coding complexities: a non-breaking space followed by | a normal space, as this sort of thing often is, is rather pointless ... Not in HTML. A browser will ignore multiple spaces. a b will display the same way as a b. Only one space is rendered because in general spaces are ignored as part of how HTML works. So if you want to show more space you use non-breaking space characters. The standard is   also does it if the page is written for the English codepage. [lots deleted - not really a Mayayana rant, as it included lots of good information.] I'm quite aware of what a non-breaking space does. However, I retain my assertion than an nbsp _followed by an ordinary space_ serves little purpose: it will not stop the line being broken at that point. I suppose if the gap so produced is _not_ at the end of a line, it _does_ force a double space which might otherwise have been rendered as a single one, but it does seem a backhanded way of achieving that. (A double nbsp - with no ordinary space - would have that effect and also prevent the line being broken at that point, if that's important.) -- J. P. Gilliver. UMRA: 1960/1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf "If god doesn't like the way I live, Let him tell me, not you." - unknown |
|
Thread Tools | |
Display Modes | Rate This Thread |
|
|