Strange characters

**swalker** · January 23rd 19, 04:04 PM posted to alt.windows7.general

Look at this

neatï¿½ nine lives minus 1.

I know the guy who replied to the email didn't do this and it appears
on all his replies regardless of the subject.

What he wrote was "neat nine lives minus 1" in a reply to a link to
a headline I sent about a cat going over a dam and being rescued.

What's up?

**J. P. Gilliver (John)[_4_]** · January 23rd 19, 05:55 PM posted to alt.windows7.general

In message , swalker
writes:
Look at this

neatï¿½ nine lives minus 1.

I know the guy who replied to the email didn't do this and it appears
on all his replies regardless of the subject.

What he wrote was "neat nine lives minus 1" in a reply to a link to
a headline I sent about a cat going over a dam and being rescued.

What's up?

Note the double space after "neat". I've noticed this a lot recently -
presumably the text comes from some modern software. The first space is
not really a space, but some non-ASCII character, which some part of the
'net is converting into two bytes. (It may even be starting out as two
bytes.) There's NO need for it: even if he's convinced he should put two
spaces after a full stop, there's no reason one of them has to be a
funny-space (they look the same to him after all).

You'll have to live with it. At best, you _might_ convince the guy not
to use the offending software, but he's unlikely to be bothered: your
first hurdle will be to convince him there's any problem, as he won't be
seeing it at his sending end, followed by - in most cases - strong
reluctance on his part to do anything about it, for the same reason (and
he'll add "nobody else is complaining").
--
J. P. Gilliver. UMRA: 1960/1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf

A true-born Englishman does not know any language. He does not speak English
too well either but, at least, he is not proud of this. He is, however,
immensely proud of not knowing any foreign languages. (George Mikes, "How to
be Inimitable" [1960].)

**David E. Ross[_2_]** · January 23rd 19, 05:56 PM posted to alt.windows7.general

On 1/23/2019 8:04 AM, swalker wrote:
Look at this

neatï¿½ nine lives minus 1.

I know the guy who replied to the email didn't do this and it appears
on all his replies regardless of the subject.

What he wrote was "neat nine lives minus 1" in a reply to a link to
a headline I sent about a cat going over a dam and being rescued.

What's up?

Either your friend used a Microsoft E-mail application to reply to you,
or else he composed his reply with Word and then copied the reply into
his E-mail application. The strange characters represent what happens
to so-called smart quotes -- favored by Microsoft -- when used in
plain-text compositions. This also happens with other non-plain-text
characters.

See my http://www.rossde.com/malaprops/writing_internet.html. The box
on the right gives several examples.

--
David E. Ross

Trump again proves he is a major source of fake news. He wants
to cut off disaster funds to repair the damage caused by the
Woolsey Fire in southern California because he claims the state
fails to manage its forests properly. The Woolsey Fire was NOT
a forest fire. Starting in an industrial tract, it did not burn
through any forests.

See http://www.rossde.com/fire.html.

**Ralph Fox** · January 23rd 19, 06:06 PM posted to alt.windows7.general

On Wed, 23 Jan 2019 10:04:42 -0600, swalker wrote:

Look at this

neatÃ¯Â¿Â½ nine lives minus 1.

Which part? The "Ã¯Â¿Â½" or the whole phrase?

I know the guy who replied to the email didn't do this and it appears
on all his replies regardless of the subject.

What he wrote was "neat nine lives minus 1" in a reply to a link to
a headline I sent about a cat going over a dam and being rescued.

What's up?

Likely

(a) The guy who replied is using a program which does not
support Unicode UTF-8 character encoding.
The guy also sees the "Ã¯Â¿Â½" before he sends the message.

or

(b) The guy's message is missing the "Content-Type:" header to
say that the message is using UTF-8.
The guy does not see the "Ã¯Â¿Â½", but you do.

--
Kind regards
Ralph

**J. P. Gilliver (John)[_4_]** · January 23rd 19, 06:24 PM posted to alt.windows7.general

In message , Wolf K
writes:
[]
In short: Email clients still do not all adhere top the same standards.

Or, alternatively: people are using strange characters when there's no
need for them. (In this case, a "special" space [or possibly
punctuation] character, where its use serves absolutely no purpose.)

Why am I not surprised?

Because you have your hobby-horse, same as I do. (They happen to be
galloping in opposite directions.)
--
J. P. Gilliver. UMRA: 1960/1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf

A true-born Englishman does not know any language. He does not speak English
too well either but, at least, he is not proud of this. He is, however,
immensely proud of not knowing any foreign languages. (George Mikes, "How to
be Inimitable" [1960].)

**Paul[_32_]** · January 23rd 19, 06:34 PM posted to alt.windows7.general

Wolf K wrote:
On 2019-01-23 13:06, Ralph Fox wrote:
On Wed, 23 Jan 2019 10:04:42 -0600, swalker wrote:

Look at this

neatÃ¯Â¿Â½ nine lives minus 1.

Which part? The "Ã¯Â¿Â½" or the whole phrase?

I know the guy who replied to the email didn't do this and it appears
on all his replies regardless of the subject.

What he wrote was "neat nine lives minus 1" in a reply to a link to
a headline I sent about a cat going over a dam and being rescued.

What's up?

Likely

(a) The guy who replied is using a program which does not
support Unicode UTF-8 character encoding.
The guy also sees the "Ã¯Â¿Â½" before he sends the message.

or

(b) The guy's message is missing the "Content-Type:" header to
say that the message is using UTF-8.
The guy does not see the "Ã¯Â¿Â½", but you do.

In short: Email clients still do not all adhere top the same standards.

Why am I not surprised?

You mean an email client with some pretense at HTML support ?

Or the usage of an email client that simply doesn't have
modern character support (better than ASCII support).

6E 65 61 74 EF BF BD 20 6E 69 6E 65
n e a t sp n i n e

I'm not good at these alphabet puzzles, and this
is what I could find.

https://stackoverflow.com/questions/...his-byte-array

"The original byte array is not encoded as UTF-8.

The StreamReader therefore replaces each invalid byte
with the replacement character U+FFFD.

When that character gets encoded back to UTF-8, this
results in the byte sequence EF BF BD.

You cannot construct the original byte value from the
string because the information is completely lost.
"

So it's some sort of transformation, somewhere along
the way.

Paul

**swalker** · January 23rd 19, 08:29 PM posted to alt.windows7.general

On Wed, 23 Jan 2019 10:04:42 -0600, swalker wrote:

Look at this

neatï¿½ nine lives minus 1.

I know the guy who replied to the email didn't do this and it appears
on all his replies regardless of the subject.

What he wrote was "neat nine lives minus 1" in a reply to a link to
a headline I sent about a cat going over a dam and being rescued.

What's up?

Thanks for the replies.

**Mayayana** · January 23rd 19, 10:06 PM posted to alt.windows7.general

"swalker" wrote

| neatï¿½ nine lives minus 1.
|
| What's up?

UTF-8 encoding. You can either switch to reading in
UTF-8, if your email program has that option, ask them
not to write in UTF-8, or ignore it.

Unfortunately, UTF-8 is a big fad these days. It's a
handy way to render different languages in webpages,
but for English speakers it's irrelevant. Yet it's used
a lot for things like non-breaking spaces and curly
quotes -- neither of which is necessary in email or
webpages.

The way it works:

Basic ASCII encoding represents each character as
a byte, up to 127. That allows for the English aplhabet
and punctuation. When more languages were needed,
ANSI came along. ANSI is the same as ASCII up to 127.
Bytes 128 to 155 are represented depending on the
local language settings. (Known as a codepage.)
Where you see an N with a tilde, a Russian will see a
Russian character.

That worked OK, but it's not good enough for international.
Japanese, Chinese, Korean require far more characters. UTF-8
is one solution, which has now become default for webpages.
UTF-8 is the same as ASCII, up to 127, but after that the
upper bytes are used in a system of multi-byte designations,
up to 4 bytes per character. That allows for extensive
character representation while still using plain text to do it
and not using null bytes, which would complicate things
greatly.

That's the story in a nutshell. It's complicated. But the
long and the short of it is that whenever you see text that's
normal except for the odd capital A with a caret, upside
down question mark, weird double brackets, etc, what you're
seeing is UTF-8 text being rendered as ANSI using the English
codepage. And most of what's messed up is probably just
supposed to be quotes and spaces.

What you see is the characters in English for bytes
239-191-189 or hexadecimal EF BF BD. Ironically, they
represent a single "replacement character" in UTF-8. In
other words, if you rendered it properly you'd see a question
mark in a diamond, or maybe just a square, indicating that
the character can't be rendered.

**Mayayana** · January 23rd 19, 10:35 PM posted to alt.windows7.general

"Paul" wrote

|
|
https://stackoverflow.com/questions/...his-byte-array
|
| "The original byte array is not encoded as UTF-8.
|
| The StreamReader therefore replaces each invalid byte
| with the replacement character U+FFFD.
|

In that example, though, the original stream can't be rendered
as UTF-8. It's got multiple null byte groupings, which can only
be valid as non-text binary data. The function he was using
did the best it could. But that's the first I've heard of using
replacement characters. That probably only happened because
the function was asked to render it.

Usually UTF-8 will simply render with the local codepage where
it's not supported, which is what happened. The mystery is
why/what replaced whatever was there originally. It looks
like the original might have been character 160, a non-breaking
space. But that's a continuation byte in UTF-8. Being alone,
following an ASCII-range character, would be invalid. But why
would the sender's software say, "This is messed up so to fix
it we'll mess it up some more."? Strange. Maybe the sender
pasted valid HTML into a fascist email program?

**J. P. Gilliver (John)[_4_]** · January 23rd 19, 11:16 PM posted to alt.windows7.general

In message , Mayayana
writes:
"Paul" wrote

|
|
https://stackoverflow.com/questions/...his-byte-array
|
| "The original byte array is not encoded as UTF-8.
|
| The StreamReader therefore replaces each invalid byte
| with the replacement character U+FFFD.
|

In that example, though, the original stream can't be rendered
as UTF-8. It's got multiple null byte groupings, which can only
be valid as non-text binary data. The function he was using
did the best it could. But that's the first I've heard of using
replacement characters. That probably only happened because
the function was asked to render it.

Usually UTF-8 will simply render with the local codepage where
it's not supported, which is what happened. The mystery is
why/what replaced whatever was there originally. It looks
like the original might have been character 160, a non-breaking
space. But that's a continuation byte in UTF-8. Being alone,
following an ASCII-range character, would be invalid. But why
would the sender's software say, "This is messed up so to fix
it we'll mess it up some more."? Strange. Maybe the sender
pasted valid HTML into a fascist email program?

Leaving aside the coding complexities: a non-breaking space followed by
a normal space, as this sort of thing often is, is rather pointless ...
(-:

--
J. P. Gilliver. UMRA: 1960/1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf

If you help someone when they're in trouble, they will remember you when
they're in trouble again.

**Mayayana** · January 24th 19, 02:30 AM posted to alt.windows7.general

"J. P. Gilliver (John)" wrote

| Leaving aside the coding complexities: a non-breaking space followed by
| a normal space, as this sort of thing often is, is rather pointless ...

Not in HTML. A browser will ignore multiple spaces.
a b will display the same way as a b. Only one
space is rendered because in general spaces are
ignored as part of how HTML works. So if you want to
show more space you use non-breaking space
characters. The standard is   &#160 also
does it if the page is written for the English codepage.

In UTF-8 it's encoded with the bytes C2 A0, or
194-160. So if you open a UTF-8 webpage as
non-UTF-8 you often see lots of Â. Microsoft pages
are like that. To fix them you can just delete every
Â because the next character is the non-breaking
space character in English ANSI, 160. (I like to do
that so that I can store files as ANSI, which is still
the standard with most computer files. And some
editors don't recognize UTF-8. Also, there's no absolute
ID for UTF-8. So for most people it's just a hassle.)

You can also see a lot of non-breaking spaces in
webpage code, usually in the form of  . (There
are a number of standards like that, which browsers
know to render. Similarly, • will render a bullet.
As will • [ANSI] or • [UTF-8])

In the early days #nbsp; was used for layout. Programs
like Front Page did it. That and many othr programs were
"WYSIWYG". They allowed creating a webpage visually,
by typing, dragging images, etc. But to do that they
had to pull off a lot of hacks behind the scenes, in the
actual webpage code. If you typed "left" and then hit the
space bar until you were on the other side of the page,
then typed "right", Front Page would create code like so:

left                 right

This all gets slightly more complicated because, believe it
or not, the whole issue is politicized. UTF-8 is politically
correct, to begin with, because it supports diversity. And
everyone should love diversity, even if it means your toaster
oven is labeled in Spanish. Then there are Linux nuts who
rabidly assert that there's not such thing as ANSI because
that was basically a Windows thing. (I'm not sure if Apple
ever used codepages.)

Then there's unicode, which can be in several forms and
is designed to support more characters. But that gets confusing
because unicode is several things. UTF-8 is a type of unicode.
The original Windows unicode, which is how Windows operates
under the surface and what unicode means in a Windows context,
is a system of 2 bytes per character. So a in ascii is byte 97.
a in ANSI is byte 97. a in UTF-8 is byte 97. But a in 2-byte
unicode is byte 97 followed by a null: 61 00. You can see it
if you open a Word doc in a hex editor.

But a null is traditionally used to indicate the end of a character
string. So suddenly switching to using 2-byte or 4-byte unicode
online would have been a disaster. By inventing UTF-8 it allowed
most webpages to stay just as they are, while accomodating
non-English characters, including Chinese and Japanese.

**Ralph Fox** · January 24th 19, 06:08 AM posted to alt.windows7.general

On Wed, 23 Jan 2019 18:24:27 +0000, J. P. Gilliver (John) wrote:

Or, alternatively: people are using strange characters when there's no
need for them. (In this case, a "special" space [or possibly
punctuation] character, where its use serves absolutely no purpose.)

I would doubt someone has _deliberately_ used a strange character here.

More likely it is an accident.

The character is the Unicode "replacement character. It is usually seen
when the data was invalid and did not match any character.

https://en.wikipedia.org/wiki/Specials_%28Unicode_block%29#Replacement_character
https://www.fileformat.info/info/unicode/char/fffd/index.htm

The email has likely been forwarded one or more times since the
accident, to make the character show up as three ISO-8859-1 characters.

--
Kind regards
Ralph
ðŸ¦Š

**Mayayana** · January 24th 19, 02:02 PM posted to alt.windows7.general

"Ralph Fox" wrote
|
| Or, alternatively: people are using strange characters when there's no
| need for them. (In this case, a "special" space [or possibly
| punctuation] character, where its use serves absolutely no purpose.)
|
|
| I would doubt someone has _deliberately_ used a strange character here.

One possibility -- the only realistic scenario I can think of:

The sender copied text from a non-UTF-8 webpage.
The email software is HTML-based and was able to paste
the actual HTML rather than just the visible text. The
HTML had a non-breaking space in it:

neat nine lives minus 1

(Using script in a webpage it's possible to access either
selected text or selected HTMLtext.)

But the email software was not properly designed to deal
with encoding and just used an arbitrary sledgehammer
approach to convert the pasting to UTF-8. Presumably the
sender had set it for UTF-8 composition. Or maybe it
was something like gmail, composing in a webpage and
designed for UTF-8 so that Google won't have to bother
with languages. Since character 160 alone is invalid UTF-8
the software function substituted official nonsense
bytes!

("You're not allowed to read this because it wasn't
done right. My name is Edith Ann, and I'm 5 years old,
and I don't have to respect ANSI text if I don't want to!")

In other words, they didn't actually convert the text
paste to UTF-8. They just copied over ASCII characters
and substituted offficial nonsense for ANSI characters.
So it might be rendered unreadable, but at least it was
official UTF-8.

But once the programmer decides to convert they
need to do the job right and be prepared for ANSI
characters, rather than just designing their software to
insert official nonsense for any character over 127.

In that scenario it's likely the email programmer didn't
know what they were doing and just thought it was a
nifty touch to offer translation to UTF-8 of pasted strings.

**J. P. Gilliver (John)[_4_]** · January 24th 19, 02:52 PM posted to alt.windows7.general

In message , Ralph Fox
writes:
On Wed, 23 Jan 2019 18:24:27 +0000, J. P. Gilliver (John) wrote:

Or, alternatively: people are using strange characters when there's no
need for them. (In this case, a "special" space [or possibly
punctuation] character, where its use serves absolutely no purpose.)

I would doubt someone has _deliberately_ used a strange character here.

More likely it is an accident.

The person taking the deliberate action is not the person sending the
email, but the programmer of the software that generated the offending
"text", and/or the software that encouraged - made possible without any
warning - the moving of said "text" from one software to another. (If
the offending character was a non-breaking space, then even its use is
pointless, as it was immediately followed by an ordinary space - and I
have frequently encountered that pairing; however, it's not clear from
the context whether it might have been punctuation [though I see no
reason why ASCII punctuation would not have sufficed - a colon in this
case, for example].)

The character is the Unicode "replacement character. It is usually seen
when the data was invalid and did not match any character.

https://en.wikipedia.org/wiki/Specia...9#Replacement_
character
https://www.fileformat.info/info/unicode/char/fffd/index.htm

The email has likely been forwarded one or more times since the
accident, to make the character show up as three ISO-8859-1 characters.

Indeed.

--
J. P. Gilliver. UMRA: 1960/1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf

"If god doesn't like the way I live, Let him tell me, not you." - unknown

**J. P. Gilliver (John)[_4_]** · January 24th 19, 03:09 PM posted to alt.windows7.general

In message , Mayayana
writes:
"J. P. Gilliver (John)" wrote

| Leaving aside the coding complexities: a non-breaking space followed by
| a normal space, as this sort of thing often is, is rather pointless ...

Not in HTML. A browser will ignore multiple spaces.
a b will display the same way as a b. Only one
space is rendered because in general spaces are
ignored as part of how HTML works. So if you want to
show more space you use non-breaking space
characters. The standard is   &#160 also
does it if the page is written for the English codepage.

[lots deleted - not really a Mayayana rant, as it included lots of good
information.]

I'm quite aware of what a non-breaking space does. However, I retain my
assertion than an nbsp _followed by an ordinary space_ serves little
purpose: it will not stop the line being broken at that point. I suppose
if the gap so produced is _not_ at the end of a line, it _does_ force a
double space which might otherwise have been rendered as a single one,
but it does seem a backhanded way of achieving that. (A double nbsp -
with no ordinary space - would have that effect and also prevent the
line being broken at that point, if that's important.)
--
J. P. Gilliver. UMRA: 1960/1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf

"If god doesn't like the way I live, Let him tell me, not you." - unknown

Thread Tools
Show Printable Version Email this Page
Display Modes	Rate This Thread
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode	Rate This Thread: