Encoding issue?

Discussion:

Encoding issue?

TPG account

2004-02-08 09:22:23 UTC

Hello the List,
I have an editField that gets filled up with a technical report
generated by an application I have written in REALbasic. The report is then
saved to a RB database. If the user wishes, the report can then be loaded
back into the editfield at a later date.
The editfield contains a special character - the degree symbol - used
to delineate angular measurement. On Mac, I found this symbol by using
KeyCaps, and pasted it into the code editor; and on Windows, I found that
font Arial, chr(176) is the same character.
i.e. strDegSymbol = chr(176)
When I run the app on Mac Classic, and create the report, the degree
symbol is inserted in the editfield correctly. When I load the editfield
with data saved to the database, the degree symbol returns correctly.
On Windows 98 and XP and perhaps others, the degree symbol is inserted
in the editfield correctly as well ... but on reload from the database, the
editField (or something else) converts the degree symbol to two or three
characters which seem like a capital A with some accent mark, followed by a
character whose shape eludes me. I can see only two characters, but the
style data which also comes back from the database shifts the formatting by
three characters.
From remarks I have read on this list, this would seem to be an
encoding issue, but the RB developer guide makes no mention of encodings (I
searched the pdf for "textencod" without success), and the online help does
not explain the matter to me either.
Is there a place where I can enlighten myself about these encodings
mysteries? or perhaps a genial guru in this field might wish to explain
what I need to do to coerce this unruly database to deliver its goods in the
manner I wish ...
Regards,
Tony Barry
chirometrics at accsoft dot com dot au
tb13 at tpg dot com dot au

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Joseph J. Strout

2004-02-08 19:38:43 UTC

Post by TPG account
The editfield contains a special character - the degree symbol - used
to delineate angular measurement.

Okey dokey.

Post by TPG account
On Mac, I found this symbol by using KeyCaps, and pasted it into
the code editor

This is fine.

Post by TPG account
and on Windows, I found that font Arial, chr(176) is the same character.
i.e. strDegSymbol = chr(176)

This is not fine. Chr() is defined only for the range 0 to 127.
Chr(176) is undefined, and it is an error to use it. Correct code
would be something like Encodings.WindowsLatin1.Chr(176) instead.

Post by TPG account
When I run the app on Mac Classic, and create the report, the degree
symbol is inserted in the editfield correctly. When I load the editfield
with data saved to the database, the degree symbol returns correctly.

Load the EditField with data from the database how? You need to
define the encoding when you get the data from the database. If
you're not doing that, then you're depending on pure luck. Please
see the Text Encoding FAQ which I'll repost here shortly.

Post by TPG account
On Windows 98 and XP and perhaps others, the degree symbol is inserted
in the editfield correctly as well ... but on reload from the database, the
editField (or something else) converts the degree symbol to two or three
characters which seem like a capital A with some accent mark, followed by a
character whose shape eludes me.

Yep, that doesn't really surprise me.

Post by TPG account
Is there a place where I can enlighten myself about these encodings
mysteries?

There was a good article about it in RB Developer recently; I
recommend you get the back issue if you're not yet a subscriber (I
also recommend you subscribe!). Also, see the FAQ which I'll post
next.

Cheers,
- Joe
--
,------------------------------------------------------------------.
| Joseph J. Strout REAL Software, Inc. |
| joe-***@public.gmane.org http://www.realsoftware.com |
`------------------------------------------------------------------'

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Joseph J. Strout

2004-02-08 19:39:25 UTC

Frequently Asked Questions
about Text Encoding in REALbasic 5
----------------------------------

0. What's all this about text encodings? Isn't a string just a string?

Not to a computer. A computer contains only numbers; in particular,
it contains bytes, each of which has the value 0-255. A string
contains a bunch of bytes. But usually, you don't want to think of
them as bytes; you want to think of them as representing some text
which you can read, find words in, and so on. So early computer
developers had to answer the question: how are we going to map these
numbers into text? The mapping from numbers to text (and vice versa)
is known as a text encoding.

The earliest and most widely accepted standard encoding is known as
"ASCII" (American Standard Code for Information Interchange). ASCII
values range from 0 to 127, and represent all the uppercase and
lowercase English letters, the digits from 0-9, and a variety of
common punctuation marks. The text you're reading can be represented
entirely in ASCII.

But ASCII is insufficient for representing text in almost any
language other than English. For example, it has no accented
characters needed by many European languages. So, a number of
extensions to ASCII were made, that make use of byte values 128-255
to represent additional characters. But different people need
different additional characters, so there are a wide variety of
single-byte encodings of this sort: MacRoman, MacIcelandic,
ISO-Latin-1, and so on.

And for many languages, 255 different characters just isn't enough.
Japanese and Chinese require tens of thousands of characters, for
example. Again there were various encodings developed to solve this
problem, such as MacJapanese and Shift-JIS. These use one byte for
each ASCII character, but two bytes for any non-ASCII character (and
are commonly referred to as "double-byte character systems").

But this proliferation of encodings presents a big problem: you can
no longer tell what text a bunch of bytes is meant to represent, just
by looking at the bytes. You need to also know the encoding
associated with them. So, in REALbasic, every string contains both
the bytes and the encoding (which may be nil if the encoding is
unknown or undefined). This is how REALbasic knows how to draw the
string as text, do text-based operations like InStr and Mid, etc.

In recent years, an industry group called the Unicode Consortium has
been developing a new standard designed to supercede all the others.
Unicode can represent every character in every writing system on the
planet, all in one encoding. This has been embraced by all major OS
vendors, and also by REALbasic. Unicode allows you to represent
different languages (e.g., a mixture of Greek and Japanese) in one
string, and eventually it will mean that you can safely assume that
any text you receive is Unicode, rather than having to handle
hundreds of possible encodings. (See http://www.unicode.org/ for
more info straight from the source.)

1. OK, so a Unicode string is a Unicode string?

Well, not quite. Unicode is a mapping between numbers and text. But
we still have to map the numbers into bytes, since a byte can contain
only 0-255 and Unicode values can be much larger. REALbasic supports
two different formats of Unicode: UTF-8 and UTF-16. See question 2.

2. What encoding are my string literals, constants, etc. in?

All strings in your REALbasic project are compiled as UTF-8. This is
a Unicode encoding that uses one byte for ASCII characters, and up to
four bytes for non-ASCII characters. It has a number of other handy
properties too, for example, an ASCII character will never appear as
part of a multi-byte character.

3. So when do I need to care about encodings?

Usually, not at all. However, when you receive text data from some
outside source, such as a database or file, you need to let REALbasic
know what encoding that text is in. You can use the DefineEncoding
method to do this, or in RB 5.1 or later, you can set the Encoding
property of the TextInputStream, or use the optional encoding
parameter of functions like Read, ReadLine, and ReadAll.

And sometimes, you'll need to provide text to another app which
requires it to be in a certain encoding. In that case, use
ConvertEncoding to change your text into that other encoding.

4. Which is faster, ConvertEncoding or TextConverter.Convert?

In most cases, ConvertEncoding is much faster than using
TextConverter.Convert. ConvertEncoding has a number of optimizations
for common cases, such as converting the same string multiple times,
or converting from one superset of ASCII to another. (All
WorldScript encodings, most Windows encodings, and UTF-8 are all
supersets of ASCII.)

So, you should usually use ConvertEncoding, but if you really need
the speed then you should just measure it both ways and see which
performs better in your particular situation.

5. How do I get a specific byte into a string?

Use ChrB. ChrB takes a byte value (0-255) and returns a string with
undefined encoding, containing exactly that byte. You can build a
string containing multiple bytes by just adding these together.

Of course, don't expect such a string to display as text in any
sensible way. If you want to make text, see the next question.

6. How do I get a specific character by its code point (or "ASCII value")?

Use TextEncoding.Chr. This returns a one-character string with the
character you specified by its code point within that encoding. For
example, a capital A in the ASCII character set would be:

s = Encodings.ASCII.Chr(65)

A copyright symbol represented in UTF-8 would be:

s = Encodings.UTF8.Chr(169)

7. How do I find the code point of a given character?

Use the Asc function. This returns the code point of the first
character of the given string, in the encoding of that string. So,
for example, if you have a string s in any variant of Unicode, then
Asc(s) is the Unicode code point of the first character of s.

8. What encoding do I get when I add two strings together?

When you concatenate two strings (e.g. A + B), if the two have the
same encoding, then the result is in the same encoding. If one
encoding is a superset of the other -- e.g., as MacRoman is a
superset of ASCII -- then the result is that encoding (MacRoman in
our example). Note that most encodings, with the notable exception
of UTF-16, are supersets of ASCII, so in most cases adding an ASCII
string to some other string will result in the encoding of that other
string. Finally, if you add two strings of incompatible encodings --
say, MacRoman and UTF-8, or MacJapanese and MacIcelandic -- then both
strings will be converted internally to UTF-8, and the result will be
represented in UTF-8.

9. How do I find out what encoding a string is in?

Use the Encoding function, which can be used in either of two ways; like this:

enc = Encoding(s)

or like this:

enc = s.Encoding

This returns a TextEncoding object, or if the string's encoding is
undefined, it returns nil.

10. When I write Unicode text to a text file, why do some other apps
fail to properly load and render the text?

The problem is that there is no standard file type or file-name
extension that distinguishes a UTF-8 text file from a file in some
legacy encoding, like MacRoman. For backwards compatibility, many
text file readers will assume that an unknown file is in some common
legacy encoding rather than UTF-8, unless you specifically tell it
otherwise (through some option in the Preferences or file-open
dialog). In addition, if you're using UTF-16, then endian issues
come into play: PCs usually write the low-order byte of each
character first, while other computers write the high-order byte
first. Getting the endianness wrong will turn a UTF-16 file into
gibberish.

However, there is a trick that may help in both cases. You can add a
special character known as a "Byte Order Mark" (or BOM for short) to
the beginning of the file. This is character U+FEFF, which normally
means "zero width non-breaking space". Many apps will interpret this
character at the start of a file as a signature indicating a Unicode
file with a particular encoding and endianness. And those which
don't, should simply render it as an invisible character.

To use this to tag a UTF-8 file, just write
Encodings.UTF8.Chr(&hFEFF) as the first character in the file. The
file name should end in ".txt" in this case. For example:

f = GetFolderItem("sample.txt")
outp = f.CreateTextFile
outp.Write Encodings.UTF8.Chr(&hFEFF)
outp.Write ConvertEncoding(myData, Encodings.UTF8)

For a UTF-16 file, you would use Encodings.UTF16.Chr(&hFEFF) as the
first character in the file, and the name should end in ".utxt".

When reading the file back in, be sure to check whether the first
character of the first line equals the BOM character, and if so,
strip it off like so:

data = inp.ReadAll
if left(data,1) = Encodings.UTF8.Chr(&hFEFF) then
data = Mid(data,2)
end if

The above would work for a UTF-8 file; for a UTF-16 file it would be similar.

For more information on the BOM, see:
<http://www.unicode.org/unicode/faq/utf_bom.html>

11. How to I assign the encoding of a string when I read it from a
file, socket, etc.?

The easiest way is to use REALbasic 5.2 or later, where all Read
methods take an optional "encoding" parameter. Simply pass in an
encoding (e.g. Encodings.UTF8), and the string read will be defined
as that encoding. In addition, a TextInputStream has an encoding
property, which defaults to UTF-8; any strings returned by the stream
will be defined as that encoding, unless you override it by passing
an encoding to the Read method.

In RB 5.0 or 5.1, these facilities are not available, so you must
instead use DefineEncoding after reading your string.
--
,------------------------------------------------------------------.
| Joseph J. Strout REAL Software, Inc. |
| joe-***@public.gmane.org http://www.realsoftware.com |
`------------------------------------------------------------------'

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Mark Lubratt

2004-02-08 20:25:54 UTC

Post by Joseph J. Strout
Frequently Asked Questions
about Text Encoding in REALbasic 5
----------------------------------

OK, now I'm REALLY confused.

Post by Joseph J. Strout
6. How do I get a specific character by its code point (or "ASCII value")?
Use TextEncoding.Chr. This returns a one-character string with the
character you specified by its code point within that encoding. For
s = Encodings.ASCII.Chr(65)
s = Encodings.UTF8.Chr(169)

Cool! I've been wanting to add the copyright symbol to a couple of
splash screens and get rid of the "(c)". So, following your FAQ and
the LR:

StaticText3.Text = DefineEncoding("Copyright ", Encodings.UTF8) +
Encodings.UTF8.Chr(169) + DefineEncoding(" 2003, 2004 ...",
Encodings.UTF8)

Now, the StaticText box displays "Copyright" correctly, but the rest is
gibberish.

The FAQ suggests that these concatenations should be handled correctly.

How would I do this??

Mark

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Norman Palardy

2004-02-08 20:33:07 UTC

Post by Mark Lubratt
Cool! I've been wanting to add the copyright symbol to a couple of
splash screens and get rid of the "(c)". So, following your FAQ and
StaticText3.Text = DefineEncoding("Copyright ", Encodings.UTF8) +
Encodings.UTF8.Chr(169) + DefineEncoding(" 2003, 2004 ...",
Encodings.UTF8)

Internally UTF 8 is the default so why not just

StaticText3.Text = "Copyright " + Encodings.UTF8.Chr(169) + " 2003,
2004 ..."

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Mark Lubratt

2004-02-08 20:41:15 UTC

Post by Norman Palardy

Post by Mark Lubratt
Cool! I've been wanting to add the copyright symbol to a couple of
splash screens and get rid of the "(c)". So, following your FAQ and
StaticText3.Text = DefineEncoding("Copyright ", Encodings.UTF8) +
Encodings.UTF8.Chr(169) + DefineEncoding(" 2003, 2004 ...",
Encodings.UTF8)

Internally UTF 8 is the default so why not just
StaticText3.Text = "Copyright " + Encodings.UTF8.Chr(169) + " 2003,
2004 ..."

OK, I forgot to mention that I'm writing this in Windows (if that makes
a difference). Yes, I've tried that and everything after Copyright is
gibberish. Then I went the DefineEncoding route.

Any ideas?

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Will Leshner

2004-02-08 20:50:21 UTC

Post by Mark Lubratt
OK, I forgot to mention that I'm writing this in Windows (if that
makes a difference). Yes, I've tried that and everything after
Copyright is gibberish. Then I went the DefineEncoding route.
Any ideas?

Here's an idea. Copy this copyright symbol and paste it directly into
your string in RB (assuming the symbol survives to your email client):

©

If that fails, try finding a copyright symbol on the Web and
copy-and-pasting that. I just tried it with RB on Windows and it worked
fine.

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Norman Palardy

2004-02-08 20:51:27 UTC

Post by Mark Lubratt
OK, I forgot to mention that I'm writing this in Windows (if that
makes a difference). Yes, I've tried that and everything after
Copyright is gibberish. Then I went the DefineEncoding route.
Any ideas?

It shouldn't.

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Mark Lubratt

2004-02-08 20:57:52 UTC

Post by Norman Palardy

Post by Mark Lubratt
OK, I forgot to mention that I'm writing this in Windows (if that
makes a difference). Yes, I've tried that and everything after
Copyright is gibberish. Then I went the DefineEncoding route.
Any ideas?

It shouldn't.

Yeah, it shouldn't, but it does. I just tried it on my Mac and it
works just fine. But Windows insists on putting out gibberish. Both
systems are running 5.2.4.

OH! New brainstorm (a light breeze) ... I'm running Windows ME. Is it
correct that Unicode support in Windows ME is incomplete to
non-existent?

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Norman Palardy

2004-02-08 21:40:36 UTC

Post by Mark Lubratt
Yeah, it shouldn't, but it does. I just tried it on my Mac and it
works just fine. But Windows insists on putting out gibberish. Both
systems are running 5.2.4.
OH! New brainstorm (a light breeze) ... I'm running Windows ME. Is
it correct that Unicode support in Windows ME is incomplete to
non-existent?

That one I don't know at all as I've never touched Windows ME

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Joseph J. Strout

2004-02-09 02:31:37 UTC

Post by Mark Lubratt
Yeah, it shouldn't, but it does. I just tried it on my Mac and it
works just fine. But Windows insists on putting out gibberish.
Both systems are running 5.2.4.

You still shouldn't be using DefineEncoding, though. As others have
since pointed out, that's only for when you have a string that RB
doesn't already know the encoding of. String literals are not such a
case. Strings read from a database might be an example of when you'd
need DefineEncoding, but this is certainly not one. At best, your
use of DefineEncoding in a case like this will have no effect; and at
worst, it will screw up the string. So please take it out.

Post by Mark Lubratt
OH! New brainstorm (a light breeze) ... I'm running Windows ME. Is
it correct that Unicode support in Windows ME is incomplete to
non-existent?

I'm not sure, but that certainly is a possibility. Hopefully
somebody who knows the various flavors of Windows a little better
will chime in...

Best,
- Joe
--
,------------------------------------------------------------------.
| Joseph J. Strout REAL Software, Inc. |
| joe-***@public.gmane.org http://www.realsoftware.com |
`------------------------------------------------------------------'

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Brady Duga

2004-02-09 03:05:58 UTC

Post by Mark Lubratt
OH! New brainstorm (a light breeze) ... I'm running Windows ME. Is
it correct that Unicode support in Windows ME is incomplete to
non-existent?

I'm not sure, but that certainly is a possibility. Hopefully somebody
who knows the various flavors of Windows a little better will chime
in...

I don't think the version of Windows matters - encodings.utf8.chr() was
broken for me on XP. I don't think it was working on any version.

--Brady
The La Jolla Underground

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Joseph J. Strout

2004-02-09 03:10:31 UTC

Post by Brady Duga

Post by Joseph J. Strout
I'm not sure, but that certainly is a possibility. Hopefully
somebody who knows the various flavors of Windows a little better
will chime in...

I don't think the version of Windows matters - encodings.utf8.chr()
was broken for me on XP. I don't think it was working on any version.

Yes, I saw your message after I posted mine, and I agree this is a
more likely hypothesis than the Windows ME one.

If that's correct, then perhaps a workaround might be to use
Encodings.WindowsLatin1.Chr(x), where x is whatever gets you the
copyright symbol in that encoding.

Cheers,
- Joe
--
,------------------------------------------------------------------.
| Joseph J. Strout REAL Software, Inc. |
| joe-***@public.gmane.org http://www.realsoftware.com |
`------------------------------------------------------------------'

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Mark Lubratt

2004-02-09 03:27:12 UTC

Post by Brady Duga

Post by Joseph J. Strout

Post by Mark Lubratt
OH! New brainstorm (a light breeze) ... I'm running Windows ME. Is
it correct that Unicode support in Windows ME is incomplete to
non-existent?

I'm not sure, but that certainly is a possibility. Hopefully
somebody who knows the various flavors of Windows a little better
will chime in...

I don't think the version of Windows matters - encodings.utf8.chr()
was broken for me on XP. I don't think it was working on any version.

I knew I read about this somewhere before...

This comes from the pgAdminIII Win9x README from www.postgresql.com.

-------------------------------

Due to lack of Unicode support, the primary Win32 build of pgAdmin III
only
supports Windows 2000/XP/2003.

-------------------------------

Since Win ME is a derivative of Win 98, it would appear that the
version of Windows matters.

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Brady Duga

2004-02-09 03:32:05 UTC

Post by Joseph J. Strout

Post by Brady Duga
I don't think the version of Windows matters - encodings.utf8.chr()
was broken for me on XP. I don't think it was working on any version.

-------------------------------
Due to lack of Unicode support, the primary Win32 build of pgAdmin III
only
supports Windows 2000/XP/2003.
-------------------------------
Since Win ME is a derivative of Win 98, it would appear that the
version of Windows matters.

I was referring specifically to encodings.utf8.chr() being broken. It
is possible that, in addition to that, there is no Unicode support on
ME. I have no clue about that, nor can I test it (I don't have an ME
machine handy).

--Brady
The La Jolla Underground

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Brian Rathbone

2004-02-08 21:00:56 UTC

Post by Mark Lubratt
Any ideas?

Here is a little hack for you...

Put this code in a little application and run it. Then open up Wordpad and
paste it in. You can copy and paste from there. I just tried it and it
works fine for me ;).

dim i as integer
dim c as clipboard
dim s as string

for i = 1 to 255
s = s + str(i) + "=" +chr(i) + chr(13) + chr(10)
next

c = new clipboard
c.text = s
c.close

hth,

Brian

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Will Leshner

2004-02-08 21:04:21 UTC

Post by Brian Rathbone
for i = 1 to 255
s = s + str(i) + "=" +chr(i) + chr(13) + chr(10)
next

Unfortunately, that doesn't work. Well, it will do something, but not
something you can depend on. Chr() is not defined for any values above
127. So this is probably safe:

for i = 1 to 127
s = s + str(i) + "=" chr(i) + chr(13) + chr(10)
next

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Mac@

2004-02-08 21:41:43 UTC

Hi,

my problem is a little bit different...
I should save and open text files with european characters...
like. á é ó ú í

ok. so take "á" - it's ascii 225 - 0D
one byte only...

but with RB i can only write with 2 byte unicode...

How can i save (or send with httpsocket) only one byte instead of two?

And of course i have to send it in string format... to a php form or
what...

i tried every encodings...

any idea?

forgive me my english :o)

Attila

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Will Leshner

2004-02-08 22:00:59 UTC

Post by Mac@
ok. so take "á" - it's ascii 225 - 0D
one byte only...

No, it isn't ASCII. ASCII is only defined for values up to 127. So 225
is not an ASCII value.

Post by Mac@
but with RB i can only write with 2 byte unicode...

No, you can use single-byte encodings such as MacRoman, but if you do,
then both sides need to agree to that and handle it appropriately.

Post by Mac@
How can i save (or send with httpsocket) only one byte instead of two?

You could choose to use something like MacRoman. In that case, you
could try converting your string to MacRoman before you send it, but if
your oritinal string has characters that can't be encoded by MacRoman,
then you may run into a problem.

By the way, why does it have to be a single-byte encoding? There are so
many characters you won't be able to represent.

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Thomas Reed

2004-02-08 22:06:48 UTC

Post by Mac@
I should save and open text files with european characters...
like. á é ó ú í

Okay. First, you're going to need to know what
encoding the text in those files was saved in.
Could be MacRoman, could be a flavor of Unicode,
could be ISOLatin1... could be quite a few other
options as well, so you *must* know this. It is
a crucial piece of information.

Post by Mac@
ok. so take "á" - it's ascii 225 - 0D

No it isn't. ASCII stops at 127. So you need to
know the encoding. In what encoding does 225
represent the "á" character? Well, it sure isn't
MacRoman -- "á" is 135 in MacRoman. Could be
Unicode, where 225 is "á", or WinLatin1, in which
(I think) the same is true.

Post by Mac@
but with RB i can only write with 2 byte unicode...

Huh? That simply isn't true. You can write just
about any Unicode variant. UTF-8 is *mostly* a
1-byte form of Unicode, but some characters may
take up to 4 (I think?).

Post by Mac@
How can i save (or send with httpsocket) only one byte instead of two?

Well, if you need to send only one byte for some
reason, you simply won't be able to send all
possible characters. I'm curious, though, why
you have this restriction placed on you? Do you
even have the option to change encodings, or is
the encoding defined by the other party in the
network transaction?
--
-Thomas

Personal web page: http://homepage.mac.com/thomasareed/
My shareware: http://www.bitjuggler.com/
REALbasic page: http://www.bitjuggler.com/extra/

There are 10 kinds of people in the world -- those who understand binary
numbers and those who don't.

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Asher Dunn

2004-02-08 23:23:04 UTC

Post by Mac@
Hi,
my problem is a little bit different...
I should save and open text files with european characters...
like. á é ó ú í
ok. so take "á" - it's ascii 225 - 0D
one byte only...
but with RB i can only write with 2 byte unicode...
How can i save (or send with httpsocket) only one byte instead of two?
And of course i have to send it in string format... to a php form or
what...
i tried every encodings...
any idea?
forgive me my english :o)

You could try the following code:

s = //the character
memoryBlock.long(0) = asc(s)
s = chr(memoryBlock.long(0))

and then send s.

I have absolutely no idea if this would do anything, but you might try it.

Asher

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Thomas Reed

2004-02-09 19:57:35 UTC

Post by Asher Dunn
s = //the character
memoryBlock.long(0) = asc(s)
s = chr(memoryBlock.long(0))
and then send s.

What is this supposed to achieve? It would appear that you are
taking a character, storing its "ASCII" value, and then using that
value to re-create the character. There's not much point in that.
In reality, things are even worse -- you're taking a character in a
(most likely) known encoding, finding its numeric value (whether that
be ASCII code, MacRoman code, Unicode code point, etc), then getting
the character corresponding to that integer in the system default
encoding. Which may very well be different, thus giving you a
different character.
--
-Thomas

Personal web page: http://homepage.mac.com/thomasareed/
My shareware: http://www.bitjuggler.com/
REALbasic page: http://www.bitjuggler.com/extra/

There are 10 kinds of people in the world -- those who understand binary
numbers and those who don't.

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Joseph J. Strout

2004-02-09 02:35:31 UTC

Post by Mac@
my problem is a little bit different...
I should save and open text files with european characters...
like. á é ó ú í

OK. What encoding are these files in?

Post by Mac@
ok. so take "á" - it's ascii 225 - 0D

No it's not. ASCII ranges from 0 to 127 (see the recently-posted
Text Encoding FAQ).

Post by Mac@
but with RB i can only write with 2 byte unicode...

Untrue. RB supports any encoding your system supports.

Post by Mac@
How can i save (or send with httpsocket) only one byte instead of two?

I think you're asking, how can you choose to write your string in
some other encoding than what it happens to be in? Simply use
ConvertEncoding to make it into whatever you want. E.g.:

MyOutputStream.Write myText.ConvertEncoding( Encodings.MacRoman )

I don't know if MacRoman is the encoding you want, but it is one in
which each character is represented in 1 byte, so it would fit the
clues you've given us.

Cheers,
- Joe
--
,------------------------------------------------------------------.
| Joseph J. Strout REAL Software, Inc. |
| joe-***@public.gmane.org http://www.realsoftware.com |
`------------------------------------------------------------------'

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Brady Duga

2004-02-08 21:37:25 UTC

Post by Mark Lubratt

Post by Norman Palardy
Internally UTF 8 is the default so why not just
StaticText3.Text = "Copyright " + Encodings.UTF8.Chr(169) + " 2003,
2004 ..."

OK, I forgot to mention that I'm writing this in Windows (if that
makes a difference). Yes, I've tried that and everything after
Copyright is gibberish. Then I went the DefineEncoding route.
Any ideas?

Encodings.UTF8.Chr() is broken under Windows. It is (sort of) reported
here:
<http://support.realsoftware.com/feedback/viewreport.php?
reportid=ekmsjluo>, though it is really more severe than that report
claims (just doesn't work). If the resolution is correct (I haven't
tested it), then it will be fixed in 5.5.

--Brady
The La Jolla Underground

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Will Leshner

2004-02-08 20:34:19 UTC

Post by Mark Lubratt
Cool! I've been wanting to add the copyright symbol to a couple of
splash screens and get rid of the "(c)". So, following your FAQ and
StaticText3.Text = DefineEncoding("Copyright ", Encodings.UTF8) +
Encodings.UTF8.Chr(169) + DefineEncoding(" 2003, 2004 ...",
Encodings.UTF8)
Now, the StaticText box displays "Copyright" correctly, but the rest
is gibberish.
The FAQ suggests that these concatenations should be handled correctly.

It does? Where? Not in the part you quoted. If I were doing it, I just
do this:

StaticText3.Text = "Copyright © 2003, 2004 ..."

You only use DefineEncoding when you know the encoding of something and
RB doesn't, like when you are pulling the string from a database or the
Internet.

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Will Leshner

2004-02-08 20:37:47 UTC

Post by Will Leshner
StaticText3.Text = "Copyright © 2003, 2004 ..."

Well, this assumes that you are on a platform (such as the Mac) where
it is simple to generate something like the copyright symbol on your
keyboard. If that isn't the case, then you can use Norman's (just
posted) code to get a copyright symbol.

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

B***@public.gmane.org

2004-02-08 20:04:29 UTC

we need a separate listserve just for encoding questions... ;+)

-bowerbird

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Asher Dunn

2004-02-08 23:10:57 UTC

Post by B***@public.gmane.org
we need a separate listserve just for encoding questions... ;+)

hehe. yeah. but then everyone would miss out on all the fun!

Asher

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

27 Replies
113 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

TPG account 2004-02-08 09:22:23 UTC

Joseph J. Strout 2004-02-08 19:38:43 UTC

Joseph J. Strout 2004-02-08 19:39:25 UTC

Mark Lubratt 2004-02-08 20:25:54 UTC

Norman Palardy 2004-02-08 20:33:07 UTC

Mark Lubratt 2004-02-08 20:41:15 UTC

Will Leshner 2004-02-08 20:50:21 UTC

Norman Palardy 2004-02-08 20:51:27 UTC

Mark Lubratt 2004-02-08 20:57:52 UTC

Norman Palardy 2004-02-08 21:40:36 UTC

Joseph J. Strout 2004-02-09 02:31:37 UTC

Brady Duga 2004-02-09 03:05:58 UTC

Joseph J. Strout 2004-02-09 03:10:31 UTC

Mark Lubratt 2004-02-09 03:27:12 UTC

Brady Duga 2004-02-09 03:32:05 UTC

Brian Rathbone 2004-02-08 21:00:56 UTC

Will Leshner 2004-02-08 21:04:21 UTC

Mac@ 2004-02-08 21:41:43 UTC

Will Leshner 2004-02-08 22:00:59 UTC

Thomas Reed 2004-02-08 22:06:48 UTC

Asher Dunn 2004-02-08 23:23:04 UTC

Thomas Reed 2004-02-09 19:57:35 UTC

Joseph J. Strout 2004-02-09 02:35:31 UTC

Brady Duga 2004-02-08 21:37:25 UTC

Will Leshner 2004-02-08 20:34:19 UTC

Will Leshner 2004-02-08 20:37:47 UTC

B***@public.gmane.org 2004-02-08 20:04:29 UTC

Asher Dunn 2004-02-08 23:10:57 UTC

about - legalese

Loading...