Frequently Asked Questions
about Text Encoding in REALbasic 5
----------------------------------
0. What's all this about text encodings? Isn't a string just a string?
Not to a computer. A computer contains only numbers; in particular,
it contains bytes, each of which has the value 0-255. A string
contains a bunch of bytes. But usually, you don't want to think of
them as bytes; you want to think of them as representing some text
which you can read, find words in, and so on. So early computer
developers had to answer the question: how are we going to map these
numbers into text? The mapping from numbers to text (and vice versa)
is known as a text encoding.
The earliest and most widely accepted standard encoding is known as
"ASCII" (American Standard Code for Information Interchange). ASCII
values range from 0 to 127, and represent all the uppercase and
lowercase English letters, the digits from 0-9, and a variety of
common punctuation marks. The text you're reading can be represented
entirely in ASCII.
But ASCII is insufficient for representing text in almost any
language other than English. For example, it has no accented
characters needed by many European languages. So, a number of
extensions to ASCII were made, that make use of byte values 128-255
to represent additional characters. But different people need
different additional characters, so there are a wide variety of
single-byte encodings of this sort: MacRoman, MacIcelandic,
ISO-Latin-1, and so on.
And for many languages, 255 different characters just isn't enough.
Japanese and Chinese require tens of thousands of characters, for
example. Again there were various encodings developed to solve this
problem, such as MacJapanese and Shift-JIS. These use one byte for
each ASCII character, but two bytes for any non-ASCII character (and
are commonly referred to as "double-byte character systems").
But this proliferation of encodings presents a big problem: you can
no longer tell what text a bunch of bytes is meant to represent, just
by looking at the bytes. You need to also know the encoding
associated with them. So, in REALbasic, every string contains both
the bytes and the encoding (which may be nil if the encoding is
unknown or undefined). This is how REALbasic knows how to draw the
string as text, do text-based operations like InStr and Mid, etc.
In recent years, an industry group called the Unicode Consortium has
been developing a new standard designed to supercede all the others.
Unicode can represent every character in every writing system on the
planet, all in one encoding. This has been embraced by all major OS
vendors, and also by REALbasic. Unicode allows you to represent
different languages (e.g., a mixture of Greek and Japanese) in one
string, and eventually it will mean that you can safely assume that
any text you receive is Unicode, rather than having to handle
hundreds of possible encodings. (See http://www.unicode.org/ for
more info straight from the source.)
1. OK, so a Unicode string is a Unicode string?
Well, not quite. Unicode is a mapping between numbers and text. But
we still have to map the numbers into bytes, since a byte can contain
only 0-255 and Unicode values can be much larger. REALbasic supports
two different formats of Unicode: UTF-8 and UTF-16. See question 2.
2. What encoding are my string literals, constants, etc. in?
All strings in your REALbasic project are compiled as UTF-8. This is
a Unicode encoding that uses one byte for ASCII characters, and up to
four bytes for non-ASCII characters. It has a number of other handy
properties too, for example, an ASCII character will never appear as
part of a multi-byte character.
3. So when do I need to care about encodings?
Usually, not at all. However, when you receive text data from some
outside source, such as a database or file, you need to let REALbasic
know what encoding that text is in. You can use the DefineEncoding
method to do this, or in RB 5.1 or later, you can set the Encoding
property of the TextInputStream, or use the optional encoding
parameter of functions like Read, ReadLine, and ReadAll.
And sometimes, you'll need to provide text to another app which
requires it to be in a certain encoding. In that case, use
ConvertEncoding to change your text into that other encoding.
4. Which is faster, ConvertEncoding or TextConverter.Convert?
In most cases, ConvertEncoding is much faster than using
TextConverter.Convert. ConvertEncoding has a number of optimizations
for common cases, such as converting the same string multiple times,
or converting from one superset of ASCII to another. (All
WorldScript encodings, most Windows encodings, and UTF-8 are all
supersets of ASCII.)
So, you should usually use ConvertEncoding, but if you really need
the speed then you should just measure it both ways and see which
performs better in your particular situation.
5. How do I get a specific byte into a string?
Use ChrB. ChrB takes a byte value (0-255) and returns a string with
undefined encoding, containing exactly that byte. You can build a
string containing multiple bytes by just adding these together.
Of course, don't expect such a string to display as text in any
sensible way. If you want to make text, see the next question.
6. How do I get a specific character by its code point (or "ASCII value")?
Use TextEncoding.Chr. This returns a one-character string with the
character you specified by its code point within that encoding. For
example, a capital A in the ASCII character set would be:
s = Encodings.ASCII.Chr(65)
A copyright symbol represented in UTF-8 would be:
s = Encodings.UTF8.Chr(169)
7. How do I find the code point of a given character?
Use the Asc function. This returns the code point of the first
character of the given string, in the encoding of that string. So,
for example, if you have a string s in any variant of Unicode, then
Asc(s) is the Unicode code point of the first character of s.
8. What encoding do I get when I add two strings together?
When you concatenate two strings (e.g. A + B), if the two have the
same encoding, then the result is in the same encoding. If one
encoding is a superset of the other -- e.g., as MacRoman is a
superset of ASCII -- then the result is that encoding (MacRoman in
our example). Note that most encodings, with the notable exception
of UTF-16, are supersets of ASCII, so in most cases adding an ASCII
string to some other string will result in the encoding of that other
string. Finally, if you add two strings of incompatible encodings --
say, MacRoman and UTF-8, or MacJapanese and MacIcelandic -- then both
strings will be converted internally to UTF-8, and the result will be
represented in UTF-8.
9. How do I find out what encoding a string is in?
Use the Encoding function, which can be used in either of two ways; like this:
enc = Encoding(s)
or like this:
enc = s.Encoding
This returns a TextEncoding object, or if the string's encoding is
undefined, it returns nil.
10. When I write Unicode text to a text file, why do some other apps
fail to properly load and render the text?
The problem is that there is no standard file type or file-name
extension that distinguishes a UTF-8 text file from a file in some
legacy encoding, like MacRoman. For backwards compatibility, many
text file readers will assume that an unknown file is in some common
legacy encoding rather than UTF-8, unless you specifically tell it
otherwise (through some option in the Preferences or file-open
dialog). In addition, if you're using UTF-16, then endian issues
come into play: PCs usually write the low-order byte of each
character first, while other computers write the high-order byte
first. Getting the endianness wrong will turn a UTF-16 file into
gibberish.
However, there is a trick that may help in both cases. You can add a
special character known as a "Byte Order Mark" (or BOM for short) to
the beginning of the file. This is character U+FEFF, which normally
means "zero width non-breaking space". Many apps will interpret this
character at the start of a file as a signature indicating a Unicode
file with a particular encoding and endianness. And those which
don't, should simply render it as an invisible character.
To use this to tag a UTF-8 file, just write
Encodings.UTF8.Chr(&hFEFF) as the first character in the file. The
file name should end in ".txt" in this case. For example:
f = GetFolderItem("sample.txt")
outp = f.CreateTextFile
outp.Write Encodings.UTF8.Chr(&hFEFF)
outp.Write ConvertEncoding(myData, Encodings.UTF8)
For a UTF-16 file, you would use Encodings.UTF16.Chr(&hFEFF) as the
first character in the file, and the name should end in ".utxt".
When reading the file back in, be sure to check whether the first
character of the first line equals the BOM character, and if so,
strip it off like so:
data = inp.ReadAll
if left(data,1) = Encodings.UTF8.Chr(&hFEFF) then
data = Mid(data,2)
end if
The above would work for a UTF-8 file; for a UTF-16 file it would be similar.
For more information on the BOM, see:
<http://www.unicode.org/unicode/faq/utf_bom.html>
11. How to I assign the encoding of a string when I read it from a
file, socket, etc.?
The easiest way is to use REALbasic 5.2 or later, where all Read
methods take an optional "encoding" parameter. Simply pass in an
encoding (e.g. Encodings.UTF8), and the string read will be defined
as that encoding. In addition, a TextInputStream has an encoding
property, which defaults to UTF-8; any strings returned by the stream
will be defined as that encoding, unless you override it by passing
an encoding to the Read method.
In RB 5.0 or 5.1, these facilities are not available, so you must
instead use DefineEncoding after reading your string.
--
,------------------------------------------------------------------.
| Joseph J. Strout REAL Software, Inc. |
| joe-***@public.gmane.org http://www.realsoftware.com |
`------------------------------------------------------------------'
- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>
Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>