14.2 Character encodings

Most of the string and text library functions accept an optional parameter specifying the character encoding to use. This parameter tells the function how the strings you pass to it are internally formatted, i.e. which character encoding they use.

Normally, you shouldn't have to use this parameter at all because starting with Hollywood 7.0 all text should be stored as UTF-8. Under certain circumstances, however, it might be necessary to use the optional character encoding parameter. For example, Hollywood strings can also contain raw binary data. This data of course isn't valid UTF-8 and thus the string functions will reject it. The only way to operate on this data then is to tell the respective functions that this isn't UTF-8 encoded data but just a raw sequence of bytes. This can be done by passing the #ENCODING_RAW constant in the character encoding parameter.

Here is an overview of the different encodings available in Hollywood:

#ENCODING_UTF8:
This is the default encoding since Hollywood 7.0 and should be used whenever you work with text.

#ENCODING_ISO8859_1:
This was the default encoding before Hollywood 7.0. This can be useful in case you need to deal with binary data or strings that aren't formatted as UTF-8. Don't be confused by the name: Even though the constant is called ISO 8859-1 it can actually be used with all kinds of non-UTF-8 encodings because for most string library functions it won't make a difference if the encoding is ISO 8859-1 or some other charset as long as one character is one byte which is true for all non-UTF-8 8-bit encodings. The only commands that won't work with non-ISO-8859-1 encodings are commands like UpperStr(), LowerStr(), etc. because they will do all upper and lower case mapping based on the ISO 8859-1 charmap and other encodings will require different charmaps so those functions won't give correct results for non-ISO-8859-1 text. Since #ENCODING_ISO8859_1 can also be used with other encodings there's also the synonym constant #ENCODING_RAW which might be less misleading semantically because it doesn't suggest that strings are in ISO 8859-1 format (see below).

#ENCODING_RAW:
This is the same as #ENCODING_ISO8859_1 but using this instead of #ENCODING_ISO8859_1 might be preferable from a semantic point of view because it doesn't suggest that strings are or must be in ISO 8859-1 encoding. Instead it simply says that strings are simply treated as a sequence of raw 8-bit characters.

#ENCODING_AMIGA:
This specifies the system's default character set on AmigaOS and compatible systems. This constant is only supported by ConvertStr() and only on AmigaOS and compatible systems, obviously. #ENCODING_AMIGA allows you to convert between AmigaOS' default character set and UTF-8 (both ways).

You can use the SetDefaultEncoding() function to change the default character encoding for the string and text libraries. See SetDefaultEncoding for details.


Show TOC