Progress UNICODE Support

8 October, 2009

Using Unicode

Using Unicode
When you adapt an application to a different locale, the character set often changes. One way to deal with multiple character sets is to use multiple code pages, one for each language or group of languages. However, this approach forces you to deal with code-page conversion and code-page incompatibility, explained in "Understanding Code Pages" and "Using Multi-byte Code Pages." Another approach is to use a single code page that includes all characters of all languages of the world. This is the idea behind Unicode, which is the focus of this chapter.

This chapter includes the following sections:

Unicode overview
Why use Unicode
Using Unicode with OpenEdge products
Using Unicode with OpenEdge databases
Using Unicode with OpenEdge applications
Guidelines for using Unicode
Unicode support for supplementary characters
Unicode overview
An evolving standard, Unicode defines a single code page that includes most symbols—letters, ideograms, syllabics (such as the Japanese Kana symbols), punctuation, diacritics, mathematical symbols, technical symbols, and so on—from most of the languages of the world, and assigns each symbol a numeric value—originally, a number between zero and 65,535, the range of an unsigned 16-bit integer.

As it turned out, Unicode’s original limit of 65,536 symbols proved too small, and the limit was extended to well over 1,000,000 symbols. Several ways of encoding each symbol were defined, and the encodings were designed so that you can convert from one to another any number of times without losing any information. For more information on the algorithms for converting between encodings, see the Unicode Web site, http://www.unicode.org. OpenEdge supports Unicode’s UTF-8 encoding. In addition, all varieties of UTF-16 and UTF-32 are supported for input and output and for LONGCHARs and CLOBs.

Why use Unicode
The Unicode approach, with its UTF-8 encoding and multi-byte characters, might seem complicated. But the other approach, using multiple code pages, can be even more complicated.

The limits of multiple code pages
Applications that use multiple code pages are often difficult to design, deploy, configure, and run, for the following reasons:

You must often convert data from one code page to another.
Before you perform a code page conversion, you must determine whether the source and target code pages are compatible.
You must get application components that use incompatible code pages to read, write, and display each other’s data, which can be difficult or impossible.
This applies especially to applications that read and write multi-lingual data to a database and then display the data on client monitors and on printed reports.
The advantages of Unicode
Unicode has many advantages, as listed in Table 9–1.

Table 9–1: Unicode’s advantages Advantage Explanation
Simplified application development. When an application component uses Unicode, all symbols needed by the application for reading and writing character data reside in a single code page.
This simplifies application development enormously.
Ease of migration of existing code. UTF-8 includes the traditional ASCII characters in its first 127 positions and assigns each of these characters its traditional ASCII value.
This simplifies adapting existing ASCII applications to Unicode.
Ease of providing shared access to data. OpenEdge clients that use incompatible code pages can easily read and write a single UTF-8 database.
OpenEdge automatically converts the code page as data passes between the client and the database.
Ease of worldwide deployment. UTF-8 databases and r-code files are multi-lingual.
They can be deployed worldwide.
Interoperability. Active-X and Java clients are Unicode based.
They can communicate with UTF-8 databases and AppServers.
Web compatibility. Unicode is becoming the universal code page of the Web. Current Web standards require Unicode and rely on it.
Multi-lingual applications. Applications using Unicode can support multiple languages in:
The data
The user interface
Reports

Using Unicode with OpenEdge products
Some OpenEdge products can run directly in Unicode (that is, when -cpinternal is set to UTF-8). Other OpenEdge products cannot, but can run in a code page that OpenEdge can convert to and from Unicode.

The following products can run in Unicode:

ActiveX client
AppServer
Batch client (GUI and character)
Database server
4GL compiler and r-code
4GL GUI client (in interactive and batch modes)
GUI Procedure Editor
Java client
SQL
Using Unicode with OpenEdge databases
Using UTF-8 with OpenEdge databases involves several techniques, covered in the following sections:

Converting an OpenEdge database to UTF-8 using the PROUTIL CONVCHAR utility
Converting an OpenEdge database to UTF-8 using dump and load
Compiling, storing, and applying the UTF-8 word-break rules to a database
Converting an OpenEdge database to UTF-8 using the PROUTIL CONVCHAR utility
To convert an OpenEdge database to UTF-8 using the PROUTIL CONVCHAR utility:
Caution: Before you begin, back up your database.
Convert the database to UTF-8 using the following syntax:

Syntax proutil database-name -C convchar convert utf-8

Load the collation data definition (.df) file using the syntax for your operating system:

Windows syntax %DLC%prolangutffilename.df
UNIX syntax $DLC/prolang/utf/filename.df

For the UTF-8 BASIC collation, use the _tran.df collation data definition file. For an International Components for Unicode (ICU) collation, use one of the collation data definition files prefixed with “ICU” (such as ICU-cs.df used for Czech databases).
Compile, store, and apply the UTF-8 word-break rules to your database. For complete instructions, see the "Compiling, storing, and applying the UTF-8 word-break rules to a database" .
Rebuild the indexes using the following syntax:

Syntax proutil database-name -C idxbuild

Converting an OpenEdge database to UTF-8 using dump and load
To convert an OpenEdge database to UTF-8 using the dump and load utilities:
Caution: Before beginning, back up your database.
Caution: Do not use binary dump and load.
Dump the schema and data of the existing database using the Data Administration utility.
(From the Procedure Editor main menu, select Tools Data Administration Admin Dump Data and Definitions.)
Create a new, empty UTF-8 database using the syntax for your operating system:

Windows syntax prodb database-name %DLC%prolangutfempty.db
UNIX syntax prodb database-name $DLC/prolang/utf/empty.db

Note: OpenEdge loads the UTF-8 BASIC collation (the _tran.df collation data definition file) in the empty UTF-8 database automatically and by default.
If you want to use an International Components for Unicode (ICU) collation, load the collation data definition (.df) file using the syntax for your operating system:

Windows syntax %DLC%prolangutffilename.df
UNIX syntax $DLC/prolang/utf/filename.df

ICU collation data definition files are prefixed with “ICU” (such as ICU-cs.df used for Czech databases).
Compile, store, and apply the UTF-8 word-break rules to the database. For complete instructions, see the "Compiling, storing, and applying the UTF-8 word-break rules to a database" .
Load the schema and data to the database using the Data Administration utility.
(From the Procedure Editor main menu, select Tools Data Administration Admin Load Data and Definitions.)
Note: When the data are loaded, the indexes are automatically rebuilt.
Compiling, storing, and applying the UTF-8 word-break rules to a database
When you convert an existing database to UTF-8, whether you use the PROUTIL CONVCHAR utility or the DUMP and LOAD utilities, you must compile, store, and apply the UTF-8 word-break rules to the database.

If you forget to apply the word-break rules to your database, you might get the following symptoms:

Queries with the CONTAINS operator return incorrect results.
A QBW syntax error saying that an asterisk (*) is allowed only at the end of a word.
To compile, store, and apply the UTF-8 word-break rules to a database:
Compile a new version of the word-break table for UTF-8 using the syntax for your operating system:

Windows syntax proutil -C wbreak-compiler %DLC%prolangconvmaputf8-bas.wbt number
UNIX syntax proutil -C wbreak-compiler $DLC/prolang/convmap/utf8-bas.wbt number

where number indicates an INTEGER between 1 and 255.
This produces a new word-break table, proword.number.
Store the new word-break table using one of the following techniques:
Store it in the $DLC directory on UNIX or in the %DLC% directory on Windows.
Store it in an arbitrary directory, then set the environment variable PROWDnumber to the value of the arbitrary directory.
Apply the new word-break rules to the database using the following syntax:

Syntax proutil database-name -C word-rules number

Using Unicode with OpenEdge applications
You can create and run a character or graphical client application that uses Unicode. Since the OpenEdge database server and the OpenEdge graphical client support Unicode’s UTF-8 encoding, but the OpenEdge interactive character client does not, OpenEdge applies a different set of rules for using Unicode with the batch character client and the graphical clients.

Rules for using Unicode with the OpenEdge character client:
The following rules apply when using Unicode with the OpenEdge batch character client:

An interactive character client must start up in a code page other than UTF-8.
You must ensure a character client accesses only records in a compatible code page.
You must follow guidelines for multi-byte programming, such as distinguishing characters, bytes, and columns.
Rules for using Unicode with the OpenEdge graphical client
You can use Unicode throughout your graphical application. When -cpinternal is set to UTF-8, the development and runtime environments for the 4GL graphical client is fully Unicode-enabled.

The following rules apply when using Unicode with the OpenEdge graphical client:

A graphical client may start up in the UTF-8 code page. If not, you must ensure the graphical client accesses only records in a compatible code page.
You must follow guidelines for multi-byte programming, such as distinguishing characters, bytes, and columns.
Specify which Unicode-enabled editor you want to use by setting the UseSourceEditor option in progress.ini. Set this option to NO to use RichEdit (which is fully Unicode-enabled) or YES to use SlickEdit (the default). This setting applies to the Procedure Editor and the AppBuilder Section Editor; it does not apply to the Progress 4GL Editor widget.
Use a Unicode font to display and print Unicode data. Specify a Unicode font by setting font options (such as DefaultFont and PrinterFont) in progress.ini.
Note: On Windows, you might also need to specify Unicode fonts on the Appearance tab in the Display Properties dialog box (accessed through the Windows Control Panel).
Unicode application example
Creating and running an application that uses Unicode is not difficult. The following is an example of creating and running an application consisting of a database server, a graphical client, and a UTF-8 database.

To create the application:
Convert the database to Unicode using one of the techniques in the "Using Unicode with OpenEdge databases" .
Design queries that access only records that use the client’s code page.
One way to do this is for tables to have a field indicating the record’s code page. When records are added, the field is populated. When the database is queried, the query references the code page field to return only those records in the client’s code page.
Imagine that in the Sports database, the Customer table has a field, db-language, indicating the code page or language of the record. A client whose language corresponds to the value of the variable user-language might submit a query like the following:

FOR EACH customer WHERE SESSION:CPINTERNAL “UTF-8” OR
db-language = user-language:
DISPLAY name address city country comments.
END.

Start an OpenEdge database server, setting the server’s code page to UTF-8. The following command fragment illustrates this:

proserve -cpinternal utf-8 -cpstream ...

Start a client in the native code page (perhaps ISO8859-15). Set -cpinternal, -cpstream, and the other code-page-related startup parameters to this code page. The following command illustrates this:

prowin32 -cpinternal iso8859-15 -cpstream iso8859-15

Your Unicode application is up and running.
Guidelines for using Unicode
When you use Unicode in OpenEdge applications, the following restrictions, cautions, and suggestions apply:

With the OpenEdge UTF-8 BASIC collation, composed and decomposed characters are treated as different characters. With the International Components for Unicode (ICU) collations, composed and decomposed characters are treated as the same character for comparisons and indexes.
The OpenEdge UTF-8 BASIC collation provides for sorting Unicode data in binary order. Alternatively, the ICU collations provide for sorting Unicode data based on the language-specific requirements for a locale.
Note: You can specify a Progress collation or an ICU collation for sorting data using either the Collation Table (-cpcoll) startup parameter, or the COLLATE option on the FOR statement, the OPEN QUERY statement, and the PRESELECT phrase. For more information on the -cpcoll startup parameter, see OpenEdge Deployment: Startup Command and Parameter Reference. For more information on the 4GL elements, see OpenEdge Development: Progress 4GL Reference.
For information about using ICU collations as database collations, see "Using Databases."
Before sorting Unicode data with the UTF-8 BASIC collation, normalize the data using the 4GL NORMALIZE function. Normalizing the data converts the data into a standardized form that allows for more accurate and consistent sorting and indexing. This is important when working with characters or sequences of characters that have multiple representations (for example, base characters and combining characters) because it ensures that equivalent strings have a unique binary representation. For more information on the 4GL NORMALIZE function, see the OpenEdge Development: Progress 4GL Reference.
Note: When sorting Unicode data with an ICU collation, you do not need to normalize the data.
When UTF-8 data contains decomposed characters, you cannot convert it to a single-byte code page. You must first compose the data using the 4GL NORMALIZE function. When you convert data from a single-byte code page to Unicode, the result is always composed data.
OpenEdge supports code-page conversion to and from UTF-8 the same way it supports code-page conversion to and from other code pages. For more information on code-page conversion, see "Understanding Code Pages," and "Understanding Character Processing Tables."
When an existing database is converted to UTF-8, the amount of storage required by each non-ASCII character increases. Roughly, each non-ASCII Latin-alphabet character converted to UTF-8 tends to require two bytes, while each double-byte Chinese, Japanese, or Korean character converted to UTF-8 tends to require three bytes.
To display and print Unicode data, consider using a Unicode font. They are available commercially.
Unicode support for supplementary characters
OpenEdge supports Unicode supplementary characters. These are Unicode characters whose codepoints are in the supplementary planes 1-16; that is, codepoints from U+10000 to U+10FFFF. In UTF-8 encoding, these are 4-byte UTF-8 values, with lead bytes ranging from 0xF0 to 0xF4.

OpenEdge supports the UTF-16 and UTF-32 transformation formats as OpenEdge code pages that can be used for conversions. For example, in the following ASC and CHR functions:

ASC( '~U0254AE', "utf-32", "utf-8" )
CHR( 152750, "utf-8", "utf-32")

Note the decimal value of 0x254AE is 152750.

We also support UTF-16 for conversions. It converts supplementary characters, and conversions within ASC and CHR properly handle 2-byte values as unsigned shorts.

Using UTF-16 in the ASC and CHR functions
To use UTF-16 in the ASC function, use the following syntax:

Syntax ASC( ch, "UTF-16" [, source-cp ] )

ch
One character, like 'A', "A", or '~U034e'.
source-cp
The name of the source code page.
This returns one of the following values, which are platform independent:

An integer less than 65536, representing the Unicode scalar value in plane 0.
A long integer composed of the high surrogate in the high-order 16 bits of the int and the low surrogate in the low-order 16 bits of the int.
To use UTF-16 in the CHR function, use the following syntax:

Syntax CHR( n, target-cp, "UTF-16" )

n
The numeric value of the character.
target-cp
The name of the target code page.
This command interprets its input as one of the following platform-independent values:

The Unicode scalar value for plane 0.
An integer composed of a high surrogate in the high-order 16 bits and the low-surrogate in the low order 16 bits.
When using UTF-16 in CODEPAGE-CONVERT, the behavior of character strings as UTF-16 are sensitive to byte order and the presence of null bytes. The 4GL can handle UTF-16 strings as RAW or MEMPTR data, and the use of PUT-UNSIGNED-SHORT and GET-UNSIGNED-SHORT solves any byte order machine dependencies.

Behavior of ASC and CHR functions
In the following example, assume cpinternal is UTF-8 and the GUI code page is 1252:

ASC( '~U0254AE', "utf-32", "utf-8"

This displays -191,978,590.

CHR accepts DECIMAL or INTEGER input values, so supplementary characters can be entered as positive values greater than the maximum integer Progress allows. The following examples display the multi-byte character consisting of the bytes 0xF4, 0x8E, 0xA3, and 0xA2:

DISPLAY CHR( 4102988706)

Or

CHR( -191978590 )

New and modified keywords
This section describes inputting Unicode codepoints in 4GL code, in support of supplementary characters.

To input Unicode scalar codepoints in plane 0 (U+0000 to U+FFFF), use this syntax:

Syntax ~uXXXX in

XXXX
A 4-digit, case-insensitive hex digit.
To input Unicode scalar codepoints in planes 0 – 16 (U+0000 to U+10FFFF), use this syntax:

Syntax ~UXXXXXX

XXXXXX
A 6-digit, case-insensitive hex digit.
When the 4GL code is parsed, this character value is converted from Unicode to -cpinternal. If the character is not a valid character in -cpinternal, the entire escaped string is passed through. For example, if -cpinternal is 1252, ~u4E00 is passed to the 4GL as is.

Limitations
Supplementary characters have the following restrictions:

Width — Assume a column width of 2 for each supplementary character. This works for the HKSCS code page, but not for other supplementary characters.
Word breaking — Follow the heuristic rules already in place for UTF-8. Supplementary characters are considered as separate words, like 3-byte UTF-8 characters.
Case rules — No case transformations are applied to supplementary characters. This works for HKSCS, but not for other supplementary characters. (In Unicode 3.1, Deseret characters are the only supplementary characters with case mappings.)

Also worth reading:

The Yellowfin Platform

The Ultimate Guide to Embedded Analytics

The Executive’s Guide to Embedded BI

Flexible Pricing Models

Forum