jBASE Configuration and Properties
This section provides the configuration details such as the environment variables, functional and JQL changes, sorting order, error message files and so on pertaining to jBASE internationalization.
JBASE 4.1 provides support code page conversion, collation sequences, international dates and times along with number and currency formatting for internationalisation. The internationalization configuration depends on the user ID and/or the following jBASE environment variables:
- JBASE_CODEPAGE
- JBASE_LOCALE
- JBASE_TIMEZONE
The user ID configuration or environment variables have no effect if the,
- Account (in which it executes the application) is not configured for international mode or
- Environment variable JBASE_I18N is not set
Application providers are responsible for the handling of all directionality issues. The jBASE library functions such as length (LEN), string comparisons (LT, LE, GT, GE) and collation order statements like (LOCATE/SORT) are modified to operate on a character basis in international mode rather than bytes, along with the currently configured user locale.
Environment Variables
The environment variables involved in internationalization are as follows.
This variable executes the application in international mode.
You can only set the JBASE_CODEPAGE environment variable to a valid code page available with the ICU package. The jcodepages command displays the list of currently available code pages. Conversion for input and output will only take place if configuring the account for international mode or the JBASE_I18N variable is set.
It is recommended to use UTF-8 for input and output, which efficiently eliminates code page conversion and reduces system resource requirements. Several commercially available telnet clients can communicate using UTF-8, in which cases the telnet client performs the conversion from the configured code page to UTF-8. Hence, it is important to ensure that the client is configured properly to ensure the correctness of the input and output code page for which the keyboard mapping required.
Code page conversion is only applicable when the JBASE_I18N environment variable is set. If this variable is not set, code page conversion does not occur, and all variables will be handled as bytes and not as characters. As configuration of the international mode is on an account basis, the state of international mode can change on execution of a LOGTO command.
You can only set the JBASE_LOCALE environment variable to a valid locale available with the ICU package. The jlocales command displays the list of currently available locales. You can use the configured locale only if the account is configured for international mode or JBASE_I18N variable is set.
If the JBASE_I18N environment variable is not set, the locale is based on the underlying OS locale configuration and configured locale for the user ID has no effect. As configuration of the international mode is on an account basis, the state of international mode can change on execution of a LOGTO command. If an account is not configured for international mode, the JBASE_I18N environment variable will be unset as the result of LOGTO.
You can only set the JBASE_TIMEZONE environment variable to a valid time zone available with the ICU package. The jtimezones command displays the list of currently available time zones. You can use the configured locale only if the account is configured for international mode or JBASE_I18N variable is set.
For example, the following environment variable configuration would configure a French user and country locale specific for France and code page set for latin1, ISO-8859-1.
JBASE_I18N=1
JBASE_CODEPAGE=iso-8859-1
JBASE_LOCALE=fr_FR
If the JBASE_I18N environment variable is not set, the timezone is based on the underlying OS timezone configuration and configured timezone for the user ID has no effect. As configuration of the international mode is on an account basis, the state of international mode can change on execution of a LOGTO command. If an account is not configured for international mode, the JBASE_I18N environment variable will be unset as the result of LOGTO.
Function Changes for International Mode
Certain jBASE library functions need to be modified to process data as UTF-8 encoded multi-byte sequences. It bases resultant values on characters rather than bytes. Some functions change their internal functionality based on the state of international mode or JBASE_I18N variable.
LEN, SUBSTRINGS, X[n,m], INDEX
In international mode, the length and sub-string extraction works in characters and not bytes and resultant positions are character positions and not byte offset.
BYTELEN
The BYTELEN function has been provided to obtain the actual number of bytes rather than characters.
The following source code example contains UTF-8 encoded characters representing the German u umlaut (0xC3 0xBC) and double s (0xC3 0x9F).
X = "Füßball";* String as UTF-8 sequence "F.C3.BC.C3.9Fball"
CRT X
CRT "Character Length of X is ":LEN(X)
CRT "Byte Length of X is ":BYTELEN(X)
CRT "Substring[1,3] of X is ": X[1,3]
If executed in international mode with the Input/Output Code Page configured to ISO-8859-1 (Latin1), this code will produce the following output.
Füßball
Character Length of X is 7
Byte Length of X is 9
Substring[1,3] of X is Füß
This section provides the character, collation and conversion properties required for internationalization.
Character Properties
The following are the character properties involved.
UPCASE, DOWNCASE, ALPHA, MATCHES, MATCHFIELD
In international mode, functions use the configured locale to convert and/or test character properties.
The following source code example contains a UTF-8 encoded byte sequence representing the German ‘u’ umlaut (0xC3 0xBC).
X = "ü" ;* this string held in source as UTF-8 "C3.BC"
CRT X: " becomes ": UPCASE(X)
IF ALPHA(X) THEN CRT X: " is alphabetic "
IF X MATCHES "1A" THEN CRT X: " is alphabetic "
If executed in international mode with the Input/Output Code Page configured to ISO-8859-1 (de_DE), this code will produce the following output.
ü becomes Ü
ü is alphabetic
ü is alphabetic
The following table shows the functions in the above output and their corresponding descriptions.
|
Function |
Description |
|---|---|
|
UPCASE |
Converts the lower case u umlaut to the upper case equivalent, that is, the UTF-8 byte sequence 0xC3 0xBC becomes 0xC3 0x9C. |
|
ALPHA |
Tests the lower case u umlaut as an alphabetic character according to the configured locale (de_DE). |
|
MATCHES |
Tests the lower case u umlaut against the single alphabetic character according to the configured locale (de_DE) |
Collation Properties
The following are the collation properties involved.
SORT, LOCATE, COMPARE, LE, LT, GE, GT
In international mode, statements use the configured locale to determine sort order.
A sort of the following UTF-8 encoded byte sequences using the SORT function will generate a different sort order depending on the configured locale.
locale configured for ‘en_US’
cote stored as UTF-8 sequence ‘cote’
coté stored as UTF-8 sequence ‘cot.C3.A9’
côte stored as UTF-8 sequence ‘c.C3.B4te’
côté stored as UTF-8 sequence ‘c.C3.B4t.C3.A9’
locale configured for ‘fr_FR’ (reverse accented collation)
cote
côte
coté
côté
X = "côte" ;* Source contains UTF-8 sequence "c.C3.B4te"
Y = "coté" ;* Source contains UTF-8 sequence "cot.C3.A9"
The following table lists the statement and corresponding output generated in International mode when executed with the locale configured for French (fr_FR).
|
Statement |
Output |
|---|---|
|
IF X LT Y THEN CRT X:" is lower in collation sequence than ":Y |
côte is lower in collation than coté |
Conversion Properties
The following are the collation properties involved.
ICONV, OCONV, FMT
The implementation of conversions is by a set of jBASE library functions, which in turn invoke functions in the IBM Public License package (ICU). This package provides cross-platform open source libraries compliant with Unicode Standard 3.0 and currently supports over 170 locales independently of the system locales. Several input and output conversions depends on the configured locale.
For example, then following source code example generates different date formats based upon the configured locale when executing in international mode.
CRT OCONV(0,"D2/")
CRT OCONV(0,"D")
For example, this code produces the following if executed in international mode with a configured German locale (de_DE).
31/12/67
31 DEZ 1967
However, some conversions can be used to force an expected format regardless of locale. For example, the DE date format will always produce a European date format. The DG format is a new Global date format for YYYYMMDD.
CRT OCONV(0,"D2/E") displays 31/12/67
CRT OCONV(0,”DG”) displays 19671231
This section provides the character, timestamp, byte count and conversion additional functions required for internationalization.
The following are the character functions involved.
CHAR, SEQ
In international mode, the CHAR function supports extended numeric range to support 32-bit Unicode code point values. The CHAR function will return a UTF-8 encoded byte sequence for the numeric range 128-247 (0x80-0xf7), range 256 and beyond. However, numeric values in the system delimiter range 248-255 (0xf8-0xff) will continue to return the normal single byte system delimiters characters. The resultant characters for numeric values in the ASCII range 0-127 (0x00-0x7f) are unchanged.
In international mode, the SEQ function supports UTF-8 encoded byte sequences representing characters in the range 0-127 (0x00-0x7f), that is, single byte characters return the normal ASCII numeric values. UTF-8 encoded byte sequences representing characters in the range 128-255 (0x80-0xff) will return the ISO-8859-1 equivalent numeric values. System delimiter characters will return numeric values in the range 248-255 (0xf8-0xff). Other UTF-8 encoded byte sequences will return the equivalent numeric value as specified by the Unicode code point.
The following are the timestamp functions involved.
TIMESTAMP, TIMDIFF, CHANGETIMESTAMP, MAKETIMESTAMP, LOCALDATE, LOCALTIME
The provision of additional functions assist with date and time internationalisation; these functions enable applications to obtain, convert and process a timestamp. These functions are available regardless of current state of international mode.
The following table shows the functions and their corresponding descriptions.
|
Function |
Description |
|---|---|
|
TIMESTAMP |
Returns a timestamp of Universal Coordinated Time (UTC) as decimal seconds |
|
TIMEDIFF |
Returns the interval between two timestamps |
|
CHANGETIMESTAMP |
Generates a new timestamp by adjusting the supplied timestamp by a dynamic array, which specifies the adjustment values |
|
MAKETIMESTAMP |
Generates a timestamp using a specified time zone |
|
LOCALTIME |
Generates an internal time value using a supplied timestamp and time zone |
|
LOCALDATE |
Generates an internal date value using a supplied timestamp and time zone |
The following are the byte count functions involved.
READBLK, WRITEBLK, OSBREAD, OSBWRITE
The primary target of the READBLK and WRITEBLK statements is at device access and uses a block size or byte count. Generally, device formats use binary values to describe the contents of the data blocks regardless of the underlying structure. As such, these statements continue to work on a byte rather than character depending on international mode status.
If the requirement is to read and/or write large files, it is recommended to use the READSEQ and/or WRITESEQ commands. In the default configuration, the READSEQ and WRITESEQ statements reads and writes one line at a time, respectively from the file into a variable, which in turn is used on a character basis rather than bytes. This assumes that the data in the file is UTF-8 encoded. If the data in the file is ISO-8859-1 (binary) and not UTF-8 encoded, data needs to be converted to UTF-8 using the UTF8 function.
The following are the additional functions involved.
BYTELEN, LATIN1, LENDP, UTF8
The provision of additional functions helps with programs that need to know the actual real byte length of a variable as well as conversion functions for handling binary values. The conversion function should only be required when dealing with binary data, for example handling data to/from tape devices.
The following table shows the functions and their corresponding descriptions.
|
Function |
Description |
|---|---|
|
BYTELEN |
Returns the number of actual bytes used for the string variable. You can use this function irrespective of the international mode status. |
|
LATIN1 |
Convert a string variable from ISO-8859-1 to a UTF-8 encoded byte sequence. You can use this function irrespective of the international mode status. |
|
LENDP |
Returns the number of character display positions required in order to display the string variable. This function determines the display width of characters. For example, null character has a display width of zero; some Japanese Kanji characters require more than one display position, and so on. This function changes behaviour if not used in international mode. |
|
UTF8 |
Converts a string variable from UTF-8 encoded byte sequence to the ISO-8859-1 (binary) equivalent. You can use this function irrespective of the international mode status. |
JQL Changes for International Mode
The modification of the jBASE jQL Processor required for complete internationalization capabilities are as follows.
For dates and times, simple date format functions are applied to use the configured locale to support the standard conversions D and MTS. Formatting numbers through MR/ML/MD uses locale for Thousands, Decimal Point and Currency notation.
TimeStamp "W{Dx}{Tx}"
In addition, it includes a provided suite of conversions including A, F and I-types for timestamp functionality, which displays a generated timestamp for date and/or time in short, long, and full formats. These conversions also support non-Gregorian locales. The meaning of the components of the conversion is as follows:
W - Is a new conversion code so not to clash with existing conversions.
D - Date
T - Time
x - Format option: S = Short, M = Medium, L = Long, F = Full
"WDS" or "WTS" SHORT is completely numeric.12/13/52 or 3:30pm
"WDM" MEDIUM is longer. Jan 12, 1952
"WDL" or "WTL" LONG is longer. January 12, 1952 or 3:30:32pm
"WDF" or "WTF" completely specifies FULL.
As a part of jBASE internationalization, jQL will now use collation tables that are specific for the user’s locale, when enabled for international mode. The keys are first passed to a lookup algorithm that converts the key into a collation key, which is tailored specifically for the user’s language. Using the collation key, the sort processor produces output in the order expected in the user’s locale.
When international mode is not enabled, the keys are sorted by the binary value of the individual characters as in the prior releases.
The primary purpose of right justified attribute definition is to produce the correct sort sequence and display properties for numeric and alphanumeric values. The use of right justified fields with completely non-numeric data affects the display and not sort order.
As part of jBASE internationalization, jQL uses a new algorithm for the right justified fields to provide optimal sorting of mixed numeric and alphanumeric fields. The field width specified in the attribute definition no longer affects the behaviour of the sort.
Pure Numeric keys
It sorts Keys from the largest negative number to the largest positive number. A single leading minus (-) or plus (+) sign may be present. This ignores leading zeros before a decimal point and trailing zeros after a decimal point for sorting purposes. Nulls will sort either before all numeric keys or as zero, depending on emulation option. If international mode is set, the characters defined in the Unicode 3.0 specification (section 4.6) to be decimal digits are sorted as numbers.
Mixed Alpha Numeric Sorting
A field can contain alpha, alphanumeric, and pure numeric values, which demands a meaningful sort order. For example, a field containing a suppliers’ part number. In this case, each candidate key is split into parts, alternating between numeric and non-numeric parts. Sign (+ or -) characters are valid only as the first character of the key and are treated as non-numeric, if available in other positions. If the part is numeric then the system processes that part in the same manner as a pure numeric key. Based on the status of the international mode, the system does the following.
|
Status |
Action |
|---|---|
|
True |
Passes non-numeric parts through the collation algorithm to produce collation key parts |
|
False |
Sorts the non-numeric parts left to right |
Data Conversion
When executing programs in international mode, it processes all variable contents as UTF-8 encoded sequences. By default, all data must be held as UTF-8 encoded byte sequences. This means that data imported into an account configured to operate in international mode must be converted from the data’s current code page to UTF-8. Normally if all the data are 8-bit bytes in the range 0x00-0x7f (ASCII), conversion is not necessary as these values are already UTF-8 encoded. However, values outside the 0x00-0x7f range must be converted into UTF-8 properly to avoid ambiguity between character set code page values.
For example, the character represented by the hex value 0xE0 in the Latin2 code page, (ISO-8859-2), is described as LATIN SMALL LETTER R WITH ACUTE. However the same hex value in the Latin1 code page, (ISO-8859-1), is used to represent the character LATIN SMALL LETTER A WITH GRAVE.
To avoid this clash of code pages the Unicode specification provides unique hex value representations for both of these characters within the specifications 32-bit value sequence.
|
Unicode Value |
Represents |
|---|---|
|
0x00E0 |
LATIN SMALL LETTER A WITH GRAVE |
|
0x0155 |
LATIN SMALL LETTER R WITH ACUTE |
The complete conversion from the original code page to UTF-8 also eliminates the requirement for on the fly conversions when reading or writing to files, as this would add massive and unnecessary overhead to all application processing, whereas the conversion from original code page to UTF-8 is a one-time investment.
File Conversion
The first requirement before configuring an account and application for international mode is to convert the file data from the original code page into UTF-8 encoded byte sequences.
You need to convert all source files containing characters in the range 0x80 thru 0x255 for these characters to be represented in UTF-8 before compilation.
The jutf8 compilation tool helps with the file conversion. The first step is to restore the data in the normal way using a restore process working in binary mode. After the files have been restored, you need to use the following utility with the imported data files to convert the data. The syntax of the conversion utility is as follows:
jutf8 {-options} {filename {,...} }
The following table lists the utility options and their descriptions.
|
Option |
Description |
|---|---|
|
c |
Indicates the code page for conversion. The default value is latin1. |
|
d |
Processes directories |
|
f |
Indicates the force mode to skip prompt for confirmation |
|
-m MapFilePath |
Uses specified map file for conversion |
|
-s |
Skips sample testing for file already converted |
|
-u |
Enables reverse conversion, that is, converts from UTF-8 to code page |
|
-v |
Indicates the verbose mode |
The conversion utility, by default, will attempt to confirm that the data is not already converted into UTF-8. Directories are skipped by default unless the –d option is explicitly specified.
You need to use the MapFilePath option to specify a file that describes the mapping of certain characters. For example, system delimiters, from and to the required hex value.
The map file describes how characters in the original file should be mapped from their current hex value to the required hex value before UTF-8 conversion. The following example maps any characters in the range 0x01-0x08 into what would normally be system delimiters before conversion to UTF-8. Therefore, character 0x04 is mapped to 0xFC and then converted to the two-byte UTF-8 encoded sequence 0xC4 0xBC, which does not clash with the system delimiter. This in turn represents the 32-bit Unicode value of 0x00FC.
MyMapFile
#From To
0x01 0xFF
0x02 0xFE
0x03 0xFD
0x04 0xFC
0x05 0xFB
0x06 0xFA
0x07 0xF9
0x08 0xF8
The jBASE directory and SEQ drivers have been modified to support an additional IOCTL command, which provides data conversion from a specified code page to UTF-8 when reading from the native operating system file. This command can also be used when writing to the native file for the data to be converted from UTF-8 to the configured code page. This IOCTL is developed specifically for import and/or export of data to external applications and is not recommended for usage as part of an application for on the fly conversion. You can also use this IOCTL with the READSEQ and WRITESEQ statements.
The following is an example of using the IOCTL to convert data in a UNIX directory file from shift_jis, Japanese, to UTF-8 while reading the record from the native file. The record is written to a jBASE Hash File, without conversion. This IOCTL command will also return the previously configured Code Page for the File Descriptor.
Convert directory record from CodePage shift-jis to UTF-8 and place into Hash file
INCLUDE JBC.h
OPEN 'MYDIRECTORY.' TO FILE ELSE STOP
OPEN 'MYHASHFILE' TO HASHFILE ELSE STOP
Setup Code Page for IOCTL command
CodePage ="shift-jis"
IF IOCTL(FILE,JIOCTL_COMMAND_SETCODEPAGE,CodePage) ELSE
CRT "Code page problem" ; STOP
END
IF CodePage NE "" THEN CRT "Previously configured Code Page : ":CodePage
Read and convert record from code page shift-jis to UTF-8
READ Record FROM FILE,"MyCodePage" THEN
CRT "No Chars ":LEN(Record), "No Bytes ":BYTELEN(Record)
WRITE Record ON HASHFILE,"MyUTF8"
END
Error Message Files
In international mode, the error message files use the configured locale to generate de-nationalised error message files to be used instead of the default error message file.
The detection of the correct error message file for the locale works depending on the full locale specification. The search happens for the full locale definition (LanguageCode_CountryCode_Variant) with all the three arguments. If it fails with the full definition, the searches continues with the first two arguments (LanguageCode_ContryCode) and if it still fails, the search continues with only the first argument (LanguageCode).
For example, in case of the locale fr_FR_EURO, the search happens in the following order, if the search continuously fails.
- jbcmessages_fr_FR_EURO
- jbcmessages_fr_FR
- jbcmessages_fr
- jbcmessages
Spooling
The jBASE spooler files holds the created spooler jobs as UTF-8 encoded byte sequences only if generated by a program executing in international mode, that is, as per the account definition. If not, it creates spooler jobs in the normal Latin1 (ISO-8859-1) code page as previously.
Printing
You can configure the CODEPAGE parameter, in the FORM TYPE configuration file in the jBASE release sub directory config (see jspform_deflt) to specify a code page to be used for conversion when de-spooling the print job. The syntax of the parameter is as follows:
CODEPAGE codepage
In the above syntax, codepage is the name of the code page to be used to convert the print job from the internal format of UTF-8 encoded byte sequences to the required code page for the printer device. For example:
CODEPAGE shift-jis
This code page parameter will convert the UTF-8 byte sequence in the print job to shift-jis for Japanese.
Whenever possible, printers should be configured to support UTF-8, thereby eliminating the code page conversion and reducing unnecessary conversion overheads on the system.
In this topic