If you live in Eastern Europe, Japan or the Middle East, and you write computer programs, you are probably familiar with UNICODE. If you are writing programs in Visual C++/MFC, then you probably have experienced some of the problems with trying to write code that runs under UNICODE and ASCII. This article should help clear up some of the confusion. The principles here will work for anyone using Visual C++ and/or MFC.
What is UNICODE ?
UNICODE is a popular solution that has evolved to solve the 256-character ASCII limit problem. The ASCII character set is limited to 256 characters, numbered 0-255. For most Latin-based languages, this is fine; one set of upper-case characters, one set of lower-case characters, and a smattering of special characters (like punctuation, currency, etc.).
However, many Asian and Eastern languages contain many more than 256 characters, sometimes thousands more. Since ASCII is limited to 256, there has been no simple way to write programs for languages with more than 255 characters.
UNICODE replaces the ASCII standard with a much larger range for mapping numeric codes to characters of many languages. It does this by doubling the bytes of the character in the program.
ASCII versus UNICODEAn ASCII character is stored in one byte of memory (one byte has a numeric range of 0-255). In order to create enough space to store larger character sets for many other languages, it was determined that the byte size for one character be increased to two bytes. This provides a numeric range of 0-65535, or 65534 characters.
Unfortunately, doubling the byte size also causes compatibility problems. Most programs that are written on the ASCII standard simply assume that a character is one byte. Many methods and functions in Windows programs take advantage of this. When these programs are compiled for UNICODE, they break.
Visual C++ SolutionsWhat does all this mean to you as a software developer ? If you are writing programs in Visual C++, it means you must now decide whether your program is to run internationally, or if it is to remain local to ASCII compliant markets.
There is some good news here. Visual C++ provides some built-in support for UNICODE. The AppWizard in VC++ allows the developer to decide whether or not to support UNICODE before generating the application framework. The Win32 SDK contains several data types that allow UNICODE compliance, and MFC provides macros that convert generic text to UNICODE data types. The developer only needs to change a few coding habits to write UNICODE-friendly applications.
The Character StringC programmers declare a character array with the char keyword :
char str[100];
Function prototypes are declared as :
void strcpy( char *out, char *in );
To adapt these declarations to two-byte per character UNICODE, use the following :
wchar_t str[100];
or
void wcscpy( wchar_t *out, wchar_t *in );
Microsoft also provides a way to write code that is preprocessor reliant. When you create a new project in Visual C++, and you decide that it should support another character set, AppWizard puts a pre-processor statement into one of the header files. This tells the compiler what character set you are intending to support. From there, you can use the Generic Data Types that are supplied with VC++, and the preprocessor will replace them with the correct data type depending on what character set you are supporting. This makes the code much easier to re-compile for other character sets.
To activate the UNICODE standard in Visual C++, select Build | Settings from the file menu (in Visual C++ 5, select Project | Settings). Select the C/C++ tab. Append the _UNICODE value to the "Preprocessor definitions" field.
In your code, you would use TCHAR wherever the keyword char would normally be used, and LPTSTR wherever char * would be used. String constants defined in quotes ("Hello, World") would be re-written using the TEXT macro :
TEXT("Hello, World")
Microsoft provides several data types, including generic types, that are compatible with both ASCII and UNICODE. All of these can be found in the Microsoft on-line documentation under Generic Data Types or Data Types.
The CodeHere are some examples using one of the code fragments from the Developing Professional Applications For Windows 95 and NT Using MFC book by Marshall Brain and Lance Lovette.
This is the "Hello, World" application using the ASCII character set :
//************************************************************ // From the book "Visual C++ 2: Developing Professional // Applications for Windows 95 and NT with MFC" // // by Marshall Brain and Lance Lovette // Published by Prentice Hall // // Copyright 1995, by Prentice Hall. // This code implements a simple "Hello World!" program in MFC //************************************************************ //hello.cpp #include <afxwin.h> // Declare the application class class CHelloApp : public CWinApp { public: virtual BOOL InitInstance(); }; // Create an instance of the application class CHelloApp HelloApp; // Declare the main window class class CHelloWindow : public CFrameWnd { CStatic* cs; public: CHelloWindow(); }; // The InitInstance function is called each // time the application first executes. BOOL CHelloApp::InitInstance() { m_pMainWnd = new CHelloWindow(); m_pMainWnd->ShowWindow(m_nCmdShow); m_pMainWnd->UpdateWindow(); return TRUE; } // The constructor for the window class CHelloWindow::CHelloWindow() { // Create the window itself Create(NULL, "Hello World!", WS_OVERLAPPEDWINDOW, CRect(0,0,200,200)); // Create a static label cs = new CStatic(); cs->Create("hello world", WS_CHILD|WS_VISIBLE|SS_CENTER, CRect(50,80,150,150), this); }
The string constants must be changed to their UNICODE counterparts. In the following code fragment, the same string constants are passed into the TEXT macro. TEXT will tell the preprocessor to check and see what character standard is being used :
// The constructor for the window class CHelloWindow::CHelloWindow() { // Create the window itself Create(NULL, TEXT("Hello World!"), WS_OVERLAPPEDWINDOW, CRect(0,0,200,200)); // Create a static label cs = new CStatic(); cs->Create( TEXT("hello world!"), WS_CHILD|WS_VISIBLE|SS_CENTER, CRect(50,80,150,150), this); }
When the preprocessor encounters a generic data type, it checks the AFXWIN.H header file for the _UNICODE definition. The preprocessor will then insert the proper data type based on whether or not the UNICODE standard is defined.
The following is an example of one of the Win32 API generic data types from the Win32 System Services book by Marshall Brain :
//********************************************************** // From the book "Win32 System Services: The Heart of Windows NT" // by Marshall Brain // Published by Prentice Hall // // Copyright 1994, by Prentice Hall. // // This code sets the volume label for drive C. //********************************************************** // drvsvl.cpp #include <windows.h> #include <iostream.h> void main() { BOOL success; char volumeName[MAX_PATH]; cout << "Enter new volume label for drive C: "; cin >> volumeName; success = SetVolumeLabel("c:\\", volumeName); if (success) cout << "success\n"; else cout << "Error code: " << GetLastError() << endl; }
The character array declared at the top of this fragment will be declared as a two byte character array using the TCHAR data type. The TEXT macro will again be used for the string constants :
void main() { BOOL success; TCHAR volumeName[MAX_PATH]; cout << TEXT("Enter new volume label for drive C: "); cin >> volumeName; success = SetVolumeLabel(TEXT("c:\\" ), volumeName); if (success) cout << TEXT("success\n"); else cout << TEXT("Error code: ") << GetLastError() << endl; }
Generic Data Types in Visual C++
Visual C++ provides several MFC-specific data types for making applications tolerant to international character sets. These are defined generically, to make the application fully portable to UNICODE, ASCII, DBCS (double byte character sets) and MBCS (multi byte character sets). It is beyond the scope of this article to explain all of the differences between the character sets; MFC provides a transparent way of implementing them all. How these Generic Data Types are mapped depends on which character set variable is set for the Project : none (the default, which is automatically ASCII), MBCS, DBCS or UNICODE. Since this article is primarily about UNICODE, I have provided the mappings for ASCII and UNICODE character types in the table that follows :
Generic MFC Data Type | Map to ASCII | Map to UNICODE | Notes : |
---|---|---|---|
_TCHAR |
|
|
_TCHAR is a macro that maps to the ASCII char data type if UNICODE is not set, and wchar_t when it is. |
_T or _TEXT |
|
wchar_t constant strings | These are functionally identical macros. They are ignored (removed by the preprocessor) for ASCII; for UNICODE, these macros convert the string into the UNICODE equivalent. |
LPTSTR |
|
|
A portable-32 bit pointer to a character string. It maps to the character type that is set for the project, as explained above. |
LPCTSTR |
|
|
A portable 32-bit pointer to a constant character string. It maps to the character type that is set for the project, as explained above. |
By using the above Generic Data Types, the developer is able to ensure that ONE variable set when the project is created, and the use of these Generic Data Types in place of the byte-specific ones they replace, is all that is needed to compile an application that is friendly to both ASCII and UNICODE. However, it is important to note that the above Generic Types are Microsoft extensions; they are not ANSI compatible. For detailed descriptions of the Generic Data Types that Microsoft provides, read the Microsoft documentation Generic Data Types and Using Generic-Text Mappings in the Microsoft Help Files.
Some technical notesTo compile MFC programs for UNICODE, you must have access to the UNICODE versions of the MFC libraries. The library files are available through a custom installation of Visual C++.
It is important to note that not defining the UNICODE standard may not have a visible effect on your program. For example, the code above will build and execute without difficulty whether the _UNICODE variable is set in the build settings or not. The problems arise when the developer begins using Win32 API functions that have more than one implementation.
When the developer uses an API function that has a dual implementation (as any of the Win32 API functions that take a char or string as a parameter will), the compiler calls the correct function based on whether the _UNICODE variable is set. Without the _UNICODE being defined the standard, the compiler will call the ASCII declared version of the function; the preprocessor will strip out the "ignored" macros.
ConclusionAs we have shown, compiling applications for the UNICODE standard is not difficult. As a mental shift, UNICODE is unique in that the changes it requires are simple and intuitive. The extensions provided by Microsoft serve to make the choice of character set even more transparent. These simple changes help simplify the process of writing new applications, and changing old ones, for international markets.
Special thanks to Yoshihiro Mori, a software developer in Japan, for pointing out to us the need for UNICODE clarification for our books, which are now starting to be published in the native languages of other countries.