String Data Types,
Conversion Classes, and Helper Functions
A Review of Text
Data Types
The text data type is somewhat of a pain to deal
with in C++ programming. The main problem is that there isn't just
one text data type; there are many of them. I use the term
text data type here in the general
sense of an array of characters. Often, different operating systems
and programming languages introduce additional semantics for an
array of characters (for example, NUL character
termination or a length prefix) before they consider an array of
characters a text string.
When you select a text data type, you must make
a number of decisions. First, you must decide what type of
characters constitute the array. Some operating systems require you
to use ANSI characters when you pass a string (such as a file name)
to the operating system. Some operating systems prefer that you use
Unicode characters but will accept ANSI characters. Other operating
systems require you to use EBCDIC characters. Stranger character
sets are in use as well, such as the Multi/Double Byte Character
Sets (MBCS/DBCS); this book largely doesn't discuss those
details.
Second, you must consider
what character set you want to use to manipulate text within your
program. No requirement states that your source code must use the
same character set that the operating system running your program
prefers. Clearly, it's more convenient when both use the same
character set, but a program and the operating system can use
different character sets. You "simply" must convert all text
strings going to and coming from the operating system.
Third, you must determine the length of a text
string. Some languages, such as C and C++, and some operating
systems, such as Windows 9x/NT/XP
and UNIX, use a terminating NUL character to delimit the
end of a text string. Other languages, such as the Microsoft Visual
Basic interpreter, Microsoft Java virtual machine, and Pascal,
prefer an explicit length prefix specifying the number of
characters in the text string.
Finally, in practice, a text string presents a
resource-management issue. Text strings typically vary in length.
This makes it difficult to allocate memory for the string on the
stackand the text string might not fit on the stack at all.
Therefore, text strings are often dynamically allocated. Of course,
this means that a text string must be freed eventually. Resource
management introduces the idea of an owner of a text string. Only
the owner frees the stringand frees it only once. Ownership becomes
quite important when you pass a text string between components.
To make matters worse, two COM objects can
reside on two different computers running two different operating
systems that prefer two different character sets for a text string.
For example, you can write one COM object in Visual Basic and run
it on the Windows XP operating system. You might pass a text string
to another COM object written in C++ running on an IBM mainframe.
Clearly, we need some standard text data type that all COM objects
in a heterogeneous environment can understand.
COM uses the OLECHAR character data
type. A COM text string is a NUL-character-terminated
array of OLECHAR characters; a pointer to such a string is
an LPOLESTR. As a rule, a text string parameter to
a COM interface method should be of type LPOLESTR. When a
method doesn't change the string, the parameter should be of type
LPCOLESTRthat is, a constant pointer to an array of
OLECHAR characters.
Frequently, though not always, the
OLECHAR type isn't the same as the characters you use when
writing your code. Sometimes, though not always, the
OLECHAR type isn't the same as the characters you must
provide when passing a text string to the operating system. This
means that, depending on context,
sometimes you need to convert a text string from one character
set to anotherand sometimes you won't.
Unfortunately, a change in compiler options (for
example, a Windows XP Unicode build or a Windows CE build) can
change this context. As a result, code that previously didn't need
to convert a string might require conversion, or vice versa. You
don't want to rewrite all string-manipulation code each time you
change a compiler option. Therefore, ATL provides a number of
string-conversion macros that convert a text string from one
character set to another and are sensitive to the context in which
you invoke the conversion.
Windows Character
Data Types
Now let's focus specifically on the Windows
platform. Windows-based COM components typically use a mix of four
text data types:
-
Unicode. A
specification for representing a character as a "wide-character,"
16-bit multilingual character code. The Windows NT/XP operating
system uses the Unicode character set internally. All characters
used in modern computing worldwide, including technical symbols and
special publishing characters, can be represented uniquely in
Unicode. The fixed character size simplifies programming when using
international character sets. In C/C++, you represent a
wide-character string as a wchar_t array; a pointer to
such a string is a wchar_t*.
-
MBCS/DBCS.
The Multi-Byte Character Set is a mixed-width character set in
which some characters consist of more than 1 byte. The Windows
9x operating systems, in general,
use the MBCS to represent characters. The Double-Byte Character Set
(DBCS) is a specific type of multibyte character set. It includes
some characters that consist of 1 byte and some characters that
consist of 2 bytes to represent the symbols for one specific
locale, such as the Japanese, Chinese, and Korean languages.
In C/C++, you represent an MBCS/DBCS string as
an unsigned char array; a pointer to such a string is an
unsigned char*. Sometimes a character is one unsigned
char in length; sometimes, it's more than one. This is loads
of fun to deal with, especially when you're trying to back up
through a string. In Visual C++, MBCS always means DBCS. Character
sets wider than 2 bytes are not supported.
-
ANSI. You can
represent all characters in the English language, as well as many
Western European languages, using only 8 bits. Versions of Windows
that support such languages use a degenerate case of MBCS, called
the Microsoft Windows ANSI character set, in which no multibyte
characters are present. The
Microsoft Windows ANSI character set, which is essentially ISO
8859/x plus additional characters,
was originally based on an ANSI draft standard.
The ANSI character set maps the letters and
numerals in the same manner as ASCII. However, ANSI does not
support control characters and maps many symbols, including
accented letters, that are not mapped in standard ASCII. All
Windows fonts are defined in the ANSI character set. This is also
called the Single-Byte Character Set (SBCS), for symmetry.
In C/C++, you represent an ANSI string as a
char array; a pointer to such a string is a
char*. A character is always one char in length.
By default, a char is a signed char in Visual
C++. Because MBCS characters are unsigned and ANSI
characters are, by default, signed characters, expressions
can evaluate differently when using ANSI characters, compared to
using MBCS characters.
-
TCHAR/_TCHAR. This is a
Microsoft-specific generic-text data type that you can map to a
Unicode character, an MBCS character, or an ANSI character using
compile-time options. You use this character type to write generic
code that can be compiled for any of the three character sets. This
simplifies code development for international markets. The C
runtime library defines the _TCHAR type, and the Windows
operating system defines the TCHAR type; they are
synonymous.
tchar.h, a Microsoft-specific C runtime
library header file, defines the generic-text data type
_TCHAR. ANSI C/C++ compiler compliance requires
implementer-defined names to be prefixed by an underscore. When you
do not define the __STDC__ preprocessor symbol (by
default, this macro is not defined in Visual C++), you indicate
that you don't require ANSI compliance. In this case, the
tchar.h header file also defines the symbol TCHAR
as another alias for the generic-text data type if it isn't already
defined. winnt.h, a Microsoft-specific Win32 operating
system header file, defines the generic-text data type
TCHAR. This header file is operating system specific, so
the symbol names don't need the underscore prefix.
Win32 APIs and
Strings
Each Win32 API that requires a string has two
versions: one that requires a Unicode argument and another that
requires an MBCS argument. On a non-MBCS-enabled version of
Windows, the MBCS version of an API expects an ANSI argument. For
example, the SetWindowText API doesn't really exist. There
are actually two functions: SetWindowTextW, which expects
a Unicode string argument, and SetWindowTextA, which
expects an MBCS/ANSI string argument.
The Windows NT/2000/XP operating systems
internally use only Unicode strings. Therefore, when you call
SetWindowTextA on Windows NT/2000/XP, the function translates the
specified string to Unicode and then calls SetWindowTextW.
The Windows 9x operating systems
do not support Unicode directly. The SetWindowTextA
function on the Windows 9x
operating systems does the work, while SetWindowTextW
returns an error. The MSLU library from Microsoft
provides implementations of almost all the Unicode functions on
Win9x.
This gives you a difficult choice. You could
write a performance-optimized component using Unicode character
strings that runs on Windows 2000 but not on Windows 9x. You could use MSLU for Unicode strings on
both families and lose performance on Windows 9x. You could write a more general component
using MBCS/ANSI character strings that runs on both operating
systems but not optimally on
Windows 2000. Alternatively, you could hedge your bets by writing
source code that enables you to decide at compile time what
character set to support.
A little coding discipline and some preprocessor
magic let you code as if there were a single API called
SetWindowText that expects a TCHAR string
argument. You specify at compile time which kind of component you
want to build. For example, you write code that calls
SetWindowText and specifies a TCHAR buffer. When
compiling a component as Unicode, you call SetWindowTextW;
the argument is a wchar_t buffer. When compiling an
MBCS/ANSI component, you call SetWindowTextA; the argument
is a char buffer.
When you write a Windows-based COM component,
you should typically use the TCHAR character type to
represent characters used by the component internally.
Additionally, you should use it for all characters used in
interactions with the operating system. Similarly, you should use
the TEXT or __TEXT macro to surround every
literal character or string.
tchar.h defines the functionally
equivalent macros _T, __T, and _TEXT,
which all compile a character or string literal as a generic-text
character or literal. winnt.h also defines the
functionally equivalent macros TEXT and __TEXT,
which are yet more synonyms for _T, __T, and
_TEXT. (There's nothing like five ways to do exactly the
same thing.) The examples in this chapter use __TEXT
because it's defined in winnt.h. I actually prefer
_T because it's less clutter in my source code.
An operating-system-agnostic coding approach
favors including tchar.h and using the _TCHAR
generic-text data type because that's somewhat less tied to the
Windows operating systems. However, we're discussing building
components with text handling optimized at compile time for
specific versions of the Windows operating systems. This argues
that we should use TCHAR, the type defined in
winnt.h. Plus, TCHAR isn't as jarring to the eyes
as _TCHAR and it's easier to type. Most code already
implicitly includes the winnt.h header file via
windows.h, and you must explicitly include
tchar.h. All sorts of good reasons support using
TCHAR, so the examples in this book use this as the
generic-text data type.
This means that you can compile specialized
versions of the component for different markets or for performance
reasons. These types and macros are defined in the winnt.h
header file.
You also must use a different set of string
runtime library functions when manipulating strings of
TCHAR characters. The familiar functions strlen,
strcpy, and so on operate only on char
characters. The less familiar functions wcslen, wcscpy,
and so on work on wchar_t characters. Moreover, the
totally strange functions _mbslen, _mbscpy, and
so on work on multibyte characters. Because TCHAR
characters are sometimes wchar_t, sometimes
char-holding ANSI characters, and sometimes
char-holding (nominally unsigned) multibyte
characters, you need an equivalent set of runtime library functions
that work with TCHAR characters.
The tchar.h header file defines a
number of useful generic-text mappings for string-handling
functions. These functions expect TCHAR parameters, so all
their function names use the _tcs (the _t
character set) prefix. For example, _tcslen is equivalent
to the C runtime library strlen function. The
_tcslen function expects TCHAR characters,
whereas the strlen function expects char
characters.
Controlling
Generic-Text Mapping Using the Preprocessor
Two preprocessor symbols and two macros control
the mapping of the TCHAR data type to the underlying
character type the application uses.
-
UNICODE/_UNICODE. The header files
for the Windows operating system APIs use the UNICODE
preprocessor symbol. The C/C++ runtime library header files use the
_UNICODE preprocessor symbol. Typically, you define either
both symbols or neither of them. When you compile with the symbol
_UNICODE defined, tchar.h maps all TCHAR
characters to wchar_t characters. The
_T,__T, and _TEXT macros prefix each
character or string literal with a capital L (creating a
Unicode character or literal, respectively). When you compile with
the symbol UNICODE defined, winnt.h maps all
TCHAR characters to wchar_t characters. The
TEXT and __TEXT macros prefix each character or
string literal with a capital L (creating a Unicode
character or literal, respectively). The _tcsXXX functions
are mapped to the corresponding _wcsXXX functions.
-
_MBCS. When you compile with the
symbol _MBCS defined, all TCHAR characters map to
char characters, and the preprocessor removes all the
_T and __TEXT macro variations. It leaves the
character or literal unchanged (creating an MBCS character or
literal, respectively). The _tcsXXX functions are mapped
to the corresponding _mbsXXX versions.
-
None of the
above. When you compile with neither symbol defined, all
TCHAR characters map to char characters and the
preprocessor removes all the _T and __TEXT macro
variations, leaving the character or literal unchanged (creating an
ANSI character or literal, respectively). The _tcsXXX
functions are mapped to the corresponding strXXX
functions.
You write generic-text-compatible code by using
the generic-text data types and functions. An example of reversing
and concatenating to a generic-text string follows:
TCHAR *reversedString, *sourceString, *completeString;
reversedString = _tcsrev (sourceString);
completeString = _tcscat (reversedString, __TEXT("suffix"));
When you compile the code without defining any
preprocessor symbols, the preprocessor produces this output:
char *reversedString, *sourceString, *completeString;
reversedString = _strrev (sourceString);
completeString = strcat (reversedString, "suffix");
When you compile the code after defining the
_UNICODE preprocessor symbol, the preprocessor produces
this output:
wchar_t *reversedString, *sourceString, *completeString;
reversedString = _wcsrev (sourceString);
completeString = wcscat (reversedString, L"suffix");
When you compile the code after defining the
_MBCS preprocessor symbol, the preprocessor produces this
output:
char *reversedString, *sourceString, *completeString;
reversedString = _mbsrev (sourceString);
completeString = _mbscat (reversedString, "suffix");
COM Character Data
Types
COM uses two character types:
-
OLECHAR. The character type COM
uses on the operating system for which you compile your source
code. For Win32 operating systems, this is the wchar_t
character type. For
Win16 operating systems, this is the char character type.
For the Mac OS, this is the char character type. For the
Solaris OS, this is the wchar_t character type. For the as
yet unknown operating system, this is who knows what. Let's just
pretend there is an abstract data type called OLECHAR. COM
uses it. Don't rely on it mapping to any specific underlying data
type.
-
BSTR. A specialized string type
some COM components use. A BSTR is a length-prefixed array
of OLECHAR characters with numerous special semantics.
Now let's complicate things a bit. You want to
write code for which you can select, at compile time, the type of
characters it uses. Therefore, you're manipulating strictly
TCHAR strings internally. You also want to call a COM
method and pass it the same strings. You must pass the method
either an OLECHAR string or a BSTR string,
depending on its signature. The strings your component uses might
or might not be in the correct character format, depending on your
compilation options. This is a job for Supermacro!
ATL
String-Conversion Classes
ATL provides a number of string-conversion
classes that convert, when necessary, among the various character
types described previously. The classes perform no conversion and,
in fact, do nothing, when the compilation options make the source
and destination character types identical. Seven different classes
in atlconv.h implement the real conversion logic, but this
header also uses a number of typedefs and preprocessor
#define statements to make using these converter classes
syntactically more convenient.
These class names use a number of abbreviations
for the various character data types:
-
T represents
a pointer to the Win32 TCHAR character typean
LPTSTR parameter.
-
W represents
a pointer to the Unicode wchar_t character typean
LPWSTR parameter.
-
A represents
a pointer to the MBCS/ANSI char character typean
LPSTR parameter.
-
OLE
represents a pointer to the COM OLECHAR character typean
LPOLESTR parameter.
-
C represents
the C/C++ const modifier.
All class names use the
form
C<source-abbreviation>2<destination-abbreviation>.
For example, the CA2W class converts an LPSTR to
an LPWSTR. When there is a C in the name (not
including the first Cthat stands for "class"), add a
const modification to the following abbreviation; for
example, the CT2CW class converts a LPTSTR to a
LPCWSTR.
The actual class behavior depends on which
preprocessor symbols you define (see Table 2.1). Note that the ATL conversion classes
and macros treat OLE and W as equivalent.
Table 2.1. Character Set Preprocessor
Symbols
Preprocessor Symbol Defined
|
T Becomes . . .
|
OLE Becomes . . .
|
None
|
A
|
W
|
_UNICODE
|
W
|
W
|
Table
2.2 lists the ATL string-conversion macros.
Table 2.2. ATL String-Conversion
Classes
CA2W
|
CA2WEX
|
CA2T
|
CA2TEX
|
CA2CT
|
CA2CTEX
|
COLE2T
|
COLE2TEX
|
COLE2CT
|
COLE2CTEX
|
CT2A
|
CT2AEX
|
CT2CA
|
CT2CAEX
|
CT2OLE
|
CT2OLEEX
|
CT2COLE
|
CT2COLEEX
|
CT2W
|
CT2WEX
|
CT2CW
|
CT2CWEX
|
CW2A
|
CW2AEX
|
CW2T
|
CW2TEX
|
CW2CT
|
CW2CTEX
|
As you can see, no BSTR conversion
classes are listed in Table
2.2. The next section of this chapter introduces the
CComBSTR class as the preferred mechanism for dealing with
BSTR-type conversions.
When you look inside the atlconv.h
header file, you'll see that many of the definitions distill down
to a fairly small set of six actual classes. For instance, when
_UNICODE is defined, CT2A becomes CW2A,
which is itself typedef'd to the CW2AEX template class.
The type definition merely applies the default template parameters
to CW2AEX. Additionally, all the previous class names
always map OLE to W, so COLE2T becomes CW2T, which is defined as
CW2W under Unicode builds. Because the source and
destination types for CW2W are the same, this class
performs no conversions. Ultimately, the only six classes defined
are the template classes CA2AEX, CA2CAEX,
CA2WEX, CW2AEX, CW2CWEX, and
CW2WEX. Only CA2WEX and CW2AEX have
different source and destination types, so these are the only two
classes doing any real work. Thus, our expansive list of conversion
classes in Table 2.2 has
distilled down to only two interesting ones. These two classes are
both defined and implemented similarly, so we look at only
CA2WEX to glean an understanding of how they both
work.
template< int t_nBufferLength = 128 >
class CA2WEX {
CA2WEX( LPCSTR psz );
CA2WEX( LPCSTR psz, UINT nCodePage );
...
public:
LPWSTR m_psz;
wchar_t m_szBuffer[t_nBufferLength];
...
};
The class definition is actually pretty simple.
The template parameter specifies the size of a fixed static buffer
to hold the string data. This means that most string-conversion
operations can be performed without allocating any dynamic storage.
If the requested string to convert exceeds the number of characters
passed as an argument to the template, CA2WEX uses
malloc to allocate additional storage.
Two constructors are provided for
CA2WEX. The first constructor accepts an LPCSTR
and uses the Win32 API function MultiByteToWideChar to
perform the conversion. By default, the class uses the ANSI code
page for the current thread's locale to perform the conversion. The
second constructor can be used to specify an alternate code page
that governs how the conversion is performed. This value is passed
directly to MultiByteToWideChar, so see the online
documentation for details on code pages accepted by the various
Win32 character conversion functions.
The simplest way to use this converter class is
to accept the default value for the buffer size parameter. Thus,
ATL provides a simple typedef to facilitate this:
To use this converter class, you need to write
only simple code such as the following:
void PutName (LPCWSTR lpwszName);
void RegisterName (LPCSTR lpsz) {
PutName (CA2W(lpsz));
}
Two other use cases are also common in
practice:
-
Receiving a generic-text string and passing to a
method that expects an OLESTR as input
-
Receiving an OLESTR and passing it to a
method that expects a generic-text string
The conversion classes are easily employed to
deal with these cases:
void PutAddress(LPOLESTR lpszAddress);
void RegisterAddress(LPTSTR lpsz) {
PutAddress(CT2OLE(lpsz));
}
void PutNickName(LPTSTR lpszName);
void RegisterAddress(LPOLESTR lpsz) {
PutNickName(COLE2T(lpsz));
}
A Note on Memory
Management
As convenient as the conversion classes are, you
can run into some nasty pitfalls if you use them incorrectly. The
conversion classes allocate the memory for the converted text
automatically and clean it up in the class destructor. This is
useful because you don't have to worry about buffer management.
However, it also means that code like this is a crash waiting to
happen:
LPOLESTR ConvertString(LPTSTR lpsz) {
return CT2OLE(lpsz);
}
You've just returned either a pointer to the
stack of the called function (which is trashed when the function
returns) if the string was short, or a pointer to an array on the
heap that will be deallocated before the function returns.
The
worst part is that, depending on your macro selection, the code
might work just fine but will crash when you switch from ANSI to
Unicode for the first time (usually two days before ship). To avoid
this, make sure that you copy the converted string to a separate
buffer (or use a string class) first if you need it for more than a
single expression.
ATL String-Helper
Functions
Sometimes you want to copy a string of
OLECHAR characters. You also happen to know that
OLECHAR characters are wide characters on the Win32
operating system. When writing a Win32 version of your component,
you might call the Win32 operating system function
lstrcpyW, which copies wide characters. Unfortunately,
Windows NT/2000, which supports Unicode, implements
lstrcpyW, but Windows 95 does not. A component that uses
the lstrcpyW API doesn't work correctly on Windows 95.
Instead of lstrcpyW, use the ATL
string-helper function ocscpy to copy an OLECHAR
character string. It works properly on both Windows NT/2000 and
Windows 95. The ATL string-helper function ocslen returns
the length of an OLECHAR string. This is nice for
symmetry, although the lstrlenW function it replaces does
work on both operating systems.
OLECHAR* ocscpy(LPOLESTR dest, LPCOLESTR src);
size_t ocslen(LPCOLESTR s);
Similarly, the Win32 CharNextW
operating system function doesn't work on Windows 95, so ATL
provides a CharNextO string-helper function that
increments an OLECHAR* by one character and returns the
next character pointer. It does not increment the pointer beyond a
NUL termination character.
LPOLESTR CharNextO(LPCOLESTR lp);
ATL
String-Conversion Macros
The string-conversion classes discussed
previously were introduced in ATL 7. ATL 3 (and code written with
ATL 3) used a set of macros instead. In fact, these macros are
still in use in the ATL code base. For example, this code is in the
atlctl.h header:
STDMETHOD(Help)(LPCOLESTR pszHelpDir) {
T* pT = static_cast<T*>(this);
USES_CONVERSION;
ATLTRACE(atlTraceControls,2,
_T("IPropertyPageImpl::Help\n"));
CComBSTR szFullFileName(pszHelpDir);
CComHeapPtr<OLECHAR>
pszFileName(LoadStringHelper(pT->m_dwHelpFileID));
if (pszFileName == NULL)
return E_OUTOFMEMORY;
szFullFileName.Append(OLESTR("\\"));
szFullFileName.Append(pszFileName);
WinHelp(pT->m_hWnd, OLE2CT(szFullFileName),
HELP_CONTEXTPOPUP, NULL);
return S_OK;
}
The macros behave much like the conversion
classes, minus the leading C in the macro name. So, to
convert from tchar to olechar, you use
T2OLE(s).
Two major differences arise between the macros
and the conversion classes. First, the macros require some local
variables to work; the USES_CONVERSION macro is required
in any function that uses the conversion macros. (It declares these
local variables.) The second difference is the location of the
conversion buffer.
In the conversion classes, the buffer is stored
either as a member variable on the stack (if the buffer is small)
or on the heap (if the buffer is large). The conversion macros
always use the stack. They call the runtime function
_alloca, which allocates extra space on the local
stack.
Although it is fast, _alloca has some
serious downsides. The stack space isn't freed until the function
exits, which means that if you do conversion in a loop, you might
end up blowing out your stack space. Another nasty problem is that
if you use the conversion macros inside a C++ catch block,
the _alloca call messes up the exception-tracking
information on the stack and you crash.
The ATL team apparently took two swipes at
improving the conversion macros. The final solution is the
conversion classes. However, a second set of conversion macros
exists: the _EX flavor. These are used much like the
original conversion macros; you put USES_CONVERSION_EX at
the top of the function. The macros have an _EX suffix, as
in T2A_EX. The _EX macros are different, however:
They take two parameters, not one. The first parameter is the
buffer to convert from as usual. The second parameter is a
threshold value. If the converted buffer is smaller than this
threshold, the memory is allocated via _alloca. If the
buffer is larger, it is allocated on the heap instead. So, these
macros give you a chance to avoid the stack overflow. (They still won't help you
in a catch block.) The ATL code uses the _EX
macros extensively; the previous example is the only one left that
still uses the old macros.
We don't go into the details of either macro set
here; the conversion classes are much safer to use and are
preferred for new code. We mention them only so that you know what
you're looking at if you see them in older code or the ATL sources
themselves.
|