QStringLiteral explained
QStringLiteral is a new macro introduced in Qt 5 to create QString from string literals. (String literals are strings inside "" included in the source code). In this blog post, I explain its inner working and implementation.
Summary
Let me start by giving a guideline on when to use it: If you want to initialize a QString from a string literal in Qt5, you should use:
- Most of the cases:
QStringLiteral("foo")
if it will actually be converted to QString QLatin1String("foo")
if it is use with a function that has an overload forQLatin1String
. (such asoperator==, operator+, startWith, replace, ...
)
I have put this summary at the beginning for the ones that don't want to read the technical details that follow.
Read on to understand how QStringLiteral works
Reminder on how QString works
QString, as many classes in Qt, is an implicitly shared class. Its only member is a pointer to the 'private' data. The QStringData is allocated with malloc, and enough room is allocated after it to put the actual string data in the same memory block.
// Simplified for the purpose of this blog struct QStringData { QtPrivate::RefCount ref; // wrapper around a QAtomicInt int size; // size of the string uint alloc : 31; // amount of memory reserved after this string data uint capacityReserved : 1; // internal detail used for reserve() qptrdiff offset; // offset to the data (usually sizeof(QSringData)) inline ushort *data() { return reinterpret_cast<ushort *>(reinterpret_cast<char *>(this) + offset); } }; // ... class QString { QStringData *d; public: // ... public API ... };
The offset is a pointer to the data relative to the QStringData. In Qt4, it used to be an actual pointer. We'll see why it has been changed.
The actual data in the string is stored in UTF-16, which uses 2 bytes per character.
Literals and Conversion
Strings literals are the strings that appears directly in the source code, between quotes.
Here are some examples. (suppose action, string
, and filename
are QString
o->setObjectName("MyObject"); if (action == "rename") string.replace("%FileName%", filename);
In the first line, we call the function QObject::setObjectName(const QString&)
. There is an implicit conversion from const char*
to QString, via its constructor. A new QStringData is allocated with enough room to hold "MyObject", and then the string is copied and converted from UTF-8 to UTF-16.
The same happens in the last line where the function QString::replace(const QString &, const QString &)
is called. A new QStringData is allocated for "%FileName%".
Is there a way to prevent the allocation of QStringData and copy of the string?
Yes, one solution to avoid the costly creation of a temporary QString object is to have overload for common function that takes const char*
parameter.
So we have those overloads for operator==
The overloads do not need to create a new QString object for our literal and can operate directly on the raw char*.
Encoding and QLatin1String
In Qt5, we changed the default decoding for the char* strings to UTF-8. But many algorithms are much slower with UTF-8 than with plain ASCII or latin1
Hence you can use QLatin1String
, which is just a thin wrapper around char *
that specify the encoding. There are overloads taking QLatin1String
for functions that can opperate or the raw latin1 data directly without conversion.
So our first example now looks like:
o->setObjectName(QLatin1String("MyObject")); if (action == QLatin1String("rename")) string.replace(QLatin1String("%FileName%"), filename);
The good news is that QString::replace
and operator==
have overloads for QLatin1String. So that is much faster now.
In the call to setObjectName, we avoided the conversion from UTF-8, but we still have an (implicit) conversion from QLatin1String to QString which has to allocate the QStringData on the heap.
Introducing QStringLiteral
Is it possible to avoid the allocation and copy of the string literal even for the cases like setObjectName
? Yes, that is what QStringLiteral
is doing.
This macro will try to generate the QStringData at compile time with all the field initialized. It will even be located in the .rodata section, so it can be shared between processes.
We need two languages feature to do that:
- The possibility to generate UTF-16 at compile time:
On Windows we can use the wide charL"String"
. On Unix we are using the new C++11 Unicode literal:u"String"
. (Supported by GCC 4.4 and clang.) -
The ability to create static data from expressions.
We want to be able to put QStringLiteral everywhere in the code. One way to do that is to put astatic QStringData
inside a C++11 lambda expression. (Supported by MSVC 2010 and GCC 4.5) (And we also make use of the GCCUpdate: The support for the GCC extension was removed before the beta because it does not work in every context lambas are working, such as in default functions arguments)__extension__ ({ })
Implementation
We will need need a POD structure that contains both the QStringData and the actual string. Its structure will depend on the method we use to generate UTF-16.
The code bellow was extracted from qstring.h, with added comments and edited for readability.
/* We define QT_UNICODE_LITERAL_II and declare the qunicodechar depending on the compiler */ #if defined(Q_COMPILER_UNICODE_STRINGS) // C++11 unicode string #define QT_UNICODE_LITERAL_II(str) u"" str typedef char16_t qunicodechar; #elif __SIZEOF_WCHAR_T__ == 2 // wchar_t is 2 bytes (condition a bit simplified) #define QT_UNICODE_LITERAL_II(str) L##str typedef wchar_t qunicodechar; #else typedef ushort qunicodechar; // fallback #endif // The structure that will contain the string. // N is the string size template <int N> struct QStaticStringData { QStringData str; qunicodechar data[N + 1]; }; // Helper class wrapping a pointer that we can pass to the QString constructor struct QStringDataPtr { QStringData *ptr; }; #if defined(QT_UNICODE_LITERAL_II) // QT_UNICODE_LITERAL needed because of macro expension rules # define QT_UNICODE_LITERAL(str) QT_UNICODE_LITERAL_II(str) # if defined(Q_COMPILER_LAMBDA) # define QStringLiteral(str) ([]() -> QString { enum { Size = sizeof(QT_UNICODE_LITERAL(str))/2 - 1 }; static const QStaticStringData<Size> qstring_literal = { Q_STATIC_STRING_DATA_HEADER_INITIALIZER(Size), QT_UNICODE_LITERAL(str) }; QStringDataPtr holder = { &qstring_literal.str }; const QString s(holder); return s; }()) # elif defined(Q_CC_GNU) // Use GCC To __extension__ ({ }) trick instead of lambda // ... <skiped> ... # endif #endif #ifndef QStringLiteral // no lambdas, not GCC, or GCC in C++98 mode with 4-byte wchar_t // fallback, return a temporary QString // source code is assumed to be encoded in UTF-8 # define QStringLiteral(str) QString::fromUtf8(str, sizeof(str) - 1) #endif
Let us simplify a bit this macro and look how the macro would expand
o->setObjectName(QStringLiteral("MyObject")); // would expand to: o->setObjectName(([]() { // We are in a lambda expression that returns a QStaticString // Compute the size using sizeof, (minus the null terminator) enum { Size = sizeof(u"MyObject")/2 - 1 }; // Initialize. (This is static data initialized at compile time.) static const QStaticStringData<Size> qstring_literal = { { /* ref = */ -1, /* size = */ Size, /* alloc = */ 0, /* capacityReserved = */ 0, /* offset = */ sizeof(QStringData) }, u"MyObject" }; QStringDataPtr holder = { &qstring_literal.str }; QString s(holder); // call the QString(QStringDataPtr&) constructor return s; }()) // Call the lambda );
The reference count is initialized to -1. A negative value is never incremented or decremented because we are in read only data.
One can see why it is so important to have an offset (qptrdiff) rather than a pointer to the string (ushort*) as it was in Qt4. It is indeed impossible to put pointer in the read only section because pointers might need to be relocated at load time. That means that each time an application or library, the OS needs to re-write all the pointers addresses using the relocation table.
Results
For fun, we can look at the assembly generated for a very simple call to QStringLiteral. We can see that there is almost no code, and how the data is laid out in the .rodata section
We notice the overhead in the binary. The string takes twice as much memory since it is encoded in UTF-16, and there is also a header of sizeof(QStringData) = 24. This memory overhead is the reason why it still makes sense to still use QLatin1String when the function you are calling has an overload for it.
QString returnAString() { return QStringLiteral("Hello"); }
Compiled with g++ -O2 -S -std=c++0x
(GCC 4.7) on x86_64
.text .globl _Z13returnAStringv .type _Z13returnAStringv, @function _Z13returnAStringv: ; load the address of the QStringData into %rdx leaq _ZZZ13returnAStringvENKUlvE_clEvE15qstring_literal(%rip), %rdx movq %rdi, %rax ; copy the QStringData from %rdx to the QString return object ; allocated by the caller. (the QString constructor has been inlined) movq %rdx, (%rdi) ret .size _Z13returnAStringv, .-_Z13returnAStringv .section .rodata .align 32 .type _ZZZ13returnAStringvENKUlvE_clEvE15qstring_literal, @object .size _ZZZ13returnAStringvENKUlvE_clEvE15qstring_literal, 40 _ZZZ13returnAStringvENKUlvE_clEvE15qstring_literal: .long -1 ; ref .long 5 ; size .long 0 ; alloc + capacityReserved .zero 4 ; padding .quad 24 ; offset .string "H" ; the data. Each .string add a terminal ' ' .string "e" .string "l" .string "l" .string "o" .string "" .string "" .zero 4
Conclusion
I hope that now that you have read this you will have a better understanding on where to use and not to use QStringLiteral.
There is another macro QByteArrayLiteral, which work exactly on the same principle but creates a QByteArray.
Update: See also the internals of QMutex and more C++11 features in Qt5.