C++ Experts Forum
The Standard Librarian: I/O and Function Objects: Containers of Pointers
Matthew Austern
Like most of the Standard C++ library, the standard container classes are parameterized by type: you can create an std::vector<int> to hold objects of type int, an std::vector<std::string> to hold string objects, and an std::vector<my_type> to hold objects of some user-defined type. It's also perfectly reasonable to create an std::vector<int*>, an std::vector<std::string*>, or an std::vector<my_type*>. Containers of pointers are important and common.
Unfortunately, in addition to being a common technique, containers of pointers are also among the most common sources of confusion for novices. Scarcely a week goes by on the C++ newsgroups without a post wondering why code like the following causes memory leaks:
{
std::vector<my_type*> v;
for (int i = 0; i < N; ++i)
v.insert(new my_type(i));
...
} // v is destroyed here
Is this memory leak a symptom of a compiler bug? Isn't std::vector's destructor supposed to destroy v's elements?
If you carefully think through how std::vector<T> works in general and you realize that there are no special rules for pointers — that, as far as vector is concerned, my_type* is just another T — then it's not too difficult to see why vector<my_type*> behaves as it does and why the code fragment in the last paragraph leaks memory. Nevertheless, the behavior of vector<my_type*> may seem surprising to people who are more familiar with older container libraries.
This column explains how containers of pointers behave, when containers of pointers are useful, and what to do when you need a container that performs more of the tasks of memory management for you than standard containers do by default.
Containers and Ownership
Standard containers use value semantics. For example, when you append a variable x to a vector v by writing:
v.push_back(x)
what you are actually doing is appending a copy of x. This expression stores (a copy of) x's value, as opposed to x's address. After you insert x into a vector you can do anything you like to x — such as giving it a new value or letting it go out of scope and be destroyed — without affecting the copy in the vector. An element of one container can't be an element of another (elements of two containers are necessarily distinct objects, even if the elements happen to compare equal), and removing an element from a container destroys that element (although other objects with the same value may exist elsewhere). Finally, a container "owns" its elements: when a container is destroyed, all of the elements in it are destroyed along with it.
These features are familiar from ordinary built-in arrays, and it may seem that they're too obvious to be worth mentioning. I'm listing them to make it clear how similar containers and arrays are. The most common conceptual mistake novices make when working with standard containers is thinking that the containers do more things "behind the scenes" than they really do.
Value semantics aren't always what you need: sometimes you need to store objects' addresses in a container, instead of just copying the objects' values. You can achieve the effect of reference semantics the same way with a container as with an array: by asking for it explicitly. You can put any kind of object in a container [1], and pointers are perfectly good objects themselves. Pointers take up memory; they can be assigned to; they have addresses; they have values that can be copied. If you need to store objects' addresses, you use a container of pointers. Instead of writing:
std::vector<my_type> v;
my_type x;
...
v.push_back(x);
you can write:
std::vector<my_type*> v;
my_type x;
...
v.push_back(&x);
In one sense, nothing has changed. You're still creating an std::vector<T>; it's just that in this case T happens to be a pointer type, my_type*. The vector still "owns" its elements, and they're still destroyed when the vector itself is, but you have to be clear on what those elements are: they're the pointers, not the things those pointers are pointing to.
This distinction between ownership of pointers and ownership of the pointed-to objects is just the same for a vector as for an array, or for a local variable. Suppose you write:
{
my_type* p = new my_type;
}
The pointer p will disappear when it goes out of scope, but the object it points to, *p, will not. If you want to destroy that object and free its memory, you need to do it yourself, either by explicitly writing delete p or by some equivalent method. Similarly, there's no special code in std::vector<my_type*> that loops through the vector and calls delete on each element. The elements disappear when the vector does. If you want something else to happen to those elements before they're destroyed, you have to do it yourself.
You might wonder why std::vector, and the other standard containers, weren't designed to do anything special with pointers. First, of course, there is the simple argument of uniformity: it's easier to understand a library with uniform semantics than a library with lots of special cases. If there were special cases, it would be hard to know where to draw the line. Would you treat iterators, or user-defined handle types, the same as pointers? If there were an exception to the general rule for vector<my_type*>, would there be an exception to the exception for vector<const my_type*>? How would the container know when to do delete p and when to do delete[] p?
Second, and more important: if std::vector<my_type*> did automatically own the pointed-to objects, std::vector would be a lot less useful. After all, if you want a vector that owns a collection of my_type objects, you've already got vector<my_type>. A vector<my_type*> is for those times when you need something different, when value semantics and strict ownership aren't good enough. You can use a container of pointers when you have objects that are referred to by more than one container, or objects that can appear more than once in a single container, or even pointers that don't point to valid objects in the first place. (They might instead be null pointers, pointers to raw memory, or pointers to subobjects.)
Imagine a few specific examples:
· You're maintaining a list of tasks, some of which are currently active and some of which are suspended. You have an std::list<task> for the full set of tasks, and an std::vector<task*> for the active subset.
· Your program has a string table: an std::vector<const char*>, where each element p points to a null-terminated array of characters. Depending on how you design your string table, you might use string literals, or you might point into a single large array of char — but either way, you certainly wouldn't want to have a loop that went through the vector and invoked delete p on each element.
· You're doing I/O multiplexing, and you pass an std::vector<std::istream*> to a function. The input streams were opened elsewhere, they'll be closed elsewhere, and perhaps one of them is &std::cin.
None of these uses would be possible if containers of pointers tried to be helpful by deleting the pointed-to objects.
Owning the Pointed-To Objects
If you create a container of pointers, the reason should usually be that the pointed-to objects are created and destroyed somewhere else. Are there ever any situations where it would make sense to have a container that owns not only pointers but also the pointed-to objects? Yes. I know of only a single good reason for such an owning container, but it's an important one: polymorphism.
Polymorphism in C++ is tied to pointer/reference semantics. Suppose, for example, that task isn't just a class, but that it's the base of a class hierarchy. If p is a task*, then p might point to a task object or it might point to an object of some class derived from task. When you invoke one of task's virtual member functions through p, the appropriate function will be selected at run time depending on which derived class p is pointing to.
Unfortunately, making task the base of a polymorphic hierarchy means that you can't use a vector<task>. Objects in a container are stored by value; an element of a vector<task> is necessarily a task object, as opposed to a derived class object. (In fact, if you follow the common advice that the base of a hierarchy should be an abstract base class, then the compiler won't even let you create task objects or a vector<task>.)
An object-oriented design usually means that you refer to an object through pointers or references, from the moment the object is created up to the moment it's destroyed. If you're going to own a collection of objects, you have very little choice but to own it through a container of pointers [2]. What's the best way to manage that container?
If you are using a container of pointers to own a set of objects, the main issue is making sure that all of the objects get destroyed when you're through with them. The most obvious solution, and probably the most common, is simply to loop through the container before you destroy it, calling delete for each element. If writing out the loop by hand is too much of a nuisance, it's easy enough to write a simple wrapper that will do it for you:
template <class T>
class my_vector : private std::vector<T*>
{
typedef std::vector<T> Base;
public:
using Base::iterator;
using Base::begin;
using Base::end;
...
public:
~my_vector() {
for (iterator i = begin(); i != end(); ++i)
delete *i;
}
};
This technique can work, but it is more limited, and requires more discipline, than might be apparent at first sight.
The trouble is, fixing up the destructor isn't enough. If you have a container that lists all of the objects that are going to be destroyed, then you'd better make sure that an object gets destroyed whenever a pointer leaves the container, and also that a pointer never shows up in the container twice. You need to be careful when you remove a pointer with erase or clear, but you also need to be careful with container assignment and even with assignment through iterators: operations like v1 = v2, and v[n] = p, are dangerous. Standard algorithms, many of which perform assignments through iterators, are another danger. You obviously can't use algorithms like std::copy and std::replace; less obviously, you can't use std::remove, std::remove_if, or std::unique[3].
A wrapper class like my_vector can address some of these issues, but it can't address all of them. It's hard to see how you could prevent clients from using assignment in dangerous ways unless you prevented them from using it at all — and at that point, what you had wouldn't look much like a container.
The problem is that each element has to be tracked individually, so perhaps a solution is to wrap the pointers instead of wrapping the whole container.
The standard library defines a wrapper for pointers, std::auto_ptr<T>. An auto_ptr object holds a pointer p of type T*, and auto_ptr's destructor deletes the object that p points to. This looks just like what we're looking for: a wrapper class with a destructor that deletes a pointer. Using vector<auto_ptr<T> > as a replacement for vector<T*> is a natural idea.
It's a natural idea, but it's wrong. The reason, once again, is value semantics. Container classes assume that they can make copies of their elements. If you have a vector<T>, for example, then an object of type T has to behave like an ordinary value. If t1 is a value of type T, you had better be able to write:
T t2(t1)
and have t2 be a copy of t1.
Formally, in the language of the C++ Standard, T is required to be Assignable and CopyConstructible. Pointers satisfy these requirements — you can make a copy of a pointer — but auto_ptr does not. The whole point of auto_ptr is that it maintains strict ownership, so it does not permit copying. There's something that has the syntactic form of a copy constructor, but auto_ptr's "copy constructor" doesn't actually copy. If t1 is an std::auto_ptr<T> and you write:
std::auto_ptr<T> t2(t1)
then t2 will not be a copy of t1. Instead of copying, what happens is a transfer of ownership — t2 gets the value that t1 used to have, and t1 is changed to become a null pointer. An auto_ptr object is fragile: you can change its value just by looking at it.
On some implementations, you'll find that you get a compile-time error when you try to create a vector<auto_ptr<T> >. That's if you're lucky; if you're unlucky, things will seem fine until you get unpredictable behavior at run time. Either way, the standard container classes can't cope with a type whose copy constructor doesn't copy. That's not what auto_ptr was designed for, and the Standard [4] even points out that "instantiating a standard library container with an auto_ptr results in undefined behavior." You should use auto_ptr when you need an exception-safe mechanism for deleting a pointer when you exit a scope; auto_ptr was named in analogy to automatic variables. You shouldn't try to use auto_ptr to manage pointers in container classes; it won't work.
Instead of auto_ptr, you should use a different kind of "smart pointer," a reference-counted pointer class. A reference-counted pointer keeps track of how many pointers are pointing to the same object. When you make a copy of a reference-counted pointer, the count is incremented; when you destroy a reference-counted pointer, the count is decremented. When the count goes to zero, the object the pointer was pointing to is automatically destroyed.
Writing a reference-counted pointer class isn't tremendously difficult, but it also isn't trivial; achieving thread safety requires special tricks. Fortunately, using reference counting doesn't mean that you have to write your own reference-counted pointer class; several such classes exist already and are freely available. You can, for example, use Boost's shared_ptr class [5]. I expect that shared_ptr, or something like it, will become part of the next revision of the C++ Standard.
Of course, reference counting is just a special kind of garbage collection. Like all forms of garbage collection, it's a system that automatically destroys objects once it determines that you don't need them anymore. The main advantage of reference-counted pointers is that they're easy to plug into an existing system: a mechanism like shared_ptr is a small stand-alone class, and you can use it in just a single part of a larger program. On the other hand, reference counting is one of the least efficient forms of garbage collection (every pointer assignment and copy requires some relatively complicated processing), and one of the least flexible (you have to be careful when two data structures include pointers to each other). Other forms of garbage collection work equally well in C++ programs. In particular, the Boehm conservative garbage collector [6] is free, portable, and well tested.
If you use a conservative garbage collector, you don't have to do very much more than link it into your program. You don't have to use any special pointer wrapper; you just allocate memory without worrying about deleting it. In particular, if you create a vector<T*>, you know that the pointed-to objects will never be deleted as long as the vector exists (the garbage collector will never destroy objects while there are still pointers to them), and you also know that they will be deleted some time after the vector is destroyed (unless, that is, some other part of the program continues to refer to them).
The advantage of garbage collection — whether with reference counting, or with a conservative garbage collector, or with some other method — is that it lets you treat objects' lifetimes as completely indeterminate: you don't have to keep track of which parts of the program refer to which objects at any particular time. On the other hand, the disadvantage of garbage collection is exactly the same! Sometimes you really do know objects' lifetimes, or at least you know that an object will never persist after a particular phase of your program has finished. You might be building a complicated parse tree, for example; maybe it's full of polymorphic objects, and maybe it's too complicated to keep track of each node individually, but you can be sure that you won't need any of them once you're done parsing.
From a manually managed vector<T*> through vector<shared_pointer<T> > to conservative garbage collection, we've steadily retreated from the notion of a vector that owns a collection of objects; garbage collection is based on the premise that "ownership" of objects is irrelevant. In a sense, it solves the problem of ownership through a container by defining the problem away.
If your program does have well-defined phases, then you might reasonably want to destroy a set of objects at the end of one phase. Instead of garbage collection, an alternative technique is allocating objects through an arena: maintaining a list of objects that you can destroy all at once.
You might wonder how this differs from the technique I mentioned earlier, looping through a container of pointers and invoking destroy on each element. Is there any real difference between the two techniques? If not, what happened to all of the dangers and limitations that I spent so much time on?
The difference is small, but important: an arena stores a collection of objects for the purpose of deleting them later and for no other reason. It may be implemented in terms of a standard container, but it does not expose the container interface. You run into problems with a vector<T*> because you can remove elements, copy over elements, and apply algorithms. Arenas are a protocol of strict ownership. The arena contains one and only one pointer to each object it manages, and it owns all of those objects; it does not let you remove, duplicate, overwrite, or iterate through any of its pointers. To work with the objects in an arena, you need to store pointers to them somewhere else — in a container, in a tree, or in whatever data structure is appropriate. Manipulation and ownership have been completely separated.
Arenas are a general idea. An arena can be as simple as a container where you remember never to use unsafe member functions, or it can be a wrapper class that tries to enforce slightly more safety. Many such wrappers have been written [7]. Listing 1 is a simplified example of yet another arena class, which uses an implementation trick so that you don't need a different arena for each pointer type. You might, for example, write:
arena a;
...
std::vector<int*> v;
v.push_back(a.create<int>(3));
...
a.destroy_all();
We've come almost back to the beginning: maintaining a vector containing pointers to objects that are owned and managed elsewhere.
Conclusion
Pointers are common in C++ programs, and so are standard containers; it's no surprise that the combination, containers of pointers, is also common.
Most of the difficulties that novices have with containers of pointers concern the issue of ownership: when do the pointed-to objects get deleted? Most of the techniques for managing containers of pointers come down to a single principle: if you have a container of pointers, the pointed-to objects should be owned elsewhere.
· If you are maintaining a collection of non-polymorphic objects of type my_type, you should store the objects by value in a container, such as a list<my_type> or a deque<my_type>. If need be, you can also have a secondary container that stores pointers to some or all of those objects.
· Don't try to put auto_ptrs into a standard container.
· If you have a collection of polymorphic objects, you need to manage it as a collection of pointers. (Although those pointers might be wrapped inside some kind of handle or smart pointer class.) When the lifetime of those objects isn't known in advance, or where lifetime issues aren't important, the easiest solution is to use garbage collection. The two simplest choices for garbage collection are a reference-counted pointer class and a conservative garbage collector. Which choice is best for you may depend on such issues as availability of tools.
· If you have a collection of polymorphic objects and you need control over their lifetime, the simplest solution is to use an arena. An example of a simple arena class is shown in Listing 1.
Notes
[1] Well, almost any: there are some restrictions — which will be discussed later — on the objects you put in a container. Most reasonable types conform to those restrictions; pointers certainly do.
[2] There is one sense in which you have a choice: you can simulate value semantics by hiding polymorphic pointers inside a non-polymorphic wrapper class. See, for example, James Coplien, Advanced C++ Programming Styles and Idioms (Addison-Wesley, 1991), for a discussion of this "envelope and letter" idiom. See also chapter 14 of Andrew Koenig and Barbara Moo's Accelerated C++ (Addison-Wesley, 2000), for a family of generic handle classes. However, while the envelope-and-letter idiom is useful, it is also fairly heavyweight and it will affect many aspects of your design. Unless you have other reasons for using an envelope-and-letter design, it would be silly to turn to it just for the sake of container classes.
[3] See Harald Nowak, "A remove_if for vector<T*>," C/C++ Users Journal, July 2001, for an explanation of the problems with remove and remove_if, and for a technique that avoids them. However, this technique does not generalize to unique.
[5] <www.boost.org/libs/smart_ptr/shared_ptr.htm>.
[6] Hans-J. Boehm, "A garbage collector for C and C++," <www.hpl.hp.com/personal/Hans_Boehm/gc>.
[7] See, for example, Andrew Koenig, "Allocating C++ objects in clusters," Journal of Object-Oriented Programming, 6(3), 1993.
Matt Austern is the author ofGeneric Programming and the STL and the chair of the C++ standardization committee’s library working group. He works at AT&T Labs — Research and can be contacted at austern@research.att.com.