Sockets
This is a really old draft from 1997.
Originally a Berkeley UNIX feature, sockets make it easy to communicate between two running programs. You can use sockets to talk to other processes running on the same machine about as easy as you can use them to communicate with a computer on the other side of the earth. You can create your own protocols, or use the socket interface to talk to existing servers using standard protocols like HTTP, FTP, Network News Transfer Protocol (NNTP), or Simple Mail Transfer Protocol (SMTP) used widely on the Internet.
Python provides socket support on most platforms, including virtually all Unix systems, as well as on Windows and Macintosh.
Creating Sockets
To use a socket in Python, you start by creating a socket object. This is done by calling the socket factory function in the module with the same name, specifying what kind of socket you want. Here’s an example:
>>> import socket >>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) >>> s <socket object, fd=3, family=2, type=1, protocol=0>(On a Windows box, the socket printout may look something like “<_socketobject instance at 88ece0>“).
The first argument to the socket function specifies the address family. Python supports two families: AF_INET for Internet addresses, and AF_UNIX for UNIX inter-process communication. The latter is obviously only available on UNIX platforms.
The second argument gives the socket type, which controls the socket’s communication behavior. Here, we used SOCK_STREAM, which gives a two-way reliable byte stream. Data you send are guaranteed to make it to the receiver, and in the same order as you sent it. If the network isn’t able to guarantee this, we’ll get an error from the socket interface. Another, less commonly used type is SOCK_DGRAM, which allows you to send and receive datagrams. A datagram is simply a block of data (usually with a fixed maximum length), but the protocol doesn’t guarantee that the data really makes it to the other end (and if it does, it may not arrive in the same order as it was sent). Finally, the SOCK_RAW type can be used on some platforms to mess around with low-level protocol details. This is usually only available to programs running with super-user privileges.
In Internet protocol terms, the combination (AF_INET, SOCK_STREAM) uses the Internet Transmission Control Protocol (TCP), while (AF_INET, SOCK_DGRAM) uses the Internet User Datagram Protocol (UDP). The (AF_INET, SOCK_RAW) combination, finally, allows you to play with the core Internet Protocol (IP) itself. If nothing of this make sense to you, don’t worry. It works just fine anyway.
[FIXME: add sidebar on Internet protocols? the protocol hierarchy: IP/UDP/TCP/FTP/SMTP/POP3/NNTP/HTTP, RFC/STD standardization, request for comments, RFC proliferation, etc.]
Calling Other Sockets
Once you have created a socket, it works pretty much like a telephone. You can either use it to dial a number in order to call another machine or process (provided it has its own socket, of course), or just hang around, waiting for incoming calls. However, when using sockets, it is actually somewhat simpler to call someone else than to set things up so that others can call you. So let’s begin by “calling” a distant web-server:
>>> s.connect(("www.python.org", 80)) >>> s.send("GET / HTTP/1.0\r\n\r\n") 18 >>> s.recv(20) 'HTTP/1.0 200 OK\015\012Dat' >>> s.close()Dialing the number
We use the connect method to “call” a server. In this example, we used an Internet socket, for which the “phone number” consists of two parts; the server name and a port number. The server name can either be given as a host name (like “www.python.org” in this case), or a numerical host address, also given as a string (for example, “194.109.137.226“).
[FIXME: mention DNS servers?]
The port number specifies which port on this server we wish to connect to. Like a company switchboard, a single Internet server can provide many different services on different ports simultaneously. In this case, we connect to an HTTP server, which is usually listening on port 80.
[FIXME: what about Unix sockets? They use special file names instead (socket=). Example?]
The call to the connect method returns as soon as the server accepts the connection. If this doesn’t happen, either because the server or port doesn’t exist, or the server couldn’t be reached, or any other problem arises, Python raises a socket.error exception (this exception object is provided by the socket module).
<FIXME: should mention that if we use a DGRAM socket, the connect method will only locate the server, not actually connect to it>
Sending and receiving data
Since the call succeeded, we can use the send method to send some data to the server. In this example, we use an HTTP command called GET to tell the server to return its “root” document. Having sent the command (terminated by two CRLF pairs), we read a few bytes from the server using the recv method. The server responds with “HTTP/1.0” to indicate that it has understood our command, followed by “200 OK” to indicate that it was accepted. Directly following this, the server will send some additional information, followed by the document itself (in this case, this is typically an HTML document).
The send method works like the write method of an ordinary file object. The main difference is that it does not provide buffering, so there’s no need to flush the socket’s write buffer before reading from it. The recv method is similar to read in that it you can specify how much data to read in a single call. It also returns an empty string if the connection is closed. However, if the connection is not closed, recv returns whatever amount of data that is available (never more than you specified, of course), but it only blocks if the receive buffers are completely empty. That is, if you request 10,000 bytes and there’s only two bytes available, you’ll get two bytes. But if there’s nothing at all, the method will wait until at least something arrives.
Both methods can also take an optional flag argument, that allows you to “peek” instead of read data from the socket, and to deal with so-called “out of band” data. We’ll describe these flags later.
Hanging up
In this example, we don’t bother to read all data from the server. Instead, we simply close the connection to tell the server that we’re done. Unlike a telephone, sockets are disposable, and cannot be reused once they’ve been connected to something. To make another call, we have to create a new socket. On the other hand, creating a socket is not very expensive. And socket servers, unlike many answering machines, stop sending data as soon as we hang up, so there’s no risk that we’ll get rubbish the next time we connect.
Socket Protocols
Most standard Internet protocols are basically very simple, “chat-style” protocols. One side (usually the connecting part) sends a command, and the other side responds with status information, or a set of data. A very simple example is the HTTP protocol, in which a typical session looks something like:
Client: connects Client: GET / HTTP/1.0 Client: sends empty line Server: HTTP/1.0 200 OK Server: sends additional response headers Server: sends empty line Server: sends document Server: disconnects(the above example is obviously not very chatty, since the server hangs up as soon as it has responded)
The following piece of code provides a minimal HTTP protocol implementation. Note that we use the “file copy” idiom to copy all data from the socket to sys.stdout.
Example: read a document via HTTP (File: httpget1.py)import socket, sys s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect(("www.python.org", 80)) s.send("GET / HTTP/1.0\r\n\r\n") while 1: buf = s.recv(1000) if not buf: break sys.stdout.write(buf) s.close()Running this script produces something like this:
HTTP/1.0 200 OK Server: WN/1.15.1 Date: Sun, 30 Mar 1997 10:28:07 GMT Last-modified: Thu, 13 Feb 1997 22:06:06 GMT Content-type: text/html <html> <head> <title>Python Home Page</title> ...rest of HTML document deleted... </html>The response begins with a status line (“HTTP/1.0 200 OK“), followed by some additional information. This includes the kind of server used by the site (Server), when the document was last modified (Last-modified), and most importantly, what kind of document this is (Content-type). This information is terminated by an empty line, directly followed by the document data. In this case, the document is an HTML document (text/html), starting with an <html> document tag.
Creating a Client Support Class
Our first attempt boldly ignored things like error handling, and also didn’t tell the difference between the status code and the other additional information provided by the server, and the document itself. Addressing this is of course pretty straight-forward, but to make it easier to concentrate on the protocol itself, let’s start by creating a support class that provides some higher-level primitives. The following class hides the socket object, providing easy-to-use writeline, read, and readline methods.
Example: a support class for simple Internet protocols (File: SimpleClient.py)import socket, string CRLF = "\r\n" class SimpleClient: "Client support class for simple Internet protocols." def __init__(self, host, port): "Connect to an Internet server." self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) self.sock.connect((host, port)) self.file = self.sock.makefile("rb") # buffered def writeline(self, line): "Send a line to the server." self.sock.send(line + CRLF) # unbuffered write def read(self, maxbytes = None): "Read data from server." if maxbytes is None: return self.file.read() else: return self.file.read(maxbytes) def readline(self): "Read a line from the server. Strip trailing CR and/or LF." s = self.file.readline() if not s: raise EOFError if s[-2:] == CRLF: s = s[:-2] elif s[-1:] in CRLF: s = s[:-1] return sThe writeline and readline methods handle newlines by themselves, using CRLF when writing lines, and CR and/or LF when reading lines. Note that in the __init__ method, this class calls the makefile method which creates a buffered file object connected to the object. This allows us to use standard read and readline semantics when reading data from the socket.
<FIXME: move text from pop3 document on why using send is usually better than makefile(“w”)>.
Using this class, we can improve our HTTP scripts somewhat. The following version uses the SimpleClient class instead of directly dealing with the socket module. It also skips the header part of the server response, copying only the document to sys.stdout. Finally, the client code itself is moved into a function, allowing us to easily fetch other documents.
Example: read a document via HTTP, revisited (File: httpget2.py)import sys from SimpleClient import SimpleClient def get(host, port = 80, document = "/"): http = SimpleClient(host, port) http.writeline("GET %s HTTP/1.0" % document) http.writeline("") while 1: line = http.readline() if not line: break return http.read() data = get("www.python.org") sys.stdout.write(data)Creating an HTTP Client Class
In the httpget2.py script, we used a SimpleClient instance mainly to avoid dealing directly with sockets in our program. The function is still rather inflexible, it doesn’t check the status code to see if we really got a correct response, and only supports the GET command.
The following class extends the SimpleClient class with an HTTP-oriented interface through its get and getbody methods. The httpcmd method can be used to implement other HTTP commands, as illustrated by the head method (we’ll get back to this command later).
Example: an HTTP client support class (File: HTTPClient.py)import string, sys import SimpleClient class HTTPClient(SimpleClient.SimpleClient): """An HTTP client protocol support class""" def __init__(self, host): # extract port from hostname, if given try: i = string.index(host, ":") host, port = host[:i], string.atoi(host[i+1:]) except ValueError: # catches both index and atoi errors port = 80 SimpleClient.SimpleClient.__init__(self, host, port) def httpcmd(self, command, document): "Send command, and return status code and list of headers." self.writeline("%s %s HTTP/1.0" % (command, document or "/")) self.writeline("") self.sock.shutdown(1) # close client end status = string.split(self.readline()) if status[0] != "HTTP/1.0": raise IOError, "unknown status response" try: status = string.atoi(status[1]) except ValueError: raise IOError, "non-numeric status code" headers = [] while 1: line = self.readline() if not line: break headers.append(line) return status, headers def get(self, document): "Get a document from the server. Return status code and headers." return self.httpcmd("GET", document) def head(self, document): "Get headers from the server. Return status code and headers." return self.httpcmd("HEAD", document) def getbody(self): "Get document body" return self.read()Things to notice
Calling the parent constructor. Note that also this class needs to call its parent constructor (since it creates the socket). In this case, the HTTPClient constructor simply converts the single host string to a host and port pair. The port is 80 by default, but the constructor allows you to use the “host:port” syntax to explicitly specify the port number. Note that both string.index and string.atoi raises ValueError exceptions, so a single try/except clause handles both a missing port, and a badly formed port number.
Shutting down the connection. The httpcmd method uses the shutdown socket method after sending the command. This method is similar to close, but it can be used to close communication in only one direction. In this case, we used the argument 1 which means that further send operations should be disabled. You can also use 0 to disable further receive operations, or 2 to disable both send and receive operations in one call.
Providing a default document. The httpcmd method uses a convenient trick to avoid sending badly formed commands to the server. If the document string is empty (false), the or operator will return the second argument instead. Python’s and operator works similarly, but it returns the second value only if the first one is true. To make this easier to remember, you can think of or as “or else,” and and as “and then.” (which, incidentally, is what the corresponding operators happen to be called in the Eiffel language).
… And Improving It
Given the above class, we can quickly pull together an even better version of the httpget script. This one allows to give a single URL instead of giving the host name, port, and document name as three different parts. Here, we use a standard Python module called urlparse to extract the host name and document name from the URL string, but leave it to the HTTPClient class to extract the optional port number from the host name. We also make sure that the status code was 200 (indicating that the document follows), otherwise, we raise an IOError exception.
Example: read a document via HTTP, third attempt (File: httpget3.py)import sys, urlparse from HTTPClient import HTTPClient def get(url): # parse the URL (ignore most parts of it) spam, host, document, spam, spam, spam = urlparse.urlparse(url) # get the document http = HTTPClient(host) status, headers = http.get(document) if status != 200: return IOError, "couldn't read document" return http.getbody() print get(sys.argv[1])So, given a much improved support library (consisting of the SimpleClient and HTTPClient classes, as well as the standard urlparse module), we’ve managed to write a compact utility which can read an arbitrary document from a web-server. Just add some “Usage” stuff, and you have a useful little script. You can also remove the get call and import this module in your own program.
Using the Standard HTTP Client Class (httplib)
But before you do that, it might be an appropriate time to look at what’s in Python’s standard library.
Example: read a document via HTTP using standard httplib (File: httpget4.py)import sys, urlparse from httplib import HTTP def get(url): spam, host, document, spam, spam, spam = urlparse.urlparse(url) http = HTTP(host) http.putrequest("GET", document) http.endheaders() status, message, headers = http.getreply() if status != 200: return IOError, "couldn't read document" return http.getfile().read() print get(sys.argv[1])Comparing the HTTPClient and httplib versions
If you compare the httpget3 and httpget4 scripts, you’ll see that they are very similar.
Exercise: take a look at the httplib sources (they are in the Lib directory provided with the standard distribution), and compare them to the SimpleClient/HTTPClient classes. What are the differences? What problems with HTTPClient does httplib address? How much work would it be to modify HTTPClient to address these problems? Would the httplib module benefit from using a support class like SimpleClient?
Using the Standard URL Client Class (urllib)
But you don’t have to stop here. The urllib module provides a unified interface to HTTP, FTP, Gopher, and local file access. Basically, you give this module an URL, and get back a file object which you can use to read the given document. So here’s our script again, shorter than ever.
Example: read a document via HTTP using standard urllib (File: httpget5.py)import sys, urllib print urllib.urlopen(sys.argv[1]).read()For the record, the file object returned by urlopen also contains an extra method, info, which is used to return header information [FIXME: This is a Message object! where are Message objects described?]. See the Library Reference manual for more information on using the urllib module.