访问子字符串最简单的的方式是使用切片
afiled = theline[3:8]
但一次只能取一个子字符串
如果还要考虑字段的长度 struct.unpack可能更合适
import struct #得到一个5字节的字符串 跳过三字节 得到两个8字节的字符串 以及其余部分 baseformat = "5s 3x 8s 8s" #theline超出的长度也由这个base-format 确定 numremain = len(theline) - struct.calcsize(baseformat) #用合适的s或者x字段完成格式 然后unpack format = "%s %ds" % (baseformat,numremain) l,s1,s2,t = struct.unpack(format,theline)
#test
>>> theline = "numremain = len(theline) - struct.calcsize(baseformat)" >>> numremain = len(theline) - struct.calcsize(baseformat) >>> format = "%s %ds" % (baseformat,numremain) >>> format '5s 3x 8s 8s 30s' >>> l,s1,s2,t = struct.unpack(format,theline) >>> l 'numre' >>> s1 'n = len(' >>> s2 'theline)' >>> t ' - struct.calcsize(baseformat)'
如果获取固定字长的数据,可以利用带列表推导(LC)的切片方法
pieces = [theline[k:k+n] for k in xrange(0,len(theline),n)]
如果想把数据切成指定长度的列 用带LC的切片方法比较容易实现
cuts = [8,14,20,26,30] pieces = [ theline[i,j] for i j in zip([0]+cuts,cuts+[None])]
在LC中调用zip,返回的是一个列表每项形如cuts[k],cuts[k+1]
第一项和最后一项为(0,cuts[0]) (cuts[len(cuts)-1],None)
将以上代码片段封装成函数
def fields(baseformat,theline,lastfield=False): #theline 超出的长度也有这个base-format 确定 #(通过 struct.calcsize计算切片的长度) numremain = len(theline)-struct.calcsize(baseformat) #用合适的s或者x字段完成格式 然后unpack format = "%s %d %s" % (baseformat,numre
下边这个是使用memoizing机制的版本
def fields(baseformat,theline,lastfield=False,_cache={ }): #生成键并尝试获得缓存的格式字符串 key = baseformat,len(theline),lastfield format _cache.get(key) if format is None: #m没有缓存的格式字符串 创建并缓存 numremain = len(theline) - struct.calcsize(baseformat) _cache[key] = format = "%s %d%s" % ( baseformat,numremain,lastfield and "s" or "x") return struct.unpack(format,theline)
cookbook上说的这个比优化之前的版本快30%到40% 不过如果这里不是瓶颈部分,没必要使用这种方法
使用LC切片函数
def split_by(theline,n,lastfield=False): #切割所有需要的片段 pieces = [theline[k:k+n] for k in xrange(0,len(theline),n)] #弱最后一段太短或不需要,丢弃 if not lastfield and len(pieces[-1] < n): pieces.pop() return pieces
def split_at(theline,cuts,lastfield=False): #切割所有需要的片段 pieces = [ theline[i,j] for i j in zip([0]+cuts,cuts+[None])] #若不需要最后一段 丢弃 if not lastfield: pieces.pop() return pieces
使用生成器的版本
def split_at(the_line,cuts,lastfield=False): last = 0 for cut in cuts: yield the_line[last:cut] last = cut if lastfield: yield the_line[last:] def split_by(the_line,n,lastfield=False): return split_at1(the_line,xrange(n,len(the_line),n),lastfield)
zip()的用法
zip([iterable, ...])
This function returns a list of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables. The returned list is truncated in length to the length of the shortest argument sequence. When there are multiple arguments which are all of the same length, zip() is similar to map() with an initial argument of None. With a single sequence argument, it returns a list of 1-tuples. With no arguments, it returns an empty list.
The left-to-right evaluation order of the iterables is guaranteed. This makes possible an idiom for clustering a data series into n-length groups using zip(*[iter(s)]*n).
zip() in conjunction with the * operator can be used to unzip a list:
>>> x = [1, 2, 3] >>> y = [4, 5, 6] >>> zipped = zip(x, y) >>> zipped [(1, 4), (2, 5), (3, 6)] >>> x2, y2 = zip(*zipped) >>> x == list(x2) and y == list(y2) True
>>> x2
(1, 2, 3)
>>> y2
(4, 5, 6)
生成器的用法参见这篇博客 http://www.cnblogs.com/cacique/archive/2012/02/24/2367183.html