Python文本处理

gxh大约 4 分钟pythonpython

Python经常被用于处理文本，这里汇总一下通常会用到的方法。

字符串组装

普通组装

>>> a = 'agan'
>>> b = 'hello'
>>> c = '%s says %s' % (a, b)
>>> c
agan says hello

与dict进行组装

>>> a = {'msg': 'hello', 'name': 'agan'}
>>> b = '%(name)s says %(msg)s' % a
>>> b
agan says hello

format

>>> b = '{} says {}'.format('agan', 'hello')
>>> b
'agan says hello'
>>> b = '{name} says {msg}'.format(name='agan', msg='hello')
>>> b
'agan says hello'

ord与chr

ord可以返回单个ascii字符的整数值，而chr相反，可以输出0到256范围内整数值对应的字符。

>>> a = '1'
>>> ord(a)
49
>>> chr(ord(a))
'1'
# 对不能显示的字符用反义字符表示
>>> a = 1
>>> chr(a)
'\x01'
>>>

hex与oct

hex返回10进制数值的16进制字符串表示，oct则是8进制。

>>> hex(16)
'0x10'
>>> oct(8)
'010'

把ord与hex结合起来，就能输出字符串的16进制表示。

>>> a = '测试'
>>> a
'\xe6\xb5\x8b\xe8\xaf\x95'
>>> for x in a:
...     print hex(ord(x))
...
0xe6
0xb5
0x8b
0xe8
0xaf
0x95

编码问题

在vim的介绍中有提到过编码问题，Python在进行文本处理时也会遇到相同的问题。vim面对的问题还更少一点，毕竟一个文件是固定的编码格式，不会混合。但python程序可不限于文本编辑器，还可以是各种东西，例如做一个爬虫，要同时处理各种编码的网页。

解决方式也很简单："decode early, unicode everywhere, encode late."

实际操作类似：

# 在得到输入的第一时间就将其decode为unicode
input_1 = input_1.decode('gbk')
input_2 = input_2.decode('utf-8')
# 业务逻辑仅处理unicode
result = process(input_1, input_2)
# 在处理结果需要保存时再encode成一个编码格式
output = result.encode('utf-8')
file.write(output)

这里就来到了unicode与str的区别，首先要说明的是，在python3里，这两者是同一样东西，而在python2里，你可以认为unicode是未经编码的字符串，而str是指定编码的字符串。

>>> a = u'abc'
>>> isinstance(a, str)
False
>>> isinstance(a, basestring)
True

如果你会使用python的C API的话，也要注意，PyStringObject仅仅代表str，PyString_Check检测对于unicode是不通过的。

logging

logging是python的标准日志处理模块，文档很全，我在这里只提两点。

logging有两种用法：

logging.basicConfig(level=logging.DEBUG, format="%(asctime)s - %(levelname)s - %(message)s")
logging.debug("test message")

与

logger = logging.getLogger('log_name')
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(asctime)s - %(levelname)s - %(message)s"))
logger.addHandler(handler)
logger.debug("test message")

其实第一种只是第二种方式的封装，logging内部自带了一个模块级别的logger，而basicConfig就是在配置它。

从源码上看，如果以stream模式使用logging内置的logger，输出一个unicode可能会乱码，因为终端的编码可能不是utf-8。

class StreamHandler(Handler):
    def emit(self, record):
        """
        Emit a record.

        If a formatter is specified, it is used to format the record.
        The record is then written to the stream with a trailing newline.  If
        exception information is present, it is formatted using
        traceback.print_exception and appended to the stream.  If the stream
        has an 'encoding' attribute, it is used to determine how to do the
        output to the stream.
        """
        try:
            msg = self.format(record)
            stream = self.stream
            fs = "%s\n"
            if not _unicode: #if no unicode support...
                stream.write(fs % msg)
            else:
                try:
                    if (isinstance(msg, unicode) and
                        getattr(stream, 'encoding', None)):
                        ufs = u'%s\n'
                        try:
                            stream.write(ufs % msg)
                        except UnicodeEncodeError:
                            #Printing to terminals sometimes fails. For example,
                            #with an encoding of 'cp1251', the above write will
                            #work if written to a stream opened or wrapped by
                            #the codecs module, but fail when writing to a
                            #terminal even when the codepage is set to cp1251.
                            #An extra encoding step seems to be needed.
                            stream.write((ufs % msg).encode(stream.encoding))
                    else:
                        stream.write(fs % msg)
                except UnicodeError:
                    stream.write(fs % msg.encode("UTF-8"))
            self.flush()
        except (KeyboardInterrupt, SystemExit):
            raise
        except:
            self.handleError(record)

re

re是正则模块，文档也非常全，这里只说一下非贪婪模式。

re默认使用的是贪婪模式，例如：

>>> import re
>>> p = re.compile(r'(.*) .*')
>>> m = p.match('abc def ghi')
>>> m.groups()
('abc def',)

可以使用?切换到非贪婪模式：

>>> import re
>>> p = re.compile(r'(.*?) .*')
>>> m = p.match('abc def ghi')
>>> m.groups()
('abc',)

缓冲模式

在将python嵌入到c程序，并且输出转接到文件时，print会失效：

./a > log.txt 2>&1 &

但如果是正常输出到终端，则没有问题：

./a

结论是，这种情况下stdout的line_buffering被设为了False，所以就不是行缓冲了，而变成了固定大小的全缓冲。当写入的数据超过缓冲区大小就能看到输出了。另外，在print后面加个sys.stdout.flush()，也能马上得到输出。

参考pylifecycle.c中的create_stdio函数，line_buffering由下面的逻辑决定:

    if (buffered_stdio && (isatty || fd == fileno(stderr)))
        line_buffering = Py_True;
    else
        line_buffering = Py_False;

其中，buffered_stdio是PyConfig里的配置项，默认是1，isatty表示文件是否指向终端。

所以，当输出被转接到文件时，isatty必然为false，就算设置buffered_stdio为1，也只有stderr能使用上行缓冲。

解决方式也很简单，在程序启动时将stdout设置为stderr。既然不合适，那就换掉它。甚至你可以自己新建一个_io.TextIOWrapper。

sys.stdout = sys.stderr