8.1. Python程序的执行过程
Python解释器在执行任何一个Python程序文件时,首先进行的动作都是先对文件中的Python源代码进行编译,编译的主要结果是产生一组Python的byte code(字节码),然后将编译的结果交给Python的虚拟机(Virtual Machine),由虚拟机按照顺序一条一条地执行字节码,从而完成对Python程序的执行动作。
对于Python编译器来说,PyCodeObject对象才是其真正的编译结果,而pyc文件只是这个对象在硬盘上的表现形式,它们实际上是Python对源文件编译的结果的两种不同存在方式。
在程序运行期间,编译结果存在于内存的PyCodeObject对象中;而Python结束运行后,编译结果又被保存到了pyc文件中。当下一次运行相同的程序时,Python会根据pyc文件中记录的编译结果直接建立内存中的PyCodeObject对象,而不用再次对源文件进行编译了。
对整体流程认识清晰后完全可以写一个工具,将基于Python3.7生成的pyc文件解析出来,pyc文件的内容用json格式组织一下如下图:
写工具的目的只是为了更加理解整个流程。实际上使用Python的dis模块可以输出更为详细清晰的内容,如下图:
8.2. PyCodeObject源码
// code.h
typedef struct {
PyObject_HEAD
int co_argcount;
int co_kwonlyargcount;
int co_nlocals;
int co_stacksize;
int co_flags;
int co_firstlineno;
PyObject *co_code;
PyObject *co_consts;
PyObject *co_names;
PyObject *co_varnames;
PyObject *co_freevars;
PyObject *co_cellvars;
Py_ssize_t *co_cell2arg;
PyObject *co_filename;
PyObject *co_name;
PyObject *co_lnotab;
void *co_zombieframe;
PyObject *co_weakreflist;
void *co_extra;
} PyCodeObject;
- Code Block:
Python编译器在对Python源代码进行编译的时候,对于代码中的一个Code Block,会创建一个PyCodeObject对象与这段代码对应。当进入一个新的名字空间,或者说作用域时,就算是进入了一个新的Code Block了。比如下面的代码有三个code block:一个对应整个test.py文件,一个对应class A,一个对应def Fun。
# test.py
class A:
pass
def Fun():
pass
a = A()
Fun()
- 名字空间:
名字空间是符号的上下文环境,符号的含义取决于名字空间。更具体地说,一个变量名对应的变量值是什么,在Python中,这并不是确定的,而是需要通过名字空间来决定。一个Code Block,对应着一个名字空间,它会对应一个PyCodeObject对象。 - Python中的code对象:
在Python中,有与C语言下的PyCodeObject对象对应的对象——code对象,这个对象是对C语言下的PyCodeObject对象的一个简单包装,通过code对象,我们可以访问PyCodeObject对象中的各个域。
8.3. 生成pyc文件
# pyc_generator.py
import imp
import sys
def generate_pyc(name):
fp, pathname, description = imp.find_module(name)
try:
imp.load_module(name, fp, pathname, description)
finally:
if fp:
fp.close()
if __name__ == '__main__':
generate_pyc(sys.argv[1])
命令行中输入如下命令会生成pyc文件:
>>> ./python3.7 pyc_generator.py test
8.3.1. 生成PyCodeObject对象和pyc文件的C流程
从上面的pyc_generator文件中的imp.load_module开始,函数调用顺序如下:
// imp.py
load_module
=>load_source
// _bootstrap.py[1]
=>_load
=>_load_unlocked
// _bootstrap_external.py
=> exec_module
=> get_code
get_code方法中调用source_to_code方法生成PyCodeObject对象,调用_code_to_timestamp_pyc将PyCodeObject转为二进制数据,调用_cache_bytecode方法将二进制数据写入文件。
值得注意的是真正的Python不会调用_bootstrap.py的_load方法(上面函数调用顺序中的[1]),在Lib/importlib/__init__.py中:
# __init__.py
try:
import _frozen_importlib as _bootstrap
except ImportError:
from . import _bootstrap
_bootstrap._setup(sys, _imp)
else:
# do sth
try:
import _frozen_importlib_external as _bootstrap_external
except ImportError:
from . import _bootstrap_external
_bootstrap_external._setup(_bootstrap)
_bootstrap._bootstrap_external = _bootstrap_external
else:
# do sth
可以看到实际上调用的是_frozen_importlib中的_load方法,而不是_bootstrap中的_load方法,此lib的内容在Python/importlib.h中被定义:
不太明白为什么要这么处理,但是分析整体流程时将此处换成了_bootstrap,便于阅读源码。
下面会详细分析生成PyCodeObject对象,将PyCodeObject转为二进制数据和将二进制数据写入文件的流程。
8.3.2. 生成PyCodeObject对象源码
// _bootstrap_external.py
source_to_code
// _bootstrap.py
=>_call_with_frames_removed
// bltinmodule.c
=> builtin_compile_impl
builtin_compile_impl的C源码如下:
// bltinmodule.c
static PyObject *
builtin_compile_impl(PyObject *module, PyObject *source, PyObject *filename, const char *mode, int flags, int dont_inherit, int optimize)
{
PyObject *source_copy;
const char *str;
int compile_mode = -1;
int is_ast;
PyCompilerFlags cf;
int start[] = {Py_file_input, Py_eval_input, Py_single_input};
PyObject *result;
cf.cf_flags = flags | PyCF_SOURCE_IS_UTF8;
if (flags &
~(PyCF_MASK | PyCF_MASK_OBSOLETE | PyCF_DONT_IMPLY_DEDENT | PyCF_ONLY_AST))
{
PyErr_SetString(PyExc_ValueError,
"compile(): unrecognised flags");
goto error;
}
/* XXX Warn if (supplied_flags & PyCF_MASK_OBSOLETE) != 0? */
if (optimize < -1 || optimize > 2) {
PyErr_SetString(PyExc_ValueError,
"compile(): invalid optimize value");
goto error;
}
if (!dont_inherit) {
PyEval_MergeCompilerFlags(&cf);
}
if (strcmp(mode, "exec") == 0)
compile_mode = 0;
else if (strcmp(mode, "eval") == 0)
compile_mode = 1;
else if (strcmp(mode, "single") == 0)
compile_mode = 2;
else {
PyErr_SetString(PyExc_ValueError,
"compile() mode must be 'exec', 'eval' or 'single'");
goto error;
}
is_ast = PyAST_Check(source);
if (is_ast == -1)
goto error;
if (is_ast) {
// do sth.
}
str = source_as_string(source, "compile", "string, bytes or AST", &cf, &source_copy);
if (str == NULL)
goto error;
result = Py_CompileStringObject(str, filename, start[compile_mode], &cf, optimize);
Py_XDECREF(source_copy);
goto finally;
error:
result = NULL;
finally:
Py_DECREF(filename);
return result;
}
其中:
- 调用source_as_string方法将上面的test.py源码加载进内存:
- 调用Py_CompileStringObject方法生成PyCodeObject对象:
// pythonrun.c
PyObject *
Py_CompileStringObject(const char *str, PyObject *filename, int start,
PyCompilerFlags *flags, int optimize)
{
PyCodeObject *co;
mod_ty mod;
PyArena *arena = PyArena_New();
if (arena == NULL)
return NULL;
mod = PyParser_ASTFromStringObject(str, filename, start, flags, arena);
if (mod == NULL) {
PyArena_Free(arena);
return NULL;
}
if (flags && (flags->cf_flags & PyCF_ONLY_AST)) {
PyObject *result = PyAST_mod2obj(mod);
PyArena_Free(arena);
return result;
}
co = PyAST_CompileObject(mod, filename, flags, optimize, arena);
PyArena_Free(arena);
return (PyObject *)co;
}
调用PyParser_ASTFromStringObject方法生成语法树,调用PyAST_CompileObject方法生成PyCodeObject对象。此处不对语法解析和编译做深入分析。
8.3.3. 将PyCodeObject对象转为二进制数据
_code_to_timestamp_pyc方法负责将PyCodeObject对象转为二进制数据,源码如下:
// _bootstrap_external.py
def _code_to_timestamp_pyc(code, mtime=0, source_size=0):
"Produce the data for a timestamp-based pyc."
data = bytearray(MAGIC_NUMBER)
data.extend(_w_long(0))
data.extend(_w_long(mtime))
data.extend(_w_long(source_size))
data.extend(marshal.dumps(code))
return data
可以看出一个pyc文件包含几部分内容:
- MAGIC_NUMBER:不同版本的Python实现都会定义不同的MAGIC_NUMBER,比如Python 3.7a0 3392,Python 3.6a0 3360,防止加载不兼容的pyc文件;
- 0:不清楚是用作什么;
- mtime:py文件创建或最近一次修改的时间信息,如果修改时间没有改变则不需要转为二进制保存,即不需要修改pyc文件;
- source_size:源码大小;
- marshal.dumps(code):PyCodeObject对象的二进制流;
marshal.dumps调用marshal_dumps_impl方法:
// marshal.c
static PyObject *
marshal_dumps_impl(PyObject *module, PyObject *value, int version)
/*[clinic end generated code: output=9c200f98d7256cad input=a2139ea8608e9b27]*/
{
return PyMarshal_WriteObjectToString(value, version);
}
PyMarshal_WriteObjectToString源码为:
// marshal.c
PyObject *
PyMarshal_WriteObjectToString(PyObject *x, int version)
{
WFILE wf;
memset(&wf, 0, sizeof(wf));
wf.str = PyBytes_FromStringAndSize((char *)NULL, 50);
if (wf.str == NULL)
return NULL;
wf.ptr = wf.buf = PyBytes_AS_STRING((PyBytesObject *)wf.str);
wf.end = wf.ptr + PyBytes_Size(wf.str);
wf.error = WFERR_OK;
wf.version = version;
if (w_init_refs(&wf, version)) {
Py_DECREF(wf.str);
return NULL;
}
w_object(x, &wf);
w_clear_refs(&wf);
if (wf.str != NULL) {
char *base = PyBytes_AS_STRING((PyBytesObject *)wf.str);
if (wf.ptr - base > PY_SSIZE_T_MAX) {
Py_DECREF(wf.str);
PyErr_SetString(PyExc_OverflowError,
"too much marshal data for a bytes object");
return NULL;
}
if (_PyBytes_Resize(&wf.str, (Py_ssize_t)(wf.ptr - base)) < 0)
return NULL;
}
if (wf.error != WFERR_OK) {
Py_XDECREF(wf.str);
if (wf.error == WFERR_NOMEMORY)
PyErr_NoMemory();
else
PyErr_SetString(PyExc_ValueError,
(wf.error==WFERR_UNMARSHALLABLE)?"unmarshallable object"
:"object too deeply nested to marshal");
return NULL;
}
return wf.str;
此处最关键的方法为w_object,该方法会调用w_complex_object,真正将PyCodeObject对象转为二进制数据就在w_complex_object方法中:
// marshal.c
static void
w_complex_object(PyObject *v, char flag, WFILE *p)
{
// do sth.
else if (PyCode_Check(v)) {
PyCodeObject *co = (PyCodeObject *)v;
W_TYPE(TYPE_CODE, p);
w_long(co->co_argcount, p);
w_long(co->co_kwonlyargcount, p);
w_long(co->co_nlocals, p);
w_long(co->co_stacksize, p);
w_long(co->co_flags, p);
w_object(co->co_code, p);
w_object(co->co_consts, p);
w_object(co->co_names, p);
w_object(co->co_varnames, p);
w_object(co->co_freevars, p);
w_object(co->co_cellvars, p);
w_object(co->co_filename, p);
w_object(co->co_name, p);
w_long(co->co_firstlineno, p);
w_object(co->co_lnotab, p);
}
// do sth.
}
可以看出:
- PyCodeObject对象的类型是TYPE_CODE,8.2节中的test.py文件会生成三个PyCodeObject对象,它们之间的关系为一个PyCodeObject对象嵌套两个PyCodeObject对象;
- co_argcount、co_kwonlyargcount等字段是通过调用w_long(调用w_byte方法写入四个字节),co_code、co_consts 等字段是通过调用w_object(实际上是调用w_long、w_string等方法),最终转为二进制数据的。这些字段的具体含义之后再进行深入分析;
- 需要注意的是有一个特殊的类型:TYPE_REF,可以通过该类型节约存储空间。以co_filename为例,这个字段的含义为py文件的完整路径,下面为test.py生成的pyc文件中co_filename字段的值:
// class A
"co_filename": {
"type": "unicode",
"size": 49,
"value": "/Users/l.wang/Documents/pythonindepth/bin/test.py"
}
// def Fun
"co_filename": {
"type": "ref",
"ref": 6
}
// test.py
"co_filename": {
"type": "ref",
"ref": 6
}
这是通过w_ref方法实现的,w_ref的源码如下。其中有一个hash表,该表的key为对象的地址,value为index,如果表中存在相同地址的对象,则写入TYPE_REF类型和index,从而节省空间。
// marshal.c
static int
w_ref(PyObject *v, char *flag, WFILE *p)
{
_Py_hashtable_entry_t *entry;
int w;
if (p->version < 3 || p->hashtable == NULL) {
return 0; /* not writing object references */
}
/* if it has only one reference, it definitely isn't shared */
if (Py_REFCNT(v) == 1) {
return 0;
}
entry = _Py_HASHTABLE_GET_ENTRY(p->hashtable, v);
if (entry != NULL) {
/* write the reference index to the stream */
_Py_HASHTABLE_ENTRY_READ_DATA(p->hashtable, entry, w);
/* we don't store "long" indices in the dict */
assert(0 <= w && w <= 0x7fffffff);
w_byte(TYPE_REF, p);
w_long(w, p);
return 1;
} else {
size_t s = p->hashtable->entries;
/* we don't support long indices */
if (s >= 0x7fffffff) {
PyErr_SetString(PyExc_ValueError, "too many objects");
goto err;
}
w = (int)s;
Py_INCREF(v);
if (_Py_HASHTABLE_SET(p->hashtable, v, w) < 0) {
Py_DECREF(v);
goto err;
}
*flag |= FLAG_REF;
return 0;
}
err:
p->error = WFERR_UNMARSHALLABLE;
return 1;
}
这个过程的逆序实现过程如下。如果flag不为0,则向list表中增加实际的值。如果类型为TYPE_REF,则根据读取的index从list表中获取真实的值。
static PyObject *
r_object(RFILE *p)
{
PyObject *v, *v2;
Py_ssize_t idx = 0;
long i, n;
int type, code = r_byte(p);
int flag, is_interned = 0;
PyObject *retval = NULL;
if (code == EOF) {
PyErr_SetString(PyExc_EOFError,
"EOF read where object expected");
return NULL;
}
p->depth++;
if (p->depth > MAX_MARSHAL_STACK_DEPTH) {
p->depth--;
PyErr_SetString(PyExc_ValueError, "recursion limit exceeded");
return NULL;
}
flag = code & FLAG_REF;
type = code & ~FLAG_REF;
#define R_REF(O) do{\
if (flag) \
O = r_ref(O, flag, p);\
} while (0)
switch (type) {
// do sth.
case TYPE_REF:
n = r_long(p);
if (n < 0 || n >= PyList_GET_SIZE(p->refs)) {
if (n == -1 && PyErr_Occurred())
break;
PyErr_SetString(PyExc_ValueError, "bad marshal data (invalid reference)");
break;
}
v = PyList_GET_ITEM(p->refs, n);
if (v == Py_None) {
PyErr_SetString(PyExc_ValueError, "bad marshal data (invalid reference)");
break;
}
Py_INCREF(v);
retval = v;
break;
// do sth.
}
}
这里存在一个问题,为什么w_ref没有像r_object中根据flag的值决定哪个字段写入hash表中,目前没有想明白。
8.3.4. 将二进制数据写入文件
_cache_bytecode方法负责将将二进制数据写入文件,源码如下:
# _bootstrap_external.py
def _cache_bytecode(self, source_path, bytecode_path, data):
# Adapt between the two APIs
mode = _calc_mode(source_path)
return self.set_data(bytecode_path, data, _mode=mode)
set_data方法源码如下:
def set_data(self, path, data, *, _mode=0o666):
"""Write bytes data to a file."""
parent, filename = _path_split(path)
path_parts = []
# Figure out what directories are missing.
while parent and not _path_isdir(parent):
parent, part = _path_split(parent)
path_parts.append(part)
# Create needed directories.
for part in reversed(path_parts):
parent = _path_join(parent, part)
try:
_os.mkdir(parent)
except FileExistsError:
# Probably another Python process already created the dir.
continue
except OSError as exc:
# Could be a permission error, read-only filesystem: just forget
# about writing the data.
_bootstrap._verbose_message('could not create {!r}: {!r}',
parent, exc)
return
try:
_write_atomic(path, data, _mode)
_bootstrap._verbose_message('created {!r}', path)
except OSError as exc:
# Same as above: just don't write the bytecode.
_bootstrap._verbose_message('could not create {!r}: {!r}', path,
exc)
写入文件的关键方法为_write_atomic,源码如下。该方法采用写入临时文件,而后重命名的方式,用于保证要么有异常从而不会生成文件,要么无异常生成指定名称的文件。
def _write_atomic(path, data, mode=0o666):
"""Best-effort function to write data to a path atomically.
Be prepared to handle a FileExistsError if concurrent writing of the
temporary file is attempted."""
# id() is used to generate a pseudo-random filename.
path_tmp = '{}.{}'.format(path, id(path))
fd = _os.open(path_tmp,
_os.O_EXCL | _os.O_CREAT | _os.O_WRONLY, mode & 0o666)
try:
# We first write data to a temporary file, and then use os.replace() to
# perform an atomic rename.
with _io.FileIO(fd, 'wb') as file:
file.write(data)
_os.replace(path_tmp, path)
except OSError:
try:
_os.unlink(path_tmp)
except OSError:
pass
raise
8.4. 参考
- Python源码剖析
8.5. 附录
分析清楚pyc文件生成的流程后,就可以实现8.1节中提到的工具,工具源码如下:
# -*- coding:utf-8 -*-
import json
import datetime
import sys
FLAG_REF = ord('\x80')
TYPE_CODE = ord('c')
TYPE_STRING = ord('s')
TYPE_SMALL_TUPLE = ord(')')
TYPE_INT = ord('i')
TYPE_SHORT_ASCII = ord('z')
TYPE_SHORT_ASCII_INTERNED = ord('Z')
TYPE_REF = ord('r')
TYPE_NONE = ord('N')
REFS_HASH = {}
def parse_code(fp):
code = int.from_bytes(fp.read(1), 'little')
code_type = code & ~FLAG_REF
code_flag = code & FLAG_REF
idx = len(REFS_HASH)
if code_flag:
REFS_HASH[idx] = None
code_dict = {}
if code_type == TYPE_CODE:
code_dict['type'] = 'code'
code_dict['co_argcount'] = int.from_bytes(fp.read(4), 'little')
code_dict['co_kwonlyargcount'] = int.from_bytes(fp.read(4), 'little')
code_dict['co_nlocals'] = int.from_bytes(fp.read(4), 'little')
code_dict['co_stacksize'] = int.from_bytes(fp.read(4), 'little')
code_dict['co_flags'] = int.from_bytes(fp.read(4), 'little')
code_dict['co_code'] = parse_code(fp)
code_dict['co_consts'] = parse_code(fp)
code_dict['co_names'] = parse_code(fp)
code_dict['co_varnames'] = parse_code(fp)
code_dict['co_freevars'] = parse_code(fp)
code_dict['co_cellvars'] = parse_code(fp)
code_dict['co_filename'] = parse_code(fp)
code_dict['co_name'] = parse_code(fp)
code_dict['co_firstlineno'] = int.from_bytes(fp.read(4), 'little')
code_dict['co_lnotab'] = parse_code(fp)
elif code_type == TYPE_STRING:
code_dict['type'] = 'string'
length = int.from_bytes(fp.read(4), 'little')
code_dict['length'] = length
# todo
value = fp.read(length)
code_dict['value'] = str(value)
if code_flag:
REFS_HASH[idx] = code_dict['value']
elif code_type == TYPE_SMALL_TUPLE:
code_dict['type'] = 'tuple'
size = int.from_bytes(fp.read(1), 'little')
code_dict['size'] = size
items = []
for _ in range(size):
items.append(parse_code(fp))
code_dict['items'] = items
if code_flag:
REFS_HASH[idx] = code_dict['items']
elif code_type == TYPE_INT:
code_dict['type'] = 'long'
value = int.from_bytes(fp.read(4), 'little')
code_dict['value'] = value
if code_flag:
REFS_HASH[idx] = code_dict['value']
elif code_type == TYPE_SHORT_ASCII:
code_dict['type'] = 'unicode'
size = int.from_bytes(fp.read(1), 'little')
code_dict['size'] = size
code_dict['value'] = fp.read(size).decode()
if code_flag:
REFS_HASH[idx] = code_dict['value']
elif code_type == TYPE_SHORT_ASCII_INTERNED:
code_dict['type'] = 'unicode'
size = int.from_bytes(fp.read(1), 'little')
code_dict['size'] = size
code_dict['value'] = fp.read(size).decode()
if code_flag:
REFS_HASH[idx] = code_dict['value']
elif code_type == TYPE_REF:
code_dict['type'] = 'ref'
code_dict['ref'] = int.from_bytes(fp.read(4), 'little')
code_dict['value'] = REFS_HASH[code_dict['ref']]
elif code_type == TYPE_NONE:
code_dict['type'] = 'none'
else:
print(code_type)
return code_dict
def parse_pyc(file_name):
pyc_dict = {}
with open(file_name, 'rb') as fp:
magic_number = int.from_bytes(fp.read(2), 'little')
if magic_number >= 3390 and magic_number <= 3392:
pyc_dict['version'] = 'Python 3.7'
else:
print('only support Python 3.7')
exit(0)
_ = fp.read(2)
_ = fp.read(4)
timestamp = int.from_bytes(fp.read(4), 'little')
pyc_dict['modified'] = str(datetime.datetime.fromtimestamp(timestamp))
source_size = int.from_bytes(fp.read(4), 'little')
pyc_dict['size'] = source_size
pyc_dict['code'] = parse_code(fp)
return(pyc_dict)
if __name__ == '__main__':
file_name = sys.argv[1]
print(json.dumps(parse_pyc(file_name), indent=2))
分析test.py后结果为:
实现了对TYPE_REF的处理,下面的value值并不在真实的二进制中包含:
"co_filename": {
"type": "ref",
"ref": 6,
"value": "/Users/l.wang/Documents/pythonindepth/bin/test.py"
}
目前没有对指令集做处理。