[Python] 在執行緒中 import 模組或 fork 導致 deadlock?
最近我們的 Python 專案遇到一個很奇怪的死結 (deadlock),
以結論來說:
- Thread 1 執行 import module A 的指令
- Module A 在 module 層級產生新的 Thread 2,且此 Thread 2 之後又會做 import module B 或 fork 的動作
只要上述兩個條件符合,就會產生 deadlock。
這個問題在 奇怪的 dead lock – 知乎 這篇裡有提到,
它遇到的是第二根 thread 做 import 時的問題,我們專案遇到的也算是同一個類型。
只是我後來自己重現問題時,重現的是 fork 造成的狀況。
來看一下知乎那篇文章的範例~
下面是 bar.py,它在 module 層級執行了一個執行緒,
而這執行緒會執行 encode() 函式,
隱含著 import encoding 這個敘述:
from threading import Thread class Bar(Thread): def run(self): u"hihi".encode("utf-8") bar = Bar() bar.start() bar.join()
而另一個 foo.py 的內容,就只是 import bar:
import bar
執行 foo.py 之後,程式就會立刻卡住不動。
用 gdb attach 後看一下,目前有兩根 thread:
(gdb) info thr Id Target Id Frame 2 Thread 0x7fad7186b700 (LWP 46382) "python" 0x00007fad792a7afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0 * 1 Thread 0x7fad79a89740 (LWP 46381) "python" 0x00007fad792a7afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
Thread 1 (本例中是主 thread) 是停在 thread.join(),在等待 Thread 2 結束:
(gdb) bt #0 0x00007fad792a7afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0 #1 0x00007fad792a7b8f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0 #2 0x00007fad792a7c2b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 #3 0x00007fad795c7795 in PyThread_acquire_lock (lock=0x20cacd0, waitflag=1) at /usr/src/debug/Python-2.7.5/Python/thread_pthread.h:323 #4 0x00007fad795cb482 in lock_PyThread_acquire_lock (self=0x7fad79a1c1b0, args=) at /usr/src/debug/Python-2.7.5/Modules/threadmodule.c:52 #5 0x00007fad7959ad40 in call_function (oparg=, pp_stack=0x7ffef9feafa0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4408 #6 PyEval_EvalFrameEx ( f=f@entry=Frame 0x20ca700, for file /usr/lib64/python2.7/threading.py, line 339, in wait (self=<_Condition(_Verbose__verbose=False, _Condition__lock=, acquire=, _Condition__waiters=[], release=) at remote 0x7fad798fb390>, timeout=None, balancing=True, waiter=, saved_state=None), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3040 #7 0x00007fad7959d08d in PyEval_EvalCodeEx (co=, globals=, locals=locals@entry=0x0, args=, argcount=1, kws=0x7fad799ded68, kwcount=0, defs=0x7fad7991eec0, defcount=2, closure=closure@entry=0x0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3640 #8 0x00007fad7959a58c in fast_function (nk=, na=, n=, pp_stack=0x7ffef9feb1b0, func=) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4504 #9 call_function (oparg=, pp_stack=0x7ffef9feb1b0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4429 #10 PyEval_EvalFrameEx ( f=f@entry=Frame 0x7fad799debc0, for file /usr/lib64/python2.7/threading.py, line 951, in join (self=<Bar(_Thread__ident=140382910723840, _Thread__block=<_Condition(_Verbose__verbose=False, _Condition__lock=, acquire=, _Condition__waiters=[], release=) at remote 0x7fad798fb390>, _Thread__name='Thread-1', _Thread__daemonic=False, _Thread__started=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=, acquire=, _Condition__waiters=[], release=) at remote 0x7fad798fb350>) at remote 0x7fad79944990>, _Thread__stderr=, _Thread__target=N...(truncated),
Thread 2 則是因為要執行 u”hihi”.encode(),
所以跑去使用 PyImport_ImportModuleLevel() 來 import encodings.utf_8 模組,
但在 PyImport_ImportModuleLevel() 裡面,
它又想用 _PyImport_AcquireLock() 去拿一個鎖 (lock),
看起來是拿不到,所以就卡在那:
(gdb) thr 2 [Switching to thread 2 (Thread 0x7fad7186b700 (LWP 46382))] #0 0x00007fad792a7afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007fad792a7afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0 #1 0x00007fad792a7b8f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0 #2 0x00007fad792a7c2b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 #3 0x00007fad795c7795 in PyThread_acquire_lock (lock=0x2038080, waitflag=waitflag@entry=1) at /usr/src/debug/Python-2.7.5/Python/thread_pthread.h:323 #4 0x00007fad795ac33d in _PyImport_AcquireLock () at /usr/src/debug/Python-2.7.5/Python/import.c:309 #5 0x00007fad795ae98f in PyImport_ImportModuleLevel (name=0x7fad79949e8c "encodings.utf_8", globals=0x0, locals=, fromlist=['*'], level=0) at /usr/src/debug/Python-2.7.5/Python/import.c:2287 #6 0x00007fad79591d6f in builtin___import__ (self=, args=, kwds=) at /usr/src/debug/Python-2.7.5/Python/bltinmodule.c:49 #7 0x00007fad7959a672 in do_call (nk=, na=, pp_stack=0x7fad7186a370, func=) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4623 #8 call_function (oparg=, pp_stack=0x7fad7186a370) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4431 #9 PyEval_EvalFrameEx ( f=f@entry=Frame 0x2042c40, for file /usr/lib64/python2.7/encodings/__init__.py, line 100, in search_function (encoding='utf-8', entry='--unknown--', norm_encoding='utf_8', aliased_encoding=None, modnames=['utf_8'], modname='utf_8'), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3040 #10 0x00007fad7959d08d in PyEval_EvalCodeEx (co=, globals=, locals=locals@entry=0x0, args=args@entry=0x7fad7994eee8, argcount=1, kws=kws@entry=0x0, kwcount=kwcount@entry=0, defs=defs@entry=0x0, defcount=defcount@entry=0, closure=0x0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3640 #11 0x00007fad795269c8 in function_call (func=, arg=('utf-8',), kw=0x0) at /usr/src/debug/Python-2.7.5/Objects/funcobject.c:526 #12 0x00007fad79501ab3 in PyObject_Call (func=func@entry=, arg=arg@entry=('utf-8',), kw=) at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529 #13 0x00007fad79593947 in PyEval_CallObjectWithKeywords (func=, arg=arg@entry=('utf-8',), kw=kw@entry=0x0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4277 #14 0x00007fad795a4440 in _PyCodec_Lookup (encoding=0x7fad79943474 "utf-8") at /usr/src/debug/Python-2.7.5/Python/codecs.c:147 #15 0x00007fad795a4589 in codec_getitem (encoding=, index=index@entry=0) at /usr/src/debug/Python-2.7.5/Python/codecs.c:211 #16 0x00007fad795a45d4 in PyCodec_Encoder (encoding=) at /usr/src/debug/Python-2.7.5/Python/codecs.c:275 ---Type to continue, or q to quit--- #17 0x00007fad795a45f8 in PyCodec_Encode (object=u'hihi', encoding=, errors=0x0) at /usr/src/debug/Python-2.7.5/Python/codecs.c:322
Thread 2 想要拿的鎖,究竟是什麼呢?
查看一下 CPython 的原始碼,
PyImport_ImportModuleLevel() 在 import 模組前,
都會用 _PyImport_AcquireLock() 取一個共同的 module lock:
PyObject * PyImport_ImportModuleLevel(char *name, PyObject *globals, PyObject *locals, PyObject *fromlist, int level) { PyObject *result; _PyImport_AcquireLock(); result = import_module_level(name, globals, locals, fromlist, level); if (_PyImport_ReleaseLock() < 0) { Py_XDECREF(result); PyErr_SetString(PyExc_RuntimeError, "not holding the import lock"); return NULL; } return result; }
當 foo.py 在 import bar 時,
Thread 1 事實上就已經先透過 _PyImport_AcquireLock() 拿到了 module lock,
但在 import bar 的過程中,新產生出來的 Thread 2 又觸發了 import encodings.utf_8,
所以 Thread 2 也要 透過 _PyImport_AcquireLock() 去拿 module lock。
Thread 2 等待 Thread 1 釋放 module lock,
但 Thread 1 等待模組 bar 裡的 bar.join() 也就是 Thread 2 的結束,
互等之下就造成了死結 (deadlock)。
如果 Thread 2 裡面沒有執行 import 指令的話,
而只是 print() 的話,deadlock 就不會產生。
但我自己在重現問題時,是用 subprocess.check_output() 來取代 import 指令,
但一樣會有 deadlock,這是為什麼呢?
用 gdb 來觀察,Thread 1 一樣是在 thread.join(),
Thread 2 最後一樣是卡在 _PyImport_AcquireLock(),
但進來的路徑有點不同,它是透過 posix_fork() 進來的:
(gdb) bt #0 0x00007f9e76e60afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0 #1 0x00007f9e76e60b8f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0 #2 0x00007f9e76e60c2b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 #3 0x00007f9e77180795 in PyThread_acquire_lock (lock=0x14cd080, waitflag=waitflag@entry=1) at /usr/src/debug/Python-2.7.5/Python/thread_pthread.h:323 #4 0x00007f9e7716533d in _PyImport_AcquireLock () at /usr/src/debug/Python-2.7.5/Python/import.c:309 #5 0x00007f9e7718b7f6 in posix_fork (self=, noargs=) at /usr/src/debug/Python-2.7.5/Modules/posixmodule.c:3846 #6 0x00007f9e77153acc in call_function (oparg=, pp_stack=0x7f9e6e9c2d40) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4392 #7 PyEval_EvalFrameEx ( f=f@entry=Frame 0x7f9e680013b0, for file /usr/lib64/python2.7/subprocess.py, line 1224, in _execute_child (self=<Popen(_child_created=False, returncode=None, stdout=None, stdin=None, pid=None, stderr=None, universal_newlines=False) at remote 0x7f9e774b5450>, args=['/bin/sh', '-c', 'ls'], executable='/bin/sh', preexec_fn=None, close_fds=False, cwd=None, env=None, universal_newlines=False, startupinfo=None, creationflags=0, shell=True, to_close=set([4, 5]), p2cread=None, p2cwrite=None, c2pread=4, c2pwrite=5, errread=None, errwrite=None, _close_in_parent=, errpipe_read=6, errpipe_write=7, gc_was_enabled=True), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3040 #8 0x00007f9e7715608d in PyEval_EvalCodeEx (co=, globals=, locals=locals@entry=0x0, args=, argcount=18, kws=0x7f9e68000e28, kwcount=0, defs=0x0, defcount=0, closure=closure@entry=0x0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3640 #9 0x00007f9e7715358c in fast_function (nk=, na=, n=, pp_stack=0x7f9e6e9c2f50, func=) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4504 #10 call_function (oparg=, pp_stack=0x7f9e6e9c2f50) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4429 #11 PyEval_EvalFrameEx ( f=f@entry=Frame 0x7f9e68000b50, for file /usr/lib64/python2.7/subprocess.py, line 711, in __init__ (self=<Popen(_child_created=False, returncode=None, stdout=None, stdin=None, pid=None, stderr=None, universal_newlines=False) at remote 0x7f9e774b5450>, args='ls', bufsize=0, executable=None, stdin=None, stdout=-1, stderr=None, preexec_fn=None, close_fds=False, shell=True, cwd=None, env=None, universal_newlines=False, startupinfo=None, creationflags=0, p2cread=None, p2cwrite=None, c2pread=4, c2pwrite=5, errread=None, errwrite=None, to_close=set([4, 5])), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3040
檢查一下 CPython 中的 posix_fork() 函式,
它確實也在 fork() 之前,呼叫了 _PyImport_AcquireLock(),
導致 Thread 2 一樣在等待 Thread 1 釋放 module lock:
static PyObject * posix_fork(PyObject *self, PyObject *noargs) { pid_t pid; int result = 0; _PyImport_AcquireLock(); pid = fork(); if (pid == 0) { /* child: this clobbers and resets the import lock. */ PyOS_AfterFork(); } else { /* parent: release the import lock. */ result = _PyImport_ReleaseLock(); } ......
看一下 CPython 中的註解,
這個 module lock 應該是為了避免多個 thread 同時 import 模組時,
有「不完全匯入 (partially imported)」的狀況發生。
至於 fork() 前為什麼也要取得這個 module lock,就還不是很清楚。
總之,在 module 層級做建立 thread、fork 等等動作,
看來是相當危險,而且也隱含著效率不彰的問題
(只要有人 import 就產生 thread or fork)。
有時那些 thread/fork 可能又隱藏在物件的 __init__() 裡面 (當然這也是不好的寫法),
導致這種問題更難在程式碼審查時抓出來。
最好的方法,就是在 module 層級,
盡量不要做產生大型物件、thread/fork 這的事囉~