[Python] Debug python 程式 hang 在 Queue.get() 的問題
今天專案的 python 程式遇到一個奇怪的問題,
執行之後似乎就卡住了,但並不是總是可以重現這個問題,
尤其是想用 python -m pdb 的方式去單步執行時,就完全不會出現問題…
先用 strace -f -p <pid> 來看看這個程式在作什麼…
發現 51609 這個 thread 一直重覆呼叫 select() 函式,
看來似乎是在等待什麼東西,每隔一個很短的時間就會多呼叫一次 select():
Process 51604 attached with 2 threads [pid 51609] select(0, NULL, NULL, NULL, {0, 38374} <unfinished ...> [pid 51604] futex(0x1b6b330, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 51609] <... select resumed> ) = 0 (Timeout) [pid 51609] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) [pid 51609] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) [pid 51609] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) [pid 51609] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) [pid 51609] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
用 gdb 來看看吧:
gdb -p 51604
下 info thr 看一下,跟 strace 結果一樣,有兩根 thread,
51609 那根 thread 還是停在 select():
(gdb) info thr Id Target Id Frame 2 Thread 0x7f546a55f700 (LWP 51609) "python" 0x00007f5476e2a8f3 in select () at ../sysdeps/unix/syscall-template.S:81 * 1 Thread 0x7f5477ff5740 (LWP 51604) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
下 thr 2 切到 51609 那根 thread:
(gdb) thr 2 [Switching to thread 2 (Thread 0x7f546a55f700 (LWP 51609))] #0 0x00007f5476e2a8f3 in select () at ../sysdeps/unix/syscall-template.S:81 81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
下 py-bt 看一下這時候對應的 python 是在跑什麼東西…
看到 frame#8 那邊是在跑 Queue.get(),這個函式會呼叫到 threading.Condition.wait():
(gdb) py-bt #2 <built-in function sleep> #4 Frame 0x7f54640035d0, for file /usr/lib64/python2.7/threading.py, line 361, in wait (self=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f5477f76b30>, acquire=<built-in method acquire of thread.lock object at remote 0x7f5477f76b30>, _Condition__waiters=[<thread.lock at remote 0x7f5477f76eb0>], release=<built-in method release of thread.lock object at remote 0x7f5477f76b30>) at remote 0x19441d0>, timeout=<float at remote 0x15ba498>, balancing=True, waiter=<thread.lock at remote 0x7f5477f76eb0>, saved_state=None, endtime=<float at remote 0x1960e38>, delay=<float at remote 0x1394f98>, gotit=False, remaining=<float at remote 0x15ba3d8>) _sleep(delay) #8 Frame 0x7f5464003200, for file /usr/lib64/python2.7/Queue.py, line 177, in get (self=<Queue(unfinished_tasks=0, queue=<collections.deque at remote 0x18d59f0>, maxsize=1000, all_tasks_done=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f5477f76b30>, acquire=<built-in method acquire of thread.lock object at remote 0x7f5477f76b30>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f5477f76b30>) at remote 0x19442d0>, mutex=<thread.lock at remote 0x7f5477f76b30>, not_full=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f5477f76b30>, acquire=<built-in method acquire of thread.lock object at remote 0x7f5477f76b30>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f5477f76b30>) at remote 0x1944290>, not_empty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f5477f76b30>, acquire=<built-in method acquire of thread.lock object at remote 0x7f...(truncated) self.not_empty.wait(remaining) #12 Frame 0x7f54640010b0, for file core/system.py, line 58, in _process_queue (self=<SystemManager(), _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f5477f76e10>, acquire=<built-in method acquire of thread.lock object at remote 0x7f5477f76e10>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f5477f76e10>) at remote 0x1944350>) at remote 0x1944310>, acquire=<built-in method acquire of thread.lock object at remote 0x7f5477f76e50>, _Condition__waiters=[<thread.lock at remote 0x1b181b0>], release=<built-in method release o...(truncated)
對照原始的 python 程式碼,可以知道究竟發生了什麼問題~
原來是我們為了避免 Queue.get() 會受到系統時間調整的影響,將它換成了使用 monotonic time,
但這個切換的同時,有可能有另外一個 thread (本例中的 51609) 已經在執行 Queue.get() 了,
也就是說,它可能一開始先拿到目前的系統時間 _time(),是一個比較大的整數,
因此設定的 endtime 會是一個大的數字~
但待會再拿系統時間 _time() 來比較時,拿到的是 monotonic time
(可以想成是系統開機後過了多久,但不完全是這樣),
因此這會是一個小很多的數字,因此 endtime – _time() 會一直都是一個很大的數字,
直到這台電腦開機夠久… (大概要開三十幾年吧)
endtime = _time() + timeout while True: # ...... remaining = endtime - _time() if remaining <= 0: break
了解了問題的原因之後,要解決就很簡單了,
只要確保我們先將 _time() 切換成 monotonic time,
再來操作 Queue,就沒有問題囉~