[Python] Debug python 程式 hang 在 Queue.get() 的問題

[Python] Debug python 程式 hang 在 Queue.get() 的問題

今天專案的 python 程式遇到一個奇怪的問題,

執行之後似乎就卡住了,但並不是總是可以重現這個問題,

尤其是想用 python -m pdb 的方式去單步執行時,就完全不會出現問題…

 

先用 strace -f -p <pid> 來看看這個程式在作什麼…

發現 51609 這個 thread 一直重覆呼叫 select() 函式,

看來似乎是在等待什麼東西,每隔一個很短的時間就會多呼叫一次 select():

Process 51604 attached with 2 threads
[pid 51609] select(0, NULL, NULL, NULL, {0, 38374} <unfinished ...>
[pid 51604] futex(0x1b6b330, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 51609] <... select resumed> )      = 0 (Timeout)
[pid 51609] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
[pid 51609] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
[pid 51609] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
[pid 51609] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
[pid 51609] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)

 

用 gdb 來看看吧:

gdb -p 51604

 

下 info thr 看一下,跟 strace 結果一樣,有兩根 thread,

51609 那根 thread 還是停在 select():

(gdb) info thr
Id   Target Id         Frame
2    Thread 0x7f546a55f700 (LWP 51609) "python" 0x00007f5476e2a8f3 in select () at ../sysdeps/unix/syscall-template.S:81
* 1    Thread 0x7f5477ff5740 (LWP 51604) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

 

下 thr 2 切到 51609 那根 thread:

(gdb) thr 2
[Switching to thread 2 (Thread 0x7f546a55f700 (LWP 51609))]
#0  0x00007f5476e2a8f3 in select () at ../sysdeps/unix/syscall-template.S:81
81  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

 

下 py-bt 看一下這時候對應的 python 是在跑什麼東西…

看到 frame#8 那邊是在跑 Queue.get(),這個函式會呼叫到 threading.Condition.wait():

(gdb) py-bt
#2 <built-in function sleep>
#4 Frame 0x7f54640035d0, for file /usr/lib64/python2.7/threading.py, line 361, in wait (self=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f5477f76b30>, acquire=<built-in method acquire of thread.lock object at remote 0x7f5477f76b30>, _Condition__waiters=[<thread.lock at remote 0x7f5477f76eb0>], release=<built-in method release of thread.lock object at remote 0x7f5477f76b30>) at remote 0x19441d0>, timeout=<float at remote 0x15ba498>, balancing=True, waiter=<thread.lock at remote 0x7f5477f76eb0>, saved_state=None, endtime=<float at remote 0x1960e38>, delay=<float at remote 0x1394f98>, gotit=False, remaining=<float at remote 0x15ba3d8>)
_sleep(delay)
#8 Frame 0x7f5464003200, for file /usr/lib64/python2.7/Queue.py, line 177, in get (self=<Queue(unfinished_tasks=0, queue=<collections.deque at remote 0x18d59f0>, maxsize=1000, all_tasks_done=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f5477f76b30>, acquire=<built-in method acquire of thread.lock object at remote 0x7f5477f76b30>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f5477f76b30>) at remote 0x19442d0>, mutex=<thread.lock at remote 0x7f5477f76b30>, not_full=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f5477f76b30>, acquire=<built-in method acquire of thread.lock object at remote 0x7f5477f76b30>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f5477f76b30>) at remote 0x1944290>, not_empty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f5477f76b30>, acquire=<built-in method acquire of thread.lock object at remote 0x7f...(truncated)
self.not_empty.wait(remaining)
#12 Frame 0x7f54640010b0, for file core/system.py, line 58, in _process_queue (self=<SystemManager(), _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f5477f76e10>, acquire=<built-in method acquire of thread.lock object at remote 0x7f5477f76e10>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f5477f76e10>) at remote 0x1944350>) at remote 0x1944310>, acquire=<built-in method acquire of thread.lock object at remote 0x7f5477f76e50>, _Condition__waiters=[<thread.lock at remote 0x1b181b0>], release=<built-in method release o...(truncated)

 

對照原始的 python 程式碼,可以知道究竟發生了什麼問題~

原來是我們為了避免 Queue.get() 會受到系統時間調整的影響,將它換成了使用 monotonic time,

但這個切換的同時,有可能有另外一個 thread (本例中的 51609) 已經在執行 Queue.get() 了,

也就是說,它可能一開始先拿到目前的系統時間 _time(),是一個比較大的整數,

因此設定的 endtime 會是一個大的數字~

但待會再拿系統時間 _time() 來比較時,拿到的是 monotonic time

(可以想成是系統開機後過了多久,但不完全是這樣),

因此這會是一個小很多的數字,因此 endtime – _time() 會一直都是一個很大的數字,

直到這台電腦開機夠久… (大概要開三十幾年吧)

endtime = _time() + timeout
while True:
# ......
remaining = endtime - _time()
if remaining <= 0:
break

 

了解了問題的原因之後,要解決就很簡單了,

只要確保我們先將 _time() 切換成 monotonic time,

再來操作 Queue,就沒有問題囉~

 

(本頁面已被瀏覽過 529 次)

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *

這個網站採用 Akismet 服務減少垃圾留言。進一步了解 Akismet 如何處理網站訪客的留言資料