[Python] Debug python 程式呼叫 deepcopy 當掉的問題
今天要來查一個 python 程式當掉,跑出 core dump 的問題…
雖然沒有查出真正的原因,但還是簡單記錄一下~
1. 使用 gdb 來 debug core dump
執行 gdb /usr/bin/python coredump 後,可以看到下面的輸出:
root@localhost /tmp/ccpp # gdb /usr/bin/python coredump Core was generated by `python -u -m /tmp/testd'. Program terminated with signal 6, Aborted. #0 0x00007f4ed2c195d7 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install python-2.7.5-16.el7.x86_64
看起來是因為 signal 6,也就是 SIGABRT 而當掉,
gdb 建議我們安裝 python 的 debug symbol~
2. 安裝 Python 的 debug symbol
既然 gdb 說缺少 debug symbol,還給我們 debuginfo-install 的指令,就直接執行吧~
安裝中間如果出現問題,可以參考使用 yum 安裝 debug symbol 這篇的說明~
3. 再次使用 gdb 來 debug core dump
重新執行 gdb /usr/bin/python coredump,
奇怪的是 py-bt 指令在這個 dump 裡沒什麼作用,沒能秀出相關的 python 函式呼叫,
但至少 bt 指令產出的 call stack 有提供比較詳細的參數資訊了:
Core was generated by `python -u -m /tmp/testd'. Program terminated with signal 6, Aborted. #0 0x00007f4ed2c195d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); Missing separate debuginfos, use: debuginfo-install postgresql93-libs-9.3.6-2PGDG.rhel7.x86_64 python-crypto-2.6.1-1.el7.x86_64 (gdb) bt #0 0x00007f4ed2c195d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007f4ed2c1acc8 in __GI_abort () at abort.c:90 #2 0x00007f4ed2c59e07 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f4ed2d628c8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196 #3 0x00007f4ed2c5fc67 in malloc_printerr (action=<optimized out>, str=0x7f4ed2d5ffb7 "corrupted double-linked list", ptr=<optimized out>) at malloc.c:4972 #4 0x00007f4ed2c6330c in _int_malloc (av=av@entry=0x7f4ed2f9e760 <main_arena>, bytes=bytes@entry=529) at malloc.c:3667 #5 0x00007f4ed2c639bc in _int_realloc (av=av@entry=0x7f4ed2f9e760 <main_arena>, oldp=oldp@entry=0x25828f0, oldsize=oldsize@entry=464, nb=nb@entry=544) at malloc.c:4247 #6 0x00007f4ed2c64702 in __GI___libc_realloc (oldmem=0x2582900, bytes=536) at malloc.c:2998 #7 0x00007f4ed39da2e9 in _PyObject_GC_Resize (op=0xbf65, nitems=49041) at /usr/src/debug/Python-2.7.5/Modules/gcmodule.c:1614 #8 0x00007f4ed39389b5 in PyFrame_New (tstate=<optimized out>, code=0x7f4ed3ce8db0, globals={'_copy_with_copy_method': <s not handle Suites at remote 0x7f4ed3d02cf8>, '_deepcopy_atomic': <s not handle Suites at remote 0x7f4ed3d02e60>, '_reconstruct': <s not handle Suites at remote 0x7f4ed3cfc230>, '_deepcopy_tuple': <s not handle Suites at remote 0x7f4ed3d02f50>, '_deepcopy_dict': <s not handle Suites at remote 0x7f4ed3cfc050>, 'deepcopy': <s not handle Suites at remote 0x7f4ed3d02de8>, 'dispatch_table': {<GeneratedProtocolMessageType(__metaclass__=<me type at remote 0x226a590>, MergeFromString=<s not handle Suites at remote 0x231daa0>, ByteSize=<s not handle Suites at remote 0x231d8c0>, __str__=<s not handle Suites at remote 0x231d758>, SerializeToString=<s not handle Suites at remote 0x231d938>, _SetListener=<s not handle Suites at remote 0x231d848>, SetInParent=<s not handle Suites at remote 0x231dcf8>, _cached_byte_size_dirty=<er_descriptor at remote 0x231b1b8>, TYPE_FIELD_NUMBER=1, HasField=<s not handle Suites at remote 0x231d578>, _Modified=<s not hand...(truncated), locals=0x0) at /usr/src/debug/Python-2.7.5/Objects/frameobject.c:728 #9 0x00007f4ed39aba16 in PyEval_EvalCodeEx (co=0xbf65, globals=<unknown at remote 0xbf91>, locals=<unknown at remote 0x6>, args=0xffffffffffffffff, argcount=0, kws=0x27, kwcount=0, defs=0x7f4ed3d7af08, defcount=2, closure=0x0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3099 #10 0x00007f4ed39aa83f in fast_function (nk=0, na=2, n=2, pp_stack=0x7f4e93ffdef0, func=<s not handle Suites at remote 0x7f4ed3d02de8>) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4194 #11 call_function (oparg=<optimized out>, pp_stack=0x7f4e93ffdef0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4119 #12 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2740
對照 python 原始碼看的時候,注意到裡面有個 copy.deepcopy() 的指令,
正好可以對應到 frame #8 的地方,
這邊不曉得為什麼 PyFrame_New -> _PyObject_GC_Resize 一路串到 _int_malloc之後,
接著就看到 frame #3 的 malloc_printerr 想要印出 “corrupted double-linked list” 這個字串,
接著就呼叫 abort() 結束程式了~
猜測是程式之前做了什麼操作導致記憶體亂掉,這邊的 deepcopy 只是倒楣,
在產生新的物件時碰到了壞掉的記憶體的內容,所以導致 malloc() 偵測到錯誤…
不過要光從這個 core dump 找出來是什麼地方弄亂了似乎有點困難,
目前只能先觀察看看是不是會再發生…
參考資料:
stackoverflow: What does ‘corrupted double-linked list’ mean