[Python] Python 2.6 和 2.7 在處理 ElementTree.write() 例外處理的差異
今天專案的 python 程式遇到了個奇怪的問題,
用 ElementTree 寫出一個 XML 檔案時,遇到了 exception..
下面是一個測試程式:
import xml.etree.cElementTree as ET root = ET.fromstring('<?xml version="1.0" encoding="UTF-8" ?><REPORT></REPORT>') elem = ET.Element("S") elem.text = "aaa\xb3\\\xa5\\\xbb\\ccc" root.append(elem) ET.ElementTree(root).write("test.xml", "UTF-8")
在 CentOS 7 上執行上面的程式,就會出現 UnicodeDecodeError 的 exception:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 820, in write serialize(write, self._root, encoding, qnames, namespaces) File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml _serialize_xml(write, e, encoding, qnames, None) File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 937, in _serialize_xml write(_escape_cdata(text, encoding)) File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1073, in _escape_cdata return text.encode(encoding, "xmlcharrefreplace") UnicodeDecodeError: 'ascii' codec can't decode byte 0xb3 in position 3: ordinal not in range(128)
奇怪的是同樣的一段程式在 CentOS 5.4 上跑,並沒有問題…
查了一下,原來是 python 版本的問題,
CentOS 5.4 上我們用的是 python 2.6,而在 CentOS 7 上用的是 python 2.7,
這兩個版本的 ElementTre 在處理字串時,處理方式是不同的~
Python 2.6
ElemenTree.write() -> _write() -> _escape_cdata(),
遇到 UnicodeError 時,會呼叫 _encode_entity(),
而在這函式中,如果遇到轉碼失敗的字元,會用 “&#%d;” 的方式寫出 XML,
因此不會遇到 exception:
def _encode_entity(text, pattern=_escape): # map reserved and non-ascii characters to numerical entities def escape_entities(m, map=_escape_map): out = [] append = out.append for char in m.group(): text = map.get(char) if text is None: text = "&#%d;" % ord(char) append(text) return string.join(out, "") try: return _encode(pattern.sub(escape_entities, text), "ascii") except TypeError: _raise_serialization_error(text)
Python 2.7
ElementTree.write() -> _serialize_xml() -> _escape_cdata(),
在此函式中,會用 encode() 函式將 text 字串轉成想要的編碼,
但它假設 text 字串一定是一個 Unicode 字串,
因此當它只是一個普通的 DBCS (像 BIG5) 字串時,就會出錯了:
def _escape_cdata(text, encoding): # escape character data try: # it's worth avoiding do-nothing calls for strings that are # shorter than 500 character, or so. assume that's, by far, # the most common case in most applications. if "&" in text: text = text.replace("&", "&") if "<" in text: text = text.replace("<", "<") if ">" in text: text = text.replace(">", ">") return text.encode(encoding, "xmlcharrefreplace") except (TypeError, AttributeError): _raise_serialization_error(text)
看起來 python 2.6 的處理方式似乎比較好一些,exception 比較不會發生,
但也許也有人希望這種狀況下能出現 exception,那就會比較喜歡 python 2.7 的作法~
這是我第一次來比較 python 2.6 & 2.7 的原始碼,
感覺挺有趣的,看起來小小的版號不同,處理上也還是會有許多差異啊~