[Python] 使用 Univeral Newline Support 來處理不同平台 line-ending 的文字檔

2016-04-08 ephrain Comments 0 Comment

今天遇到一個問題，同事在 Windows 上作了一個文字檔，

輸入我們的 Linux 程式裡之後，發現每一行都多了一個 byte…

其實問題很簡單，因為 Windows/Linux/Mac 的慣用 line-ending 都不一樣，

因此在跨平台時會導致這樣的問題～

舉個例子來看，Windows 上製作的文字檔 line-ending 是 CR LF (0d 0a)：

testuser@localhost ~ $ hexdump -C line_ending_windows.txt
00000000  6c 69 6e 65 31 0d 0a 6c  69 6e 65 20 32 0d 0a 0d  |line1..line 2...|
00000010  0a 6c 69 6e 65 34 0d 0a                           |.line4..|

Unix/Linux 上的 line-ending 是 LF (0a)：

testuser@localhost ~ $ hexdump -C line_ending_unix.txt
00000000  6c 69 6e 65 31 0a 6c 69  6e 65 20 32 0a 0a 6c 69  |line1.line 2..li|
00000010  6e 65 34 0a                                       |ne4.|

而 Mac 的 line-ending 則是 CR (0d)：

testuser@localhost ~ $ hexdump -C line_ending_mac.txt
00000000  6c 69 6e 65 31 0d 6c 69  6e 65 20 32 0d 0d 6c 69  |line1.line 2..li|
00000010  6e 65 34 0d                                       |ne4.|

如果在 python 裡面，使用 open() 直接開檔來讀，

會發現 readlines() 在不同平台上，傳回各種不同的結果：

>>> open("line_ending_windows.txt", "r").readlines()
['line1\r\n', 'line 2\r\n', '\r\n', 'line4\r\n']
>>> open("line_ending_unix.txt", "r").readlines()
['line1\n', 'line 2\n', '\n', 'line4\n']
>>> open("line_ending_mac.txt", "r").readlines()
['line1\rline 2\r\rline4\r']

若是想要通吃各平台上不同 line-ending 的文字檔，

可以啟用 python 的 Universal Newline Support 的功能～

啟用的方法是在開檔模式裡加一個 “U” 代表 universal newline support，

這樣 python 就會自動辨認正確的 line-ending 來處理了，

可以看到啟用這功能後，readlines() 傳回的資料都是一致的囉：

>>> open("line_ending_windows.txt", "rU").readlines()
['line1\n', 'line 2\n', '\n', 'line4\n']
>>> open("line_ending_unix.txt", "rU").readlines()
['line1\n', 'line 2\n', '\n', 'line4\n']
>>> open("line_ending_mac.txt", "rU").readlines()
['line1\n', 'line 2\n', '\n', 'line4\n']

參考資料：stackoverflow: Dealing with Windows line-endings in Python

(本頁面已被瀏覽過 354 次)

EPH 的程式日記

記錄程式設計生活的點點滴滴

[Python] 使用 Univeral Newline Support 來處理不同平台 line-ending 的文字檔

2016-04-08 ephrain Comments 0 Comment

發佈留言取消回覆

發佈留言 取消回覆

發佈留言取消回覆