Hướng dẫn python change file encoding - mã hóa tệp thay đổi python

Question

Làm cách nào để in văn bản được mã hóa UTF-8 vào bảng điều khiển bằng Python

print u"some unicode text \N{EURO SIGN}"
print b"some utf-8 encoded bytestring \xe2\x82\xac".decode('utf-8')

tức là, nếu bạn có một chuỗi Unicode thì hãy in nó trực tiếp. Nếu bạn có một bytestring thì trước tiên hãy chuyển nó sang Unicode.

Cài đặt ngôn ngữ của bạn ( LANG, LC_CTYPE) cho biết ngôn ngữ utf-8 và do đó (về lý thuyết) bạn có thể in utf-8 bằng cách kiểm tra trực tiếp và nó sẽ được hiển thị chính xác trong thiết bị đầu cuối của bạn (nếu cài đặt thiết bị đầu cuối phù hợp với cài đặt ngôn ngữ và chúng phải được ) nhưng bạn nên tránh điều đó: không hardcode mã hóa ký tự của môi trường bên trong script của bạn ; in Unicode trực tiếp để thay thế .in Unicode trực tiếp để thay thế .

Có rất nhiều giả định sai trong câu hỏi của bạn.

Bạn không cần phải thiết lập PYTHONIOENCODINGvới cài đặt ngôn ngữ của mình, để in Unicode vào thiết bị đầu cuối. ngôn ngữ utf-8 hỗ trợ tất cả các ký tự Unicode tức là, nó hoạt động như hiện tại.

Bạn không cần giải pháp thay thế sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout). Nó có thể bị hỏng nếu một số mã (mà bạn không kiểm soát) cần in byte và / hoặc nó có thể bị hỏng khi in Unicode sang bảng điều khiển Windows (mã sai, không thể in các ký tự không thể giải mã) . Cài đặt ngôn ngữ chính xác và / hoặc PYTHONIOENCODINGenvvar là đủ. Ngoài ra, nếu bạn cần thay thế

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

1thì hãy sử dụng

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

2thay vì

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

3mô-đun như

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

4gói .

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

5không liên quan đến cài đặt ngôn ngữ của bạn và tới PYTHONIOENCODING. Giả định của bạn rằng cài đặt PYTHONIOENCODING sẽ thay đổi

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

5là không chính xác. Bạn nên kiểm tra

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

9thay thế.

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

5không được sử dụng khi bạn in ra bảng điều khiển. Nó có thể được sử dụng làm dự phòng trên Python 2 nếu stdout được chuyển hướng đến tệp / đường dẫn trừ khi

1được đặt:

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

Đừng gọi điện thoại

2; nó có thể làm hỏng dữ liệu của bạn một cách âm thầm và / hoặc phá vỡ các mô-đun của bên thứ 3 không mong đợi điều đó. Ghi

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

5được sử dụng để chuyển đổi bytestrings (

4) đến / từ

5bằng Python 2 ngầm ví dụ

6. Xem thêm, trích dẫn trong câu trả lời của @ mesilliac .

10 hữu ích 0 bình luận chia sẻ 0 bình luận chia sẻ

You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?

Answer: You can't unless the file format provides for this. XML, for example, begins with:

This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

3 module and use

8 which provides the missing bit in Python.

As for your editor, you must check if it offers some way to set the encoding of a file.

The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.

The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).

That said, you can use the Python function eval() to turn an escaped string into a string:

>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1

As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:

>>> x.decode('utf-8')
u'Capit\xe1n\n'

Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:

0000000: 4361 7069 745c 7863 335c 7861 316e  Capit\xc3\xa1n

9, for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?

Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).

So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.

Your solution using

>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1

0 does work, but you must be aware how much memory you use: Three times the amount of using

>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1

1.

Hãy nhớ rằng một tệp chỉ là một chuỗi các byte với 8 bit.Cả bit và byte đều không có ý nghĩa.Chính bạn là người nói "65 có nghĩa là 'A'".Vì

>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1

2 sẽ trở thành "à" nhưng máy tính không có cách nào để biết, bạn phải nói bằng cách chỉ định mã hóa được sử dụng khi viết tệp.

programming python Export PYTHONUTF8=1 Csv UTF-8 python PYTHONIOENCODING Charset Python

Hướng dẫn python change file encoding - mã hóa tệp thay đổi python

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội