Chuyển đổi nhiều tệp văn bản thành csv Python
Khi sắp xếp dữ liệu với Pandas, cuối cùng bạn sẽ làm việc với nhiều loại nguồn dữ liệu. Chúng tôi đã giới thiệu cách để Pandas tương tác với bảng tính Excel, cơ sở dữ liệu sql, v.v. Trong hướng dẫn hôm nay, chúng ta sẽ tìm hiểu cách sử dụng Python3 để nhập văn bản (. txt) vào Pandas DataFrames. Quá trình như mong đợi là tương đối đơn giản để làm theo Show
Thí dụ. Đọc một tệp văn bản vào DataFrame trong PythonGiả sử rằng bạn có một tệp văn bản có tên là các cuộc phỏng vấn. txt, chứa dữ liệu được phân định bằng tab Chúng tôi sẽ tiếp tục và tải tệp văn bản bằng pd. read_csv()
Kết quả sẽ trông hơi méo vì bạn chưa chỉ định tab làm dấu phân cách cột của mình Chỉ định chuỗi thoát /t làm dấu phân cách của bạn, sẽ sửa dữ liệu DataFrame của bạn
Nhập nhiều tệp văn bản vào Python Pandas DataFramesĐây là một trường hợp thú vị hơn, trong đó bạn cần nhập một số tệp văn bản nằm trong một thư mục trong hệ điều hành của mình vào Khung dữ liệu Pandas. Tệp văn bản của bạn có thể chứa dữ liệu được trích xuất từ hệ thống, cơ sở dữ liệu của bên thứ ba, v.v. Trước khi tiếp tục, chúng ta cần nhập một vài thư viện Python
Bây giờ sử dụng đoạn mã sau
Khi bạn đã điền DataFrame của mình, bạn có thể phân tích thêm và trực quan hóa dữ liệu của mình bằng Pandas Ghi chú. Sau khi bạn thay đổi ký tự tách danh sách cho máy tính của mình, tất cả các chương trình sẽ sử dụng ký tự mới làm dấu tách danh sách. Bạn có thể thay đổi ký tự trở lại ký tự mặc định bằng cách làm theo quy trình tương tựThe pandas I/O API is a set of top level In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object05 functions accessed like that generally return a pandas object. The corresponding In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object07 functions are object methods that are accessed like . Below is a table containing available In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object09 and In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object10 Format Type Data Description Reader Writer text CSV text Fixed-Width Text File text JSON text HTML text LaTeX text XML text Local clipboard binary MS Excel binary OpenDocument binary HDF5 Format binary định dạng lông vũ binary Parquet Format binary ORC Format binary Stata binary SAS binary SPSS binary Python Pickle Format SQL SQL SQL Google BigQuery is an informal performance comparison for some of these IO methods Note For examples that use the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object11 class, make sure you import it with In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object12 for Python 3 CSV & text filesThe workhorse function for reading text files (a. k. a. flat files) is . See the for some advanced strategies Parsing optionsaccepts the following common arguments Basicfilepath_or_buffer variousEither a path to a file (a , , or In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object17), URL (including http, ftp, and S3 locations), or any object with a In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object18 method (such as an open file or )sep str, defaults to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object20 for , In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object22 for Delimiter to use. If sep is In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, . In addition, separators longer than 1 character and different from In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object26 will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example. In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object27delimiter str, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Alternative argument name for sep delim_whitespace boolean, default FalseSpecifies whether or not whitespace (e. g. In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object29 or In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object30) will be used as the delimiter. Equivalent to setting In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object31. If this option is set to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32, nothing should be passed in for the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object33 parameter Column and index locations and namesheader int or list of ints, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object34 Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names. if no names are passed the behavior is identical to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object35 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object36. Explicitly pass In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object35 to be able to replace existing names The header can be a list of ints that specify row locations for a MultiIndex on the columns e. g. In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object38. Intervening rows that are not specified will be skipped (e. g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object39, so header=0 denotes the first line of data rather than the first line of the filenames array-like, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 List of column names to use. If file contains no header row, then you should explicitly pass In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object36. Duplicates in this list are not allowedindex_col int, str, sequence of int / str, or False, optional, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Column(s) to use as the row labels of the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43, either given as string name or column index. Nếu một chuỗi int / str được đưa ra, Multi Index được sử dụng Note In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object44 can be used to force pandas to not use the first column as the index, e. g. when you have a malformed file with delimiters at the end of each line The default value of In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 instructs pandas to guess. If the number of fields in the column header row is equal to the number of fields in the body of the data file, then a default index is used. If it is larger, then the first columns are used as index so that the remaining number of fields in the body are equal to the number of fields in the header The first row after the header is used to determine the number of columns, which will go into the index. Nếu các hàng tiếp theo chứa ít cột hơn hàng đầu tiên, thì chúng được điền bằng In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46 Điều này có thể tránh được thông qua In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47. Điều này đảm bảo rằng các cột được lấy nguyên trạng và dữ liệu theo sau bị bỏ quausecols giống như danh sách hoặc có thể gọi được, mặc định In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Trả về một tập hợp con của các cột. Nếu giống như danh sách, tất cả các phần tử phải là vị trí (i. e. chỉ số số nguyên vào cột tài liệu) hoặc chuỗi tương ứng với tên cột do người dùng cung cấp trong In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object49 hoặc được suy ra từ (các) hàng tiêu đề tài liệu. Nếu In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object49 được đưa ra, (các) hàng tiêu đề tài liệu không được tính đến. Ví dụ: tham số In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 giống như danh sách hợp lệ sẽ là In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object52 hoặc In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object53 Thứ tự phần tử bị bỏ qua, do đó, In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object54 giống như In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object55. Để khởi tạo một DataFrame từ In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object56 với thứ tự phần tử được giữ nguyên, hãy sử dụng In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object57 cho các cột theo thứ tự In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object58 hoặc In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object59 cho thứ tự In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object60 Nếu có thể gọi được, hàm có thể gọi được sẽ được đánh giá dựa trên tên cột, trả về các tên mà hàm có thể gọi được đánh giá là True In [1]: import pandas as pd In [2]: from io import StringIO In [3]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3" In [4]: pd.read_csv(StringIO(data)) Out[4]: col1 col2 col3 0 a b 1 1 a b 2 2 c d 3 In [5]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ["COL1", "COL3"]) Out[5]: col1 col3 0 a 1 1 a 2 2 c 3 Sử dụng tham số này dẫn đến thời gian phân tích cú pháp nhanh hơn nhiều và sử dụng bộ nhớ thấp hơn khi sử dụng công cụ c. Công cụ Python tải dữ liệu trước khi quyết định bỏ cột nào bóp boolean, mặc địnhIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 Nếu dữ liệu được phân tích cú pháp chỉ chứa một cột thì hãy trả về In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object62 Không dùng nữa kể từ phiên bản 1. 4. 0. Nối In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object63 vào lệnh gọi tới In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object64 để nén dữ liệu. tiền tố str, mặc định In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Tiền tố để thêm vào số cột khi không có tiêu đề, e. g. 'X' cho X0, X1, ... Không dùng nữa kể từ phiên bản 1. 4. 0. Use a list comprehension on the DataFrame’s columns after calling In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66. In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 1mangle_dupe_cols boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 Duplicate columns will be specified as ‘X’, ‘X. 1’…’X. N’, rather than ‘X’…’X’. Passing in In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 will cause data to be overwritten if there are duplicate names in the columns Deprecated since version 1. 5. 0. The argument was never implemented, and a new argument where the renaming pattern can be specified will be added instead. General parsing configurationdtype Type name or dict of column -> type, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Data type for data or columns. E. g. In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object70 Use In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object15 or In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object72 together with suitable In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object73 settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion New in version 1. 5. 0. Support for defaultdict was added. Specify a defaultdict as input where the default determines the dtype of the columns which are not explicitly listed. engine {In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object74, In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object75, In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object76} Parser engine to use. The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine New in version 1. 4. 0. The “pyarrow” engine was added as an experimental engine, and some features are unsupported, or may not work correctly, with this engine. converters dict, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Dict of functions for converting values in certain columns. Keys can either be integers or column labels true_values list, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Values to consider as In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32false_values list, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Values to consider as In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61skipinitialspace boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 Skip spaces after delimiter skiprows list-like or integer, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3" In [11]: pd.read_csv(StringIO(data)) Out[11]: col1 col2 col3 0 a b 1 1 a b 2 2 c d 3 In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0) Out[12]: col1 col2 col3 0 a b 2skipfooter int, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object84 Number of lines at bottom of file to skip (unsupported with engine=’c’) nrows int, mặc địnhIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Number of rows of file to read. Useful for reading pieces of large files low_memory boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61, or specify the type with the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88 parameter. Note that the entire file is read into a single In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 regardless, use the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object90 or In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object91 parameter to return the data in chunks. (Only valid with C parser)memory_map boolean, default False If a filepath is provided for In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object92, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead NA and missing data handlingna_values scalar, str, list-like, or dict, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. See below for a list of the values interpreted as NaN by default keep_default_na boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 Whether or not to include the default NaN values when parsing the data. Depending on whether In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object73 is passed in, the behavior is as follows
Note that if In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:10 is passed in as In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61, the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object96 and In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object73 parameters will be ignoredna_filter boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:15 can improve the performance of reading a large fileverbose boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 Indicate number of NA values placed in non-numeric columns skip_blank_lines boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 If In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32, skip over blank lines rather than interpreting as NaN values Datetime handlingparse_dates boolean or list of ints or names or list of lists or dict, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61.
Note A fast-path exists for iso8601-formatted dates infer_datetime_format boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 If In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 and parse_dates is enabled for a column, attempt to infer the datetime format to speed up the processingkeep_date_col boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 If In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 and parse_dates specifies combining multiple columns then keep the original columnsdate_parser function, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Function to use for converting a sequence of string columns to an array of datetime instances. The default uses In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:29 to do the conversion. pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs. 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as argumentsdayfirst boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 DD/MM format dates, international and European format cache_dates boolean, default TrueIf True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets New in version 0. 25. 0 Iterationiterator boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 Return In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:32 object for iteration or getting chunks with In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:33chunksize int, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Return In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:32 object for iteration. See below Quoting, compression, and file formatcompression {In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object34, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:37, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:38, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:39, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:40, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:41, In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:43}, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object34 For on-the-fly decompression of on-disk data. If ‘infer’, then use gzip, bz2, zip, xz, or zstandard if In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object92 is path-like ending in ‘. gz’, ‘. bz2’, ‘. zip’, ‘. xz’, ‘. zst’, respectively, and no decompression otherwise. If using ‘zip’, the ZIP file must contain only one data file to be read in. Đặt thành In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 để không giải nén. Can also be a dict with key In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:47 set to one of { In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:39, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:37, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:38, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:41} and other key-value pairs are forwarded to In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:52, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:53, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:54, or In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:55. As an example, the following could be passed for faster compression and to create a reproducible gzip archive. In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:56 Changed in version 1. 1. 0. dict option extended to support In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:57 and In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:58. Changed in version 1. 2. 0. Previous versions forwarded dict entries for ‘gzip’ to In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:59. thousands str, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Thousands separator decimal str, defaultIn [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:61 Character to recognize as decimal point. E. g. use In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object20 for European datafloat_precision string, default None Specifies which converter the C engine should use for floating-point values. The options are In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 for the ordinary converter, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:64 for the high-precision converter, and In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:65 for the round-trip converterlineterminator str (length 1), default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Character to break file into lines. Only valid with C parser quotechar str (độ dài 1)The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored quoting int orIn [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:67 instance, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object84 Control field quoting behavior per In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:67 constants. Use one of In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:70 (0), In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:71 (1), In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:72 (2) or In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:73 (3)doublequote boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 When In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:75 is specified and In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:76 is not In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:73, indicate whether or not to interpret two consecutive In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:75 elements inside a field as a single In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:75 elementescapechar str (length 1), default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 One-character string used to escape delimiter when quoting is In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:73comment str, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object39), fully commented lines are ignored by the parameter In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:84 but not by In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:85. For example, if In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:86, parsing ‘#empty\na,b,c\n1,2,3’ with In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object35 will result in ‘a,b,c’ being treated as the headermã hóa str, mặc định In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Mã hóa để sử dụng cho UTF khi đọc/ghi (e. g. In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:89). phương ngữ str hoặc ví dụ, mặc định In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Nếu được cung cấp, thông số này sẽ ghi đè giá trị (mặc định hoặc không) cho các thông số sau. In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object33, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:93, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:94, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:95, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:75 và In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:76. Nếu cần ghi đè các giá trị, Cảnh báo phân tích cú pháp sẽ được đưa ra. Xem tài liệu để biết thêm chi tiết xử lý lỗierror_bad_lines boolean, tùy chọn, mặc địnhIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Các dòng có quá nhiều trường (e. g. một dòng csv có quá nhiều dấu phẩy) theo mặc định sẽ gây ra một ngoại lệ và không có In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 nào được trả về. Nếu In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61, thì những “dòng xấu” này sẽ bị loại bỏ khỏi In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 được trả về. Xem bên dưới Không dùng nữa kể từ phiên bản 1. 3. 0. Tham số In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:03 nên được sử dụng thay thế để xác định hành vi khi gặp phải một dòng xấu thay thế. warn_bad_lines boolean, tùy chọn, mặc định In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Nếu error_bad_lines là In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61, vàWarrior_bad_lines là In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32, một cảnh báo cho mỗi “dòng xấu” sẽ được xuất ra Không dùng nữa kể từ phiên bản 1. 3. 0. Tham số In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:03 nên được sử dụng thay thế để xác định hành vi khi gặp phải một dòng xấu thay thế. on_bad_lines ('lỗi', 'cảnh báo', 'bỏ qua'), 'lỗi' mặc định Chỉ định những việc cần làm khi gặp phải một dòng xấu (một dòng có quá nhiều trường). Các giá trị được phép là
Mới trong phiên bản 1. 3. 0 Chỉ định kiểu dữ liệu cộtBạn có thể chỉ định loại dữ liệu cho toàn bộ In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 hoặc các cột riêng lẻ In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object May mắn thay, pandas cung cấp nhiều hơn một cách để đảm bảo rằng (các) cột của bạn chỉ chứa một In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88. Nếu bạn không quen với những khái niệm này, bạn có thể xem để tìm hiểu thêm về dtypes và để tìm hiểu thêm về chuyển đổi In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object72 trong pandas Chẳng hạn, bạn có thể sử dụng đối số In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:11 của In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: Hoặc bạn có thể sử dụng chức năng để ép buộc các dtypes sau khi đọc dữ liệu, In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: sẽ chuyển đổi tất cả phân tích cú pháp hợp lệ thành float, để lại phân tích cú pháp không hợp lệ là In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46 Cuối cùng, cách bạn xử lý việc đọc trong các cột có chứa các kiểu dữ liệu hỗn hợp tùy thuộc vào nhu cầu cụ thể của bạn. Trong trường hợp trên, nếu bạn muốn loại bỏ các điểm bất thường của dữ liệu, thì đó có lẽ là lựa chọn tốt nhất của bạn. However, if you wanted for all the data to be coerced, no matter the type, then using the In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:11 argument of would certainly be worth trying Note In some cases, reading in abnormal data with columns containing mixed dtypes will result in an inconsistent dataset. If you rely on pandas to infer the dtypes of your columns, the parsing engine will go and infer the dtypes for different chunks of the data, rather than the whole dataset at once. Consequently, you can end up with column(s) with mixed dtypes. For example, In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000)) In [30]: df = pd.DataFrame({"col_1": col_1}) In [31]: df.to_csv("foo.csv") In [32]: mixed_df = pd.read_csv("foo.csv") In [33]: mixed_df["col_1"].apply(type).value_counts() Out[33]: will result with In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:19 containing an In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:20 dtype for certain chunks of the column, and In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object15 for others due to the mixed dtypes from the data that was read in. It is important to note that the overall column will be marked with a In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88 of In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object72, which is used for columns with mixed dtypes Specifying categorical dtypeIn [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:24 columns can be parsed directly by specifying In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:25 or In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:26 In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3" In [36]: pd.read_csv(StringIO(data)) Out[36]: col1 col2 col3 0 a b 1 1 a b 2 2 c d 3 In [37]: pd.read_csv(StringIO(data)).dtypes Out[37]: col1 object col2 object col3 int64 dtype: object In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes Out[38]: col1 category col2 category col3 category dtype: object Individual columns can be parsed as a In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:24 using a dict specification In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes Out[39]: col1 category col2 object col3 int64 dtype: object Specifying In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:25 will result in an unordered In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:24 whose In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:30 are the unique values observed in the data. For more control on the categories and order, create a In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:31 ahead of time, and pass that for that column’s In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88 In [40]: from pandas.api.types import CategoricalDtype In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True) In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes Out[42]: col1 category col2 object col3 int64 dtype: object When using In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:33, “unexpected” values outside of In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:34 are treated as missing values In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 10 This matches the behavior of In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:35 Note With In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:25, the resulting categories will always be parsed as strings (object dtype). If the categories are numeric they can be converted using the function, or as appropriate, another converter such as When In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88 is a In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:31 with homogeneous In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:30 ( all numeric, all datetimes, etc. ), the conversion is done automatically In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 11 Naming and using columnsHandling column namesA file may or may not have a header row. pandas assumes the first row should be used as the column names In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 12 By specifying the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object49 argument in conjunction with In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:84 you can indicate other names to use and whether or not to throw away the header row (if any) In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 13 If the header is in a row other than the first, pass the row number to In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]:84. This will skip the preceding rows In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 14 Note Default behavior is to infer the column names. if no names are passed the behavior is identical to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object35 and column names are inferred from the first non-blank line of the file, if column names are passed explicitly then the behavior is identical to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object36 Duplicate names parsing
If the file or header contains duplicate names, pandas will by default distinguish between them so as to prevent overwriting data In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 15 There is no more duplicate data because In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]:48 by default, which modifies a series of duplicate columns ‘X’, …, ‘X’ to become ‘X’, ‘X. 1’, …, ‘X. N’ Filtering columns ( |