Working with dataframes in python
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: The Pandas DataFrame: Working With Data Efficiently Show
The Pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields. DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or Calc. In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they’re an integral part of the Python and NumPy ecosystems. In this tutorial, you’ll learn:
It’s time to get started with Pandas DataFrames! Introducing the Pandas DataFramePandas DataFrames are data structures that contain:
You can start working with DataFrames by importing Pandas: >>>
Now that you have Pandas imported, you can work with DataFrames. Imagine you’re using Pandas to analyze data about job candidates for a position developing
web applications with Python. Say you’re interested in the candidates’ names, cities, ages, and scores on a Python programming test, or
In this table, the first row contains the column labels ( Now you have everything you need to create a Pandas DataFrame. There are several ways to create a Pandas DataFrame. In most cases, you’ll use the
For this example, assume you’re using a dictionary to pass the data: >>>
Finally, Now you’re ready to create a Pandas DataFrame: >>>
That’s it!
This figure shows the labels and data from The row labels are outlined in blue, whereas the column labels are outlined in red, and the data values are outlined in purple. Pandas DataFrames can sometimes be very large, making it impractical to look at all the rows at once. You can use
>>>
That’s how you can show just the beginning or end of a Pandas DataFrame. The parameter
You can access a column in a Pandas DataFrame the same way you would get a value from a dictionary: >>>
This is the most convenient way to get a column from a Pandas DataFrame. If the name of the column is a string that is a valid Python identifier, then you can use dot notation to access it. That is, you can access the column the same way you would get the attribute of a class instance: >>>
That’s how you get a particular column. You’ve extracted the column that corresponds with the label It’s important to notice that you’ve extracted both the data and the corresponding row labels: Each column of a Pandas DataFrame is an instance of >>>
In this case, You can also access a whole row with the
accessor >>>
This time, you’ve extracted the row that corresponds to the label The returned row is also an instance of Creating a Pandas DataFrameAs already mentioned, there are several way to create a Pandas DataFrame. In this section,
you’ll learn to do this using the
There are other methods as well, which you can learn about in the official documentation. You can start by importing Pandas along with NumPy, which you’ll use throughout the following examples: >>>
That’s it. Now you’re ready to create some DataFrames. Creating a Pandas DataFrame With DictionariesAs you’ve already seen, you can create a Pandas DataFrame with a Python dictionary: >>>
The keys of the dictionary are the DataFrame’s column labels, and the dictionary values are the data values in the corresponding DataFrame columns. The values can be contained in a tuple,
list, one-dimensional NumPy array, Pandas It’s possible to
control the order of the columns with the >>>
As you can see, you’ve specified the row labels Creating a Pandas DataFrame With ListsAnother way to create a Pandas DataFrame is to use a list of dictionaries: >>>
Again, the dictionary keys are the column labels, and the dictionary values are the data values in the DataFrame. You can also use a nested list, or a list of lists, as the data values. If you do, then it’s wise to explicitly specify the labels of columns, rows, or both when you create the DataFrame: >>>
That’s how you can use a nested list to create a Pandas DataFrame. You can also use a list of tuples in the same way. To do so, just replace the nested lists in the example above with tuples. Creating a Pandas DataFrame With NumPy ArraysYou can pass a two-dimensional NumPy array to the >>>
Although this example looks almost the same as the
nested list implementation above, it has one advantage: You can specify the optional parameter When >>>
As you can see, when you change the first
item of If this behavior isn’t what you want, then you should specify Creating a Pandas DataFrame From FilesYou can save and load the data and labels from a Pandas DataFrame to and from a number of file types, including CSV, Excel, SQL, JSON, and more. This is a very powerful feature. You can save your job candidate DataFrame to a CSV file with
>>>
The statement above will produce a CSV file called
Now that you have a CSV file with data, you can load it with
>>>
That’s how you get a Pandas DataFrame from a file. In this case, Retrieving Labels and DataNow that you’ve created your DataFrame, you can start retrieving information from it. With Pandas, you can perform the following actions:
Pandas DataFrame Labels as SequencesYou can get the DataFrame’s row labels with >>>
Now you have the row and column labels as special kinds of sequences. As you can with any other Python sequence, you can get a single item: >>>
In addition to extracting a particular item, you can apply other sequence operations, including iterating through the labels of rows or columns. However, this is rarely necessary since Pandas offers other ways to iterate over DataFrames, which you’ll see in a later section. You can also use this approach to modify the labels: >>>
In this example, you use
Keep in mind that if you try to modify a particular item of Data as NumPy ArraysSometimes you might want to extract data from a Pandas DataFrame without its labels. To get a NumPy array with the unlabeled data, you can use either
>>>
Both The Pandas documentation suggests using
However, Data TypesThe types of the data values, also called data types or dtypes, are important because they determine the amount of memory your DataFrame uses, as well as its calculation speed and level of precision. Pandas relies heavily on NumPy data types. However, Pandas 1.0 introduced some additional types:
You can get the data types for each column of a Pandas DataFrame with
>>>
As you can see, If you want to modify the data type of one or more columns, then you can use
>>>
The most important and only mandatory parameter of As you can see, the data types for the
columns Pandas DataFrame SizeThe attributes
>>>
The You can even check the amount of memory used by each column with >>>
As you can see, In the example above, the last two columns, Accessing and Modifying DataYou’ve already learned how to get a particular row or column of a Pandas DataFrame as a >>>
In the first example, you access the column Getting Data With AccessorsIn addition to the
accessor >>>
Pandas has four accessors in total:
Of these, >>>
Just as you can with NumPy, you can provide slices along with lists or arrays instead of indices to get multiple rows or columns: >>>
In this example, you use:
Both statements return a Pandas DataFrame with the intersection of the desired five rows and two columns. This brings up a very important difference between The reason you only get indices You can skip rows and columns with >>>
In this example, you specify the desired row indices with the slice Instead of using the slicing construct, you could also use the built-in Python class >>>
You might find one of these approaches more convenient than others depending on your situation. It’s possible to use >>>
Here, you used Setting Data With AccessorsYou can use accessors to modify parts of a Pandas DataFrame by passing a Python sequence, NumPy array, or single value: >>>
The statement The following example shows that you can use negative indices with >>>
In this example, you’ve accessed and modified the last column ( Inserting and Deleting DataPandas provides several convenient techniques for inserting and deleting rows or columns. You can choose among them based on your situation and needs. Inserting and Deleting RowsImagine you want to add a new person to your list of job candidates. You can start by creating a
new >>>
The new object has labels that correspond to the column labels from You can add >>>
Here, You’ve appended a new row with a single call to >>>
Here, Inserting and Deleting ColumnsThe most straightforward way to insert a column in a Pandas DataFrame is to follow the same procedure that you use when you add an item to a dictionary. Here’s how you can append a column containing your candidates’ scores on a JavaScript test: >>>
Now the original DataFrame has one more column, You don’t have to provide a full sequence of values. You can add a new column with a single value: >>>
The DataFrame If you’ve used dictionaries in the past, then this way of inserting columns might be familiar to you. However, it doesn’t allow you to specify the location of the new column. If the location of the new column is important, then you can use >>>
You’ve just inserted another column with the score of the Django test. The parameter You can delete one or more columns from a Pandas DataFrame just as you would with a regular Python dictionary, by using the
>>>
Now you have You can also remove one or more columns with >>>
You’ve removed the column By
default, Applying Arithmetic OperationsYou can apply basic arithmetic operations such as addition, subtraction, multiplication, and division to Pandas >>>
You can use this technique to insert a new column to a Pandas DataFrame. For example, try calculating a >>>
Now your DataFrame has a column with a Applying NumPy and SciPy FunctionsMost NumPy and SciPy routines can be applied to Pandas Instead of passing a NumPy array to >>>
The variable But that’s not all! You can use the NumPy array returned by >>>
The result is the same as in the previous example, but here you used the existing NumPy function instead of writing your own code. Sorting a Pandas DataFrameYou can sort a Pandas DataFrame with >>>
This example sorts your DataFrame by the
values in the column If you want to sort by multiple columns, then just pass lists as arguments for >>>
In this case, the DataFrame is sorted by the column The optional parameter If you’ve ever tried to sort values in Excel, then you might find the Pandas approach much more efficient and convenient. When you have large amounts of data, Pandas can significantly outperform Excel. For more information on sorting in Pandas, check out Pandas Sort: Your Guide to Sorting Data in Python. Filtering DataData filtering is another powerful feature of Pandas. It works similarly to indexing with Boolean arrays in NumPy. If you apply some logical operation on a >>>
In this case, You now have the Series >>>
As you can see, You can create very powerful and sophisticated expressions by combining logical operations with the following operators:
For example, you can get a DataFrame with the candidates whose >>>
The expression You can also apply NumPy logical routines instead of operators. For some operations that require data filtering, it’s more convenient to use
>>>
In this example, the condition is Determining Data StatisticsPandas provides many statistical methods for DataFrames. You can get basic statistics
for the numerical columns of a Pandas DataFrame with >>>
Here, If you want to get particular statistics
for some or all of your columns, then you can call methods such as >>>
When applied to a Pandas DataFrame, these methods return Series with the results for
each column. When applied to a To learn more about statistical calculations with Pandas, check out Descriptive Statistics With Python and NumPy, SciPy, and Pandas: Correlation With Python. Handling Missing DataMissing data is very common in data science and machine learning. But never fear! Pandas has very powerful features for working with missing data. In fact, its documentation has an entire section dedicated to working with missing data. Pandas usually represents missing data with NaN (not a number) values. In Python, you can get NaN with Here’s an example of a Pandas DataFrame with a missing value: >>>
The variable Calculating With Missing DataMany Pandas methods omit >>>
In the first example, However, if you instruct Filling Missing DataPandas has several options for filling, or replacing, missing values with other values. One of the most convenient methods is
Here’s how you can apply the options mentioned above: >>>
In the first example, Another popular option is to apply interpolation and replace missing values with interpolated values. You can do this with >>>
As you can see, You can also use the optional parameter
The default setting for Deleting Rows and Columns With Missing DataIn certain situations, you might want to delete rows or even columns that have missing values. You can do this with
>>>
In this case, Iterating Over a Pandas DataFrameAs you learned earlier, a DataFrame’s row and column labels can be retrieved as sequences with
With >>>
That’s how you use With >>>
That’s how you use Similarly, >>>
You can specify the name of the named tuple with the parameter Working With Time SeriesPandas excels at handling time series. Although this functionality is partly based on NumPy datetimes and timedeltas, Pandas provides much more flexibility. Creating DataFrames With Time-Series LabelsIn this section, you’ll create a Pandas DataFrame using the hourly temperature data from a single day. You can start by creating a list (or tuple, NumPy array, or other data type) with the data values, which will be hourly temperatures given in degrees Celsius: >>>
Now you have the variable The next step is to create a sequence of dates and times. Pandas provides a very convenient function, >>>
Now that you have the temperature values and the corresponding dates and times, you can create the DataFrame. In many cases, it’s convenient to use date-time values as the row labels: >>>
That’s it! You’ve created a DataFrame with time-series data and date-time row indices. Indexing and SlicingOnce you have a Pandas DataFrame with time-series data, you can conveniently apply slicing to get just a part of the information: >>>
This example shows how to extract the temperatures between 05:00 and 14:00 (5 a.m. and 2 p.m.). Although you’ve provided strings, Pandas knows that your row labels are date-time values and interprets the strings as dates and times. Resampling and RollingYou’ve just seen how to combine date-time row labels and use slicing to get the information you need from the time-series data. This is just the beginning. It gets better! If you want to split a day into four six-hour intervals and get the mean temperature for each interval, then you’re just one statement away from doing so. Pandas provides the method
>>>
You now have a new Pandas DataFrame with four rows. Each row corresponds to a single six-hour interval. For example, the value Instead of You might also need to do some rolling-window analysis. This involves calculating a statistic for a specified number of adjacent rows, which make up your window of data. You can “roll” the window by selecting a different set of adjacent rows to perform your calculations on. Your first window starts with the first row in your DataFrame and includes as many adjacent rows as you specify. You then move your window down one row, dropping the first row and adding the row that comes immediately after the last row, and calculate the same statistic again. You repeat this process until you reach the last row of the DataFrame. Pandas provides
the method >>>
Now you have a DataFrame with mean temperatures calculated for several three-hour windows. The parameter In the example above, the third value ( Plotting With Pandas DataFramesPandas allows you to visualize data or create plots based on DataFrames. It uses Matplotlib in the background, so exploiting Pandas’ plotting capabilities is very similar to working with Matplotlib. If you want to display the plots, then you first need to import >>>
Now you can use >>>
Now You can also apply You can save your figure by chaining the methods >>>
This statement
creates the plot and saves it as a file called You can get other types of plots with a Pandas DataFrame. For example, you can visualize your job candidate data from before as a histogram with
>>>
In this example, you extract the Python test score and total score data and visualize it with a histogram. The resulting plot looks like this: This is just the basic
look. You can adjust details with optional parameters including Further ReadingPandas DataFrames are very comprehensive objects that support many operations not mentioned in this tutorial. Some of these include:
The official Pandas tutorial summarizes some of the available options nicely. If you want to learn more about Pandas and DataFrames, then you can check out these tutorials:
You’ve learned that Pandas DataFrames handle two-dimensional data. If you need to work with labeled data in more than two dimensions, you can check out xarray, another powerful Python library for data science with very similar features to Pandas. If you work with big data and want a DataFrame-like experience, then you might give Dask a chance and use its DataFrame API. A Dask DataFrame contains many Pandas DataFrames and performs computations in a lazy manner. ConclusionYou now know what a Pandas DataFrame is, what some of its features are, and how you can use it to work with data efficiently. Pandas DataFrames are powerful, user-friendly data structures that you can use to gain deeper insight into your datasets! In this tutorial, you’ve learned:
You’ve learned enough to cover the fundamentals of DataFrames. If you want to dig deeper into working with data in Python, then check out the entire range of Pandas tutorials. If you have questions or comments, then please put them in the comment section below. Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: The Pandas DataFrame: Working With Data Efficiently How do DataFrames work in Python?The Pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields. DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or Calc.
What are DataFrames in Python?DataFrame. DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
What is DataFrame in Python with example?A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.
Why DataFrames are used in Python?Pandas DataFrame is a widely used data structure which works with a two-dimensional array with labeled axes (rows and columns). DataFrame is defined as a standard way to store data that has two different indexes, i.e., row index and column index.
|