Can you analyse data with python?
Show
A step by step guide to get started with data analysis in PythonPhoto by Chris Liverani on UnsplashThe Role of a Data AnalystA data analyst uses programming tools to mine large amounts of complex data, and find relevant information from this data. In short, an analyst is someone who derives meaning from messy data. A data analyst needs to have skills in the following areas, in order to be useful in the workplace:
In this article, I am going to walk you through the end-to-end data analysis process with Python. If you follow along to this tutorial and code everything out the way I did, you can then use these codes and tools for future data analytic projects.We will start with downloading and cleaning the dataset, and then move on to the analysis and visualization. Finally, we will tell a story around our data findings. I will be using a dataset from Kaggle called Pima Indian Diabetes Database, which you can download to perform the analysis. Pre-RequisitesFor this entire analysis, I will be using a Jupyter Notebook. You can use any Python IDE you like. You will need to install libraries along the way, and I will provide links that will walk you through the installation process. The AnalysisPhoto by Luke Chesser on UnsplashAfter downloading the dataset, you will need to read the .csv file as a data frame in Python. You can do this using the Pandas library. If you do not have it installed, you can do so with a simple “pip install pandas” in your terminal. If you face any difficulty with the installation or simply want to learn more about the Pandas library, you can check out their documentation here. Read the DataTo read the data frame into Python, you will need to import Pandas first. Then, you can read the file and create a data frame with the following lines of code: import pandas as pd To check the head of the data frame, run: df.head() Image by AuthorFrom the screenshot above, you can see 9 different variables related to a patient’s health. As an analyst, you will need to have a basic understanding of these variables:
As an analyst, you will need to know the difference between these variable types — Numeric and Categorical. Numeric variables are variables that are a measure, and have some kind of numeric meaning. All the variables in this dataset except for “outcome” are numeric. Categorical variables are also called nominal variables, and have two or more categories that can be classified. The variable “outcome” is categorical — 0 represents the absence of diabetes, and 1 represents the presence of diabetes. A Quick NoteBefore continuing with the analysis, I would like to make a quick note: Analysts are humans, and we often come with preconceived notions of what we expect to see in the data. For example, you would expect an older person to be more likely to have diabetes. You would want to see this correlation in the data, which might not always be the case. Keep an open mind during the analysis process, and do not let your bias effect the decision making. Pandas ProfilingThis is a very useful tool that can be used by analysts. It generates an analysis report on the data frame, and helps you better understand the correlation between variables. To generate a Pandas Profiling report, run the following lines of code: import pandas_profiling as pp This report will give you some overall statistical information on the dataset, which looks like this: Image by AuthorBy just glancing at the dataset statistics, we can see that there are no missing or duplicate cells in our data frame. The information provided above usually requires us to run a few lines of codes to find, but is generated a lot more easily with Pandas Profiling. Pandas Profiling also provides more information on each variable. I will show you an example: Image by AuthorThis is information generated for the variable called “Pregnancies.” As an analyst, this report saves a lot of time, as we don’t have to go through each individual variable and run too many lines of code. From here, we can see that:
In the report, there is information like this provided for each variable. This helps us a lot in our understanding of the dataset and all the columns in it. Image by AuthorThe plot above is a correlation matrix. It helps us gain a better understanding of the correlation between the variables in the dataset. There is a slight positive correlation between the variables “Age” and “Skin Thickness”, which can be looked into further in the visualization section of the analysis. Since there are no missing or duplicate rows in the data frame as seen above, we don’t need to do any additional data cleaning. Data VisualizationNow that we have a basic understanding of each variable, we can try to find the relationship between them. The simplest and fastest way to do this is by generating visualizations. In this tutorial, we will be using three libraries to get the job done — Matplotlib, Seaborn, and Plotly. If you are a complete beginner to Python, I suggest starting out and getting familiar with Matplotlib and Seaborn. Here is the documentation for Matplotlib, and here is the one for Seaborn. I strongly suggest spending some time reading the documentation, and doing tutorials using these two libraries in order to improve on your visualization skills. Plotly is a library that allows you to create interactive charts, and requires slightly more familiarity with Python to master. You can find the installation guide and requirements here. If you follow along to this tutorial exactly, you will be able to make beautiful charts with these three libraries. You can then use my code as a template for any future analysis or visualization tasks in the future. Visualizing the Outcome VariableFirst, run the following lines of code to import Matplotlib, Seaborn, Numpy, and Plotly after installation: # Visualization Importsimport matplotlib.pyplot as plt Next, run the following lines of code to create a pie chart visualizing the outcome variable: dist = df['Outcome'].value_counts() This is done with the Plotly library, and you will get an interactive chart that looks like this: Image by AuthorYou can play around with the chart and choose to change the colors, labels, and legend. From the chart above, however, we can see that most patients in the dataset are not diabetic. Less than half of them have an outcome of 1 (have diabetes). Correlation Matrix with PlotlySimilar to the correlation matrix generated in Pandas Profiling, we can create one using Plotly: def df_to_plotly(df): The codes above will generate a correlation matrix that is similar to the one above: Image by AuthorAgain, similar to the matrix generated above, a positive correlation can be observed between the variables:
To further understand the correlations between variables, we will create some plots: Visualize Glucose Levels and Insulinfig = px.scatter(df, x='Glucose', y='Insulin') Running the codes above should give you a plot that looks like this: Image by AuthorThere is a positive correlation between the variables glucose and insulin. This makes sense, because a person with higher glucose levels would be expected to take more insulin. Visualize Outcome and AgeNow, we will visualize the variables outcome and age. We will create a boxplot to do so, using the code below: fig = px.box(df, x='Outcome', y='Age') The resulting plot will look somewhat like this: Image by AuthorFrom the plot above, you can see that older people are more likely to have diabetes. The median age for adults with diabetes is around 35, while it is much lower for people without diabetes. However, there are a lot of outliers. There are a few elderly people without diabetes (one even over 80 years old), that can be observed in the boxplot. Visualizing BMI and OutcomeFinally, we will visualize the variables “BMI” and “Outcome”, to see if there is any correlation between the two variables. To do this, we will use the Seaborn library: plot = sns.boxplot(x='Outcome',y="BMI",data=df)
Image by AuthorThe boxplot created here is similar to the one created above using Plotly. However, Plotly is better at creating visualizations that are interactive, and the charts look prettier compared to the ones made in Seaborn. From the box plot above, we can see that higher BMI correlates with a positive outcome. People with diabetes tend to have higher BMI’s than people without diabetes. You can make more visualizations like the ones above, by simply changing the variable names and running the same lines of code. I will leave that as an exercise for you to do, to get a better grasp on your visualization skills with Python. Data StorytellingPhoto by Blaz Photo on UnsplashFinally, we can tell a story around the data we have analyzed and visualized. Our findings can be broken down as follows: People with diabetes are highly likely to be older than people who don’t. They are also more likely to have higher BMI’s, or suffer from obesity. They are also more likely to have higher glucose levels in their blood. People with higher glucose levels also tend to take more insulin, and this positive correlation indicates that patients with diabetes could also have higher insulin levels (this correlation can be checked by creating a scatter plot). That’s all for this article! I hope you found this tutorial helpful, and can use it as a future reference for projects you need to create. Good luck in your data science journey, and happy learning!
Can you analyze data with Python?skill PathAnalyze data with Python
Data is everywhere. That means more companies are tracking, analyzing, and using the insights they find to make better decisions. In this Skill Path, you'll learn the fundamentals of data analysis while building Python skills.
Is Python good for data analyst?Python is a popular multi-purpose programming language widely used for its flexibility, as well as its extensive collection of libraries, which are valuable for analytics and complex calculations.
Which Python is best for data analysis?Pandas (Python data analysis) is a must in the data science life cycle. It is the most popular and widely used Python library for data science, along with NumPy in matplotlib.
|