Pandas is one of the tools that you can use in Python to analyze data. It is a popular library because of all the functionalities that offer and its flexibility.
Pandas for python is a Data Analysis Library. It provides us with several data structures that facilitate the analysis of the data after we represent it in the appropriate data structure. Also, it works perfectly in conjunction with other well-known python packages used for scientific calculations and numbers manipulation.
What is data analysis?
Yes, we know that Pandas is used for data analysis, but what is data analysis?
“Data Analysis is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data”. Source document here.
Sometimes, people want to start learning or applying Data Science as part of the path of creating a career as a programmer. However, in my experience, they usually don’t get far away in that attempt, because when writing python code, they don’t know what technique to use.
My recommendation here is simple, learn some statistical methods first, then you will be able to apply them to the solution of a problem.
Statistics is not needed to understand this article, but as you go deeper in data analysis, you will need it to a certain extent, at least to know what the techniques are used for and how to interpret the results.
What Pandas Library can do for us?
As mentioned before, Pandas is a library for data analysis in Python.
According to this article, we can divide the process of data analysis as follows:
- Data requirements: This is to determine what data is required as input to be able to produce the expected output. In other words, what data is do we need to solve the specific problem?
- Data collection: Usually, we have to collect data from several sources. You should use as many as you need.
- Data processing: Most of the time you will find unstructured data or data in different formats. Before starting the analysis, you should structure all the data according to the same format.
- Data cleaning: After the data is processed, it can contain errors, missing data, or duplicates. Certain tasks need to be undertaken to eliminate the errors, otherwise, the results might not be accurate.
- Exploratory data analysis: “In statistics, exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.” Source: Wikipedia.
- Modelling and algorithms: In this step, we can apply certain algorithms to identify relations among variables.
- Data product: This is an application that gives an output after executing a certain algorithm(s) for data analysis.
- Communication: This is the way to give feedback about the information that was found in the data after the analysis.
Pandas support the data analysis process from the data processing step.
Data structures in Pandas
Informally, Data structures is a way to store data in a computer. A data structure defines how to store data and the operations you can do with that data.
In Pandas, you have available two main data structures:
- Series: One-dimensional labelled array.
- DataFrame: A two dimensional labelled, in which the column data can be of different types (number, strings, etc.)
Loading data with Pandas
Pandas’ library can load data in different formats, some of them are:
- CSV
- JSON
- Excel
- HTML
- HDF5
- SQL
- SPSS
- Google BigQuery
- Among others.
As you can see, you have tools to load data in several formats. This will facilitate data collection and processing steps.
Cleaning data
In this case, Pandas provides you with a flexible and easy way to find missing data.
Then you can choose one of the many approaches for data cleaning. Like using the mean as the missing values, just removing/ignoring the record, among others.
Analyzing data
There are many ways of analyzing data. To explain all the techniques will probably take several books.
Here I just want to give a flavor of some of the statistical functions that you can find in Pandas:
- Percentage change
- Covariance
- Correlation
- Data ranking
You will also find helpful certain tasks that you can also accomplish with some of the basic algorithms described here (notice that Pandas library already include these algorithms ). Some examples are as follows:
- Retrieve a value: Find a specific value, for instance: what is the cost of the stock X?
- Find an extremum: You can use it to find minimum or maximum values, for instance, what is the stock with the lower price?
Creating graphs from the data
With Pandas you can create several types of graphs, among them you can find the following:
- Bar plots
- Histograms
- Box plots
- Area plot
- Scatter plot
- Pie plot
You can create these types of graphs (and more) using the functions provided by Pandas, or by using matplotlib, a python library to create graphs from data.
Summary
Pandas Library gives you a comprehensive environment to do data analysis.
It is a flexible library, that together with other Python libraries, creates a very powerful environment to extract information from data using different sources.
In my next article, I’ll give you examples that you can use to start working with this great Python Library.
So, keep tuned!