Analyzing your Data¶
DataFrameProfiler class¶
This class makes a profile for a given dataframe and its different general features. Based on spark-df-profiling by Julio Soto.
Initially it is a good idea to see a general view of the DataFrame to be analyzed.
Lets assume you have the following dataset, called foo.csv, in your current directory:
id | firstName | lastName | billingId | product | price | birth | dummyCol |
1 | Luis | Alvarez$$%! | 123 | Cake | 10 | 1980/07/07 | never |
2 | André | Ampère | 423 | piza | 8 | 1950/07/08 | gonna |
3 | NiELS | Böhr//((%% | 551 | pizza | 8 | 1990/07/09 | give |
4 | PAUL | dirac$ | 521 | pizza | 8 | 1954/07/10 | you |
5 | Albert | Einstein | 634 | pizza | 8 | 1990/07/11 | up |
6 | Galileo | GALiLEI | 672 | arepa | 5 | 1930/08/12 | never |
7 | CaRL | Ga%%%uss | 323 | taco | 3 | 1970/07/13 | gonna |
8 | David | H$$$ilbert | 624 | taaaccoo | 3 | 1950/07/14 | let |
9 | Johannes | KEPLER | 735 | taco | 3 | 1920/04/22 | you |
10 | JaMES | M$$ax%%well | 875 | taco | 3 | 1923/03/12 | down |
11 | Isaac | Newton | 992 | pasta | 9 | 1999/02/15 | never |
12 | Emmy%% | Nöether$ | 234 | pasta | 9 | 1993/12/08 | gonna |
13 | Max!!! | Planck!!! | 111 | hamburguer | 4 | 1994/01/04 | run |
14 | Fred | Hoy&&&le | 553 | pizzza | 8 | 1997/06/27 | around |
15 | ((( Heinrich ))))) | Hertz | 116 | pizza | 8 | 1956/11/30 | and |
16 | William | Gilbert### | 886 | BEER | 2 | 1958/03/26 | desert |
17 | Marie | CURIE | 912 | Rice | 1 | 2000/03/22 | you |
18 | Arthur | COM%%%pton | 812 | 110790 | 5 | 1899/01/01 | # |
19 | JAMES | Chadwick | 467 | null | 10 | 1921/05/03 | # |
# Import optimus
import optimus as op
# Import os for files
import os
# Instance of Utilities class
tools = op.Utilities()
# Reading dataframe. os.getcwd() returns de current directory of the notebook
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
file_path = "file:///" + os.getcwd() + "/foo.csv"
df = tools.read_csv(path=filePath, sep=',')
# Instance of profiler class
profiler = op.DataFrameProfiler(df)
profiler.profiler()
This overview presents basic information about the DataFrame, like number of variable it has, how many are missing values and in which column, the types of each variable, also some statistical information that describes the variable plus a frequency plot. table that specifies the existing datatypes in each column dataFrame and other features. Also, for this particular case, the table of dataType is shown in order to visualize a sample of column content.
DataFrameAnalyzer class¶
DataFrameAnalyzer class analyze dataType of rows in each columns of dataFrames.
DataFrameAnalyzer methods
- DataFrameAnalyzer.column_analyze(column_list, plots=True, values_bar=True, print_type=False, num_bars=10)
- DataFrameAnalyzer.plot_hist(df_one_col, hist_dict, type_hist, num_bars=20, values_bar=True)
- DataFrameAnalyzer.get_categorical_hist(df_one_col, num_bars)
- DataFrameAnalyzer.get_numerical_hist(df_one_col, num_bars)
- DataFrameAnalyzer.unique_values_col(column)
- DataFrameAnalyzer.write_json(json_cols, path_to_json_file)
- DataFrameAnalyzer.get_frequency(columns, sort_by_count=True)
Lets assume you have the following dataset, called foo.csv, in your current directory:
id | firstName | lastName | billingId | product | price | birth | dummyCol |
1 | Luis | Alvarez$$%! | 123 | Cake | 10 | 1980/07/07 | never |
2 | André | Ampère | 423 | piza | 8 | 1950/07/08 | gonna |
3 | NiELS | Böhr//((%% | 551 | pizza | 8 | 1990/07/09 | give |
4 | PAUL | dirac$ | 521 | pizza | 8 | 1954/07/10 | you |
5 | Albert | Einstein | 634 | pizza | 8 | 1990/07/11 | up |
6 | Galileo | GALiLEI | 672 | arepa | 5 | 1930/08/12 | never |
7 | CaRL | Ga%%%uss | 323 | taco | 3 | 1970/07/13 | gonna |
8 | David | H$$$ilbert | 624 | taaaccoo | 3 | 1950/07/14 | let |
9 | Johannes | KEPLER | 735 | taco | 3 | 1920/04/22 | you |
10 | JaMES | M$$ax%%well | 875 | taco | 3 | 1923/03/12 | down |
11 | Isaac | Newton | 992 | pasta | 9 | 1999/02/15 | never |
12 | Emmy%% | Nöether$ | 234 | pasta | 9 | 1993/12/08 | gonna |
13 | Max!!! | Planck!!! | 111 | hamburguer | 4 | 1994/01/04 | run |
14 | Fred | Hoy&&&le | 553 | pizzza | 8 | 1997/06/27 | around |
15 | ((( Heinrich ))))) | Hertz | 116 | pizza | 8 | 1956/11/30 | and |
16 | William | Gilbert### | 886 | BEER | 2 | 1958/03/26 | desert |
17 | Marie | CURIE | 912 | Rice | 1 | 2000/03/22 | you |
18 | Arthur | COM%%%pton | 812 | 110790 | 5 | 1899/01/01 | # |
19 | JAMES | Chadwick | 467 | null | 10 | 1921/05/03 | # |
The following code shows how to instantiate the class to analyze a dataFrame:
# Import optimus
import optimus as op
# Import os for files
import os
# Instance of Utilities class
tools = op.Utilities()
# Reading dataframe. os.getcwd() returns de current directory of the notebook
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
file_path = "file:///" + os.getcwd() + "/foo.csv"
df = tools.read_csv(path=file_path, sep=',')
analyzer = op.DataFrameAnalyzer(df=df,pathFile=filePath)
Analyzer.column_analyze(column_list, plots=True, values_bar=True, print_type=False, num_bars=10)¶
This function counts the number of registers in a column that are numbers (integers, floats) and the number of string registers.
Input:
column_list
: A list or a string column name.
plots
: Can be True or False. If true it will output the predefined plots.
values_bar (optional)
: Can be True or False. If it is True, frequency values are placed over each bar.
print_type (optional)
: Can be one of the following strings: ‘integer’, ‘string’, ‘float’. Depending of what string
is provided, a list of distinct values of that type is printed.
num_bars
: number of bars printed in histogram
The method outputs a list containing the number of the different datatypes [nulls, strings, integers, floats].
Example:
analyzer.column_analyze("*", plots=False, values_bar=True, print_type=False, num_bars=10)
Column name: id | |||
Column datatype: int | |||
Datatype | Quantity | Percentage | |
None | 0 | 0.00 % | |
Empty str | 0 | 0.00 % | |
String | 0 | 0.00 % | |
Integer | 19 | 100.00 % | |
Float | 0 | 0.00 % |
Min value: 1
Max value: 19
end of __analyze 4.059180021286011
Column name: firstName | |||
Column datatype: string | |||
Datatype | Quantity | Percentage | |
None | 0 | 0.00 % | |
Empty str | 0 | 0.00 % | |
String | 19 | 100.00 % | |
Integer | 0 | 0.00 % | |
Float | 0 | 0.00 % |
end of __analyze 1.1431787014007568
Column name: lastName | |||
Column datatype: string | |||
Datatype | Quantity | Percentage | |
None | 0 | 0.00 % | |
Empty str | 0 | 0.00 % | |
String | 19 | 100.00 % | |
Integer | 0 | 0.00 % | |
Float | 0 | 0.00 % |
end of __analyze 0.9663524627685547
Column name: billingId | |||
Column datatype: int | |||
Datatype | Quantity | Percentage | |
None | 0 | 0.00 % | |
Empty str | 0 | 0.00 % | |
String | 0 | 0.00 % | |
Integer | 19 | 100.00 % | |
Float | 0 | 0.00 % |
Min value: 111
Max value: 992
end of __analyze 4.292513847351074
Column name: product | |||
Column datatype: string | |||
Datatype | Quantity | Percentage | |
None | 0 | 0.00 % | |
Empty str | 0 | 0.00 % | |
String | 18 | 94.74 % | |
Integer | 1 | 5.26 % | |
Float | 0 | 0.00 % |
end of __analyze 1.180891990661621
Column name: price | |||
Column datatype: int | |||
Datatype | Quantity | Percentage | |
None | 0 | 0.00 % | |
Empty str | 0 | 0.00 % | |
String | 0 | 0.00 % | |
Integer | 19 | 100.00 % | |
Float | 0 | 0.00 % |
Min value: 1
Max value: 10
end of __analyze 4.364053964614868
Column name: birth | |||
Column datatype: string | |||
Datatype | Quantity | Percentage | |
None | 0 | 0.00 % | |
Empty str | 0 | 0.00 % | |
String | 19 | 100.00 % | |
Integer | 0 | 0.00 % | |
Float | 0 | 0.00 % |
end of __analyze 0.9144570827484131
Column name: dummyCol | |||
Column datatype: string | |||
Datatype | Quantity | Percentage | |
None | 0 | 0.00 % | |
Empty str | 0 | 0.00 % | |
String | 19 | 100.00 % | |
Integer | 0 | 0.00 % | |
Float | 0 | 0.00 % |
end of __analyze 0.9651758670806885
Total execution time: 17.98968768119812
General Description | ||
Features | Name or Quantity | |
File Name | foo.csv | |
Columns | 8 | |
Rows | 19 |
Analyzer.get_categorical_hist(df_one_col, num_bars)¶
This function analyzes a dataframe of a single column (only string type columns) and returns a dictionary with bins and values of frequency.
Input:
df_one_col
:One column dataFrame.
num_bars
: Number of bars or histogram bins.
The method outputs a dictionary with bins and values of frequency for only type strings colmuns.
Example:
Lets say we want to plot a histogram of frecuencies for the product
column. We first need to obtain the dictionary of the frecuencies for each one. This is what this function does for categorical data. Remember that if you run the columnAnalyze()
method with plots = True
this is done for you.
productDf = analyzer.df.select("product") #or df.select("product")
hist_dictPro = analyzer.get_categorical_hist(df_one_col=productDf, num_bars=10)
print(hist_dictPro)
#Output
"""[{'cont': 4, 'value': 'pizza'}, {'cont': 3, 'value': 'taco'}, {'cont': 2, 'value': 'pasta'}, {'cont': 1, 'value': 'hamburguer'}, {'cont': 1, 'value': 'BEER'}, {'cont': 1, 'value': 'Rice'}, {'cont': 1, 'value': 'piza'}, {'cont': 1, 'value': 'Cake'}, {'cont': 1, 'value': 'arepa'}, {'cont': 1, 'value': '110790'}]"""
Now that we have the dictionary we just need to call plot_hist()
.
Analyzer.get_numerical_hist(df_one_col, num_bars)¶
This function analyzes a dataframe of a single column (only numerical columns) and returns a dictionary with bins and values of frequency.
Input:
df_one_col
:One column dataFrame.
num_bars
: Number of bars or histogram bins.
The method outputs a dictionary with bins and values of frequency for only numerical colmuns.
Example:
Lets say we want to plot a histogram of frequencies for the price
column. We first need to obtain the dictionary of the frequencies for each one. This is what this function does for numerical data. Remember that if you run the columnAnalyze()
method with plots = True
this is done for you.
priceDf = analyzer.df.select("price") #or df.select("price")
hist_dictPri = analyzer.get_numerical_hist(df_one_col=priceDf, num_bars=10)
print(hist_dictPri)
#Output
"""[{'cont': 2, 'value': 9.55}, {'cont': 2, 'value': 8.649999999999999}, {'cont': 6, 'value': 7.749999999999999}, {'cont': 2, 'value': 5.05}, {'cont': 1, 'value': 4.1499999999999995}, {'cont': 4, 'value': 3.25}, {'cont': 1, 'value': 2.3499999999999996}, {'cont': 1, 'value': 1.45}]"""
Analyzer.plot_hist(df_one_col, hist_dict, type_hist, num_bars=20, values_bar=True)¶
This function builds the histogram (bins) of a categorical or numerical column dataframe.
Input:
df_one_col
: A dataFrame of one column.
hist_dict
: Python dictionary with histogram values.
type_hist
: type of histogram to be generated, numerical or categorical.
num_bars
: Number of bars in histogram.
values_bar
: If values_bar is True, values of frequency are plotted over bars.
The method outputs a plot of the histogram for a categorical or numerical column.
Example:
# For a categorical DF
analyzer.plot_hist(df_one_col=productDf,hist_dict= hist_dictPro, type_hist='categorical')
# For a numerical DF
analyzer.plot_hist(df_one_col=priceDf,hist_dict= hist_dictPri, type_hist='categorical')
Analyzer.unique_values_col(column)¶
This function counts the number of values that are unique and also the total number of values. Then, returns the values obtained.
Input:
column
: Name of column dataFrame, this argument must be string type.
The method outputs a dictionary of values counted, as an example: {'unique': 10, 'total': 15}
.
Example:
print(analyzer.unique_values_col("product"))
print(analyzer.unique_values_col("price"))
#Output
{'unique': 13, 'total': 19}
{'unique': 8, 'total': 19}
Analyzer.write_json(json_cols, path_to_json_file)¶
This functions outputs a JSON for the DataFrame in the specified path.
Input:
json_cols
: Dictionary that represents the dataframe.
path_to_json_file
: Specified path to write the returned JSON.
The method outputs the dataFrame as a JSON. To use it in a simple way first run
json_cols = analyzer.column_analyze(column_list="*", print_type=False, plots=False)
And you will have the desired dictionary to pass to the write_json function.
Example:
analyzer.write_json(json_cols=json_cols, path_to_json_file= os.getcwd() + "/foo.json")
Analyzer.get_frequency(self, columns, sort_by_count=True)¶
This function gets the frequencies for values inside the specified columns.
Input:
columns
: String or List of columns to analyze
sort_by_count
: Boolean if true the counts will be sort desc.
The method outputs a Spark Dataframe with counts per existing values in each column.
Tu use it, first lets create a sample DataFrame:
import random
import optimus as op
from pyspark.sql.types import StringType, StructType, IntegerType, FloatType, DoubleType, StructField
schema = StructType(
[
StructField("strings", StringType(), True),
StructField("integers", IntegerType(), True),
StructField("integers2", IntegerType(), True),
StructField("floats", FloatType(), True),
StructField("double", DoubleType(), True)
]
)
size = 200
# Generating strings column:
foods = [' pizza! ', 'pizza', 'PIZZA;', 'pizza', 'pízza¡', 'Pizza', 'Piz;za']
foods = [foods[random.randint(0,6)] for count in range(size)]
# Generating integer column:
num_col_1 = [random.randint(0,9) for number in range(size)]
# Generating integer column:
num_col_2 = [random.randint(0,9) for number in range(size)]
# Generating integer column:
num_col_3 = [random.random() for number in range(size)]
# Generating integer column:
num_col_4 = [random.random() for number in range(size)]
# Building DataFrame
df = op.spark.createDataFrame(list(zip(foods, num_col_1, num_col_2, num_col_3, num_col_4)),schema=schema)
# Instantiate Analyzer
analyzer = op.DataFrameAnalyzer(df)
# Get frequency DataFrame
df_counts = analyzer.get_frequency(["strings", "integers"], True)
And you will get (note that these are random generated values):
strings | count |
pizza | 48 |
Piz;za | 38 |
Pizza | 37 |
pízza¡ | 29 |
pizza! | 25 |
PIZZA; | 23 |
integers | count |
8 | 31 |
5 | 24 |
1 | 24 |
9 | 20 |
6 | 20 |
2 | 19 |
3 | 19 |
0 | 17 |
4 | 14 |
7 | 12 |