Analyzing your Data

DataFrameProfiler class

This class makes a profile for a given dataframe and its different general features. Based on spark-df-profiling by Julio Soto.

Initially it is a good idea to see a general view of the DataFrame to be analyzed.

Lets assume you have the following dataset, called foo.csv, in your current directory:

id firstName lastName billingId product price birth dummyCol
1 Luis Alvarez$$%! 123 Cake 10 1980/07/07 never
2 André Ampère 423 piza 8 1950/07/08 gonna
3 NiELS Böhr//((%% 551 pizza 8 1990/07/09 give
4 PAUL dirac$ 521 pizza 8 1954/07/10 you
5 Albert Einstein 634 pizza 8 1990/07/11 up
6 Galileo GALiLEI 672 arepa 5 1930/08/12 never
7 CaRL Ga%%%uss 323 taco 3 1970/07/13 gonna
8 David H$$$ilbert 624 taaaccoo 3 1950/07/14 let
9 Johannes KEPLER 735 taco 3 1920/04/22 you
10 JaMES M$$ax%%well 875 taco 3 1923/03/12 down
11 Isaac Newton 992 pasta 9 1999/02/15 never
12 Emmy%% Nöether$ 234 pasta 9 1993/12/08 gonna
13 Max!!! Planck!!! 111 hamburguer 4 1994/01/04 run
14 Fred Hoy&&&le 553 pizzza 8 1997/06/27 around
15 ((( Heinrich ))))) Hertz 116 pizza 8 1956/11/30 and
16 William Gilbert### 886 BEER 2 1958/03/26 desert
17 Marie CURIE 912 Rice 1 2000/03/22 you
18 Arthur COM%%%pton 812 110790 5 1899/01/01 #
19 JAMES Chadwick 467 null 10 1921/05/03 #
# Import optimus
import optimus as op
# Import os for files
import os

# Instance of Utilities class
tools = op.Utilities()

# Reading dataframe. os.getcwd() returns de current directory of the notebook
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
file_path = "file:///" + os.getcwd() + "/foo.csv"

df = tools.read_csv(path=filePath, sep=',')

# Instance of profiler class
profiler = op.DataFrameProfiler(df)
profiler.profiler()

This overview presents basic information about the DataFrame, like number of variable it has, how many are missing values and in which column, the types of each variable, also some statistical information that describes the variable plus a frequency plot. table that specifies the existing datatypes in each column dataFrame and other features. Also, for this particular case, the table of dataType is shown in order to visualize a sample of column content.

DataFrameAnalyzer class

DataFrameAnalyzer class analyze dataType of rows in each columns of dataFrames.

DataFrameAnalyzer methods

  • DataFrameAnalyzer.column_analyze(column_list, plots=True, values_bar=True, print_type=False, num_bars=10)
  • DataFrameAnalyzer.plot_hist(df_one_col, hist_dict, type_hist, num_bars=20, values_bar=True)
  • DataFrameAnalyzer.get_categorical_hist(df_one_col, num_bars)
  • DataFrameAnalyzer.get_numerical_hist(df_one_col, num_bars)
  • DataFrameAnalyzer.unique_values_col(column)
  • DataFrameAnalyzer.write_json(json_cols, path_to_json_file)
  • DataFrameAnalyzer.get_frequency(columns, sort_by_count=True)

Lets assume you have the following dataset, called foo.csv, in your current directory:

id firstName lastName billingId product price birth dummyCol
1 Luis Alvarez$$%! 123 Cake 10 1980/07/07 never
2 André Ampère 423 piza 8 1950/07/08 gonna
3 NiELS Böhr//((%% 551 pizza 8 1990/07/09 give
4 PAUL dirac$ 521 pizza 8 1954/07/10 you
5 Albert Einstein 634 pizza 8 1990/07/11 up
6 Galileo GALiLEI 672 arepa 5 1930/08/12 never
7 CaRL Ga%%%uss 323 taco 3 1970/07/13 gonna
8 David H$$$ilbert 624 taaaccoo 3 1950/07/14 let
9 Johannes KEPLER 735 taco 3 1920/04/22 you
10 JaMES M$$ax%%well 875 taco 3 1923/03/12 down
11 Isaac Newton 992 pasta 9 1999/02/15 never
12 Emmy%% Nöether$ 234 pasta 9 1993/12/08 gonna
13 Max!!! Planck!!! 111 hamburguer 4 1994/01/04 run
14 Fred Hoy&&&le 553 pizzza 8 1997/06/27 around
15 ((( Heinrich ))))) Hertz 116 pizza 8 1956/11/30 and
16 William Gilbert### 886 BEER 2 1958/03/26 desert
17 Marie CURIE 912 Rice 1 2000/03/22 you
18 Arthur COM%%%pton 812 110790 5 1899/01/01 #
19 JAMES Chadwick 467 null 10 1921/05/03 #

The following code shows how to instantiate the class to analyze a dataFrame:

# Import optimus
import optimus as op
# Import os for files
import os

# Instance of Utilities class
tools = op.Utilities()


# Reading dataframe. os.getcwd() returns de current directory of the notebook
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
file_path = "file:///" + os.getcwd() + "/foo.csv"

df = tools.read_csv(path=file_path, sep=',')

analyzer = op.DataFrameAnalyzer(df=df,pathFile=filePath)

Analyzer.column_analyze(column_list, plots=True, values_bar=True, print_type=False, num_bars=10)

This function counts the number of registers in a column that are numbers (integers, floats) and the number of string registers.

Input:

column_list: A list or a string column name.

plots: Can be True or False. If true it will output the predefined plots.

values_bar (optional): Can be True or False. If it is True, frequency values are placed over each bar.

print_type (optional): Can be one of the following strings: ‘integer’, ‘string’, ‘float’. Depending of what string is provided, a list of distinct values of that type is printed.

num_bars: number of bars printed in histogram

The method outputs a list containing the number of the different datatypes [nulls, strings, integers, floats].

Example:

analyzer.column_analyze("*", plots=False, values_bar=True, print_type=False, num_bars=10)
      Column name: id
      Column datatype: int
Datatype Quantity Percentage  
None 0 0.00 %  
Empty str 0 0.00 %  
String 0 0.00 %  
Integer 19 100.00 %  
Float 0 0.00 %  

Min value: 1

Max value: 19

end of __analyze 4.059180021286011

      Column name: firstName
      Column datatype: string
Datatype Quantity Percentage  
None 0 0.00 %  
Empty str 0 0.00 %  
String 19 100.00 %  
Integer 0 0.00 %  
Float 0 0.00 %  

end of __analyze 1.1431787014007568

      Column name: lastName
      Column datatype: string
Datatype Quantity Percentage  
None 0 0.00 %  
Empty str 0 0.00 %  
String 19 100.00 %  
Integer 0 0.00 %  
Float 0 0.00 %  

end of __analyze 0.9663524627685547

      Column name: billingId
      Column datatype: int
Datatype Quantity Percentage  
None 0 0.00 %  
Empty str 0 0.00 %  
String 0 0.00 %  
Integer 19 100.00 %  
Float 0 0.00 %  

Min value: 111

Max value: 992

end of __analyze 4.292513847351074

      Column name: product
      Column datatype: string
Datatype Quantity Percentage  
None 0 0.00 %  
Empty str 0 0.00 %  
String 18 94.74 %  
Integer 1 5.26 %  
Float 0 0.00 %  

end of __analyze 1.180891990661621

      Column name: price
      Column datatype: int
Datatype Quantity Percentage  
None 0 0.00 %  
Empty str 0 0.00 %  
String 0 0.00 %  
Integer 19 100.00 %  
Float 0 0.00 %  

Min value: 1

Max value: 10

end of __analyze 4.364053964614868

      Column name: birth
      Column datatype: string
Datatype Quantity Percentage  
None 0 0.00 %  
Empty str 0 0.00 %  
String 19 100.00 %  
Integer 0 0.00 %  
Float 0 0.00 %  

end of __analyze 0.9144570827484131

      Column name: dummyCol
      Column datatype: string
Datatype Quantity Percentage  
None 0 0.00 %  
Empty str 0 0.00 %  
String 19 100.00 %  
Integer 0 0.00 %  
Float 0 0.00 %  

end of __analyze 0.9651758670806885

Total execution time: 17.98968768119812

    General Description
Features Name or Quantity  
File Name foo.csv  
Columns 8  
Rows 19  

Analyzer.get_categorical_hist(df_one_col, num_bars)

This function analyzes a dataframe of a single column (only string type columns) and returns a dictionary with bins and values of frequency.

Input:

df_one_col:One column dataFrame.

num_bars: Number of bars or histogram bins.

The method outputs a dictionary with bins and values of frequency for only type strings colmuns.

Example:

Lets say we want to plot a histogram of frecuencies for the product column. We first need to obtain the dictionary of the frecuencies for each one. This is what this function does for categorical data. Remember that if you run the columnAnalyze() method with plots = True this is done for you.

productDf = analyzer.df.select("product") #or df.select("product")
hist_dictPro = analyzer.get_categorical_hist(df_one_col=productDf, num_bars=10)
print(hist_dictPro)
#Output
"""[{'cont': 4, 'value': 'pizza'}, {'cont': 3, 'value': 'taco'}, {'cont': 2, 'value': 'pasta'}, {'cont': 1, 'value':         'hamburguer'}, {'cont': 1, 'value': 'BEER'}, {'cont': 1, 'value': 'Rice'}, {'cont': 1, 'value': 'piza'}, {'cont': 1,         'value': 'Cake'}, {'cont': 1, 'value': 'arepa'}, {'cont': 1, 'value': '110790'}]"""

Now that we have the dictionary we just need to call plot_hist().

Analyzer.get_numerical_hist(df_one_col, num_bars)

This function analyzes a dataframe of a single column (only numerical columns) and returns a dictionary with bins and values of frequency.

Input:

df_one_col:One column dataFrame.

num_bars: Number of bars or histogram bins.

The method outputs a dictionary with bins and values of frequency for only numerical colmuns.

Example:

Lets say we want to plot a histogram of frequencies for the price column. We first need to obtain the dictionary of the frequencies for each one. This is what this function does for numerical data. Remember that if you run the columnAnalyze() method with plots = True this is done for you.

priceDf = analyzer.df.select("price") #or df.select("price")
hist_dictPri = analyzer.get_numerical_hist(df_one_col=priceDf, num_bars=10)
print(hist_dictPri)
#Output
"""[{'cont': 2, 'value': 9.55}, {'cont': 2, 'value': 8.649999999999999}, {'cont': 6, 'value': 7.749999999999999}, {'cont':   2, 'value': 5.05}, {'cont': 1, 'value': 4.1499999999999995}, {'cont': 4, 'value': 3.25}, {'cont': 1, 'value':               2.3499999999999996}, {'cont': 1, 'value': 1.45}]"""

Analyzer.plot_hist(df_one_col, hist_dict, type_hist, num_bars=20, values_bar=True)

This function builds the histogram (bins) of a categorical or numerical column dataframe.

Input:

df_one_col: A dataFrame of one column.

hist_dict: Python dictionary with histogram values.

type_hist: type of histogram to be generated, numerical or categorical.

num_bars: Number of bars in histogram.

values_bar: If values_bar is True, values of frequency are plotted over bars.

The method outputs a plot of the histogram for a categorical or numerical column.

Example:

# For a categorical DF
analyzer.plot_hist(df_one_col=productDf,hist_dict= hist_dictPro, type_hist='categorical')
../_images/productHist.png
# For a numerical DF
analyzer.plot_hist(df_one_col=priceDf,hist_dict= hist_dictPri, type_hist='categorical')
../_images/priceHist.png

Analyzer.unique_values_col(column)

This function counts the number of values that are unique and also the total number of values. Then, returns the values obtained.

Input:

column: Name of column dataFrame, this argument must be string type.

The method outputs a dictionary of values counted, as an example: {'unique': 10, 'total': 15}.

Example:

print(analyzer.unique_values_col("product"))
print(analyzer.unique_values_col("price"))
#Output
{'unique': 13, 'total': 19}
{'unique': 8, 'total': 19}

Analyzer.write_json(json_cols, path_to_json_file)

This functions outputs a JSON for the DataFrame in the specified path.

Input:

json_cols: Dictionary that represents the dataframe.

path_to_json_file: Specified path to write the returned JSON.

The method outputs the dataFrame as a JSON. To use it in a simple way first run

json_cols = analyzer.column_analyze(column_list="*", print_type=False, plots=False)

And you will have the desired dictionary to pass to the write_json function.

Example:

analyzer.write_json(json_cols=json_cols, path_to_json_file= os.getcwd() + "/foo.json")

Analyzer.get_frequency(self, columns, sort_by_count=True)

This function gets the frequencies for values inside the specified columns.

Input:

columns: String or List of columns to analyze

sort_by_count: Boolean if true the counts will be sort desc.

The method outputs a Spark Dataframe with counts per existing values in each column.

Tu use it, first lets create a sample DataFrame:

import random
import optimus as op
from pyspark.sql.types import StringType, StructType, IntegerType, FloatType, DoubleType, StructField

schema = StructType(
        [
        StructField("strings", StringType(), True),
        StructField("integers", IntegerType(), True),
        StructField("integers2", IntegerType(), True),
        StructField("floats",  FloatType(), True),
        StructField("double",  DoubleType(), True)
        ]
)

size = 200
# Generating strings column:
foods = ['    pizza!       ', 'pizza', 'PIZZA;', 'pizza', 'pízza¡', 'Pizza', 'Piz;za']
foods = [foods[random.randint(0,6)] for count in range(size)]
# Generating integer column:
num_col_1 = [random.randint(0,9) for number in range(size)]
# Generating integer column:
num_col_2 = [random.randint(0,9) for number in range(size)]
# Generating integer column:
num_col_3 = [random.random() for number in range(size)]
# Generating integer column:
num_col_4 = [random.random() for number in range(size)]

# Building DataFrame
df = op.spark.createDataFrame(list(zip(foods, num_col_1, num_col_2, num_col_3, num_col_4)),schema=schema)

# Instantiate Analyzer
analyzer = op.DataFrameAnalyzer(df)

# Get frequency DataFrame
df_counts = analyzer.get_frequency(["strings", "integers"], True)

And you will get (note that these are random generated values):

strings count
pizza 48
Piz;za 38
Pizza 37
pízza¡ 29
pizza! 25
PIZZA; 23
integers count
8 31
5 24
1 24
9 20
6 20
2 19
3 19
0 17
4 14
7 12