Feature Engineering with Optimus¶
Now with Optimus we have made easy the process of Feature Engineering.
When we talk about Feature Engineering we refer to creating new features from your existing ones to improve model performance. Sometimes this is the case, or sometimes you need to do it because a certain model doesn’t recognize the data as you have it, so these transformations let you run most of Machine and Deep Learning algorithms.
These methods are part of the DataFrameTransformer, and they are a high level of abstraction for Spark Feature Engineering methods. You’ll see how easy it is to prepare your data with Optimus for Machine Learning.
Methods for Feature Engineering¶
fe.string_to_index(input_cols)¶
This method maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values.
df
Data frame to transform
input_cols
argument receives a list of columns to be indexed.
Let’s start by creating a DataFrame with Optimus.
from pyspark.sql import Row, types
from pyspark.ml import feature, classification
from optimus import Optimus
from optimus.ml.models import ML
import optimus.ml.feature as fe
op = Optimus()
ml = ML()
spark = op.spark
sc = op.sc
# Creating sample DF
data = [('Japan', 'Tokyo', 37800000),('USA', 'New York', 19795791),('France', 'Paris', 12341418),
('Spain','Madrid',6489162)]
df = op.spark.createDataFrame(data, ["country", "city", "population"])
df.table()
country | city | population |
Japan | Tokyo | 37800000 |
USA | New York | 19795791 |
France | Paris | 12341418 |
Spain | Madrid | 6489162 |
# Indexing columns 'city" and 'country'
df_sti = fe.string_to_index(df, input_cols=["city", "country"])
# Show indexed DF
df_sti.table()
country | city | population | city_index | country_index |
Japan | Tokyo | 37800000 | 1.0 | 1.0 |
USA | New York | 19795791 | 2.0 | 3.0 |
France | Paris | 12341418 | 3.0 | 2.0 |
Spain | Madrid | 6489162 | 0.0 | 0.0 |
fe.index_to_string(input_cols)¶
This method maps a column of indices back to a new column of corresponding string values. The index-string mapping is either from the ML (Spark) attributes of the input column, or from user-supplied labels (which take precedence over ML attributes).
df
Data frame to transform
input_cols
argument receives a list of columns to be indexed.
Let’s go back to strings with the DataFrame we created in the last step.
# Indexing columns 'city" and 'country'
df_sti = fe.string_to_index(df, input_cols=["city", "country"])
# Show indexed DF
df_sti.table()
country | city | population | city_index | country_index |
Japan | Tokyo | 37800000 | 1.0 | 1.0 |
USA | New York | 19795791 | 2.0 | 3.0 |
France | Paris | 12341418 | 3.0 | 2.0 |
Spain | Madrid | 6489162 | 0.0 | 0.0 |
# Going back to strings from index
df_its = fe.string_to_index(df_sti, input_cols=["country_index"])
# Show DF with column "county_index" back to string
df_its.table()
country | city | population | country_index | city_index | country_index_string |
Japan | Tokyo | 37800000 | 1.0 | 1.0 | Japan |
USA | New York | 19795791 | 3.0 | 2.0 | USA |
France | Paris | 12341418 | 2.0 | 3.0 | France |
Spain | Madrid | 6489162 | 0.0 | 0.0 | Spain |
fe.one_hot_encoder(input_cols)¶
This method maps a column of label indices to a column of binary vectors, with at most a single one-value.
df
Data frame to transform
input_cols
argument receives a list of columns to be encoded.
Let’s create a sample dataframe to see what OHE does:
# Creating DataFrame
data = [
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
]
df = op.spark.createDataFrame(data,["id", "category"])
# One Hot Encoding
df_ohe = fe.one_hot_encoder(df, input_cols=["id"])
# Show encoded dataframe
df_ohe.table()
id | category | id_encoded |
0 | a | (5,[0],[1.0]) |
1 | b | (5,[1],[1.0]) |
2 | c | (5,[2],[1.0]) |
3 | a | (5,[3],[1.0]) |
4 | a | (5,[4],[1.0]) |
5 | c | (5,[],[]) |
Transformer.vector_assembler(input_cols)¶
This method combines a given list of columns into a single vector column.
input_cols
argument receives a list of columns to be encoded.
This is very important because lots of Machine Learning algorithms in Spark need this format to work.
Let’s create a sample dataframe to see what vector assembler does:
# Import Vectors
from pyspark.ml.linalg import Vectors
# Creating DataFrame
data = [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)]
df = op.spark.createDataFrame(data,["id", "hour", "mobile", "user_features", "clicked"])
# Assemble features
df_va = fe.vector_assembler(df, input_cols=["hour", "mobile", "user_features"])
# Show assembled df
print("Assembled columns 'hour', 'mobile', 'user_features' to vector column 'features'")
df_va.select("features", "clicked").table()
features | clicked |
[18.0,1.0,0.0,10.0,0.5] | 1.0 |
fe.normalizer(input_cols,p=2.0)¶
This method transforms a dataset of Vector rows, normalizing each Vector to have unit norm. It takes parameter p, which specifies the p-norm used for normalization. (p=2) by default.
input_cols
argument receives a list of columns to be normalized.
p
argument is the p-norm used for normalization.
Let’s create a sample dataframe to see what normalizer does:
id | features | features_normalized |
0 | [1.0,0.5,-1.0] | [0.6666666666666666,0.3333333333333333,-0.6666666666666666] |
1 | [2.0,1.0,1.0] | [0.8164965809277261,0.4082482904638631,0.4082482904638631] |
2 | [4.0,10.0,2.0] | [0.3651483716701107,0.9128709291752769,0.18257418583505536] |