Data Pre-processing using Scikit-learn

Pooja Lo
3 min readAug 9, 2021

Data pre-processing is one technique of data mining using that you can convert your raw data into an understandable format. In his practical, we will take one dataset and performing the following task

  1. Standardization
  2. normalization
  3. encoding
  4. discretization
  5. imputation of missing values.

We take one dataset that is Delhi House Price Prediction

Dataset is here

Standardization

Data standardization is the process of bringing data into a uniform format. In this values are centered around the mean with a unit standard deviation.

from sklearn.preprocessing import StandardScalernumeric_columns = [c for c in data.columns if data[c].dtype != np.dtype('O')]
temp_data = data[numeric_columns]
standard_scaler = StandardScaler()
standardized_data = standard_scaler.fit_transform(temp_data)
pd.DataFrame(standardized_data , columns = temp_data.columns)

In this, we collect numerical values from the dataset and do Standardization that using “StandardScaler() “.

Normalization

Normalization is one technique of data mining. It will convert source data to a specific format so that will increase effectiveness. It will remove duplicate data and minimize data.

from sklearn.preprocessing import MinMaxScalernormalizer = MinMaxScaler()
normalized_data = normalizer.fit_transform(temp_data)
pd.DataFrame(normalized_data , columns = temp_data.columns)

Encoding

Using encoding covert categorical variables to binary or numerical counterparts.

List of encoding techniqe

  • Label Encoding
  • One hot Encoding
  • Dummy Encoding
  • Effect Encoding
  • Binary Encoding
  • BaseN Encoding
  • Hash Encoding
  • Target Encoding

In this practical we will learn about Label Encoding and One hot Encoding

Label Encoding

In Lable Encoding is one technique of Encoding using that we can convert out label data to ordinal. Inthis label convert into integer value.

from sklearn.preprocessing import LabelEncoderle = LabelEncoder()
data['Status'] = le.fit_transform(data['Status'])
data['Status'].value_counts()

One hot Encoding

This is one encoding technique In this for each level of a categorical feature, we create a new variable. It will give according that value like 0 or 1.

one_hot = OneHotEncoder()
transformed_data = one_hot.fit_transform(data['Transaction'].values.reshape(-1,1)).toarray()
one_hot.categories_
transformed_data = pd.DataFrame(transformed_data ,columns = ['Area', 'BHK'])
transformed_data.head()
transformed_data.iloc[90 , ]data['Transaction'][90]

In this practical we take two column from this dataset Area and BHK and do according using OneHotEncoder().

Discretization

Discretization is one technique using that we can convert continuous variables, models or functions into a discrete form.

for Discretixzation we take 4 column Area,BHK,Price,Status

Uniform Discretization Transform

trans = KBinsDiscretizer(n_bins =10 , encode = 'ordinal' , strategy='uniform')
new_data = trans.fit_transform(temp)
pd.DataFrame(new_data,columns = temp.columns )

A uniform discretization transform will preserve the probability distribution of each input variable but will make it discrete with the specified number of ordinal groups or labels. We can apply the uniform discretization transform using the KBinsDiscretizer class.

KMeans Discretization Transform

trans = KBinsDiscretizer(n_bins =10 , encode = 'ordinal' , strategy='kmeans')
new_data = trans.fit_transform(temp)
pd.DataFrame(new_data,columns = temp.columns )

A K-means discretization transform will attempt to fit k clusters for each input variable and then assign each observation to a cluster.

Quantile Discretization Transform

trans = KBinsDiscretizer(n_bins =10 , encode = 'ordinal' , strategy='quantile')
new_data = trans.fit_transform(temp)
pd.DataFrame(new_data,columns = temp.columns )

A quantile discretization transform will attempt to split the observations for each input variable into k groups, where the number of observations assigned to each group is approximately equal.

imputation of missing values

Missing value do effect in data analysis. So we have to handle that missing value. There are two way of handle missing value one is delete that particular row from that set of add some value form predection.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan , strategy='mean')
Per_Sqft = imputer.fit_transform(data['Per_Sqft'].values.reshape(-1,1))
Per_Sqft

SimpleInputer is class of sklearn using this we can handle missing value.

Github

--

--