본문 바로가기

Python/Data Analysis

Data Analysis / ML / Normalization

< Normalization >

※ 데이터가 가진 feature들의 scale이 심하게 차이나는 경우 이를 조정해주는 작업

ex) 집에대한 데이터가 있다고 가정해보자

  • 방의 개수 : {1, 2, 3, ... , 20 } - 숫자 차이가 크지 않음
  • 집의 연식(월) : {12, 24, ... , 240} - 숫자 차이가 큼
  • 각 feature에 대해 동일한 scale을 적용할 필요가 있음

1. Min-Max Normalization

Min-Max Normalization

  • 장점 : 모든 feature들에 대해 동일한 척도로 Scaling
  • 단점 : 이상치에 상당히 민감

2. Z-Score Normalization

Z-Score Normalization

  • 수식에서 분자는 평균, 분모는 분산을 의미
  • 장점 : 이상치에 상대적으로 덜 민감
  • 단점 : 동일한 척도로 scaling 되지 않음

< Min-Max Normalization 예제 - 독립변수 1개 >

※ scipy 라이브러리의 MinMaxScaler 사용

import numpy as np
import pandas as pd
from my_library.machine_learning_library import numerical_derivative
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.preprocessing import MinMaxScaler
# Raw Data Loading
df = pd.read_csv('./data/ozone.csv')
training_data = df[['Temp','Ozone']]
training_data = training_data.dropna(how='any')
# 이상치 처리
zscore_threshold = 2.0
# Temp에 대한 이상치 제거
outliers = training_data['Temp'][np.abs(stats.zscore(training_data['Temp'])) > zscore_threshold]
training_data = training_data.loc[~training_data['Temp'].isin(outliers)]
# Ozone에 대한 이상치 제거
outliers = training_data['Ozone'][np.abs(stats.zscore(training_data['Ozone'])) > zscore_threshold]
training_data = training_data.loc[~training_data['Ozone'].isin(outliers)]
# sklearn을 이용한 Normalization
# Min-Max Normalization 기법 이용
# 독립변수와 종속변수의 scaler 객체 각각 생성
scaler_x = MinMaxScaler()
scaler_t = MinMaxScaler()
scaler_x.fit(training_data['Temp'].values.reshape(-1,1))
scaler_t.fit(training_data['Ozone'].values.reshape(-1,1))
training_data['Temp'] = scaler_x.transform(training_data['Temp'].values.reshape(-1,1))
training_data['Ozone'] = scaler_t.transform(training_data['Ozone'].values.reshape(-1,1))
# Training Data Set
x_data = training_data['Temp'].values.reshape(-1,1)
t_data = training_data['Ozone'].values.reshape(-1,1)
# Weight & bias
W = np.random.rand(1,1)
b = np.random.rand(1)
# loss function
def loss_func(x,t):
y = np.dot(x,W) + b
return np.mean(np.power((t-y),2))
# predict
def predict(x):
return np.dot(x,W) + b
# learning_rate
learning_rate = 1e-4
f = lambda x : loss_func(x_data,t_data)
# 학습 진행
for step in range(300000):
W -= learning_rate * numerical_derivative(f,W)
b -= learning_rate * numerical_derivative(f,b)
if step % 30000 == 0:
print('W : {}, b : {}, loss : {}'.format(W,b,loss_func(x_data,t_data)))
# 예측
print(predict(62)) # [[44.57342749]]
# 그래프
plt.scatter(x_data, t_data)
plt.plot(x_data, np.dot(x_data, W) + b, color='r')
plt.show()
### 결과 ###
# W : [[0.18840095]], b : [0.92278829], loss : 0.5016993013567107
# W : [[0.1231764]], b : [0.29590244], loss : 0.04989706221494551
# W : [[0.28037741]], b : [0.21268213], loss : 0.03928407322680011
# W : [[0.39961755]], b : [0.14981081], loss : 0.03318859804228349
# W : [[0.48998454]], b : [0.10216346], loss : 0.029687680741445198
# W : [[0.55846978]], b : [0.06605359], loss : 0.02767693972868662
# W : [[0.61037179]], b : [0.03868747], loss : 0.026522076853342427
# W : [[0.64970607]], b : [0.01794788], loss : 0.0258587849363896
# W : [[0.67951583]], b : [0.00223024], loss : 0.02547782527746574
# W : [[0.70210736]], b : [-0.00968148], loss : 0.02525902227682894

Min-Max Nomalization

< Min-Max Normalization 예제 - 독립변수 여러개 >

import numpy as np
import pandas as pd
from my_library.machine_learning_library import numerical_derivative
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.preprocessing import MinMaxScaler
# Raw Data Loading
df = pd.read_csv('./data/ozone.csv')
# 결측치 제거
training_data = df.dropna(how='any')
# 이상치 처리
zscore_threshold = 2.0
# 각 column의 이상치를 확인하고 제거
for col in training_data.columns:
outliers = training_data[col][np.abs(stats.zscore(training_data[col])) > zscore_threshold]
training_data = training_data.loc[~training_data[col].isin(outliers)]
# sklearn을 이용한 Min-Max Nomalization
# 독립변수와 종속변수의 scaler 객체를 각각 생성
scaler_x = MinMaxScaler()
scaler_t = MinMaxScaler()
training_data_x = training_data.iloc[:,1:4]
training_data_t = training_data['Ozone'].values.reshape(-1,1)
scaler_x.fit(training_data_x)
scaler_t.fit(training_data_t)
training_data_x = scaler_x.transform(training_data_x)
training_data_t = scaler_t.transform(training_data_t)
# Training Data Set
x_data = training_data_x
t_data = training_data_t
# Weight & bias
W = np.random.rand(3,1)
b = np.random.rand(1)
# loss function
def loss_func(x,t):
y = np.dot(x,W) + b
return np.mean(np.power((t-y),2))
# predict
def predict(x):
return np.dot(x,W) + b
# learning_rate
learning_rate = 1e-4
f = lambda x : loss_func(x_data,t_data)
# 학습 진행
for step in range(300000):
W -= learning_rate * numerical_derivative(f,W)
b -= learning_rate * numerical_derivative(f,b)
if step % 30000 == 0:
print('W : {}, b : {}, loss = {}'.format(W,b,loss_func(x_data,t_data)))
predict_val = np.array([200,10,70]).reshape(-1,3)
print(predict(predict_val)) # [[75.03700679]]