< Normalization >
※ 데이터가 가진 feature들의 scale이 심하게 차이나는 경우 이를 조정해주는 작업
ex) 집에대한 데이터가 있다고 가정해보자
- 방의 개수 : {1, 2, 3, ... , 20 } - 숫자 차이가 크지 않음
- 집의 연식(월) : {12, 24, ... , 240} - 숫자 차이가 큼
- 각 feature에 대해 동일한 scale을 적용할 필요가 있음
1. Min-Max Normalization

- 장점 : 모든 feature들에 대해 동일한 척도로 Scaling
- 단점 : 이상치에 상당히 민감
2. Z-Score Normalization

- 수식에서 분자는 평균, 분모는 분산을 의미
- 장점 : 이상치에 상대적으로 덜 민감
- 단점 : 동일한 척도로 scaling 되지 않음
< Min-Max Normalization 예제 - 독립변수 1개 >
※ scipy 라이브러리의 MinMaxScaler 사용
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
import pandas as pd | |
from my_library.machine_learning_library import numerical_derivative | |
import matplotlib.pyplot as plt | |
from scipy import stats | |
from sklearn.preprocessing import MinMaxScaler | |
# Raw Data Loading | |
df = pd.read_csv('./data/ozone.csv') | |
training_data = df[['Temp','Ozone']] | |
training_data = training_data.dropna(how='any') | |
# 이상치 처리 | |
zscore_threshold = 2.0 | |
# Temp에 대한 이상치 제거 | |
outliers = training_data['Temp'][np.abs(stats.zscore(training_data['Temp'])) > zscore_threshold] | |
training_data = training_data.loc[~training_data['Temp'].isin(outliers)] | |
# Ozone에 대한 이상치 제거 | |
outliers = training_data['Ozone'][np.abs(stats.zscore(training_data['Ozone'])) > zscore_threshold] | |
training_data = training_data.loc[~training_data['Ozone'].isin(outliers)] | |
# sklearn을 이용한 Normalization | |
# Min-Max Normalization 기법 이용 | |
# 독립변수와 종속변수의 scaler 객체 각각 생성 | |
scaler_x = MinMaxScaler() | |
scaler_t = MinMaxScaler() | |
scaler_x.fit(training_data['Temp'].values.reshape(-1,1)) | |
scaler_t.fit(training_data['Ozone'].values.reshape(-1,1)) | |
training_data['Temp'] = scaler_x.transform(training_data['Temp'].values.reshape(-1,1)) | |
training_data['Ozone'] = scaler_t.transform(training_data['Ozone'].values.reshape(-1,1)) | |
# Training Data Set | |
x_data = training_data['Temp'].values.reshape(-1,1) | |
t_data = training_data['Ozone'].values.reshape(-1,1) | |
# Weight & bias | |
W = np.random.rand(1,1) | |
b = np.random.rand(1) | |
# loss function | |
def loss_func(x,t): | |
y = np.dot(x,W) + b | |
return np.mean(np.power((t-y),2)) | |
# predict | |
def predict(x): | |
return np.dot(x,W) + b | |
# learning_rate | |
learning_rate = 1e-4 | |
f = lambda x : loss_func(x_data,t_data) | |
# 학습 진행 | |
for step in range(300000): | |
W -= learning_rate * numerical_derivative(f,W) | |
b -= learning_rate * numerical_derivative(f,b) | |
if step % 30000 == 0: | |
print('W : {}, b : {}, loss : {}'.format(W,b,loss_func(x_data,t_data))) | |
# 예측 | |
print(predict(62)) # [[44.57342749]] | |
# 그래프 | |
plt.scatter(x_data, t_data) | |
plt.plot(x_data, np.dot(x_data, W) + b, color='r') | |
plt.show() | |
### 결과 ### | |
# W : [[0.18840095]], b : [0.92278829], loss : 0.5016993013567107 | |
# W : [[0.1231764]], b : [0.29590244], loss : 0.04989706221494551 | |
# W : [[0.28037741]], b : [0.21268213], loss : 0.03928407322680011 | |
# W : [[0.39961755]], b : [0.14981081], loss : 0.03318859804228349 | |
# W : [[0.48998454]], b : [0.10216346], loss : 0.029687680741445198 | |
# W : [[0.55846978]], b : [0.06605359], loss : 0.02767693972868662 | |
# W : [[0.61037179]], b : [0.03868747], loss : 0.026522076853342427 | |
# W : [[0.64970607]], b : [0.01794788], loss : 0.0258587849363896 | |
# W : [[0.67951583]], b : [0.00223024], loss : 0.02547782527746574 | |
# W : [[0.70210736]], b : [-0.00968148], loss : 0.02525902227682894 |

< Min-Max Normalization 예제 - 독립변수 여러개 >
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
import pandas as pd | |
from my_library.machine_learning_library import numerical_derivative | |
import matplotlib.pyplot as plt | |
from scipy import stats | |
from sklearn.preprocessing import MinMaxScaler | |
# Raw Data Loading | |
df = pd.read_csv('./data/ozone.csv') | |
# 결측치 제거 | |
training_data = df.dropna(how='any') | |
# 이상치 처리 | |
zscore_threshold = 2.0 | |
# 각 column의 이상치를 확인하고 제거 | |
for col in training_data.columns: | |
outliers = training_data[col][np.abs(stats.zscore(training_data[col])) > zscore_threshold] | |
training_data = training_data.loc[~training_data[col].isin(outliers)] | |
# sklearn을 이용한 Min-Max Nomalization | |
# 독립변수와 종속변수의 scaler 객체를 각각 생성 | |
scaler_x = MinMaxScaler() | |
scaler_t = MinMaxScaler() | |
training_data_x = training_data.iloc[:,1:4] | |
training_data_t = training_data['Ozone'].values.reshape(-1,1) | |
scaler_x.fit(training_data_x) | |
scaler_t.fit(training_data_t) | |
training_data_x = scaler_x.transform(training_data_x) | |
training_data_t = scaler_t.transform(training_data_t) | |
# Training Data Set | |
x_data = training_data_x | |
t_data = training_data_t | |
# Weight & bias | |
W = np.random.rand(3,1) | |
b = np.random.rand(1) | |
# loss function | |
def loss_func(x,t): | |
y = np.dot(x,W) + b | |
return np.mean(np.power((t-y),2)) | |
# predict | |
def predict(x): | |
return np.dot(x,W) + b | |
# learning_rate | |
learning_rate = 1e-4 | |
f = lambda x : loss_func(x_data,t_data) | |
# 학습 진행 | |
for step in range(300000): | |
W -= learning_rate * numerical_derivative(f,W) | |
b -= learning_rate * numerical_derivative(f,b) | |
if step % 30000 == 0: | |
print('W : {}, b : {}, loss = {}'.format(W,b,loss_func(x_data,t_data))) | |
predict_val = np.array([200,10,70]).reshape(-1,3) | |
print(predict(predict_val)) # [[75.03700679]] |
'Python > Data Analysis' 카테고리의 다른 글
Data Analysis / ML / Logistic Regression(1) (0) | 2020.10.07 |
---|---|
Data Analysis / ML / TensorFlow - Linear Regression (0) | 2020.10.05 |
Data Analysis / ML / Linear Regression Model (0) | 2020.10.04 |
Data Analysis / ML / Basic Concept(2) (0) | 2020.09.22 |
Data Analysis / ML / Basic Concept(1) (0) | 2020.09.22 |