Data Analysis / ML / Normalization

< Normalization >

※ 데이터가 가진 feature들의 scale이 심하게 차이나는 경우 이를 조정해주는 작업

ex) 집에대한 데이터가 있다고 가정해보자

방의 개수 : {1, 2, 3, ... , 20 } - 숫자 차이가 크지 않음
집의 연식(월) : {12, 24, ... , 240} - 숫자 차이가 큼
각 feature에 대해 동일한 scale을 적용할 필요가 있음

1. Min-Max Normalization

장점 : 모든 feature들에 대해 동일한 척도로 Scaling
단점 : 이상치에 상당히 민감

2. Z-Score Normalization

수식에서 분자는 평균, 분모는 분산을 의미
장점 : 이상치에 상대적으로 덜 민감
단점 : 동일한 척도로 scaling 되지 않음

< Min-Max Normalization 예제 - 독립변수 1개 >

※ scipy 라이브러리의 MinMaxScaler 사용

	import numpy as np
	import pandas as pd
	from my_library.machine_learning_library import numerical_derivative
	import matplotlib.pyplot as plt
	from scipy import stats
	from sklearn.preprocessing import MinMaxScaler

	# Raw Data Loading
	df = pd.read_csv('./data/ozone.csv')
	training_data = df[['Temp','Ozone']]
	training_data = training_data.dropna(how='any')

	# 이상치 처리
	zscore_threshold = 2.0

	# Temp에 대한 이상치 제거
	outliers = training_data['Temp'][np.abs(stats.zscore(training_data['Temp'])) > zscore_threshold]
	training_data = training_data.loc[~training_data['Temp'].isin(outliers)]

	# Ozone에 대한 이상치 제거
	outliers = training_data['Ozone'][np.abs(stats.zscore(training_data['Ozone'])) > zscore_threshold]
	training_data = training_data.loc[~training_data['Ozone'].isin(outliers)]

	# sklearn을 이용한 Normalization
	# Min-Max Normalization 기법 이용

	# 독립변수와 종속변수의 scaler 객체 각각 생성
	scaler_x = MinMaxScaler()
	scaler_t = MinMaxScaler()

	scaler_x.fit(training_data['Temp'].values.reshape(-1,1))
	scaler_t.fit(training_data['Ozone'].values.reshape(-1,1))

	training_data['Temp'] = scaler_x.transform(training_data['Temp'].values.reshape(-1,1))
	training_data['Ozone'] = scaler_t.transform(training_data['Ozone'].values.reshape(-1,1))

	# Training Data Set
	x_data = training_data['Temp'].values.reshape(-1,1)
	t_data = training_data['Ozone'].values.reshape(-1,1)

	# Weight & bias
	W = np.random.rand(1,1)
	b = np.random.rand(1)

	# loss function
	def loss_func(x,t):
	y = np.dot(x,W) + b
	return np.mean(np.power((t-y),2))

	# predict
	def predict(x):
	return np.dot(x,W) + b

	# learning_rate
	learning_rate = 1e-4
	f = lambda x : loss_func(x_data,t_data)

	# 학습 진행
	for step in range(300000):
	W -= learning_rate * numerical_derivative(f,W)
	b -= learning_rate * numerical_derivative(f,b)

	if step % 30000 == 0:
	print('W : {}, b : {}, loss : {}'.format(W,b,loss_func(x_data,t_data)))

	# 예측
	print(predict(62)) # [[44.57342749]]

	# 그래프
	plt.scatter(x_data, t_data)
	plt.plot(x_data, np.dot(x_data, W) + b, color='r')
	plt.show()

	### 결과 ###
	# W : [[0.18840095]], b : [0.92278829], loss : 0.5016993013567107
	# W : [[0.1231764]], b : [0.29590244], loss : 0.04989706221494551
	# W : [[0.28037741]], b : [0.21268213], loss : 0.03928407322680011
	# W : [[0.39961755]], b : [0.14981081], loss : 0.03318859804228349
	# W : [[0.48998454]], b : [0.10216346], loss : 0.029687680741445198
	# W : [[0.55846978]], b : [0.06605359], loss : 0.02767693972868662
	# W : [[0.61037179]], b : [0.03868747], loss : 0.026522076853342427
	# W : [[0.64970607]], b : [0.01794788], loss : 0.0258587849363896
	# W : [[0.67951583]], b : [0.00223024], loss : 0.02547782527746574
	# W : [[0.70210736]], b : [-0.00968148], loss : 0.02525902227682894

view raw min-max_normalization.py hosted with ❤ by GitHub

< Min-Max Normalization 예제 - 독립변수 여러개 >

	import numpy as np
	import pandas as pd
	from my_library.machine_learning_library import numerical_derivative
	import matplotlib.pyplot as plt
	from scipy import stats
	from sklearn.preprocessing import MinMaxScaler

	# Raw Data Loading
	df = pd.read_csv('./data/ozone.csv')

	# 결측치 제거
	training_data = df.dropna(how='any')

	# 이상치 처리
	zscore_threshold = 2.0

	# 각 column의 이상치를 확인하고 제거
	for col in training_data.columns:
	outliers = training_data[col][np.abs(stats.zscore(training_data[col])) > zscore_threshold]
	training_data = training_data.loc[~training_data[col].isin(outliers)]

	# sklearn을 이용한 Min-Max Nomalization
	# 독립변수와 종속변수의 scaler 객체를 각각 생성
	scaler_x = MinMaxScaler()
	scaler_t = MinMaxScaler()

	training_data_x = training_data.iloc[:,1:4]
	training_data_t = training_data['Ozone'].values.reshape(-1,1)

	scaler_x.fit(training_data_x)
	scaler_t.fit(training_data_t)

	training_data_x = scaler_x.transform(training_data_x)
	training_data_t = scaler_t.transform(training_data_t)

	# Training Data Set
	x_data = training_data_x
	t_data = training_data_t

	# Weight & bias
	W = np.random.rand(3,1)
	b = np.random.rand(1)

	# loss function
	def loss_func(x,t):
	y = np.dot(x,W) + b
	return np.mean(np.power((t-y),2))

	# predict
	def predict(x):
	return np.dot(x,W) + b

	# learning_rate
	learning_rate = 1e-4
	f = lambda x : loss_func(x_data,t_data)

	# 학습 진행
	for step in range(300000):
	W -= learning_rate * numerical_derivative(f,W)
	b -= learning_rate * numerical_derivative(f,b)

	if step % 30000 == 0:
	print('W : {}, b : {}, loss = {}'.format(W,b,loss_func(x_data,t_data)))

	predict_val = np.array([200,10,70]).reshape(-1,3)
	print(predict(predict_val)) # [[75.03700679]]

view raw multi_min-max_nomalization.py hosted with ❤ by GitHub

'Python > Data Analysis' 카테고리의 다른 글

Data Analysis / ML / Logistic Regression(1) (0)	2020.10.07
Data Analysis / ML / TensorFlow - Linear Regression (0)	2020.10.05
Data Analysis / ML / Linear Regression Model (0)	2020.10.04
Data Analysis / ML / Basic Concept(2) (0)	2020.09.22
Data Analysis / ML / Basic Concept(1) (0)	2020.09.22

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

My Life

Data Analysis / ML / Normalization

< Normalization >

1. Min-Max Normalization

2. Z-Score Normalization

< Min-Max Normalization 예제 - 독립변수 1개 >

< Min-Max Normalization 예제 - 독립변수 여러개 >

'Python > Data Analysis' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Data Analysis / ML / Normalization

< Normalization >

1. Min-Max Normalization

2. Z-Score Normalization

< Min-Max Normalization 예제 - 독립변수 1개 >

< Min-Max Normalization 예제 - 독립변수 여러개 >

'Python > Data Analysis' 카테고리의 다른 글

'Python/Data Analysis' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역