Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

Today

Total

관리 메뉴

A.I

Explolation9 캐글 따라해보기 - 주택가격 예측 본문

AIFFEL

Explolation9 캐글 따라해보기 - 주택가격 예측

볼링치는 AI개발자 2021. 2. 2. 17:44

캐글 따라해보기¶

conda install -c conda-forge xgboost
conda install -c conda-forge lightgbm
conda install -c conda-forge missingno

In [1]:

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [3]:

# 필요한 라이브러리 import 
import warnings
warnings.filterwarnings("ignore")

import os
from os.path import join

import pandas as pd
import numpy as np

import missingno as msno

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score
import xgboost as xgb
import lightgbm as lgb

import matplotlib.pyplot as plt
import seaborn as sns

print('얍💢')

얍💢

In [4]:

# 데이터 경로지정
data_dir = os.getenv('HOME')+'/aiffel/kaggle_kakr_housing/data'

train_data_path = join(data_dir, 'train.csv')
sub_data_path = join(data_dir, 'test.csv')      # 테스트, 즉 submission 시 사용할 데이터 경로

print(train_data_path)
print(sub_data_path)

/home/ssac24/aiffel/kaggle_kakr_housing/data/train.csv
/home/ssac24/aiffel/kaggle_kakr_housing/data/test.csv

In [5]:

# 데이터 불러오기
data = pd.read_csv(train_data_path)
sub = pd.read_csv(sub_data_path)
print('train data dim : {}'.format(data.shape))
print('sub data dim : {}'.format(sub.shape))

train data dim : (15035, 21)
sub data dim : (6468, 20)

In [6]:

y = data['price']
del data['price']

print(data.columns)

Index(['id', 'date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [7]:

# 학습 데이터와 테스트 데이터 합치기
train_len = len(data)
data = pd.concat((data, sub), axis=0)

print(len(data))

In [8]:

data.head()

Out[8]:

	id	date	bedrooms	bathrooms	sqft_living	sqft_lot	floors	condition	grade	sqft_above	yr_built	zipcode	lat	long	sqft_living15	sqft_lot15
0	0	20141013T000000	3	1.00	1180	5650	1.0	3	7	1180	1955	98178	47.5112	-122.257	1340	5650
1	1	20150225T000000	2	1.00	770	10000	1.0	3	6	770	1933	98028	47.7379	-122.233	2720	8062
2	2	20150218T000000	3	2.00	1680	8080	1.0	3	8	1680	1987	98074	47.6168	-122.045	1800	7503
3	3	20140627T000000	3	2.25	1715	6819	2.0	3	7	1715	1995	98003	47.3097	-122.327	2238	6819
4	4	20150115T000000	3	1.50	1060	9711	1.0	3	7	1060	1963	98198	47.4095	-122.315	1650	9711

전처리¶

In [9]:

# 결측치를 찾는 함수
msno.matrix(data)

Out[9]:

<AxesSubplot:>

In [11]:

# 결측치 개수 확인
for c in data.columns:
    print('{} : {}'.format(c, len(data.loc[pd.isnull(data[c]), c].values)))

id : 0
date : 0
bedrooms : 0
bathrooms : 0
sqft_living : 0
sqft_lot : 0
floors : 0
waterfront : 0
view : 0
condition : 0
grade : 0
sqft_above : 0
sqft_basement : 0
yr_built : 0
yr_renovated : 0
zipcode : 0
lat : 0
long : 0
sqft_living15 : 0
sqft_lot15 : 0

In [12]:

# id 컬럼 정리
sub_id = data['id'][train_len:]
del data['id']

print(data.columns)

Index(['date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [13]:

# date 데이터 정리 str은 월까지만 자르기위함
data['date'] = data['date'].apply(lambda x : str(x[:6]))

data.head()

Out[13]:

	date	bedrooms	bathrooms	sqft_living	sqft_lot	floors	condition	grade	sqft_above	yr_built	zipcode	lat	long	sqft_living15	sqft_lot15
0	201410	3	1.00	1180	5650	1.0	3	7	1180	1955	98178	47.5112	-122.257	1340	5650
1	201502	2	1.00	770	10000	1.0	3	6	770	1933	98028	47.7379	-122.233	2720	8062
2	201502	3	2.00	1680	8080	1.0	3	8	1680	1987	98074	47.6168	-122.045	1800	7503
3	201406	3	2.25	1715	6819	2.0	3	7	1715	1995	98003	47.3097	-122.327	2238	6819
4	201501	3	1.50	1060	9711	1.0	3	7	1060	1963	98198	47.4095	-122.315	1650	9711

In [14]:

fig, ax = plt.subplots(9, 2, figsize=(12, 50))   # 가로스크롤 때문에 그래프 확인이 불편하다면 figsize의 x값을 조절해 보세요. 

# id 변수(count==0인 경우)는 제외하고 분포를 확인합니다.
count = 1
columns = data.columns
for row in range(9):
    for col in range(2):
        sns.kdeplot(data[columns[count]], ax=ax[row][col])
        ax[row][col].set_title(columns[count], fontsize=15)
        count += 1
        if count == 19 :
            break

In [15]:

skew_columns = ['bedrooms', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement']

for c in skew_columns:
    data[c] = np.log1p(data[c].values)

print('얍💢')

얍💢

In [16]:

fig, ax = plt.subplots(3, 2, figsize=(12, 15))

count = 0
for row in range(3):
    for col in range(2):
        if count == 5:
            break
        sns.kdeplot(data[skew_columns[count]], ax=ax[row][col])
        ax[row][col].set_title(skew_columns[count], fontsize=15)
        count += 1

In [17]:

sns.kdeplot(y)
plt.show()

In [18]:

# 0에 치우치지않게 log변환을 함
y_log_transformation = np.log1p(y)

sns.kdeplot(y_log_transformation)
plt.show()

In [19]:

# 데이터 나누기
sub = data.iloc[train_len:, :]
x = data.iloc[:train_len, :]

print(x.shape)
print(sub.shape)

(15035, 19)
(6468, 19)

In [21]:

#  Average Blending 만들기
gboost = GradientBoostingRegressor(random_state=2019)
xgboost = xgb.XGBRegressor(random_state=2019)
lightgbm = lgb.LGBMRegressor(random_state=2019)

models = [{'model':gboost, 'name':'GradientBoosting'}, {'model':xgboost, 'name':'XGBoost'},
          {'model':lightgbm, 'name':'LightGBM'}]

print('얍💢')

얍💢

In [22]:

# Cross Validation 검증
def get_cv_score(models):
    kfold = KFold(n_splits=5, random_state=2019).get_n_splits(x.values)
    for m in models:
        print("Model {} CV score : {:.4f}".format(m['name'], np.mean(cross_val_score(m['model'], x.values, y)), 
                                                  kf=kfold))
print('얍💢')

얍💢

In [23]:

get_cv_score(models)

Model GradientBoosting CV score : 0.8597
Model XGBoost CV score : 0.8861
Model LightGBM CV score : 0.8819

Baseline 모델에서는 다음과 같이 여러 모델을 입력하면 각 모델에 대한 예측 결과를 평균 내어 주는 AgeragingBlending() 함수를 만들어 사용합니다. AgeragingBlending() 함수는 models 딕셔너리 안에 있는 모델을 모두 x와 y로 학습시킨 뒤 predictions에 그 예측 결괏값을 모아서 평균한 값을 반환합니다.

In [29]:

def AveragingBlending(models, x, y, sub_x):
    for m in models : 
        m['model'].fit(x.values, y)
    
    predictions = np.column_stack([
        m['model'].predict(sub_x.values) for m in models
    ])
    return np.mean(predictions, axis=1)

print('얍💢')

얍💢

In [30]:

y_pred = AveragingBlending(models, x, y, sub)
print(len(y_pred))
y_pred

Out[30]:

array([ 529966.66304912,  430726.21272617, 1361676.91242777, ...,
        452081.69137012,  341572.97685942,  421725.1231835 ])

In [31]:

data_dir = os.getenv('HOME')+'/aiffel/kaggle_kakr_housing/data'

submission_path = join(data_dir, 'sample_submission.csv')
submission = pd.read_csv(submission_path)
submission.head()

Out[31]:

	id	price
0	15035	100000
1	15036	100000
2	15037	100000
3	15038	100000
4	15039	100000

In [32]:

result = pd.DataFrame({
    'id' : sub_id, 
    'price' : y_pred
})

result.head()

Out[32]:

	id	price
0	15035	5.299667e+05
1	15036	4.307262e+05
2	15037	1.361677e+06
3	15038	3.338036e+05
4	15039	3.089006e+05

In [33]:

my_submission_path = join(data_dir, 'submission.csv')
result.to_csv(my_submission_path, index=False)

print(my_submission_path)

/home/ssac24/aiffel/kaggle_kakr_housing/data/submission.csv

랭킹 올려보기¶

In [136]:

data_dir = os.getenv('HOME')+'/aiffel/kaggle_kakr_housing/data'

train_data_path = join(data_dir, 'train.csv')
test_data_path = join(data_dir, 'test.csv') 

train = pd.read_csv(train_data_path)
test = pd.read_csv(test_data_path)

print('얍💢')

얍💢

In [137]:

train.head()

Out[137]:

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	yr_built	zipcode	lat	long	sqft_living15	sqft_lot15
0	0	20141013T000000	221900.0	3	1.00	1180	5650	1.0	...	7	1180	1955	98178	47.5112	-122.257	1340	5650
1	1	20150225T000000	180000.0	2	1.00	770	10000	1.0	...	6	770	1933	98028	47.7379	-122.233	2720	8062
2	2	20150218T000000	510000.0	3	2.00	1680	8080	1.0	...	8	1680	1987	98074	47.6168	-122.045	1800	7503
3	3	20140627T000000	257500.0	3	2.25	1715	6819	2.0	...	7	1715	1995	98003	47.3097	-122.327	2238	6819
4	4	20150115T000000	291850.0	3	1.50	1060	9711	1.0	...	7	1060	1963	98198	47.4095	-122.315	1650	9711

5 rows × 21 columns

In [138]:

# date 값을 정수형 데이터로 전환
train['date'] = train['date'].apply(lambda i: i[:6]).astype(int)
train.head()

Out[138]:

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	yr_built	zipcode	lat	long	sqft_living15	sqft_lot15
0	0	201410	221900.0	3	1.00	1180	5650	1.0	...	7	1180	1955	98178	47.5112	-122.257	1340	5650
1	1	201502	180000.0	2	1.00	770	10000	1.0	...	6	770	1933	98028	47.7379	-122.233	2720	8062
2	2	201502	510000.0	3	2.00	1680	8080	1.0	...	8	1680	1987	98074	47.6168	-122.045	1800	7503
3	3	201406	257500.0	3	2.25	1715	6819	2.0	...	7	1715	1995	98003	47.3097	-122.327	2238	6819
4	4	201501	291850.0	3	1.50	1060	9711	1.0	...	7	1060	1963	98198	47.4095	-122.315	1650	9711

5 rows × 21 columns

In [139]:

y = train['price']
del train['price']

print(train.columns)

Index(['id', 'date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [140]:

del train['id']

print(train.columns)

Index(['date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [141]:

test['date'] = test['date'].apply(lambda i: i[:6]).astype(int)

del test['id']

print(test.columns)

Index(['date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [142]:

# y 값의 분포 확인
sns.kdeplot(y)
plt.show()

In [143]:

y = np.log1p(y)
y

Out[143]:

0        12.309987
1        12.100718
2        13.142168
3        12.458779
4        12.583999
           ...    
15030    13.322338
15031    13.822984
15032    12.793862
15033    12.899222
15034    12.691584
Name: price, Length: 15035, dtype: float64

In [144]:

sns.kdeplot(y)
plt.show()

In [145]:

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15035 entries, 0 to 15034
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           15035 non-null  int64  
 1   bedrooms       15035 non-null  int64  
 2   bathrooms      15035 non-null  float64
 3   sqft_living    15035 non-null  int64  
 4   sqft_lot       15035 non-null  int64  
 5   floors         15035 non-null  float64
 6   waterfront     15035 non-null  int64  
 7   view           15035 non-null  int64  
 8   condition      15035 non-null  int64  
 9   grade          15035 non-null  int64  
 10  sqft_above     15035 non-null  int64  
 11  sqft_basement  15035 non-null  int64  
 12  yr_built       15035 non-null  int64  
 13  yr_renovated   15035 non-null  int64  
 14  zipcode        15035 non-null  int64  
 15  lat            15035 non-null  float64
 16  long           15035 non-null  float64
 17  sqft_living15  15035 non-null  int64  
 18  sqft_lot15     15035 non-null  int64  
dtypes: float64(4), int64(15)
memory usage: 2.2 MB

In [146]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

print('얍💢')

얍💢

In [147]:

# RMSE 계산
# y_test나 y_pred는 위에서 np.log1p()로 변환이 된 값이기 때문에 원래 데이터의 단위에 맞게 되돌리기 위해 np.expm1()을 추가해야 한다
def rmse(y_test, y_pred):
    return np.sqrt(mean_squared_error(np.expm1(y_test), np.expm1(y_pred)))

print('얍💢')

얍💢

In [148]:

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

print('얍💢')

얍💢

In [149]:

# random_state는 모델초기화나 데이터셋 구성에 사용되는 랜덤 시드값입니다. 
#random_state=None    # 이게 초기값입니다. 아무것도 지정하지 않고 None을 넘겨주면 모델 내부에서 임의로 선택합니다.  
random_state=2020        # 하지만 우리는 이렇게 고정값을 세팅해 두겠습니다. 

gboost = GradientBoostingRegressor(random_state=random_state)
xgboost = XGBRegressor(random_state=random_state)
lightgbm = LGBMRegressor(random_state=random_state)
rdforest = RandomForestRegressor(random_state=random_state)

models = [gboost, xgboost, lightgbm, rdforest]

print('얍💢')

얍💢

In [150]:

gboost.__class__.__name__

Out[150]:

'GradientBoostingRegressor'

In [151]:

df = {}

for model in models:
    # 모델 이름 획득
    model_name = model.__class__.__name__

    # train, test 데이터셋 분리 - 여기에도 random_state를 고정합니다. 
    X_train, X_test, y_train, y_test = train_test_split(train, y, random_state=random_state, test_size=0.2)

    # 모델 학습
    model.fit(X_train, y_train)
    
    # 예측
    y_pred = model.predict(X_test)

    # 예측 결과의 rmse값 저장
    df[model_name] = rmse(y_test, y_pred)
    
    # data frame에 저장
    score_df = pd.DataFrame(df, index=['RMSE']).T.sort_values('RMSE', ascending=False)
    
df

Out[151]:

{'GradientBoostingRegressor': 128360.19649691365,
 'XGBRegressor': 110318.66956616656,
 'LGBMRegressor': 111920.36735892233,
 'RandomForestRegressor': 125487.07102453562}

In [152]:

def get_scores(models, train, y):
    df = {}

    for model in models:
        model_name = model.__class__.__name__

        X_train, X_test, y_train, y_test = train_test_split(train, y, random_state=random_state, test_size=0.2)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        df[model_name] = rmse(y_test, y_pred)
        score_df = pd.DataFrame(df, index=['RMSE']).T.sort_values('RMSE', ascending=False)

    return score_df

In [153]:

# 그리드 탐색
from sklearn.model_selection import GridSearchCV

print('얍💢')

얍💢

In [157]:

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [1, 10],
}

In [158]:

model = LGBMRegressor(random_state=random_state)

print('얍💢')

얍💢

In [160]:

grid_model = GridSearchCV(model, param_grid=param_grid, \
                        scoring='neg_mean_squared_error', \
                        cv=5, verbose=1, n_jobs=5)

grid_model.fit(train, y)

Fitting 5 folds for each of 4 candidates, totalling 20 fits

[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done  20 out of  20 | elapsed:    0.6s finished

Out[160]:

GridSearchCV(cv=5, estimator=LGBMRegressor(random_state=2020), n_jobs=5,
             param_grid={'max_depth': [1, 10], 'n_estimators': [50, 100]},
             scoring='neg_mean_squared_error', verbose=1)

In [161]:

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [1, 10],
}

model = LGBMRegressor(random_state=random_state)
my_GridSearch(model, train, y, param_grid, verbose=2, n_jobs=5)

Fitting 5 folds for each of 4 candidates, totalling 20 fits

[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done  20 out of  20 | elapsed:    0.6s finished

Out[161]:

	max_depth	n_estimators	score	RMSLE
3	10	100	-0.027027	0.164399
2	10	50	-0.029177	0.170814
1	1	100	-0.055020	0.234564
0	1	50	-0.073394	0.270914

In [162]:

grid_model.cv_results_

Out[162]:

{'mean_fit_time': array([0.05505657, 0.06682   , 0.13093429, 0.19804134]),
 'std_fit_time': array([0.00156011, 0.00166798, 0.00277595, 0.01625213]),
 'mean_score_time': array([0.00485344, 0.00585899, 0.00943289, 0.01416516]),
 'std_score_time': array([0.00025748, 0.00040021, 0.00084021, 0.00215965]),
 'param_max_depth': masked_array(data=[1, 1, 10, 10],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_n_estimators': masked_array(data=[50, 100, 50, 100],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'max_depth': 1, 'n_estimators': 50},
  {'max_depth': 1, 'n_estimators': 100},
  {'max_depth': 10, 'n_estimators': 50},
  {'max_depth': 10, 'n_estimators': 100}],
 'split0_test_score': array([-0.0756974 , -0.05555652, -0.02885847, -0.02665428]),
 'split1_test_score': array([-0.07666447, -0.057876  , -0.03041465, -0.02795896]),
 'split2_test_score': array([-0.07354904, -0.05546079, -0.03068533, -0.02834112]),
 'split3_test_score': array([-0.07510863, -0.05582109, -0.02987609, -0.02774809]),
 'split4_test_score': array([-0.06595281, -0.05038773, -0.02605217, -0.02443328]),
 'mean_test_score': array([-0.07339447, -0.05502043, -0.02917734, -0.02702714]),
 'std_test_score': array([0.00385583, 0.00247946, 0.00168295, 0.00141292]),
 'rank_test_score': array([4, 3, 2, 1], dtype=int32)}

In [163]:

params = grid_model.cv_results_['params']
params

Out[163]:

[{'max_depth': 1, 'n_estimators': 50},
 {'max_depth': 1, 'n_estimators': 100},
 {'max_depth': 10, 'n_estimators': 50},
 {'max_depth': 10, 'n_estimators': 100}]

In [164]:

score = grid_model.cv_results_['mean_test_score']
score

Out[164]:

array([-0.07339447, -0.05502043, -0.02917734, -0.02702714])

In [165]:

results = pd.DataFrame(params)
results['score'] = score

results

Out[165]:

	max_depth	n_estimators	score
0	1	50	-0.073394
1	1	100	-0.055020
2	10	50	-0.029177
3	10	100	-0.027027

In [166]:

results['RMSE'] = np.sqrt(-1 * results['score'])
results

Out[166]:

	max_depth	n_estimators	score	RMSE
0	1	50	-0.073394	0.270914
1	1	100	-0.055020	0.234564
2	10	50	-0.029177	0.170814
3	10	100	-0.027027	0.164399

In [167]:

results = results.rename(columns={'RMSE': 'RMSLE'})
results

Out[167]:

	max_depth	n_estimators	score	RMSLE
0	1	50	-0.073394	0.270914
1	1	100	-0.055020	0.234564
2	10	50	-0.029177	0.170814
3	10	100	-0.027027	0.164399

In [168]:

results = results.sort_values('RMSLE')
results

Out[168]:

	max_depth	n_estimators	score	RMSLE
3	10	100	-0.027027	0.164399
2	10	50	-0.029177	0.170814
1	1	100	-0.055020	0.234564
0	1	50	-0.073394	0.270914

In [169]:

"""
1. GridSearchCV 모델로 `model`을 초기화합니다.
2. 모델을 fitting 합니다.
3. params, score에 각 조합에 대한 결과를 저장합니다. 
4. 데이터 프레임을 생성하고, RMSLE 값을 추가한 후 점수가 높은 순서로 정렬한 `results`를 반환합니다.
"""

def my_GridSearch(model, train, y, param_grid, verbose=3, n_jobs=5):
    # GridSearchCV 모델로 초기화
    grid_model = GridSearchCV(model, param_grid=param_grid, scoring='neg_mean_squared_error', \
                              cv=5, verbose=verbose, n_jobs=n_jobs)

    # 모델 fitting
    grid_model.fit(train, y)

    # 결과값 저장
    params = grid_model.cv_results_['params']
    score = grid_model.cv_results_['mean_test_score']

    # 데이터 프레임 생성
    results = pd.DataFrame(params)
    results['score'] = score

    # RMSLE 값 계산 후 정렬
    results['RMSLE'] = np.sqrt(-1 * results['score'])
    results = results.sort_values('RMSLE')

    return results

In [170]:

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [1, 10],
}

model = LGBMRegressor(random_state=random_state)
my_GridSearch(model, train, y, param_grid, verbose=2, n_jobs=5)

Fitting 5 folds for each of 4 candidates, totalling 20 fits

[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done  20 out of  20 | elapsed:    0.5s finished

Out[170]:

	max_depth	n_estimators	score	RMSLE
3	10	100	-0.027027	0.164399
2	10	50	-0.029177	0.170814
1	1	100	-0.055020	0.234564
0	1	50	-0.073394	0.270914

In [171]:

model = LGBMRegressor(max_depth=10, n_estimators=100, random_state=random_state)
model.fit(train, y)
prediction = model.predict(test)
prediction

Out[171]:

array([13.13580793, 13.08051399, 14.11202067, ..., 13.01592878,
       12.69894979, 12.96297768])

In [172]:

# 원래 스케일로 되돌리기
prediction = np.expm1(prediction)
prediction

Out[172]:

array([ 506766.66784595,  479506.10405112, 1345155.15609376, ...,
        449515.92243642,  327402.87855805,  426332.71354302])

In [173]:

data_dir = os.getenv('HOME')+'/aiffel/kaggle_kakr_housing/data'

submission_path = join(data_dir, 'sample_submission.csv')
submission = pd.read_csv(submission_path)
submission.head()

Out[173]:

	id	price
0	15035	100000
1	15036	100000
2	15037	100000
3	15038	100000
4	15039	100000

In [174]:

submission['price'] = prediction
submission.head()

Out[174]:

	id	price
0	15035	5.067667e+05
1	15036	4.795061e+05
2	15037	1.345155e+06
3	15038	3.122579e+05
4	15039	3.338645e+05

In [175]:

submission_csv_path = '{}/submission_{}_RMSLE_{}.csv'.format(data_dir, 'lgbm', '0.164399')
submission.to_csv(submission_csv_path, index=False)
print(submission_csv_path)

/home/ssac24/aiffel/kaggle_kakr_housing/data/submission_lgbm_RMSLE_0.164399.csv

In [176]:

# 학습결과를 저장까지
"""
아래의 과정을 수행하는 `save_submission(model, train, y, test, model_name, rmsle)` 함수를 구현해 주세요.
1. 모델을 `train`, `y`로 학습시킵니다.
2. `test`에 대해 예측합니다.
3. 예측값을 `np.expm1`으로 변환하고, `submission_model_name_RMSLE_100000.csv` 형태의 `csv` 파일을 저장합니다.
"""

def save_submission(model, train, y, test, model_name, rmsle=None):
    model.fit(train, y)
    prediction = model.predict(test)
    prediction = np.expm1(prediction)
    data_dir = os.getenv('HOME')+'/aiffel/kaggle_kakr_housing/data'
    submission_path = join(data_dir, 'sample_submission.csv')
    submission = pd.read_csv(submission_path)
    submission['price'] = prediction
    submission_csv_path = '{}/submission_{}_RMSLE_{}.csv'.format(data_dir, model_name, rmsle)
    submission.to_csv(submission_csv_path, index=False)
    print('{} saved!'.format(submission_csv_path))

In [177]:

save_submission(model, train, y, test, 'lgbm', rmsle='0.0168')

[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
/home/ssac24/aiffel/kaggle_kakr_housing/data/submission_lgbm_RMSLE_0.0168.csv saved!

Grid Search 는 탐색할 하이퍼 파라미터를 Grid(격자)와 같이 정해놓고 탐색하는 방법이다. 사람이 먼저 탐색할 하이퍼 파라미터의 값들을 정해두면, 그 값들로 만들어질 수 있는 모든 조합을 탐색한다. 특정 값에 대한 하이퍼 파라미터 조합을 모두 탐색하고자 할 때 유리하다.

Random Search 는 사람이 탐색할 하이퍼 파라미터의 공간만 정해두고, 그 안에서 랜덤으로 조합을 선택해서 탐색하는 방법이다. Grid Search는 사람이 정해둔 값들로 이루어지는 조합만 탐색하기 때문에 최적의 조합을 놓칠 수 있는 반면, Random Search는 말 그대로 랜덤으로 탐색하기 때문에 최적의 조합을 찾을 수 있는 가능성이 언제나 열려 있다. 하지만 그 가능성 또한 랜덤성에 의존하기 때문에 언제나 최적을 찾는다는 보장은 없다.

GridSearchCV에 입력되는 인자들은 다음과 같습니다.

param_grid : 탐색할 파라미터의 종류 (딕셔너리로 입력)
scoring : 모델의 성능을 평가할 지표
cv : cross validation을 수행하기 위해 train 데이터셋을 나누는 조각의 개수
verbose : 그리드 탐색을 진행하면서 진행 과정을 출력해서 보여줄 메세지의 양 (숫자가 클수록 더 많은 메세지를 출력합니다.)
n_jobs : 그리드 탐색을 진행하면서 사용할 CPU의 개수

lightgbm 라이브러리의 인자는 다음과 같습니다.

max_depth : 의사 결정 나무의 깊이, 정수 사용
learning_rate : 한 스텝에 이동하는 양을 결정하는 파라미터,보통 0.0001에서 .1 사이의 실수 사용
n_estimators : 사용하는 개별 모델의 개수, 보통 50~100 이상의 정수 사용
num_leaves : 하나의 LightGBM 트리가 가질 수 있는 최대 잎의 수
boosting_type : 부스팅 방식, gbdt, rf 등의 문자열 입력

정리¶

정규분포를 위해 np.log1p()를 통해 log를 씌워 골고루 분포하게 만듦
msno.matrix(data)를 통해 결측치를 알아낼 수 있음
앙상블 기법을 통해 여러가지 모델을 섞어서 결과를 더 정확하게 도출해낼수 있음
- Voting은 여러 모델이 분류해 낸 결과들로부터 말 그대로 다수결 투표를 통해 최종 결과를 선택하는 방법으로, 분류 문제에서 사용됩니다.
- Averaging은 각 모델이 계산해 낸 실숫값들을 평균 혹은 가중평균하여 사용하는 방법으로, 회귀 문제에서 사용됩니다.
pandas로 컬럼 변경

'AIFFEL' 카테고리의 다른 글

Explolation12 생성자 모델링 (0)	2021.02.23
Explolation10 인물사진 배경바꿔보기 (0)	2021.02.04
Explolation 8 영화 추천 시스템 만들기 (0)	2021.01.29
Explolation7 닮은 꼴 연예인 찾기 (0)	2021.01.26
Explolation 6 작사가 만들기 (0)	2021.01.21

'AIFFEL' Related Articles

« 2024/07 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

A.I