A.I
Explolation17 예측 추천 시스템 본문
예측 추천 시스템¶
- mkdir -p ~/aiffel/yoochoose-data
- cd ~/aiffel
- wget https://aiffelstaticprd.blob.core.windows.net/media/documents/yoochoose-data.7z
- sudo apt install p7zip-full
- 7z x yoochoose-data.7z -oyoochoose-data
1. Session-Based Recommendation¶
In [1]:
# 데이터 설명(README)를 읽어 봅니다.
import os
f = open(os.getenv('HOME')+'/aiffel/yoochoose-data/dataset-README.txt', 'r')
while True:
line = f.readline()
if not line: break
print(line)
f.close()
SUMMARY ================================================================================ This dataset was constructed by YOOCHOOSE GmbH to support participants in the RecSys Challenge 2015. See http://recsys.yoochoose.net for details about the challenge. The YOOCHOOSE dataset contain a collection of sessions from a retailer, where each session is encapsulating the click events that the user performed in the session. For some of the sessions, there are also buy events; means that the session ended with the user bought something from the web shop. The data was collected during several months in the year of 2014, reflecting the clicks and purchases performed by the users of an on-line retailer in Europe. To protect end users privacy, as well as the retailer, all numbers have been modified. Do not try to reveal the identity of the retailer. LICENSE ================================================================================ This dataset is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. YOOCHOOSE cannot guarantee the completeness and correctness of the data or the validity of results based on the use of the dataset as it was collected by implicit tracking of a website. If you have any further questions or comments, please contact YooChoose <support@YooChoose.com>. The data is provided "as it is" and there is no obligation of YOOCHOOSE to correct it, improve it or to provide additional information about it. CLICKS DATASET FILE DESCRIPTION ================================================================================ The file yoochoose-clicks.dat comprising the clicks of the users over the items. Each record/line in the file has the following fields/format: Session ID, Timestamp, Item ID, Category -Session ID – the id of the session. In one session there are one or many clicks. Could be represented as an integer number. -Timestamp – the time when the click occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ -Item ID – the unique identifier of the item that has been clicked. Could be represented as an integer number. -Category – the context of the click. The value "S" indicates a special offer, "0" indicates a missing value, a number between 1 to 12 indicates a real category identifier, any other number indicates a brand. E.g. if an item has been clicked in the context of a promotion or special offer then the value will be "S", if the context was a brand i.e BOSCH, then the value will be an 8-10 digits number. If the item has been clicked under regular category, i.e. sport, then the value will be a number between 1 to 12. BUYS DATSET FILE DESCRIPTION ================================================================================ The file yoochoose-buys.dat comprising the buy events of the users over the items. Each record/line in the file has the following fields: Session ID, Timestamp, Item ID, Price, Quantity -Session ID - the id of the session. In one session there are one or many buying events. Could be represented as an integer number. -Timestamp - the time when the buy occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ -Item ID – the unique identifier of item that has been bought. Could be represented as an integer number. -Price – the price of the item. Could be represented as an integer number. -Quantity – the quantity in this buying. Could be represented as an integer number. TEST DATASET FILE DESCRIPTION ================================================================================ The file yoochoose-test.dat comprising only clicks of users over items. This file served as a test file in the RecSys challenge 2015. The structure is identical to the file yoochoose-clicks.dat but you will not find the corresponding buying events to these sessions in the yoochoose-buys.dat file.
2. 데이터전처리¶
- pip install pathlib
Data Load¶
In [2]:
import datetime as dt
from pathlib import Path
import os
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
In [3]:
data_path = Path(os.getenv('HOME')+'/aiffel/yoochoose-data')
train_path = data_path / 'yoochoose-clicks.dat'
train_path
Out[3]:
PosixPath('/home/ssac24/aiffel/yoochoose-data/yoochoose-clicks.dat')
In [4]:
def load_data(data_path: Path, nrows=None):
data = pd.read_csv(data_path, sep=',', header=None, usecols=[0, 1, 2],
parse_dates=[1], dtype={0: np.int32, 2: np.int32}, nrows=nrows)
data.columns = ['SessionId', 'Time', 'ItemId']
return data
In [5]:
# 시간이 좀 걸릴 수 있습니다. 메모리도 10GB 가까이 소요될 수 있으니 메모리 상태에 주의해 주세요.
data = load_data(train_path, None)
data.sort_values(['SessionId', 'Time'], inplace=True) # data를 id와 시간 순서로 정렬해줍니다.
data
Out[5]:
SessionId | Time | ItemId | |
---|---|---|---|
0 | 1 | 2014-04-07 10:51:09.277000+00:00 | 214536502 |
1 | 1 | 2014-04-07 10:54:09.868000+00:00 | 214536500 |
2 | 1 | 2014-04-07 10:54:46.998000+00:00 | 214536506 |
3 | 1 | 2014-04-07 10:57:00.306000+00:00 | 214577561 |
4 | 2 | 2014-04-07 13:56:37.614000+00:00 | 214662742 |
... | ... | ... | ... |
32230487 | 11562158 | 2014-09-26 04:50:29.172000+00:00 | 214849132 |
32230488 | 11562158 | 2014-09-26 04:52:21.900000+00:00 | 214854774 |
32230489 | 11562158 | 2014-09-26 05:16:32.904000+00:00 | 214849132 |
32230490 | 11562159 | 2014-09-26 19:16:28.897000+00:00 | 214849132 |
32230477 | 11562161 | 2014-09-26 20:45:42.791000+00:00 | 214546022 |
33003944 rows × 3 columns
In [6]:
data['SessionId'].nunique(), data['ItemId'].nunique()
Out[6]:
(9249729, 52739)
Session Length¶
- SessionId를 공유하는 데이터 row의 개수를 의미
- 해당 세션의 사용자가 그 세션 동안 몇번의 액션을 취했는지(몇개의 상품정보를 클릭했는지)를 의미
In [7]:
session_length = data.groupby('SessionId').size()
session_length
Out[7]:
SessionId 1 4 2 6 3 3 4 2 6 2 .. 11562156 2 11562157 2 11562158 3 11562159 1 11562161 1 Length: 9249729, dtype: int64
In [8]:
session_length.median(), session_length.mean()
Out[8]:
(2.0, 3.568098481587947)
In [9]:
session_length.min(), session_length.max()
Out[9]:
(1, 200)
In [10]:
session_length.quantile(0.999)
Out[10]:
41.0
In [11]:
long_session = session_length[session_length==200].index[0]
data[data['SessionId']==long_session]
Out[11]:
SessionId | Time | ItemId | |
---|---|---|---|
580293 | 189448 | 2014-04-01 08:56:28.983000+00:00 | 214830392 |
580294 | 189448 | 2014-04-01 08:56:31.815000+00:00 | 214830392 |
580295 | 189448 | 2014-04-01 08:57:08.301000+00:00 | 214830392 |
580296 | 189448 | 2014-04-01 08:57:10.338000+00:00 | 214830392 |
580297 | 189448 | 2014-04-01 08:58:01.728000+00:00 | 214830390 |
... | ... | ... | ... |
580488 | 189448 | 2014-04-01 10:35:52.400000+00:00 | 214830137 |
580489 | 189448 | 2014-04-01 10:37:15.094000+00:00 | 214830118 |
580490 | 189448 | 2014-04-01 10:37:35.955000+00:00 | 214830118 |
580491 | 189448 | 2014-04-01 10:37:37.098000+00:00 | 214830118 |
580492 | 189448 | 2014-04-01 10:37:46.557000+00:00 | 214830116 |
200 rows × 3 columns
In [12]:
# 세션길이 기준 하위 99.9%까지의 분포 누적합
length_count = session_length.groupby(session_length).size()
length_percent_cumsum = length_count.cumsum() / length_count.sum()
length_percent_cumsum_999 = length_percent_cumsum[length_percent_cumsum < 0.999]
length_percent_cumsum_999
Out[12]:
1 0.136189 2 0.520858 3 0.695280 4 0.796461 5 0.855125 6 0.894389 7 0.920036 8 0.938321 9 0.951293 10 0.961084 11 0.968267 12 0.973959 13 0.978320 14 0.981815 15 0.984587 16 0.986837 17 0.988673 18 0.990201 19 0.991460 20 0.992520 21 0.993436 22 0.994207 23 0.994871 24 0.995444 25 0.995920 26 0.996342 27 0.996714 28 0.997042 29 0.997330 30 0.997577 31 0.997796 32 0.998001 33 0.998177 34 0.998327 35 0.998461 36 0.998590 37 0.998706 38 0.998805 39 0.998896 40 0.998981 dtype: float64
In [13]:
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 10))
plt.bar(x=length_percent_cumsum_999.index,
height=length_percent_cumsum_999, color='red')
plt.xticks(length_percent_cumsum_999.index)
plt.yticks(np.arange(0, 1.01, 0.05))
plt.title('Cumsum Percentage Until 0.999', size=20)
plt.show()
Session Time¶
In [14]:
oldest, latest = data['Time'].min(), data['Time'].max()
print(oldest)
print(latest)
2014-04-01 03:00:00.124000+00:00 2014-09-30 02:59:59.430000+00:00
In [15]:
# latest는 Timestamp 객체이기 때문에 int 객체와의 사칙연산을 지원하지 않으므로
# 날짜끼리의 차이를 구하고 싶을 때는 datetime 라이브러리의 timedelta 객체를 사용
type(latest)
Out[15]:
pandas._libs.tslibs.timestamps.Timestamp
In [16]:
month_ago = latest - dt.timedelta(30) # 최종 날짜로부터 30일 이전 날짜를 구한다.
data = data[data['Time'] > month_ago] # 방금 구한 날짜 이후의 데이터만 모은다.
data
Out[16]:
SessionId | Time | ItemId | |
---|---|---|---|
26837834 | 9194111 | 2014-08-31 17:40:46.805000+00:00 | 214853420 |
26837835 | 9194111 | 2014-08-31 17:42:26.089000+00:00 | 214850942 |
26837836 | 9194111 | 2014-08-31 17:44:06.583000+00:00 | 214829878 |
26837837 | 9194111 | 2014-08-31 17:48:49.873000+00:00 | 214853420 |
26838214 | 9194112 | 2014-09-01 13:26:36.292000+00:00 | 214853422 |
... | ... | ... | ... |
32230487 | 11562158 | 2014-09-26 04:50:29.172000+00:00 | 214849132 |
32230488 | 11562158 | 2014-09-26 04:52:21.900000+00:00 | 214854774 |
32230489 | 11562158 | 2014-09-26 05:16:32.904000+00:00 | 214849132 |
32230490 | 11562159 | 2014-09-26 19:16:28.897000+00:00 | 214849132 |
32230477 | 11562161 | 2014-09-26 20:45:42.791000+00:00 | 214546022 |
5641401 rows × 3 columns
Data Cleansing¶
In [17]:
# short_session을 제거한 다음 unpopular item을 제거하면 다시 길이가 1인 session이 생길 수 있습니다.
# 이를 위해 반복문을 통해 지속적으로 제거 합니다.
def cleanse_recursive(data: pd.DataFrame, shortest, least_click) -> pd.DataFrame:
while True:
before_len = len(data)
data = cleanse_short_session(data, shortest)
data = cleanse_unpopular_item(data, least_click)
after_len = len(data)
if before_len == after_len:
break
return data
def cleanse_short_session(data: pd.DataFrame, shortest):
session_len = data.groupby('SessionId').size()
session_use = session_len[session_len >= shortest].index
data = data[data['SessionId'].isin(session_use)]
return data
def cleanse_unpopular_item(data: pd.DataFrame, least_click):
item_popular = data.groupby('ItemId').size()
item_use = item_popular[item_popular >= least_click].index
data = data[data['ItemId'].isin(item_use)]
return data
In [18]:
data = cleanse_recursive(data, shortest=2, least_click=5)
data
Out[18]:
SessionId | Time | ItemId | |
---|---|---|---|
26837834 | 9194111 | 2014-08-31 17:40:46.805000+00:00 | 214853420 |
26837835 | 9194111 | 2014-08-31 17:42:26.089000+00:00 | 214850942 |
26837836 | 9194111 | 2014-08-31 17:44:06.583000+00:00 | 214829878 |
26837837 | 9194111 | 2014-08-31 17:48:49.873000+00:00 | 214853420 |
26838202 | 9194123 | 2014-08-31 19:26:57.386000+00:00 | 214601207 |
... | ... | ... | ... |
32230485 | 11562157 | 2014-09-25 12:31:10.391000+00:00 | 214580372 |
32230486 | 11562157 | 2014-09-25 12:31:29.679000+00:00 | 214516012 |
32230487 | 11562158 | 2014-09-26 04:50:29.172000+00:00 | 214849132 |
32230488 | 11562158 | 2014-09-26 04:52:21.900000+00:00 | 214854774 |
32230489 | 11562158 | 2014-09-26 05:16:32.904000+00:00 | 214849132 |
5254242 rows × 3 columns
Train/ Valid/ Test split¶
- 현재의 예측을 하는 것이 중요하므로 예전에 성능이 좋은 모델을 지금 쓰면 맞지 않을 수가 있는데, 사용자들의 소비 패턴이 달라지기 때문
- 그래서 Session-Based Recommendation에서는 기간에 따라 Train/ Valid/ Test 셋을 나누기도 합니다.
- 가장 마지막 1일 기간 동안을 Test로, 2일전부터 1일전 까지를 valid set으로 나누어 사용해볼 것임.
In [19]:
test_path = data_path / 'yoochoose-test.dat'
test= load_data(test_path)
test['Time'].min(), test['Time'].max()
Out[19]:
(Timestamp('2014-04-01 03:00:08.250000+0000', tz='UTC'), Timestamp('2014-09-30 02:59:23.866000+0000', tz='UTC'))
In [20]:
def split_by_date(data: pd.DataFrame, n_days: int):
final_time = data['Time'].max()
session_last_time = data.groupby('SessionId')['Time'].max()
session_in_train = session_last_time[session_last_time < final_time - dt.timedelta(n_days)].index
session_in_test = session_last_time[session_last_time >= final_time - dt.timedelta(n_days)].index
before_date = data[data['SessionId'].isin(session_in_train)]
after_date = data[data['SessionId'].isin(session_in_test)]
after_date = after_date[after_date['ItemId'].isin(before_date['ItemId'])]
return before_date, after_date
In [21]:
tr, test = split_by_date(data, n_days=1)
tr, val = split_by_date(tr, n_days=1)
In [22]:
# data에 대한 정보를 살펴봅니다.
def stats_info(data: pd.DataFrame, status: str):
print(f'* {status} Set Stats Info\n'
f'\t Events: {len(data)}\n'
f'\t Sessions: {data["SessionId"].nunique()}\n'
f'\t Items: {data["ItemId"].nunique()}\n'
f'\t First Time : {data["Time"].min()}\n'
f'\t Last Time : {data["Time"].max()}\n')
In [23]:
stats_info(tr, 'train')
stats_info(val, 'valid')
stats_info(test, 'test')
* train Set Stats Info Events: 5125100 Sessions: 1243431 Items: 20153 First Time : 2014-08-31 03:00:01.111000+00:00 Last Time : 2014-09-28 02:57:34.348000+00:00 * valid Set Stats Info Events: 58074 Sessions: 12350 Items: 6232 First Time : 2014-09-28 03:00:25.298000+00:00 Last Time : 2014-09-29 02:58:27.660000+00:00 * test Set Stats Info Events: 71009 Sessions: 15289 Items: 6580 First Time : 2014-09-29 02:37:20.695000+00:00 Last Time : 2014-09-30 02:59:59.430000+00:00
In [24]:
# train set에 없는 아이템이 val, test기간에 생길 수 있으므로 train data를 기준으로 인덱싱합니다.
id2idx = {item_id : index for index, item_id in enumerate(tr['ItemId'].unique())}
def indexing(df, id2idx):
df['item_idx'] = df['ItemId'].map(lambda x: id2idx.get(x, -1)) # id2idx에 없는 아이템은 모르는 값(-1) 처리 해줍니다.
return df
tr = indexing(tr, id2idx)
val = indexing(val, id2idx)
test = indexing(test, id2idx)
In [25]:
save_path = data_path / 'processed'
save_path.mkdir(parents=True, exist_ok=True)
tr.to_pickle(save_path / 'train.pkl')
val.to_pickle(save_path / 'valid.pkl')
test.to_pickle(save_path / 'test.pkl')
Session-Parallel Mini-Batches¶
길이가 제일 긴 세션의 연산이 끝날 때까지 짧은 세션들이 기다려야 하는 단점을 보완하여 Session이 끝날 때까지 기다리지 않고 병렬적으로 계산
session2가 끝나면 session4가 시작하는 방식으로 Mini-Batch의 shape은 (3, 1, 1)이 되고 RNN cell의 state가 1개로만 이루어집니다
In [26]:
# 데이터가 주어지면 세션이 시작되는 인덱스를 담는 값과 세션을 새로 인덱싱한 값을 갖는 클래스를 만듭니다.
class SessionDataset:
"""Credit to yhs-968/pyGRU4REC."""
def __init__(self, data):
self.df = data
self.click_offsets = self.get_click_offsets() # 각 세션이 시작된 인덱스를 가진 변수
self.session_idx = np.arange(self.df['SessionId'].nunique()) # indexing to SessionId
def get_click_offsets(self):
"""
Return the indexes of the first click of each session IDs,
"""
offsets = np.zeros(self.df['SessionId'].nunique() + 1, dtype=np.int32)
offsets[1:] = self.df.groupby('SessionId').size().cumsum()
return offsets
In [27]:
tr_dataset = SessionDataset(tr)
tr_dataset.df.head(10)
Out[27]:
SessionId | Time | ItemId | item_idx | |
---|---|---|---|---|
26837834 | 9194111 | 2014-08-31 17:40:46.805000+00:00 | 214853420 | 0 |
26837835 | 9194111 | 2014-08-31 17:42:26.089000+00:00 | 214850942 | 1 |
26837836 | 9194111 | 2014-08-31 17:44:06.583000+00:00 | 214829878 | 2 |
26837837 | 9194111 | 2014-08-31 17:48:49.873000+00:00 | 214853420 | 0 |
26838202 | 9194123 | 2014-08-31 19:26:57.386000+00:00 | 214601207 | 3 |
26838203 | 9194123 | 2014-08-31 19:34:37.068000+00:00 | 214510689 | 4 |
26838193 | 9194124 | 2014-08-31 19:14:28.308000+00:00 | 214849327 | 5 |
26838194 | 9194124 | 2014-08-31 19:16:31.114000+00:00 | 214828970 | 6 |
26838196 | 9194127 | 2014-09-01 15:36:11.651000+00:00 | 214845997 | 7 |
26838197 | 9194127 | 2014-09-01 15:38:00.222000+00:00 | 214845997 | 7 |
In [28]:
tr_dataset.click_offsets
Out[28]:
array([ 0, 4, 6, ..., 5125095, 5125097, 5125100], dtype=int32)
In [29]:
tr_dataset.session_idx
Out[29]:
array([ 0, 1, 2, ..., 1243428, 1243429, 1243430])
SessionDataLoader¶
In [30]:
class SessionDataLoader:
"""Credit to yhs-968/pyGRU4REC."""
def __init__(self, dataset: SessionDataset, batch_size=50):
self.dataset = dataset
self.batch_size = batch_size
def __iter__(self):
""" Returns the iterator for producing session-parallel training mini-batches.
Yields:
input (B,): Item indices that will be encoded as one-hot vectors later.
target (B,): a Variable that stores the target item indices
masks: Numpy array indicating the positions of the sessions to be terminated
"""
start, end, mask, last_session, finished = self.initialize() # initialize 메소드에서 확인해주세요.
"""
start : Index Where Session Start
end : Index Where Session End
mask : indicator for the sessions to be terminated
"""
while not finished:
min_len = (end - start).min() - 1 # Shortest Length Among Sessions
for i in range(min_len):
# Build inputs & targets
inp = self.dataset.df['item_idx'].values[start + i]
target = self.dataset.df['item_idx'].values[start + i + 1]
yield inp, target, mask
start, end, mask, last_session, finished = self.update_status(start, end, min_len, last_session, finished)
def initialize(self):
first_iters = np.arange(self.batch_size) # 첫 배치에 사용할 세션 Index를 가져옵니다.
last_session = self.batch_size - 1 # 마지막으로 다루고 있는 세션 Index를 저장해둡니다.
start = self.dataset.click_offsets[self.dataset.session_idx[first_iters]] # data 상에서 session이 시작된 위치를 가져옵니다.
end = self.dataset.click_offsets[self.dataset.session_idx[first_iters] + 1] # session이 끝난 위치 바로 다음 위치를 가져옵니다.
mask = np.array([]) # session의 모든 아이템을 다 돌은 경우 mask에 추가해줄 것입니다.
finished = False # data를 전부 돌았는지 기록하기 위한 변수입니다.
return start, end, mask, last_session, finished
def update_status(self, start: np.ndarray, end: np.ndarray, min_len: int, last_session: int, finished: bool):
# 다음 배치 데이터를 생성하기 위해 상태를 update합니다.
start += min_len # __iter__에서 min_len 만큼 for문을 돌았으므로 start를 min_len 만큼 더해줍니다.
mask = np.arange(self.batch_size)[(end - start) == 1]
# end는 다음 세션이 시작되는 위치인데 start와 한 칸 차이난다는 것은 session이 끝났다는 뜻입니다. mask에 기록해줍니다.
for i, idx in enumerate(mask, start=1): # mask에 추가된 세션 개수만큼 새로운 세션을 돌것입니다.
new_session = last_session + i
if new_session > self.dataset.session_idx[-1]: # 만약 새로운 세션이 마지막 세션 index보다 크다면 모든 학습데이터를 돈 것입니다.
finished = True
break
# update the next starting/ending point
start[idx] = self.dataset.click_offsets[self.dataset.session_idx[new_session]] # 종료된 세션 대신 새로운 세션의 시작점을 기록합니다.
end[idx] = self.dataset.click_offsets[self.dataset.session_idx[new_session] + 1]
last_session += len(mask) # 마지막 세션의 위치를 기록해둡니다.
return start, end, mask, last_session, finished
In [31]:
tr_data_loader = SessionDataLoader(tr_dataset, batch_size=4)
tr_dataset.df.head(15)
Out[31]:
SessionId | Time | ItemId | item_idx | |
---|---|---|---|---|
26837834 | 9194111 | 2014-08-31 17:40:46.805000+00:00 | 214853420 | 0 |
26837835 | 9194111 | 2014-08-31 17:42:26.089000+00:00 | 214850942 | 1 |
26837836 | 9194111 | 2014-08-31 17:44:06.583000+00:00 | 214829878 | 2 |
26837837 | 9194111 | 2014-08-31 17:48:49.873000+00:00 | 214853420 | 0 |
26838202 | 9194123 | 2014-08-31 19:26:57.386000+00:00 | 214601207 | 3 |
26838203 | 9194123 | 2014-08-31 19:34:37.068000+00:00 | 214510689 | 4 |
26838193 | 9194124 | 2014-08-31 19:14:28.308000+00:00 | 214849327 | 5 |
26838194 | 9194124 | 2014-08-31 19:16:31.114000+00:00 | 214828970 | 6 |
26838196 | 9194127 | 2014-09-01 15:36:11.651000+00:00 | 214845997 | 7 |
26838197 | 9194127 | 2014-09-01 15:38:00.222000+00:00 | 214845997 | 7 |
26838198 | 9194127 | 2014-09-01 15:38:56.867000+00:00 | 214845997 | 7 |
26838259 | 9194128 | 2014-08-31 19:09:27.360000+00:00 | 214581830 | 8 |
26838260 | 9194128 | 2014-08-31 19:10:04.641000+00:00 | 214574135 | 9 |
26838261 | 9194128 | 2014-08-31 19:10:57+00:00 | 214857795 | 10 |
26838262 | 9194128 | 2014-08-31 19:11:31.797000+00:00 | 214574139 | 11 |
In [32]:
iter_ex = iter(tr_data_loader)
In [33]:
inputs, labels, mask = next(iter_ex)
print(f'Model Input Item Idx are : {inputs}')
print(f'Label Item Idx are : {"":5} {labels}')
print(f'Previous Masked Input Idx are {mask}')
Model Input Item Idx are : [0 3 5 7] Label Item Idx are : [1 4 6 7] Previous Masked Input Idx are []
In [34]:
def mrr_k(pred, truth: int, k: int):
indexing = np.where(pred[:k] == truth)[0]
if len(indexing) > 0:
return 1 / (indexing[0] + 1)
else:
return 0
def recall_k(pred, truth: int, k: int) -> int:
answer = truth in pred[:k]
return int(answer)
Model Architecture¶
- pip install tqdm
In [35]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout, GRU
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tqdm import tqdm
In [36]:
def create_model(args):
inputs = Input(batch_shape=(args.batch_size, 1, args.num_items))
gru, _ = GRU(args.hsz, stateful=True, return_state=True, name='GRU')(inputs)
dropout = Dropout(args.drop_rate)(gru)
predictions = Dense(args.num_items, activation='softmax')(dropout)
model = Model(inputs=inputs, outputs=[predictions])
model.compile(loss=categorical_crossentropy, optimizer=Adam(args.lr), metrics=['accuracy'])
model.summary()
return model
In [37]:
class Args:
def __init__(self, tr, val, test, batch_size, hsz, drop_rate, lr, epochs, k):
self.tr = tr
self.val = val
self.test = test
self.num_items = tr['ItemId'].nunique()
self.num_sessions = tr['SessionId'].nunique()
self.batch_size = batch_size
self.hsz = hsz
self.drop_rate = drop_rate
self.lr = lr
self.epochs = epochs
self.k = k
args = Args(tr, val, test, batch_size=2048, hsz=50, drop_rate=0.1, lr=0.001, epochs=3, k=20)
In [38]:
model = create_model(args)
Model: "model" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(2048, 1, 20153)] 0 _________________________________________________________________ GRU (GRU) [(2048, 50), (2048, 50)] 3030750 _________________________________________________________________ dropout (Dropout) (2048, 50) 0 _________________________________________________________________ dense (Dense) (2048, 20153) 1027803 ================================================================= Total params: 4,058,553 Trainable params: 4,058,553 Non-trainable params: 0 _________________________________________________________________
Model Training¶
In [39]:
# train 셋으로 학습하면서 valid 셋으로 검증합니다.
def train_model(model, args):
train_dataset = SessionDataset(args.tr)
train_loader = SessionDataLoader(train_dataset, batch_size=args.batch_size)
for epoch in range(1, args.epochs + 1):
total_step = len(args.tr) - args.tr['SessionId'].nunique()
tr_loader = tqdm(train_loader, total=total_step // args.batch_size, desc='Train', mininterval=1)
for feat, target, mask in tr_loader:
reset_hidden_states(model, mask) # 종료된 session은 hidden_state를 초기화합니다. 아래 메서드에서 확인해주세요.
input_ohe = to_categorical(feat, num_classes=args.num_items)
input_ohe = np.expand_dims(input_ohe, axis=1)
target_ohe = to_categorical(target, num_classes=args.num_items)
result = model.train_on_batch(input_ohe, target_ohe)
tr_loader.set_postfix(train_loss=result[0], accuracy = result[1])
val_recall, val_mrr = get_metrics(args.val, model, args, args.k) # valid set에 대해 검증합니다.
print(f"\t - Recall@{args.k} epoch {epoch}: {val_recall:3f}")
print(f"\t - MRR@{args.k} epoch {epoch}: {val_mrr:3f}\n")
def reset_hidden_states(model, mask):
gru_layer = model.get_layer(name='GRU') # model에서 gru layer를 가져옵니다.
hidden_states = gru_layer.states[0].numpy() # gru_layer의 parameter를 가져옵니다.
for elt in mask: # mask된 인덱스 즉, 종료된 세션의 인덱스를 돌면서
hidden_states[elt, :] = 0 # parameter를 초기화 합니다.
gru_layer.reset_states(states=hidden_states)
def get_metrics(data, model, args, k: int): # valid셋과 test셋을 평가하는 코드입니다.
# train과 거의 같지만 mrr, recall을 구하는 라인이 있습니다.
dataset = SessionDataset(data)
loader = SessionDataLoader(dataset, batch_size=args.batch_size)
recall_list, mrr_list = [], []
total_step = len(data) - data['SessionId'].nunique()
for inputs, label, mask in tqdm(loader, total=total_step // args.batch_size, desc='Evaluation', mininterval=1):
reset_hidden_states(model, mask)
input_ohe = to_categorical(inputs, num_classes=args.num_items)
input_ohe = np.expand_dims(input_ohe, axis=1)
pred = model.predict(input_ohe, batch_size=args.batch_size)
pred_arg = tf.argsort(pred, direction='DESCENDING') # softmax 값이 큰 순서대로 sorting 합니다.
length = len(inputs)
recall_list.extend([recall_k(pred_arg[i], label[i], k) for i in range(length)])
mrr_list.extend([mrr_k(pred_arg[i], label[i], k) for i in range(length)])
recall, mrr = np.mean(recall_list), np.mean(mrr_list)
return recall, mrr
In [40]:
# 학습 시간이 다소 오래 소요됩니다. (예상시간 1시간)
train_model(model, args)
Train: 100%|█████████▉| 1891/1895 [21:38<00:02, 1.46it/s, accuracy=0.113, train_loss=5.8] Evaluation: 77%|███████▋ | 17/22 [01:48<00:31, 6.38s/it] Train: 0%| | 0/1895 [00:00<?, ?it/s]
- Recall@20 epoch 1: 0.449161 - MRR@20 epoch 1: 0.166672
Train: 100%|█████████▉| 1891/1895 [21:46<00:02, 1.45it/s, accuracy=0.168, train_loss=4.8] Evaluation: 77%|███████▋ | 17/22 [01:34<00:27, 5.55s/it] Train: 0%| | 0/1895 [00:00<?, ?it/s]
- Recall@20 epoch 2: 0.596536 - MRR@20 epoch 2: 0.243300
Train: 100%|█████████▉| 1891/1895 [21:53<00:02, 1.44it/s, accuracy=0.189, train_loss=4.37] Evaluation: 77%|███████▋ | 17/22 [01:27<00:25, 5.18s/it]
- Recall@20 epoch 3: 0.658433 - MRR@20 epoch 3: 0.283255
5. Inference¶
In [41]:
def test_model(model, args, test):
test_recall, test_mrr = get_metrics(test, model, args, 20)
print(f"\t - Recall@{args.k}: {test_recall:3f}")
print(f"\t - MRR@{args.k}: {test_mrr:3f}\n")
test_model(model, args, test)
Evaluation: 81%|████████▏ | 22/27 [01:53<00:25, 5.17s/it]
- Recall@20: 0.656672 - MRR@20: 0.273394
프로젝트 : Movielens 영화 SBR¶
1) wget으로 데이터 다운로드 $ wget http://files.grouplens.org/datasets/movielens/ml-1m.zip
2) 다운받은 데이터를 작업디렉토리로 이동 $ mv ml-1m.zip ~/aiffel/yoochoose-data
3) 압축 해제 $ cd ~/aiffel/yoochoose-data && unzip ml-1m.zip
In [461]:
data_path = Path(os.getenv('HOME')+'/aiffel/yoochoose-data/ml-1m')
train_path = data_path / 'ratings.dat'
def load_data(data_path: Path, nrows=None):
data = pd.read_csv(data_path, sep='::', header=None, usecols=[0, 1, 2, 3], dtype={0: np.int32, 1: np.int32, 2: np.int32}, nrows=nrows)
data.columns = ['UserId', 'ItemId', 'Rating', 'Time']
return data
data = load_data(train_path, None)
data.sort_values(['UserId', 'Time'], inplace=True) # data를 id와 시간 순서로 정렬해줍니다.
data
Out[461]:
UserId | ItemId | Rating | Time | |
---|---|---|---|---|
31 | 1 | 3186 | 4 | 978300019 |
22 | 1 | 1270 | 5 | 978300055 |
27 | 1 | 1721 | 4 | 978300055 |
37 | 1 | 1022 | 5 | 978300055 |
24 | 1 | 2340 | 3 | 978300103 |
... | ... | ... | ... | ... |
1000019 | 6040 | 2917 | 4 | 997454429 |
999988 | 6040 | 1921 | 4 | 997454464 |
1000172 | 6040 | 1784 | 3 | 997454464 |
1000167 | 6040 | 161 | 3 | 997454486 |
1000042 | 6040 | 1221 | 4 | 998315055 |
1000209 rows × 4 columns
In [462]:
# 유저 수, 영화 수
data['UserId'].nunique(), data['ItemId'].nunique()
Out[462]:
(6040, 3706)
In [463]:
# 각 아이디별 사람들이 보고 리뷰의 개수
session_length = data.groupby('UserId').size()
session_length
Out[463]:
UserId 1 53 2 129 3 51 4 21 5 198 ... 6036 888 6037 202 6038 20 6039 123 6040 341 Length: 6040, dtype: int64
In [464]:
# 리뷰 개수의 평균값
session_length.median(), session_length.mean()
Out[464]:
(96.0, 165.5975165562914)
In [465]:
session_length.min(), session_length.max()
Out[465]:
(20, 2314)
In [466]:
# 99.9%의 영화가 1343개 이하
session_length.quantile(0.999)
Out[466]:
1343.181000000005
In [467]:
long_session = session_length[session_length>=1343].index[0]
data[data['UserId']==long_session]
Out[467]:
UserId | ItemId | Rating | Time | |
---|---|---|---|---|
137631 | 889 | 1266 | 3 | 975247862 |
137864 | 889 | 2430 | 3 | 975247862 |
137889 | 889 | 1643 | 3 | 975247862 |
138039 | 889 | 3461 | 3 | 975247862 |
138316 | 889 | 1193 | 1 | 975247862 |
... | ... | ... | ... | ... |
138285 | 889 | 1322 | 1 | 975364486 |
138292 | 889 | 1328 | 1 | 975364486 |
139033 | 889 | 2974 | 2 | 975364486 |
137720 | 889 | 3047 | 5 | 975364518 |
138794 | 889 | 3713 | 4 | 975364518 |
1518 rows × 4 columns
In [468]:
# 시각화
length_count = session_length.groupby(session_length).size()
length_percent_cumsum = length_count.cumsum() / length_count.sum()
length_percent_cumsum_999 = length_percent_cumsum[length_percent_cumsum < 0.999]
plt.figure(figsize=(20, 10))
plt.bar(x=length_percent_cumsum_999.index,
height=length_percent_cumsum_999, color='red')
plt.xticks(length_percent_cumsum_999.index)
plt.yticks(np.arange(0, 1.01, 0.05))
plt.title('Cumsum Percentage Until 0.999', size=20)
plt.show()
Datetime 처리 하는 법¶
In [469]:
# 시간 관련 정보 확인
# unix timestamp 형식인 Time 컬럼을 datetime 객체로 변경
# https://stackoverflow.com/questions/3682748/converting-unix-timestamp-string-to-readable-date
# https://www.kite.com/python/answers/how-to-modify-all-the-values-in-a-pandas-dataframe-column-in-python
data['Time'] = data['Time'].apply(lambda x:dt.datetime.utcfromtimestamp(int(x)))
oldest, latest = data['Time'].min(), data['Time'].max()
print(oldest)
print(latest)
2000-04-25 23:05:32 2003-02-28 17:49:50
In [470]:
type(latest)
Out[470]:
pandas._libs.tslibs.timestamps.Timestamp
In [471]:
pd.to_datetime(latest, unit='s')
Out[471]:
Timestamp('2003-02-28 17:49:50')
In [472]:
# short_session을 제거한 다음 unpopular item을 제거하면 다시 길이가 1인 session이 생길 수 있습니다.
# 이를 위해 반복문을 통해 지속적으로 제거 합니다.
def cleanse_recursive(data: pd.DataFrame, shortest, least_click) -> pd.DataFrame:
while True:
before_len = len(data)
data = cleanse_short_session(data, shortest)
data = cleanse_unpopular_item(data, least_click)
after_len = len(data)
if before_len == after_len:
break
return data
# SessionId -> UserId로 변경
def cleanse_short_session(data: pd.DataFrame, shortest):
session_len = data.groupby('UserId').size()
session_use = session_len[session_len >= shortest].index
data = data[data['UserId'].isin(session_use)]
return data
def cleanse_unpopular_item(data: pd.DataFrame, least_click):
item_popular = data.groupby('ItemId').size()
item_use = item_popular[item_popular >= least_click].index
data = data[data['ItemId'].isin(item_use)]
return data
In [473]:
# timedelta: 날짜끼리의 차이를 구할 때 사용
data = cleanse_recursive(data, shortest=2, least_click=5)
data
Out[473]:
UserId | ItemId | Rating | Time | |
---|---|---|---|---|
31 | 1 | 3186 | 4 | 2000-12-31 22:00:19 |
22 | 1 | 1270 | 5 | 2000-12-31 22:00:55 |
27 | 1 | 1721 | 4 | 2000-12-31 22:00:55 |
37 | 1 | 1022 | 5 | 2000-12-31 22:00:55 |
24 | 1 | 2340 | 3 | 2000-12-31 22:01:43 |
... | ... | ... | ... | ... |
1000019 | 6040 | 2917 | 4 | 2001-08-10 14:40:29 |
999988 | 6040 | 1921 | 4 | 2001-08-10 14:41:04 |
1000172 | 6040 | 1784 | 3 | 2001-08-10 14:41:04 |
1000167 | 6040 | 161 | 3 | 2001-08-10 14:41:26 |
1000042 | 6040 | 1221 | 4 | 2001-08-20 13:44:15 |
999611 rows × 4 columns
In [474]:
data.sort_values(by=['Time'], inplace=True) # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
data.shape
Out[474]:
(999611, 4)
In [475]:
# split
data_train = data[:700000]
data_val = data[700000:850000]
data_test = data[850000:]
In [476]:
def stats_info(data: pd.DataFrame, status: str):
print(f'* {status} Set Stats Info\n'
f'\t Events: {len(data)}\n'
f'\t Users (Sessions): {data["UserId"].nunique()}\n'
f'\t Items: {data["ItemId"].nunique()}\n'
f'\t First Time : {data["Time"].min()}\n'
f'\t Last Time : {data["Time"].max()}\n')
In [477]:
stats_info(data_train, 'train')
stats_info(data_val, 'valid')
stats_info(data_test, 'test')
* train Set Stats Info Events: 700000 Users (Sessions): 4870 Items: 3408 First Time : 2000-04-25 23:05:32 Last Time : 2000-11-22 03:15:20 * valid Set Stats Info Events: 150000 Users (Sessions): 1348 Items: 3351 First Time : 2000-11-22 03:15:23 Last Time : 2000-12-10 04:52:34 * test Set Stats Info Events: 149611 Users (Sessions): 1511 Items: 3362 First Time : 2000-12-10 04:52:34 Last Time : 2003-02-28 17:49:50
In [478]:
# train set에 없는 아이템이 val, test기간에 생길 수 있으므로 train data를 기준으로 인덱싱합니다.
id2idx = {item_id : index for index, item_id in enumerate(data_train['ItemId'].unique())}
def indexing(df, id2idx):
df['item_idx'] = df['ItemId'].map(lambda x: id2idx.get(x, -1)) # id2idx에 없는 아이템은 모르는 값(-1) 처리 해줍니다.
return df
data_train = indexing(data_train, id2idx)
data_val = indexing(data_val, id2idx)
data_test = indexing(data_test, id2idx)
In [479]:
save_path = data_path / 'processed'
save_path.mkdir(parents=True, exist_ok=True)
data_train.to_pickle(save_path / 'train.pkl')
data_val.to_pickle(save_path / 'valid.pkl')
data_test.to_pickle(save_path / 'test.pkl')
모델구성¶
In [480]:
# 데이터가 주어지면 세션이 시작되는 인덱스를 담는 값과 세션을 새로 인덱싱한 값을 갖는 클래스를 만듭니다.
class SessionDataset:
"""Credit to yhs-968/pyGRU4REC."""
def __init__(self, data):
self.df = data
self.click_offsets = self.get_click_offsets() # 각 세션이 시작된 인덱스를 가진 변수
self.session_idx = np.arange(self.df['UserId'].nunique()) # indexing to SessionId
def get_click_offsets(self):
"""
Return the indexes of the first click of each session IDs,
"""
offsets = np.zeros(self.df['UserId'].nunique() + 1, dtype=np.int32)
offsets[1:] = self.df.groupby('UserId').size().cumsum()
return offsets
In [481]:
tr_dataset = SessionDataset(data_train)
tr_dataset.df.head(10)
Out[481]:
UserId | ItemId | Rating | Time | item_idx | |
---|---|---|---|---|---|
1000138 | 6040 | 858 | 4 | 2000-04-25 23:05:32 | 0 |
999873 | 6040 | 593 | 5 | 2000-04-25 23:05:54 | 1 |
1000153 | 6040 | 2384 | 4 | 2000-04-25 23:05:54 | 2 |
1000007 | 6040 | 1961 | 4 | 2000-04-25 23:06:17 | 3 |
1000192 | 6040 | 2019 | 5 | 2000-04-25 23:06:17 | 4 |
999868 | 6040 | 573 | 4 | 2000-04-25 23:07:36 | 5 |
999877 | 6040 | 1419 | 3 | 2000-04-25 23:07:36 | 6 |
999920 | 6040 | 213 | 5 | 2000-04-25 23:07:36 | 7 |
999967 | 6040 | 3111 | 5 | 2000-04-25 23:07:36 | 8 |
999980 | 6040 | 3505 | 4 | 2000-04-25 23:07:36 | 9 |
In [482]:
tr_dataset.click_offsets
Out[482]:
array([ 0, 76, 174, ..., 699559, 699682, 700000], dtype=int32)
In [483]:
tr_dataset.session_idx
Out[483]:
array([ 0, 1, 2, ..., 4867, 4868, 4869])
In [484]:
class SessionDataLoader:
"""Credit to yhs-968/pyGRU4REC."""
def __init__(self, dataset: SessionDataset, batch_size=50):
self.dataset = dataset
self.batch_size = batch_size
def __iter__(self):
""" Returns the iterator for producing session-parallel training mini-batches.
Yields:
input (B,): Item indices that will be encoded as one-hot vectors later.
target (B,): a Variable that stores the target item indices
masks: Numpy array indicating the positions of the sessions to be terminated
"""
start, end, mask, last_session, finished = self.initialize() # initialize 메소드에서 확인해주세요.
"""
start : Index Where Session Start
end : Index Where Session End
mask : indicator for the sessions to be terminated
"""
while not finished:
min_len = (end - start).min() - 1 # Shortest Length Among Sessions
for i in range(min_len):
# Build inputs & targets
inp = self.dataset.df['item_idx'].values[start + i]
target = self.dataset.df['item_idx'].values[start + i + 1]
yield inp, target, mask
start, end, mask, last_session, finished = self.update_status(start, end, min_len, last_session, finished)
def initialize(self):
first_iters = np.arange(self.batch_size) # 첫 배치에 사용할 세션 Index를 가져옵니다.
last_session = self.batch_size - 1 # 마지막으로 다루고 있는 세션 Index를 저장해둡니다.
start = self.dataset.click_offsets[self.dataset.session_idx[first_iters]] # data 상에서 session이 시작된 위치를 가져옵니다.
end = self.dataset.click_offsets[self.dataset.session_idx[first_iters] + 1] # session이 끝난 위치 바로 다음 위치를 가져옵니다.
mask = np.array([]) # session의 모든 아이템을 다 돌은 경우 mask에 추가해줄 것입니다.
finished = False # data를 전부 돌았는지 기록하기 위한 변수입니다.
return start, end, mask, last_session, finished
def update_status(self, start: np.ndarray, end: np.ndarray, min_len: int, last_session: int, finished: bool):
# 다음 배치 데이터를 생성하기 위해 상태를 update합니다.
start += min_len # __iter__에서 min_len 만큼 for문을 돌았으므로 start를 min_len 만큼 더해줍니다.
mask = np.arange(self.batch_size)[(end - start) == 1]
# end는 다음 세션이 시작되는 위치인데 start와 한 칸 차이난다는 것은 session이 끝났다는 뜻입니다. mask에 기록해줍니다.
for i, idx in enumerate(mask, start=1): # mask에 추가된 세션 개수만큼 새로운 세션을 돌것입니다.
new_session = last_session + i
if new_session > self.dataset.session_idx[-1]: # 만약 새로운 세션이 마지막 세션 index보다 크다면 모든 학습데이터를 돈 것입니다.
finished = True
break
# update the next starting/ending point
start[idx] = self.dataset.click_offsets[self.dataset.session_idx[new_session]] # 종료된 세션 대신 새로운 세션의 시작점을 기록합니다.
end[idx] = self.dataset.click_offsets[self.dataset.session_idx[new_session] + 1]
last_session += len(mask) # 마지막 세션의 위치를 기록해둡니다.
return start, end, mask, last_session, finished
In [485]:
tr_data_loader = SessionDataLoader(tr_dataset, batch_size=4)
tr_dataset.df.head(15)
Out[485]:
UserId | ItemId | Rating | Time | item_idx | |
---|---|---|---|---|---|
1000138 | 6040 | 858 | 4 | 2000-04-25 23:05:32 | 0 |
999873 | 6040 | 593 | 5 | 2000-04-25 23:05:54 | 1 |
1000153 | 6040 | 2384 | 4 | 2000-04-25 23:05:54 | 2 |
1000007 | 6040 | 1961 | 4 | 2000-04-25 23:06:17 | 3 |
1000192 | 6040 | 2019 | 5 | 2000-04-25 23:06:17 | 4 |
999868 | 6040 | 573 | 4 | 2000-04-25 23:07:36 | 5 |
999877 | 6040 | 1419 | 3 | 2000-04-25 23:07:36 | 6 |
999920 | 6040 | 213 | 5 | 2000-04-25 23:07:36 | 7 |
999967 | 6040 | 3111 | 5 | 2000-04-25 23:07:36 | 8 |
999980 | 6040 | 3505 | 4 | 2000-04-25 23:07:36 | 9 |
1000155 | 6040 | 1734 | 2 | 2000-04-25 23:08:01 | 10 |
999971 | 6040 | 2503 | 5 | 2000-04-25 23:09:51 | 11 |
999888 | 6040 | 919 | 5 | 2000-04-25 23:09:51 | 12 |
999884 | 6040 | 912 | 5 | 2000-04-25 23:09:51 | 13 |
1000186 | 6040 | 527 | 5 | 2000-04-25 23:10:19 | 14 |
In [486]:
iter_ex = iter(tr_data_loader)
In [487]:
inputs, labels, mask = next(iter_ex)
print(f'Model Input Item Idx are : {inputs}')
print(f'Label Item Idx are : {"":5} {labels}')
print(f'Previous Masked Input Idx are {mask}')
Model Input Item Idx are : [ 0 76 171 204] Label Item Idx are : [ 1 77 172 205] Previous Masked Input Idx are []
In [488]:
def mrr_k(pred, truth: int, k: int):
indexing = np.where(pred[:k] == truth)[0]
if len(indexing) > 0:
return 1 / (indexing[0] + 1)
else:
return 0
def recall_k(pred, truth: int, k: int) -> int:
answer = truth in pred[:k]
return int(answer)
In [489]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout, GRU
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tqdm import tqdm
In [490]:
def create_model(args):
inputs = Input(batch_shape=(args.batch_size, 1, args.num_items))
gru, _ = GRU(args.hsz, stateful=True, return_state=True, name='GRU')(inputs)
dropout = Dropout(args.drop_rate)(gru)
predictions = Dense(args.num_items, activation='softmax')(dropout)
model = Model(inputs=inputs, outputs=[predictions])
model.compile(loss=categorical_crossentropy, optimizer=Adam(args.lr), metrics=['accuracy'])
model.summary()
return model
In [491]:
class Args:
def __init__(self, tr, val, test, batch_size, hsz, drop_rate, lr, epochs, k):
self.tr = tr
self.val = val
self.test = test
self.num_items = tr['ItemId'].nunique()
self.num_sessions = tr['UserId'].nunique()
self.batch_size = batch_size
self.hsz = hsz
self.drop_rate = drop_rate
self.lr = lr
self.epochs = epochs
self.k = k
args = Args(data_train, data_val, data_test, batch_size=256, hsz=50, drop_rate=0.1, lr=0.001, epochs=5, k=20)
In [492]:
model = create_model(args)
Model: "model_10" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_11 (InputLayer) [(256, 1, 3408)] 0 _________________________________________________________________ GRU (GRU) [(256, 50), (256, 50)] 519000 _________________________________________________________________ dropout_10 (Dropout) (256, 50) 0 _________________________________________________________________ dense_10 (Dense) (256, 3408) 173808 ================================================================= Total params: 692,808 Trainable params: 692,808 Non-trainable params: 0 _________________________________________________________________
In [493]:
# SessionId -> UserId
def train_model(model, args):
train_dataset = SessionDataset(args.tr)
train_loader = SessionDataLoader(train_dataset, batch_size=args.batch_size)
for epoch in range(1, args.epochs + 1):
total_step = len(args.tr) - args.tr['UserId'].nunique()
tr_loader = tqdm(train_loader, total=total_step // args.batch_size, desc='Train', mininterval=1)
for feat, target, mask in tr_loader:
reset_hidden_states(model, mask) # 종료된 session은 hidden_state를 초기화합니다. 아래 메서드에서 확인해주세요.
input_ohe = to_categorical(feat, num_classes=args.num_items)
input_ohe = np.expand_dims(input_ohe, axis=1)
target_ohe = to_categorical(target, num_classes=args.num_items)
result = model.train_on_batch(input_ohe, target_ohe)
tr_loader.set_postfix(train_loss=result[0], accuracy = result[1])
val_recall, val_mrr = get_metrics(args.val, model, args, args.k) # valid set에 대해 검증합니다.
print(f"\t - Recall@{args.k} epoch {epoch}: {val_recall:3f}")
print(f"\t - MRR@{args.k} epoch {epoch}: {val_mrr:3f}\n")
def reset_hidden_states(model, mask):
gru_layer = model.get_layer(name='GRU') # model에서 gru layer를 가져옵니다.
hidden_states = gru_layer.states[0].numpy() # gru_layer의 parameter를 가져옵니다.
for elt in mask: # mask된 인덱스 즉, 종료된 세션의 인덱스를 돌면서
hidden_states[elt, :] = 0 # parameter를 초기화 합니다.
gru_layer.reset_states(states=hidden_states)
def get_metrics(data, model, args, k: int): # valid셋과 test셋을 평가하는 코드입니다.
# train과 거의 같지만 mrr, recall을 구하는 라인이 있습니다.
dataset = SessionDataset(data)
loader = SessionDataLoader(dataset, batch_size=args.batch_size)
recall_list, mrr_list = [], []
total_step = len(data) - data['UserId'].nunique()
for inputs, label, mask in tqdm(loader, total=total_step // args.batch_size, desc='Evaluation', mininterval=1):
reset_hidden_states(model, mask)
input_ohe = to_categorical(inputs, num_classes=args.num_items)
input_ohe = np.expand_dims(input_ohe, axis=1)
pred = model.predict(input_ohe, batch_size=args.batch_size)
pred_arg = tf.argsort(pred, direction='DESCENDING') # softmax 값이 큰 순서대로 sorting 합니다.
length = len(inputs)
recall_list.extend([recall_k(pred_arg[i], label[i], k) for i in range(length)])
mrr_list.extend([mrr_k(pred_arg[i], label[i], k) for i in range(length)])
recall, mrr = np.mean(recall_list), np.mean(mrr_list)
return recall, mrr
In [494]:
train_model(model, args)
Train: 94%|█████████▎| 2544/2715 [00:45<00:03, 55.46it/s, accuracy=0.00391, train_loss=7.28] Evaluation: 74%|███████▍ | 432/580 [06:56<02:22, 1.04it/s] Train: 0%| | 0/2715 [00:00<?, ?it/s, accuracy=0.0117, train_loss=7.11]
- Recall@20 epoch 1: 0.081543 - MRR@20 epoch 1: 0.018304
Train: 94%|█████████▎| 2544/2715 [00:44<00:02, 57.04it/s, accuracy=0, train_loss=6.99] Evaluation: 74%|███████▍ | 432/580 [06:40<02:17, 1.08it/s] Train: 0%| | 0/2715 [00:00<?, ?it/s, accuracy=0.00391, train_loss=6.82]
- Recall@20 epoch 2: 0.141249 - MRR@20 epoch 2: 0.033689
Train: 94%|█████████▎| 2544/2715 [00:43<00:02, 58.60it/s, accuracy=0.0156, train_loss=6.86] Evaluation: 74%|███████▍ | 432/580 [06:32<02:14, 1.10it/s] Train: 0%| | 0/2715 [00:00<?, ?it/s, accuracy=0.0156, train_loss=6.48]
- Recall@20 epoch 3: 0.170103 - MRR@20 epoch 3: 0.040294
Train: 94%|█████████▎| 2544/2715 [00:43<00:02, 58.66it/s, accuracy=0.0117, train_loss=6.79] Evaluation: 74%|███████▍ | 432/580 [06:31<02:14, 1.10it/s] Train: 0%| | 0/2715 [00:00<?, ?it/s, accuracy=0.0156, train_loss=6.36]
- Recall@20 epoch 4: 0.181849 - MRR@20 epoch 4: 0.042909
Train: 94%|█████████▎| 2544/2715 [00:42<00:02, 59.48it/s, accuracy=0.0156, train_loss=6.73] Evaluation: 74%|███████▍ | 432/580 [06:29<02:13, 1.11it/s]
- Recall@20 epoch 5: 0.187391 - MRR@20 epoch 5: 0.044757
In [495]:
def test_model(model, args, test):
test_recall, test_mrr = get_metrics(test, model, args, 20)
print(f"\t - Recall@{args.k}: {test_recall:3f}")
print(f"\t - MRR@{args.k}: {test_mrr:3f}\n")
test_model(model, args, data_test)
Evaluation: 80%|████████ | 464/578 [07:01<01:43, 1.10it/s]
- Recall@20: 0.186667 - MRR@20: 0.043227
'AIFFEL' 카테고리의 다른 글
Explolation20 cGAN을 이용한 스케치 채색 (0) | 2021.03.22 |
---|---|
Explolation19 Bert를 이용한 퀴즈 정답 예측 (0) | 2021.03.18 |
Explolation18 OCR로 글씨 인식 (0) | 2021.03.11 |
Explolation15 챗봇 만들기 (0) | 2021.03.03 |
Explolation14 의료영상 진단 (0) | 2021.02.25 |