[EDA] Feature 시각화

선뭉 2023. 9. 12. 16:03

타깃값(0: 정상/ 1: 비정상) 에 따른 데이터 분포 확인

- Categorical (범주형 데이터)

# 이상치 유무에 따른 차이를 보기 위한 데이터 분류
train_0 = train[train['Y_LABEL']==0]
train_1 = train[train['Y_LABEL']==1]

# 'COMPONENT_ARBITRARY' #Test Feature
fig, ax = plt.subplots(1, 2, figsize=(16, 6))


sns.countplot(x = 'COMPONENT_ARBITRARY',
                data = train_0,
                ax = ax[0],
                order = train_0['COMPONENT_ARBITRARY'].value_counts().index)
ax[0].tick_params(labelsize=12)
ax[0].set_title('anomaly = 0')
ax[0].set_ylabel('count')
ax[0].tick_params(rotation=50)


sns.countplot(x = 'COMPONENT_ARBITRARY',
              data = train_1,
              ax = ax[1],
              order = train_1['COMPONENT_ARBITRARY'].value_counts().index)
ax[1].tick_params(labelsize=12)
ax[1].set_title('anomaly = 1')
ax[1].set_ylabel('count')
ax[1].tick_params(rotation=50)


plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.show()

- numerical (수치형 변수)

num_plot 이라는 함수 정의 후 수치변수 대입

# Numerical 그래프 함수 정의
def num_plot(train, train_0, train_1, column):
  
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    if column in test_features :
        color = 'red'
    else :
        color = 'blue'

    sns.distplot(train_0[column], color = color,
                ax = axes[0])
    axes[0].tick_params(labelsize=12)
    axes[0].set_title('anomaly = 0')
    axes[0].set_ylabel('count')
    axes[0].set_xlim(None,train[column].max())

    sns.distplot(train_1[column], color = color,
                ax = axes[1])
    axes[1].tick_params(labelsize=12)
    axes[1].set_title('anomaly = 1')
    axes[1].set_ylabel('count')
    axes[1].set_xlim(None,train[column].max())


    plt.subplots_adjust(wspace=0.3, hspace=0.3)

num_plot(train, train_0, train_1,'SAMPLE_TRANSFER_DAY')