[Pandas] Shape of passed values is (598, 2795), indices imply (598, 2877) error

SimpleImputer를 이용해서 데이터를 보간 후 데이터 프레임으로 만드는 과정에서 에러가 발생했다.

train_x 의 shape은 (598, 2877) 인데 SimpleImputer로 보간 후에 나온 데이터의 shape은 (598,2795)이다.

컬럼 82개는 어디로 간것일까?

https://stackoverflow.com/questions/62198172/im-getting-this-error-shape-of-passed-values-is-55-93315-indices-imply-6

I'm getting this error: "Shape of passed values is (55, 93315), indices imply (68, 93315)" when applying the Imputer

Here is my code ` from sklearn.preprocessing import Imputer imp = Imputer(strategy='median') imputed_df = pd.DataFrame(imp.fit_transform(df1), columns=df1.columns)` The error mes...

stackoverflow.com

위의 답변을 참고하면 해결 할 수 있다.

SimpleImpter의 fit_transform()은 train_x에 있는 컬럼의 모든 데이터가 Nan값이라면 해당 컬럼을 삭제 후에 보간을 진행한다.

확인을 해보자.

# 열의 모든 값이 nan인 경우의 컬럼 카운트 
cnt = 0

for i in train_x:
    if train_x[i].isnull().sum() == train_x.shape[0]:
        cnt += 1
print(cnt)

train_x의 모든 데이터가 Nan인 경우의 컬럼을 카운트 한다.

정확히 82개가 맞다.

그래서 SimpleImputer에서 82개를 날리고 보간을 진행하고(598,2795), train_x의 컬럼으로 데이터 프레임(598,2877)을 합치려고 하니 에러가 발생한 것이다.

보간을 하기 전에 컬럼의 모든 데이터가 Nan인 경우 컬럼을 삭제하자.


# 열의 모든 값이 nan인 경우 해당 컬럼 삭제!
drop_3 = []

for i in train_x:
    if train_x[i].isnull().sum() == train_x.shape[0]: # 컬럼의 모든 데이터가 Nan인 경우 
        drop_3.append(i) # drop_3 # 리스트에 컬럼명 append
train_x = train_x.drop(drop_3, axis = 1) # train_x에서 drop_3 컬럼명을 가진 데이터 모두 drop

train_x의 shape이 (598,2877) -> (598,2795)로 바뀌었다!

다음에 SimpleImputer를 이용해 결측값 보간을 해도 에러가 발생하지 않는다.

에러 없이 성공하였다.

결측치 보간이 잘되었는지 확인해보자.



print('train : ',train_x.isnull().sum())

전체 코드



# 열의 모든 값이 nan인 경우의 컬럼 카운트 
cnt = 0
for i in train_x:
    if train_x[i].isnull().sum() == train_x.shape[0]:
        cnt += 1
print(cnt)


# 열의 모든 값이 nan인 경우 해당 컬럼 삭제!
drop_3 = []

for i in train_x:
    if train_x[i].isnull().sum() == train_x.shape[0]: # 컬럼의 모든 데이터가 Nan인 경우 
        drop_3.append(i) # drop_3 # 리스트에 컬럼명 append
train_x = train_x.drop(drop_3, axis = 1) # train_x에서 drop_3 컬럼명을 가진 데이터 모두 drop


#SimpleImputer를 이용해 결측값을 평균값으로 보간함. 
imp_mean = SimpleImputer(missing_values=np.nan,strategy='mean')
train_x = pd.DataFrame(imp_mean.fit_transform(train_x), columns=train_x.columns)


# 결측값이 없는지 확인 
print('train : ',train_x.isnull().sum())

'인공지능 > 머신러닝' 카테고리의 다른 글

파라미터와 하이퍼파라미터 (train 하는가? 안하는가?) (2)	2023.01.19
Generative learning algorithms (0)	2023.01.19

삽질블로그

[Pandas] Shape of passed values is (598, 2795), indices imply (598, 2877) error

'인공지능 > 머신러닝' 카테고리의 다른 글

티스토리툴바

[Pandas] Shape of passed values is (598, 2795), indices imply (598, 2877) error

'인공지능 > 머신러닝' 카테고리의 다른 글

관련글

티스토리툴바