Pandas groupby 및 롤링 창

Aug 21 2020

그룹화 기능이 적용된 후 특정 기간 동안 한 필드의 합계를 계산하려고합니다.

내 데이터 세트는 다음과 같습니다.

Date          Company   Country    Sold
01.01.2020       A          BE       1
02.01.2020       A          BE       0
03.01.2020       A          BE       1
03.01.2020       A          BE       1
04.01.2020       A          BE       1
05.01.2020       B          DE       1
06.01.2020       B          DE       0

현재 날짜를 제외하고 지난 7 일 동안 "회사, 국가"그룹별로 판매 합계를 계산하는 각 행마다 새 열을 추가하고 싶습니다.

Date          Company   Country    Sold      LastWeek_Count
01.01.2020       A          BE       1           0
02.01.2020       A          BE       0           1
03.01.2020       A          BE       1           1
03.01.2020       A          BE       1           1
04.01.2020       A          BE       1           3
05.01.2020       B          DE       1           0
06.01.2020       B          DE       0           1

나는 다음을 시도했지만 현재 날짜도 포함되어 있으며 동일한 날짜, 즉 03.01.2020에 대해 다른 값을 제공합니다.

df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(7, on ='Date')['Sold'].sum().reset_index()

이러한 계산을 수행하는 데 사용할 수있는 pandas에 내장 함수가 있습니까?

답변

Terry Aug 21 2020 at 09:27

One way would be to first consolidate the Sold value of each group (['Date', 'Company', 'Country']) on a single line using a temporary DF.
After that, apply your .groupby with .rolling with an interval of 8 rows.
After calculating the sum, subtract the value of each line with the value in Sold column and add that column in the original DF with .merge

#convert Date column to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
#create a temporary DataFrame
df2 = df.groupby(['Date', 'Company', 'Country'])['Sold'].sum().reset_index()
#calc the lastweek
df2['LastWeek_Count'] = (df2.groupby(['Company', 'Country'])
                            .rolling(8, min_periods=1, on = 'Date')['Sold']
                            .sum().reset_index(drop=True)
                        ) 
#subtract the value of 'lastweek' from the current 'Sold'
df2['LastWeek_Count'] = df2['LastWeek_Count'] - df2['Sold']
#add th2 new column in the original DF
df.merge(df2.drop(columns=['Sold']), on = ['Date', 'Company', 'Country'])
#output:
    Date        Company Country Sold    LastWeek_Count
0   2020-01-01  A       BE      1       0.0
1   2020-01-02  A       BE      0       1.0
2   2020-01-03  A       BE      1       1.0
3   2020-01-03  A       BE      1       1.0
4   2020-01-04  A       BE      1       3.0
5   2020-01-05  B       DE      1       0.0
6   2020-01-06  B       DE      0       1.0
1 DavidErickson Aug 21 2020 at 07:10

You can use a .rolling window of 8 and then subtract the sum of the Date (for each grouped row) to effectively get the previous 7 days. For this sample data, we should also pass min_periods=1 (otherwise you will get NaN values, but for your actual dataset, you will need to decide what you want to do with windows that are < 8).

Then from the .rolling window of 8, simply do another .groupby of the relevant columns but also include Date this time, and take the max value of the newly created LastWeek_Count column. You need to take the max, because you have multiple records per day, so by taking the max, you are taking the total aggregated amount per Date.

Then, create a series that takes the grouped by sum per Date. In the final step subtract the sum by date from the rolling 8-day max, which is a workaround to how you can get the sum of the previous 7 days, as there is not a parameter for an offset with .rolling:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(8, min_periods=1, on='Date')['Sold'].sum().reset_index()['Sold']
df['LastWeek_Count'] = df.groupby(['Company', 'Country', 'Date'])['LastWeek_Count'].transform('max')
s = df.groupby(['Company', 'Country', 'Date'])['Sold'].transform('sum')
df['LastWeek_Count'] = (df['LastWeek_Count']-s).astype(int)

Out[17]: 
        Date Company Country  Sold  LastWeek_Count
0 2020-01-01       A      BE     1               0
1 2020-01-02       A      BE     0               1
2 2020-01-03       A      BE     1               1
3 2020-01-03       A      BE     1               1
4 2020-01-04       A      BE     1               3
5 2020-01-05       B      DE     1               0
6 2020-01-06       B      DE     0               1