Pandas groupby 및 롤링 창
그룹화 기능이 적용된 후 특정 기간 동안 한 필드의 합계를 계산하려고합니다.
내 데이터 세트는 다음과 같습니다.
Date Company Country Sold
01.01.2020 A BE 1
02.01.2020 A BE 0
03.01.2020 A BE 1
03.01.2020 A BE 1
04.01.2020 A BE 1
05.01.2020 B DE 1
06.01.2020 B DE 0
현재 날짜를 제외하고 지난 7 일 동안 "회사, 국가"그룹별로 판매 합계를 계산하는 각 행마다 새 열을 추가하고 싶습니다.
Date Company Country Sold LastWeek_Count
01.01.2020 A BE 1 0
02.01.2020 A BE 0 1
03.01.2020 A BE 1 1
03.01.2020 A BE 1 1
04.01.2020 A BE 1 3
05.01.2020 B DE 1 0
06.01.2020 B DE 0 1
나는 다음을 시도했지만 현재 날짜도 포함되어 있으며 동일한 날짜, 즉 03.01.2020에 대해 다른 값을 제공합니다.
df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(7, on ='Date')['Sold'].sum().reset_index()
이러한 계산을 수행하는 데 사용할 수있는 pandas에 내장 함수가 있습니까?
답변
One way would be to first consolidate the Sold value of each group (['Date', 'Company', 'Country']) on a single line using a temporary DF.
After that, apply your .groupby
with .rolling
with an interval of 8 rows.
After calculating the sum, subtract the value of each line with the value in Sold column and add that column in the original DF with .merge
#convert Date column to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
#create a temporary DataFrame
df2 = df.groupby(['Date', 'Company', 'Country'])['Sold'].sum().reset_index()
#calc the lastweek
df2['LastWeek_Count'] = (df2.groupby(['Company', 'Country'])
.rolling(8, min_periods=1, on = 'Date')['Sold']
.sum().reset_index(drop=True)
)
#subtract the value of 'lastweek' from the current 'Sold'
df2['LastWeek_Count'] = df2['LastWeek_Count'] - df2['Sold']
#add th2 new column in the original DF
df.merge(df2.drop(columns=['Sold']), on = ['Date', 'Company', 'Country'])
#output:
Date Company Country Sold LastWeek_Count
0 2020-01-01 A BE 1 0.0
1 2020-01-02 A BE 0 1.0
2 2020-01-03 A BE 1 1.0
3 2020-01-03 A BE 1 1.0
4 2020-01-04 A BE 1 3.0
5 2020-01-05 B DE 1 0.0
6 2020-01-06 B DE 0 1.0
You can use a .rolling
window of 8
and then subtract the sum of the Date (for each grouped row) to effectively get the previous 7 days. For this sample data, we should also pass min_periods=1
(otherwise you will get NaN
values, but for your actual dataset, you will need to decide what you want to do with windows that are < 8
).
Then from the .rolling
window of 8
, simply do another .groupby
of the relevant columns but also include Date
this time, and take the max
value of the newly created LastWeek_Count
column. You need to take the max
, because you have multiple records per day, so by taking the max, you are taking the total aggregated amount per Date
.
Then, create a series that takes the grouped by sum
per Date
. In the final step subtract the sum by date from the rolling 8-day max, which is a workaround to how you can get the sum of the previous 7 days, as there is not a parameter for an offset with .rolling
:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(8, min_periods=1, on='Date')['Sold'].sum().reset_index()['Sold']
df['LastWeek_Count'] = df.groupby(['Company', 'Country', 'Date'])['LastWeek_Count'].transform('max')
s = df.groupby(['Company', 'Country', 'Date'])['Sold'].transform('sum')
df['LastWeek_Count'] = (df['LastWeek_Count']-s).astype(int)
Out[17]:
Date Company Country Sold LastWeek_Count
0 2020-01-01 A BE 1 0
1 2020-01-02 A BE 0 1
2 2020-01-03 A BE 1 1
3 2020-01-03 A BE 1 1
4 2020-01-04 A BE 1 3
5 2020-01-05 B DE 1 0
6 2020-01-06 B DE 0 1