複数の条件に基づいて2つのデータフレームをマージします

Aug 23 2020

2つのデータフレーム(df-aとdf-b)を比較し、特定のIDと1つのデータフレーム(df-b)の日付が、他のデータフレーム(df-a)でIDが一致する日付範囲内にある場所を検索しようとしています。 )。次に、df-aのすべての列を削除し、それらが一致するdf-bに連結します。例えば

データフレームdf-aがある場合、次の形式df-aです。

    ID       Start_Date    End_Date     A   B   C   D   E 
0   cd2      2020-06-01    2020-06-24   'a' 'b' 'c' 10  20
1   cd2      2020-06-24    2020-07-21
2   cd56     2020-06-10    2020-07-03
3   cd915    2020-04-28    2020-07-21
4   cd103    2020-04-13    2020-04-24

およびdf-bin

    ID      Date
0   cd2     2020-05-12
1   cd2     2020-04-12
2   cd2     2020-06-10
3   cd15    2020-04-28
4   cd193   2020-04-13

df-c =のような出力dfが欲しい

    ID      Date        Start_Date  End_Date    A   B   C   D   E 
0   cd2     2020-05-12      -           -       -   -   -   -   -
1   cd2     2020-04-12      -           -       -   -   -   -   -
2   cd2     2020-06-10 2020-06-01 2020-06-11    'a' 'b' 'c' 10  20
3   cd15    2020-04-28      -           -       -   -   -   -   -
4   cd193   2020-04-13      -           -       -   -   -   -   -

以前の投稿で、データフレームを比較し、この条件が満たされた場所でドロップできるという素晴らしい回答を得ましたが、df-aから情報を適切に抽出する方法を見つけるのに苦労しています。現在の試みは以下の通りです!

df_c=df_b.copy()

ar=[]
for i in range(df_c.shape[0]):
    currentID = df_c.stafnum[i]
    currentDate = df_c.Date[i]
    df_a_entriesForCurrentID = df_a.loc[df_a.stafnum == currentID]

    for j in range(df_a_entriesForCurrentID.shape[0]):
        startDate = df_a_entriesForCurrentID.iloc[j,:].Leave_Start_Date
        endDate = df_a_entriesForCurrentID.iloc[j,:].Leave_End_Date

        if (startDate <= currentDate <= endDate):
            print(df_c.loc[i])
            print(df_a_entriesForCurrentID.iloc[j,:])
            
            #df_d=pd.concat([df_c.loc[i], df_a_entriesForCurrentID.iloc[j,:]], axis=0)
            
            #df_fin_2=df_fin.append(df_d, ignore_index=True)
            #ar.append(df_d)

回答

1 RichieV Aug 24 2020 at 03:41

したがって、一種の「ソフト」マッチを作成する必要があります。これは、日付範囲の一致をベクトル化しようとするソリューションです。

# notice working with dates as strings, inequalities will only work if dates in format y-m-d
# otherwise it is safer to parse all date columns like `df_a.Date = pd.to_datetime(df_a)`

# create a groupby object once so we can efficiently filter df_b inside the loop
# good idea if df_b is considerably large and has many different IDs
gdf_b = df_b.groupby('ID')
b_IDs = gdf_b.indices # returns a dictionary with grouped rows {ID: arr(integer-indices)}

matched = [] # so we can collect matched rows from df_b
# iterate over rows with `.itertuples()`, more efficient than iterating range(len(df_a))
for i, ID, date in df_a.itertuples():
    if ID in b_IDs:
        gID = gdf_b.get_group(ID) # get the filtered df_b
        inrange = gID.Start_Date.le(date) & gID.End_Date.ge(date)
        if any(inrange):
            matched.append(
                gID.loc[inrange.idxmax()] # get the first row with date inrange
                .values[1:] # use the array without column indices and slice `ID` out
            )
        else:
            matched.append([np.nan] * (df_b.shape[1] - 1)) # no date inrange, fill with NaNs
    else:
        matched.append([np.nan] * (df_b.shape[1] - 1)) # no ID match, fill with NaNs
df_c = df_a.join(pd.DataFrame(matched, columns=df_b.columns[1:]))
print(df_c)

出力

      ID        Date  Start_Date    End_Date    A    B    C     D     E
0    cd2  2020-05-12         NaN         NaN  NaN  NaN  NaN   NaN   NaN
1    cd2  2020-04-12         NaN         NaN  NaN  NaN  NaN   NaN   NaN
2    cd2  2020-06-10  2020-06-01  2020-06-24    a    b    c  10.0  20.0
3   cd15  2020-04-28         NaN         NaN  NaN  NaN  NaN   NaN   NaN
4  cd193  2020-04-13         NaN         NaN  NaN  NaN  NaN   NaN   NaN