Domanda sulla derivata della matrice vettoriale nella propagazione posteriore
Diciamo che ho una matrice come di seguito:
$$ W = \begin{bmatrix} w_{1,1} & w_{1,2} \\ w_{2,1} & w_{2,2} \end{bmatrix} $$ $$ \vec{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} $$ $$ \vec{y} = W\vec{x} = \begin{bmatrix} w_{1,1}x_1 + w_{1,2}x_2 \\ w_{2,1}x_1 + w_{2,2}x_2 \end{bmatrix} $$
Nella retro-propagazione, è necessario calcolare il $\partial {\vec{y}} \over \partial {W}$ aggiornare $W$.
Ma, secondo Wiki , non c'è consenso sulla definizione di una derivata di un vettore da una matrice.
Allora, come posso ottenere il valore di $\partial {\vec{y}} \over \partial {W}$?
Risposte
Qualunque sia la tua idea $\frac{\partial y}{\partial W}$, parte dei dati trasportati da questo oggetto è l'insieme di tutte le derivate parziali $\frac{\partial y}{\partial W_{ij}}$, e questi derivati dovrebbero formare tutte le "voci" di $\frac{\partial y}{\partial W}$. In questa pagina wiki , gli autori usano solo queste derivate parziali e non fanno alcun riferimento a una derivata "totale"$\frac{\partial y}{\partial W}$.
Let $e_1,e_2$ denote the canonical basis of $\Bbb R^2$, i.e. the columns of the $2 \times 2$ identity matrix. We can see that these partial derivatives are given by $$ \frac{\partial y}{\partial W_{ij}} = x_j e_i. $$ To put things in terms of scalar entries, we would say that $ \frac{\partial y_k}{\partial W_{ij}} = \delta_{ik} x_j, $ where $y_k$ denotes the $k$th entry of $y$ and $\delta_{ik}$ denotes a "Kronecker delta".
Now, in terms of the total/Frechet derivative, we could say the following. $y(W)$ defines a function from $\Bbb R^{2 \times 2}$ to $\Bbb R^2$, so for any $W \in \Bbb R^{2 \times 2}$, $D_Wy(X) = Dy(X)$ defines a linear map from $\Bbb R^{2 \times 2}$ to $\Bbb R^2$; specifically, for any $H \in \Bbb R^{2 \times 2}$, we have $$ Dy(X)(H) = y(H) = Hx. $$ Although it is not an array of entries, this function $Dy$ is the operator that the array/tensor $\frac{\partial y}{\partial W}$ would represent. We can recover the partial derivatives by evaluating the "directional derivatives" $d_Wy(X)(E_{ij})$, where $E_{ij} = e_ie_j^T$ is the matrix with a $1$ in the $i,j$ entry and zeros elsewhere. Indeed, we have $$ Dy(X)(E_{ij}) = E_{ij} x = e_i (e_j^Tx) = x_j e_i. $$ The chain rule tells us the following: for any function $g:\mathcal Z \to \Bbb R^{2 \times 2}$, we may compute the total derivative of $y \circ g$ as follows. For any $z \in \mathcal Z$, the derivative (a linear map from $\mathcal Z$ to $\Bbb R^{2}$) is given by $$ D(y \circ g)(z) = Dy(g(z)) \circ Dg(z), $$ where $Dy(g(z))$ is a linear map from $\Bbb R^{2 \times 2} \to \Bbb R^2$ and $Dg(z)$ is a linear map from $\mathcal Z$ to $\Bbb R^{2 \times 2}$. More concretely, if $h \in \mathcal Z$, then the directional derivative "along" $h$ should be given by $$ D(y \circ g)(z)(h) = [Dy(g(z)) \circ Dg(z)](h) = [Dg(z)(h)] x. $$ Similarly, for any function $p: \Bbb R^2 \to \mathcal Z$, we may compute the total derivative of $p \circ y$ as follows. For any $X \in \Bbb R^{2 \times 2}$, the derivative (a linear map from $\Bbb R^{2 \times 2}$ to $\mathcal Z$) is given by $$ D(p \circ y)(X) = Dh(y(X)) \circ Dy(X), $$ where $Dh(y(X))$ is a linear map from $\Bbb R^2$ to $\mathcal Z$ and $Dy(X)$ is a linear map from $\Bbb R^{2 \times 2}$ to $\Bbb R^2$. More concretely, if $H \in \Bbb R^{2 \times 2}$, then the directional derivative "along" $H$ should be given by $$ D(p \circ y)(X)(H) = [Dp(y(X)) \circ Dy(X)](H) = Dp(y(X))(Hx). $$