Why isn't Vanishing Gradient matter in Transformer[reddit]/r/MLQuestions
When we learn about activation function, the sigmoid function is not a go-to method because it causes gradient vanishing problem. But I understand that the Transformer has many softmax layers, a general form of sigmoid, in each transformer block to compute attention scores. But I haven't seen any articles that the Transformer network is vulnerable to vanishing gradient. I understand that residual connections and scaled softmax were applied to help gradient flow well, but I was wondering if you have any other reason why softmax is actually not a big deal in terms of vanishing gradient?