On Fine-grained Temporal Emotion Recognition in Video