This paper investigates the problem of text-video retrieval. Many existing works treat video and text separately. However, textual cues often describe specific subregions of the video. This effect is particularly noticeable in long videos and in vivo videos. This cross-modal attention model (X-Pool) addresses this problem by generating video representations conditioned on the input text.
Testimonials about Cross-Modal Language-Video Attention for Text-Video Retrieval