Efficient Temporal Action Localization via Vision-Language Modelling