The strange case of reproducibility versus representativeness in contextual suggestion test collections