AdTEC: A Unified Benchmark for Evaluating Text Quality in Search Engine Advertising

Abstract

As the fluency of ad texts automatically generated by natural language generation technologies continues to improve, there is an increasing demand to assess the quality of these creatives in real-world setting. We propose AdTEC, the first public benchmark to evaluate ad texts from multiple perspectives within practical advertising operations. Our contributions are as follows: (i) Defining five tasks for evaluating the quality of ad texts, as well as constructing a Japanese dataset based on the practical operational experiences of advertising agencies, which are typically maintained in-house. (ii) Validating the performance of existing pre-trained language models (PLMs) and human evaluators on this dataset. (iii) Analyzing the characteristics and providing challenges of the benchmark. Our results show that while PLMs have a practical level of performance in several tasks, humans continue to outperform them in certain domains, indicating that there remains significant potential for further improvement in this area.

Overview

An example of a sponsored search ad. This demonstrates the typical structure, including titles and descriptions, which are the core components evaluated in the AdTEC benchmark for relevance and quality.

We generalized advertising operations (AdOps) workflow: (1) The advertiser creates an LP to promote a product. (2) Based on the product information in the LP and target customers, text and graphics are designed by creators. (3) The creatives are evaluated based on fluency, attractiveness, regulations, legality, and other factors. (4) Once the creatives pass the quality evaluation, they are submitted to a delivery platform. (5) Customers respond to the displayed ads, such as page views, clicks, and purchases. (6) Based on the customer engagement, ad performance is reported back to the advertiser, and Steps 1-5 are repeated to improve the quality of the LP and ads.

Dataset Construction. The annotation process was iterative: we conducted pilot annotation on a small data sample, revised guidelines based on results, and repeated this cycle at least twice (the "guideline creation" shown as a green line in the image). Once annotator agreement reached a threshold, we proceeded to the main annotation phase with a larger sample using finalized guidelines (shown as a dashed line). Please note that the methods of preprocessing, data splitting, and annotation vary depending on the task, so please refer to the paper for details.

Key Contributions

1

First Public AdOps Dataset

Constructing the first public benchmark for ad text evaluation based on practical, real-world advertising operations (AdOps).

2

Comprehensive Benchmarking

Validating the performance of various PLMs and human evaluators, establishing strong state-of-the-art baselines.

3

In-depth Analysis & Challenges

Analyzing the dataset's characteristics and identifying key challenges to guide future research in the advertising NLP domain.

Tasks

Task Description

The goal is to predict the overall quality of an ad text with binary labels: `acceptable` / `unacceptable`.

Background

As most ad delivery platforms impose text length restrictions, minor grammatical errors are tolerated to enhance readability. However, excessive compression can mislead customers, and such poor-quality ads should be detected before delivery to avoid negative impacts on the advertiser.

An image demonstrating examples of advertising text acceptability tasks

Dataset Statistics

Statistical information of data included in AdTEC benchmark

Task	Train	Dev	Test
Ad Acceptability	13,265	970	980
Ad Consistency	10,635	945	970
Ad Performance Estimation	125,087	965	965
A3 Recognition	1,856	465	410
Ad Similarity	4,980	623	629

Examples

Citation

Please cite our paper when you use this dataset.

@inproceedings{zhang2025adtec,
  title={{AdTEC}: A Unified Benchmark for Evaluating Text Quality in Search Engine Advertising},
  author={Peinan Zhang and Yusuke Sakai and Masato Mita and Hiroki Ouchi and Taro Watanabe},
  booktitle={Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)},
  year={2025},
  publisher={Association for Computational Linguistics},
  eprint={2408.05906},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2408.05906},
}

Frequently Asked Questions

What is the main purpose of this dataset?

The AdTEC dataset is the first public benchmark for evaluating the quality of ad text in search engine advertising from multiple, practical perspectives. It aims to provide evaluation criteria based on real-world advertising operations, which have been lacking in previous research.

What languages is the dataset available in?

Currently, the dataset consists of Japanese text only. However, the task design and evaluation framework proposed in this research are language-independent and can be applied to other languages.

How can I access the dataset?

The dataset is available on the Hugging Face Hub. The associated code can be obtained from the GitHub repository.

What are the license terms?

The AdTEC dataset is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. This requires attribution, non-commercial use, and sharing any derivatives under the same license.