2016->2017:这一年可以说是摘要领域研究爆发的“前夜”。具体表现在:
a) 训练方式上:强化学习、对抗学习开始登陆NLP,使得摘要任务有了可以考虑更多损失函数的可能;
b) CNN/DM的数据集的提出(搭配一些不太高的baselines),大大推动这个领域的发展;
c) 模型结构上:copy [11] /coverage [12] 等新机制的提出。
Gigaword:[paper][data]
The dataset is organized by Rush for abstractive summarization, which contains many spurious headline-article pairs. There are 3.8M training, 189k development and 1951 test samples.
LCSTS:[paper][data]
A large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo.
CNN/DM:[paper][data]
The dataset is re-organized by Nallapati, covering 286,817 training pairs, 13,368
validation pairs and 11,487 test pairs.
Newsroom:[paper][data]
It’s a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications.
Newsroom-processed (ACL2019):[paper][data]
The original Newsroom dataset is pre-processed and repurposed for cross-domain evaluation.
Xsum:[paper][data]
The dataset is collected a real-world, large scale dataset for this task by harvesting online articles from the British Broadcasting Corporation, which does not favor extractive strategies and calls for an abstractive modeling approach.
arXiv:[paper][data]
Scientific papers which are an example of long-form structured document types
PubMed:scientific papers which are an example of long-form structured document types
PubMed:[paper][data]
Scientific papers which are an example of long-form structured document types.
RottenTomatoes:[paper][data]
It is a movie review website that aggregates both professional critics and user-generated reviews
Reddit TIFU:[paper][data]
The dataset could less suffer from some biases that key sentences usually locate at the beginning of the text and favorable summary candidates are already inside the text in similar forms.
BIGPATENT (ACL2019):[paper][data]
It consists of 1.3 million records of U.S. patent documents along with human written abstractive summaries.
多文档摘要
DUC:[paper][data]
NIST launched a new text summarization evaluation effort, called DUC.
WikiSum:[paper][data]
The input is comprised of a Wikipedia topic (title of article) and a collection of non-Wikipedia reference documents, and the target is the Wikipedia article text.
ScisummNet (Yale University; AAAI19):[paper][data]
It contains 1,000 examples of papers, citation information and human summaries, is orders of magnitude larger than prior datasets.
Multi-News (Yale University; ACL2019):[paper][data]
A multi-document summarization dataset in the news domain.
多模态摘要
MMS:[paper][data]
The dataset provides an asynchronous (i.e., there is no given description for images and no subtitle for videos) collection of multi-modal information about a specific news topics, including multiple documents, images, and videos, to generate a fixed length textual summary.
MSMO:[paper][data]
The dataset provides a testbed for Multimodal Summarization with Multimodal Output, constructed from Daily Mail website.
How2:[NeurIPS18][ACL19][data]
The corpus consists of around 80,000 instructional videos (about 2,000 hours) with associated English sub-titles and summaries.
新的摘要任务
Idebate:[paper][data]
It is a Wikipedia-style website for gathering pro and con arguments on controversial issues.
Debatepedia:[paper][data]
The dataset is created from Debatepedia an encyclopedia of pro and con arguments and quotes on critical debate topics.
Funcom:[paper][data]
It is a collection of ~2.1 million Java methods and their associated Javadoc comments.
TALKSUMM (ACL2019):[paper][data]
It contains 1716 summaries for papers from several computer science conferences based on the video of talk.
Multi-Aspect CNN/DM (ACL2019):[paper][data]
An aspect-based summarization dataset.