Rashkin 2020
@misc{https://doi.org/10.48550/arxiv.2104.07228,
doi = {10.48550/ARXIV.2104.07228},
url = {https://arxiv.org/abs/2104.07228},
author = {Yu, Wenhao and Zhu, Chenguang and Zhao, Tong and Guo, Zhichun and Jiang, Meng},
keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Sentence-Permuted Paragraph Generation},
publisher = {arXiv},
year = {2021},
copyright = {Creative Commons Attribution 4.0 International}
}
https://github.com/hrashkin/plotmachines
Introduction
- Task: outline-conditioned story generation
- Outline: a set of phrases describing the key characters and events in a story
- To generate a coherent narrative consistent with the given outline
- Challenge: the input provides only the rough elements of the plot
- How the plot elements will intertwine with each other across different parts of
- The above should be dynamically determined based on what has already been composed while being consistent to the outline and the overall narrative structure
- Model: PlotMachines
- Transforms an outline into a multi-parpag
- Uses dynamic memory blocks
- To keep track of the implicit plot states (outline + story so far)
- Informed with high-level narrative structures using discourse labels
- To learn different styles of writing in different parts of the narrative (beginning, middle, end)
Outline-conditioned generation
- How story generation models can plan long narrative (multi-paragraph) according to controllable story elements (outline)
- To be flexible: outlines are loosely defined as lists of an arbitrary number of un-ordered multi-word plot points that guide the story to be generated
- In this paper, the scope of plot points are limited to events and phrases to be loosely integrated in the output story
- Because they can be automatically extracted
- To be natural: stories have appropriate discourse and narrative flow
Dataset
- Wikiplots corpuse
- WritingPrompts (Fan et al., 2018)
- NYTimes (Sandhaus, 2008) — for generalization
- RAKE algorithm (Rose et al., 2010)
- To determine key phrases based on the word frequency and co-occurrence
- In this paper, key-points are filtered with overlapping n-grams
- Extracts longer outline points (3 to 8 words) with no particular order
Model
- PlotMachines
- End-to-end trainable transformer on top of the GPT model + memory mechanisms + special discourse features
- Motivated by human writing styles where each paragraph is a distinct section of related sentences
(Pi,hi,Mi)=PM(o,di,hi−1,Mi−1)
- Time step i, a new paragraph Pi, outline representation o, discourse representation d, vector representation h, memory M
Outline representation
- The plot outline o as a sequence of tokens
_kw_
to delimit each plot point
_endkw_
to end the sequence
- The entire outline truncated to n tokens in maximum
Discourse representation
- Posited: there are stylistic differences between the beginning, middle, and the end of a story
_i_
, _b_
, _c_
for the introduction, body, and conclusion paragraphs
- Appended to the outline representation
Preceding context representation
- To incorporate previous story context
- An embedded representation hi−1 added to the model input
- Computed as the average embedding of GPT (not fine-tuned) output representations of words from the previous paragraph
Memory representation
- To keep track of the parts of the outline that have been mentioned
- To maintain semantic coherence throughout the entire story
- K, a set of vectors to keep track of outline points + D, a matrix to store a latent topic distribution of the generated story so far
Notation
\begin{align*} & M= [K;D] \text{ where} \\ & M: \mathbb R^{d\times2n}\\ & K: \mathbb R^{d\times n}\text{ representation of outline points} \\ & D: \mathbb R^{d\times n}\text{ representation of latent document state} \end{align*}
- The embedding size of the transformer model d, the maximum number of tokens in the outline n
- K initialized with embeddings of tokens in the outline
- D initialized randomly
- Mji: the j-th column of memory at the timestep for paragraph i
Updating memory
- Based on the update equations in entity-based models such as Henaff et al. (2017)
- Gating mechanism g
- To learn to flexibly control the update of each cell in memory
\begin{align*} & \hat M^i_j=\tanh (W_1M^{i-1}_j+W_2h^{i-1}) \\ & g^i_j=\sigma (W_3M^{i-1}_j+W_4h^{i-1}) \\ & M^i_j=(1-g^i_j) \odot M^{i-1}_j+ g^i_j \odot \hat M^i_j\\ & \text{where } W: \mathbb R^{d\times d} \text{ matrix} \end{align*}
- Attention within the transformer blocks to contain two parallel attention modules
- Performs the standard GPT self-attention
- Uses transformer input to attend over the memory vectors
- The outputs of the two modules are averaged
Training and decoding
- Training: predicts each paragraph (cross-entropy loss)
- Previous paragraphs’ gold representations are used to update the memory and compute hi−1
- Decoding
- Starts from the first paragraph
- Uses its own predictions rto compute hi−1 and update the memory
- 5-paragraph structure assumed
Experiments
Experimental setup
Baselines
- Fan et al. (2018): Fusion model
- Yao et al. (2019): Plan-and-Write
- Zellers et al. (2019): Grover (large-scale)
- These generate the entire document based on outlines unlike PlotMachines (which generate recurrently paragraph by paragraph)
Ablated PlotMachines models
- Mem: memory blocks
- Disc: discourse tokens
- Base GPT and GPT2 (fine-tuned) — only with outline inputs
- PM-NoMem-NoDisc — + preceding context representations
- PM-NoMem
Automatic metrics
- ROUGE + self-BLEU — how realistic-looking as well as diverse
Coverage
- PM is the best in all three datasets
Ablations
- GPT-2 < PM-NoMem-NoDisc < PM-NoMem << PM
Diversity
- PM is the highest on ROUGE and lowest on self-BLEU
- Self-BLEU scores similar to those of the gold stories → its diversity similar to that of human-writing
Human evaluations
- Outline utilization, narrative flow, ordering
- Small-scale study evaluating full-length stories
- PM as the best except for outline utilization (Grover the highest)
- Large-scale study evaluating single-paragraph excerpts
- With random paragraphs from each story
- With the paragraph with the most n-gram overlap with the outline (i.e. closest)
- PM found out to be using outlines more naturally, especially with the “closest” paragraphs
- Repetitiveness, natural transition, relevant and on-topic
- PM as the best
- Proxy task: humans to decipher the order of the generated paragraphs
- It would be easier to decipher the order if the model output is very well-structured
- Humans are more accurate with Grover and Fusion
N-gram-based outline usage analysis
- /> 20% of the n-grams in the outline point also appear in the paragraph
- Grover tends to over-repeat outline points (repetitive)
- Fusion leaves out portions of the outline
- PM is more inclusive and similar to the gold reference
Qualitative examples
- Grover often finishes the story in the middle and starts a new story
- PM adheres more to the beginning-middle-ending structure
- Often starts by setting the scene
- Writes conclusions with a definitive closing action