Understanding Revision Behavior in Adaptive Writing Support Systems for Education

Revision behavior in adaptive writing support systems is an important and relatively new area of research that can improve the design and effectiveness of these tools, and promote students’ self-regulated learning (SRL). Understanding how these tools are used is key to improving them to better support learners in their writing and learning processes. In this paper, we present a novel pipeline with insights into the revision behavior of students at scale. We leverage a data set of two groups using an adaptive writing support tool in an educational setting. With our novel pipeline, we show that the tool was effective in promoting revision among the learners. Depending on the writing feedback, we were able to analyze different strategies of learners when revising their texts, we found that users of the exemplary case improved over time and that females tend to be more efficient. Our research contributes a pipeline for measuring SRL behaviors at scale in writing tasks (i.e., engagement or revision behavior) and informs the design of future adaptive writing support systems for education, with the goal of enhancing their effectiveness in supporting student writing. The source code is available at https://github.com/lucamouchel/ Understanding-Revision-Behavior .


INTRODUCTION
Intelligent writing support tools (e.g., Grammarly, Word-Tune or Quilbot) offer new ways for learners to receive feedback and thus revise their texts [19].These and other writing support systems bear the potential to provide learners with needed adaptive feedback on their writing exercises when educators are not present, (e.g., on grammatical mistakes [32], argumentation [26], empathy [29], or general persuasive writing [27]).They can help students in their self-regulated learning (SRL) process [9,36], to organize their thoughts and ideas, reflect on their learnings, or simply receive feedback on frequently occurring grammar or argumentation mistakes.From an educational perspective, it is important to understand how these tools are used by learners in educational settings and how they improve the effectiveness of educational scenarios [7,18].Present research is largely focused on designing and building writing support systems [8,20].However, there are not many insights into the effects of the usage of these tools and their impact on students' SRL processes [4,25], which is why we contribute a novel pipeline analyzing and visualizing revision behavior to better understand how we can design, develop and improve existing systems to better support students.Techniques from the field of data mining are a solution to understanding revision behavior and explaining SRL.One such technique is Keystroke Logging (KL).KL allows us to use educational data mining to analyze user behavior in writing tasks [16,35,23] 1 .In this study, we model, inspect, and analyze quantitative data in learners' writing interactions through KL by developing a novel pipeline.We use a keystroke log from an experiment, where users were divided into two groups.The first one was given adaptive feedback and the second one was not.A detailed description of our dataset and the experiment demographics and procedure are available in Section 3. To the best of our knowledge, no publicly available pipeline exists that focuses on processing the keystroke behavior of learners and helps analyze SRL characteristics such as engagement, revision, or visualize the learning path.We intend to first identify and visualize the differences between these two groups in their revision process and compare different user profiles and measure their engagement over time.We use an exemplary data set to build this pipeline and apply data mining in order to gain insights into the underlying process of this writing activity.
been focusing theoretically on behavioral and cognitive processes of writing [14,17].In fact, Flower and Hayes [11] laid the groundwork for research on the psychology of writing.They propose that the act of writing is propelled by goals, which are created by the writer and grow in number as the writing progresses.Today, writing support tools need to support this cognitive process as it emphasizes writers' intentions, rather than their actions [12].It is important to understand what these tools help with, and how we may design new ones [12].While prior works on text revision [8,15,20,28] have proposed machine collaborative writing interfaces, they focus on collecting human-machine interaction data to better train neural models, rather than understanding the underlying processes of text revision.Several studies in the past have used KL as a technique to study revision [16,23,35] in different settings and some of the aims were to understand and evaluate keystroke log features in a writing task context.However, until now, KL has been scarcely used in the classroom [25].One issue with keystroke loggers is their invasive nature.KL raises several ethical issues, most notably privacy violation [25], but in this study, participants gave their consent for the collection of their data, all the while preserving their privacy.Previous research has suggested that writing time and number of keystrokes, which are indicative of general writing fluency and effort, are related to writing quality [1,33].Another feature of interest is pause times, [34] found that under a certain timed-writing test condition, shorter pauses are preferred as that indicates an adequate understanding of the task requirements, more familiarity with the writing topic, and better task planning [23].

Self-Regulated Learning
To analyze revision behavior, we rely on the lens of selfregulated learning (SRL).SRL refers to the pro-active process that learners engage in to optimize their learning outcome [36].According to Zimmerman's model of SRL [36], there are three major phases: forethought, performance and self-reflection.The forethought phase includes task analysis, such as goal setting and strategic planning and selfmotivational beliefs.The performance phase includes selfcontrol processes, such as task and attention-focusing strategies.The self-reflection phase includes processes involving self-judgment and self-reaction [31].SRL is essential in the context of studying revision behavior in writing support systems as it allows writers to take an active role in identifying and addressing their own writing weaknesses, rather than simply relying on the writing support system to automatically detect and correct errors.This can lead to a deeper understanding of the writing process.

METHOD
To investigate revision behavior in the writing process, we propose a pipeline for the automatic analysis of the SRL behavior of users during a writing task.Our work follows the Knowledge Discovery in Databases process by following the methodology in Fig. 1.

Demographics, Procedure & Dataset Description
With approval of the ethical board of our university, we collected data from a writing experiment which consisted of 73 users divided into two groups, as illustrated in Table 1.The two groups of users were tasked with writing three cooking recipes.Both groups were given a sample recipe as reference.The first group (G1) received adaptive feedback from the platform when they submitted their texts.The second group (G2) did not receive any feedback.Once they submitted their recipes to the system, users in G1 had the option to reset and start a new recipe or revise their texts based on the feedback.The same protocol was followed for G2, but they did not receive feedback.Here are several examples of the feedback the platform provided users: 'List each ingredient separately.','Enumerate the steps.','How can your recipe be more specific?','Use stir, mix, or beat instead of "add" to be more specific.'or 'Indicate whether the meat, poultry, or seafood is boned, skinned, or otherwise prepared.' With regards to the dataset, the entries of the log data we collected consisted of user ids, event dates, the keystroke logs as a JSON file and the final version of the text submitted at that particular date.An example of an entry is as follows: 2023-01-01, 12:00:00, user1, [{'time': 1, 'character': 'a'}, ... }], "a) Cook ...".

Qualitative Perception
Following the experiment, users were tasked with answering follow-up questions and we identified eight different topics regarding the reported revisions, including, adding missing ingredients, improving the clarity, not making any changes and others.To do this, we used BERTTopic [13], a topic modeling technique that clusters sentence embeddings generated by Sentence-BERT [22], to perform qualitative analysis of participants' open responses about recipe revisions: (What did you edit (add, remove or change) from the original text (the recipe you wrote)? ).We split the sentences into clusters based on their relevance, assigned names to each cluster, and computed the probability of each sentence belonging to a cluster.We grouped the sentences by participant to obtain the set of topics associated with their entire text answer.For example, if a participant's answer consisted of sentences with assigned topics A, B, and C, the set of topics associated with their answer would be Z = {A, B, C}.

Data Processing
Given that the logs consist of the users' first attempts at writing one of the three recipes and their respective revision phases, it is important to separate them in order to focus only on the revision steps.We define sessions for a user as all the data collected from them for one recipe.To separate sessions, we use cosine distance to detect where the session ends and where the next one starts.One advantage of using cosine distance for text comparison is that it is relatively insensitive to the length of the strings.In contrast, other measures of distance such as Euclidean distance are sensitive to the length of the vectors and can be affected by the presence of common words that do not contribute significantly to the meaning of the strings.To map sentences to 50-dimensional vectors, we use a GloVe model [21], which is already trained on Wikipedia.First, we map each word to their embeddings and then compute the sum of the vectors component-wise.Formally, each text submitted t has a set of words W = {wi | 1 ≤ i ≤ Nt}, where Nt is the number of words for text t.Then, we map each wi to their embeddings wi which are 50-dimensional vectors.Now let t be the embedding of the text t, then t = N t i=1 wi.This allows us to capture each word of the text and this way, we can collect the set of all text embeddings in the dataset i=1 wi, ... .We use T to run the recursive algorithm described in Appendix C on the recipes submitted and compute the cosine distance between the text embeddings of t k , t k+1 , ..., starting at k = 0, until we find n > k When we do, we define n as the index of a new recipe in our data.Then, we repeat the process by starting at tn and comparing tn with the text embeddings tn+1, tn+2, ... to find the next index.This way, we collect the indices of new recipes in our dataset so that we can focus on the revision between these indices.
Moreover, to apply process mining techniques, we built event logs from the writing task.For each group, we collect the activities for each user, by looking at when they submit the first, second and third recipes and all the revision steps in between.

Feature Extraction
Different aspects of SRL have been researched extensively [18].In a meta-analysis on online education, [6] found significant associations with academic achievement for five subscales of SRL: effort regulation (persistence in learning), time management (ability to plan study time), metacognition (awareness and control of thoughts), critical thinking (ability to carefully examine material), and help-seeking (obtaining assistance if needed)2 .Based on these findings, we use the following dimensions to represent student behavior: effort regulation (Number of Revisions, Number of Edits, Time Spent Revising), time management (Time Spent Revising, Pause Times), metacognition (Efficiency, Pause Time), and critical thinking (DIRatio).A detailed description of these feature variables can be found in Appendix A, Table 3.

Building the Learning Path
Understanding revision behavior implies understanding the underlying process in the writing task (e.g., how long do users in a group take to revise on average or how many users revise).In order to understand this better, process mining, especially process discovery [5], can help us model and visualize the writing process for users in a group and design a learning path when using adaptive writing support systems [30].In this study, we use Directly-Follows Graphs (DFGs) [3], which represent activities and their relationships3 .This is useful for the field of SRL as it provides a way to visualize and analyze the steps involved in a process, especially revision.A formal definition of DFGs can be found in Appendix B.

Revision Strategies
With this study, we find that users in different groups revise their texts differently.Recall that G1 is given adaptive feedback and G2 is not.By providing insightful feedback on what a user can change in their writing, users tend to have more revision steps with fewer edits at each step.However, users not receiving feedback follow the opposite trend, they have fewer revision steps, with a larger number of revisions at each step.This phenomenon is visible in Fig. 2.
In fact, for the first and second texts (Appendix D, Tables 4  and 5), we find p-values < 0.05 for Number of Revisions using t-tests, which indicates a significant difference in the number of revisions.This is also underlined by the mean number of revisions and edits.On average, users in G1 tend to revise their texts more often, with fewer edits at each step (Appendix D, Tables 4 to 6).From the directly-follows graphs (Fig. 3), we see that users spend approximately the same amount of time writing recipes and the same amount of time revising at the first revision step.However, we see that users in G2 revise much longer when having consecutive revision sessions (6 min on average) compared to G1 (56 s)(Fig.3).This confirms that users in G1 have shorter revision sessions, whereas users in G2 have longer revision steps.
Figure 2: Bubble plot for the first recipe sorted by the number of revisions.The bubbles correspond to the total number of edits (insertions and deletions) for a user at each revision step Figure 3: Overview of the SRL behavior of students revising their texts as directly-followed graphs for G1 (left) and G2 (right) automatically calculated and drawn by our pipeline

Engagement
From the second recipe onwards, we find that users revise less often, perform fewer edits, spend less time revising and type faster (Fig. 4).This stems from users being less engaged in the task at hand.In fact, users spend 67% less time revising in G2 (Fig. 4) (from 264 seconds on average for the first recipe to 86.5 seconds for the third one (Tables 4  and 6)) and 64% less in G1 (from 224 seconds to 81.2).In G2, users perform 74% fewer edits between the first and last recipe (Fig. 4).Users in G2 performed on average 222 edits when revising for the first recipe and only 57 for the third one (Tables 4 and 6).The decrease in pause time for the two groups also declines over time (0.822 to 0.553 seconds on average for G1 and 0.646 to 0.525 seconds for G2), even though participants in G2 consistently maintain a smaller average pause time when revising.This is one interpretation of the results and Fig. 4, another one would be to consider users are improving in this task.On average, pause time for G1 decreased by 32.7% and 18.7% for G2 (Fig. 4).Shorter pause times indicate better understanding of the task requirements and better task planning [34].This is coherent with the participants' reported changes.As seen in Fig. 5, we found that participants from G2 increasingly reported making no changes to their recipes (36% for the third recipe).In contrast, participants in G1 continued reporting making changes based on the received adaptive feedback.Nevertheless, there was also an increase in the participants in G1 that did not edit the recipe, one participant noted I didn't edit as much this time as I remembered to add them the first time around.

Gender Comparison
Research has often found that males tend to be more impaired at composing text in comparison with women.The study in [35] found that female students performed better than male students on a number of levels.Females had higher scores, revised more, and were more efficient: they revised more per unit of time, exhibiting greater writing fluency.In this study, we found that there is a clear distinction in the writing capabilities between males and females.Like [35], we find females are more efficient in this writing task.They tend to have higher efficiency scores (Fig. 6) and we find p = 0.0038 when comparing efficiency scores in G2 (Table 2), which demonstrates the disparity in efficiency distribution between the two groups.Curiously, males revise less often when receiving feedback (as seen on the x-axis by the number of times revised, (Fig. 6)4 ).On the contrary, when users do not receive feedback, females revise once at most (because index 0 is not a revision phase, Fig. 6).This also reinforces women's abilities in their writing, suggesting they feel less need for revision if they do not receive feedback on what they can improve.Regarding the Delete-Insert ratio (DIRatio), although we find there is no statistical difference (Table 2), we find that males in G2 generally have higher scores, especially in G2.Having higher DIRatio scores means users delete a larger portion of their texts (over 15% for several male users in G2, Fig. 6).Looking back at SRL, especially on the critical thinking aspect [6], which is defined as the ability to examine material, we can see males are more self-critical and delete a larger portion of their texts compared to females when they do not receive feedback.

DISCUSSION & CONCLUSION
With this research, we contribute to the field of understanding the use of intelligent writing systems by learners.We do this by gaining insights into their SRL by inspecting revision behavior.From the log data we collected, we built and modelled a pipeline to analyze and visualize user behavior in the revision phases of the writing task, by observing different features extracted from the revisions of G1 (with adaptive feedback) and G2 (without adaptive feedback)(Figs.2, 4 and 6).Our analysis revealed that learners in different groups revise using different strategies.Learners who were equipped with adaptive feedback revised more often, with fewer edits at each revision step and users without adaptive feedback followed the opposite trend.This suggests that the support provided by the system may influence revision behavior and how it is used.Additionally, we found users seemed to be improving in the writing task as demonstrated by the post-survey and the data, even though they seem to be less engaged from the evolution of the feature variables in Fig. 4. Finally, we concluded females were more efficient than males in this experiment, by having higher efficiency scores.While there has been research on the effectiveness of such systems in improving writing skills, there is a limited understanding of how users revise their writing when using these tools.To evaluate users' SRL, it is crucial to have a better understanding of how they self-regulate, especially in writing activities, in order to provide them with the correct tools to improve their writing skills and understand the underlying writing process [4,31].
Regarding future directions, one can focus on clustering revision data in order to gain further insights into the revision behavior in a writing task.We have already done this, by identifying eight reported revisions, including adding more details, changing the structure, improving the clarity or not making any changes.Nevertheless, we focus on not making any changes, but analyzing other revision reports could help shed light on more differences between the groups.As such, clustering could be used for each group to identify the differences between the two groups or between learners in the same group, to see how users revise when receiving feedback or not.
In conclusion, our research on revision behavior in adaptive writing support systems has shed light on how users in different groups approach revision.The development of a pipeline to study this topic has allowed us to collect and analyze data on user writing and revision activity, leading to the discovery of important patterns and trends.Overall, our study has made a significant contribution to the field by providing a deeper understanding of revision behavior.
Table 3: Overview of feature variables automatically calculated through our pipeline to measure SRL behavior of users in their writing exercises based on keystroke logs

Number Of Revisions
For each user, we count the amount of times they revise each time they write a recipe (i.e., when they submitted then re-edited their texts).This gives a sense of the effort put into the revision phase of the writing task.

Number of edits
The total number of insertions and deletions during a revision step.Insertions are counted as any characters that are typed including whitespaces, and deletions are counted as the number of times the user presses any of the Backspace or Delete buttons.

Time Spent Revising in seconds
We compute the average time users spend revising for each group, for each recipe.This allows us to compare the two groups and to estimate the effort put in by both groups.
Delete-Insert Ratio (DI-Ratio) The average deletions over insertions ratio, which approximately captures the extent of editing and revision of any kind [35].

Efficiency
Estimated by the number of insertions per second, which indicates a general writing speed.This feature is arguably an indicator of writing fluency [35].

Pause Time during Revision in seconds
For each user, we collect the inter-key time interval and compute the mean of these intervals.This captures the average lag time between two adjacent keystroke actions [35].This feature captures the effort and persistence level of users.

Figure 1 :
Figure 1: Overview of our pipeline and methodology, following the KDD process

Figure 4 :Figure 5 :
Figure 4: Visualizing user engagement and feature evolution on 4 feature variables over the entirety of the writing experiment

Figure 6 :
Figure 6: Overview of 2 SRL features from our pipeline comparing males and females

Table 1 :
Demographics of the participants per group from the exemplary data set

Table 4 :
Overview of SRL features from our pipeline for the first written text between students receiving adaptive feedback (G1) and no feedback (G2) based on our data set

Table 5 :
Overview of SRL features from our pipeline for the second written text between students receiving adaptive feedback (G1) and no feedback (G2) based on our data set

Table 6 :
Overview of SRL features from our pipeline for the third written text between students receiving adaptive feedback (G1) and no feedback (G2) based on our data set