Professor Justin Zobel, Pro-Vice Chancellor (Graduate & International Research), University of Melbourne.
The emergence of tools such as ChatGPT and Stable Diffusion has led to widespread debate in the global academic community: on how can they be used constructively, when they are problematic, and even on the implications for the meaning of ‘plagiarism’.
Much of the discussion around conduct, or rather misconduct, has focused on the challenges arising from coursework students using these tools to complete their coursework. However there are also profound challenges for research integrity.
We need to consider research integrity for higher-degree-by-research candidates (HDRs) in relation to these tools. To begin I think it is helpful to consider the tools themselves – what they are being used for and how they work.
Digital assistance tools
The broad category of digital assistance tools (DATs) has been developing for decades. We are routinely and uncontroversially using a DAT when it corrects our grammar in a document we’re writing or recommends alternative queries for a search engine. For many years, students (and academics) have been using similar tools, based on similar technology, as they write essays and papers and summarise collections of papers. Other established applications include translation between languages and correction, clarification, rewording, and reformulation of written expression.
In some respects, then, the most recent tools are not a fresh challenge – but they are far richer than their predecessors. As a computer scientist who has worked with language technologies throughout my career, I am astonished by the rapid development in the last few months of tools based on large language models. They truly are remarkable, providing a mechanism for interacting with a vast collection of human discourse that – at this early stage! – seems to be as dramatic an innovation as the emergence of the internet and the smartphone.
Those innovations led to profound social and behavioural change; these rich DATs seem poised to do the same. They can suggest how to organise an argument or structure a chapter. They can rewrite provided text into a new wording. They can code, develop websites, and generate images. And, critically, they can generate text that is a good facsimile of academic writing.
Above all, the text and images they generate are new, and their controlled use of randomness to select amongst choices means that repetition of a prompt never leads to identical output, confounding current approaches to management of research integrity.
Generated by DALL-E on 2023-03-18 05.23.12 in response to the prompt ‘robot writing at a desk in the style of Edward Ardizzone’
Key to the effectiveness of DATs is the use of large repositories of text. These repositories are many thousands of times larger than the biggest physical libraries; they’re a city of libraries. They contain text from every kind of source – books, articles, social media, transcripts, and so on – across languages and cultures.
Analysis of text collections at this scale allows inference of a huge range of characteristics. Which words tend to occur together? Or occur in place of each other? Or occur in the same textual contexts? Where should commas and fullstops be placed? What clause and sentence structures are common, and how do these, and word usages, vary with topic or audience? How do clause structures aggregate into larger units, such as arguments or poems?
With millions of examples of every kind of usage, such statistics can be very precise. Such a system does not need to be told that the word ‘grade’ can be associated with words related to either scores, steepness, or smoothing. It can infer the connection by observation. Words that pertain to forests, for example, will tend to be automatically linked together by this process. Linkage will mean that when one forest-related word is used, then the likelihood of using another later on is increased.
The biggest language models can have hundreds of billions of parameters, and in effect these models consist entirely of probabilistic associations between words, phrases, sentence types, and so on. They can be thought of as a kind of supercharged averaging of all of human discourse, without reference to anything beyond the text that we might regard as being in the realm of the ‘real world’.
DATs like ChatGPT based on large language models are designed to generate text (or other material) in response to prompts – such as questions, requests, or examples to be modified. So the core of it is this: when ChatGPT is given a prompt it generates an initial word – a choice from amongst millions of combinations of word, sentence, dialect, and so on, biased by linkage to the words in the prompt, and by the likely rhetorical structures suggested by the prompt. And then it generates the next word, and the next, until it decides to finish.
And that’s it.
The vastness of the models, and strengths of linkage between parameters and structures and words, creates the tendency to generate coherent responses. The size of the system and the volume of material that has been ingested are the sole determinants of apparent reasonableness and knowledgeability. Humans consider what knowledge they want to impart, and then form, revise, and review statements to describe that knowledge. DATs, in contrast, are concerned only with generating sequences of words that are suggested by the words they have already seen or generated.
That’s why DATs confabulate, or appear to invent: there is no external validation or underlying database of truth. Even if all of the input was ‘correct’ the output would still be fallible, because there is no awareness of meaning and no sense in which semantic coherence is part of the generation process.
My view is that DATs are not even a step towards a sentient AI; they are at best a tool that such an AI might use for communication. Yet the output they produce seems so very human. On one hand, there is a disturbing dissonance between their lack of reasoning and lack of use of factual information and on the other the appearance they create of confident, slick knowledgeability. The superficial plausibility of the text they generate – and the impression that one is interacting with a cognitive entity – is psychologically persuasive but deeply misleading.
Generated by DALL-E on 2023-03-18 05.47.08 in response to the prompt ‘a painting of a lecturer teaching robots in a classroom’
Academic publishers have responded to the advent of DATs with, for example, policy statements by Nature and Elsevier that restrict the use of AI-generated text in publications, prohibit the inclusion of AIs as authors, and set out guidelines for acknowledging where the text has come from.
My institution, the University of Melbourne, has likewise made a statement on use of these tools in research writing. It explains how use of DATs can be a breach of our policies on research integrity. Succinctly, we require that material that has been generated or substantially altered by a DAT must be acknowledged as such, and that AIs cannot be listed as authors.
However, in one respect it goes further than the publishers’ statements, by noting an aspect that is specific to HDR: the use of DATs for editing. Our HDR policy states that assistance can be sought in accordance with the Australian Standards for Editing Practice – but this assistance is limited to elements such as written expression (clarity, grammar, spelling, and so on), completeness, and consistency. Assistance with elements such as content and structure can only be provided by thesis supervisors. The kinds of assistance that are provided by some DATs are far more extensive than this standard allows and their use in this way is potentially a breach of the Australian Code for the Responsible Conduct of Research, 2018.
With DATs, existing approaches to detection of unoriginality do not apply; currently, tools can suggest whether DATs have been used in thesis writing and other HDR material, but do not provide the rigorous evidence that would be required to prove that a tool was used. The difficulty of automatically identifying breaches does not mean that we should weaken our stance on what constitutes ethical practice, but there is no question that these technologies are a profound challenge to our ability to ensure that work has been undertaken appropriately.
As a general principle, our policies and practices should, as far as possible, be enduring and not just a reaction to events of the moment. However, at Melbourne a review of our policies found that they were already appropriate, and, for now, should remain appropriate with regard to DATs even as the technology changes.
Generated by DALL-E on 2023-03-18 07.07.18 in response to the prompt ‘a photograph of a computer disguised as a person’
Does it matter?
The advent of these new technologies has led to questioning of assumptions about how we teach, the purposes of teaching, and whether and when uses of DATs are genuinely of concern. Such questioning of assumptions is understandable and we need to have considered responses if our policies are to be defensible.
However, my view is that some use of DATs by HDRs is indeed problematic, for a range of reasons. The obvious one is the same as for coursework students: examination of a thesis is intended in part to assess the candidate’s ability to understand and communicate, and that’s why we expect them to provide their own text. If they provide text from some other source, that assessment is undermined.
It is plausible that text that is generated today might be detected in the future. An HDR who incorporates DAT-generated text into a thesis in 2023 might well find themselves exposed in 2033 as having committed misconduct – our practices for imposing consequences for ethical breaches have no statute of limitations.
That said, the unreliability of DAT-generated text means that today it would be risky to include more than small fragments in a thesis. My experience is that for rich topics it is extremely difficult to prompt generation of text that is correct but not trite. There are also other concerns.
- HDRs can be misled by confabulated, incomplete, or absurd summaries of topics of interest.
- They can conceal an inability to communicate clearly, or a lack of knowledge of basics, not just in the thesis but also in emails, proposals, and progress reports.
- Some HDRs already use DATs for machine translation to understand material in other languages (including English, if that isn’t their first language); tidying the translated output with another DAT creates further opportunities for garbling of content.
- Machine translation is sometimes used by HDRs to write in another language and then translate to English, thus disguising lack of capability in English expression.
- A legal issue is of ownership; unacknowledged use of DAT-generated text sits uncomfortably with current copyright law.
- Another legal issue is of disclosure of IP; if the prompts entered into a DAT concern an innovation, the retention of the prompt by the DAT will mean that the IP is lost to the author.
There are valid uses of DATs in writing, such as assistance with grammar, advice on organisation of text, and help with defeat of writer’s block. What is critical is that these kinds of legitimate use don’t blur into less appropriate activities.
Some people have speculated that these AIs herald a future in which training of humans in writing is no longer required. To me, such speculation rests on a narrow view of the roles that writing skills play. They are far more than the capacity to describe ideas in text. Writing is intimately linked with cognition and the ability to organise concepts into a coherent form, and the act of writing enhances memory during performance of complex tasks. This includes not just authoring of individual sentences and paragraphs but structuring of arguments, development of complex descriptions, and so on.
The slogan ‘writing is thinking’ has a great deal of truth to it and is of particular relevance to HDRs because so much of research is a fumbling towards ideas and thoughts that have not previously been articulated. The process of grappling with how to precisely express concepts in written form is critical to development of them into research contributions.
The visibility of this struggle is critical to good supervision. Assessment and critique of writing is a key tool though which a supervisor can mentor an HDR’s intellectual development; concealment by the HDR of inability to undertake such writing can mean that they do not progress towards success as an independent researcher. This is a factor in our expectation that HDRs be honest and open with their supervisors. If an HDR does use a DAT, they should tell their supervisor that they have done so.
A future in which strength as a writer is not of value for researchers is not yet with us, and indeed in my view remains remote. Until it arrives, we will continue to expect that our HDRs speak in their own voices, and be concerned when they cannot do so without assistance, digital or otherwise.
A revised and condensed version of this article has appeared in the University of Melbourne’s Pursuit newsletter. With thanks to Karin Verspoor for her comments and clarifications on the explanation of large language models and to the attendees at the ACGR national meeting in April 2023 who provided feedback on the presentation on which this article is based.