top of page

Blueprints for Evaluating AI in Journalism

Writer's picture: ZaraZara

Blueprints for Evaluating AI in Journalism

'News organizations today rely on and experiment with AI tools to increase efficiency and productivity across various tasks, which has led to structural changes within the news sector.


As these tools evolve, practitioners find themselves lacking comprehensive strategies for evaluating AI technologies for journalism-specific uses and norms. Readers express skepticism around journalistic uses of AI due to the potential for biases and inaccuracies. Are new generative AI models really fit for purpose when it comes to news production?


Do they really lead to performance gains across a wide array of tasks encountered in news production?

We propose a framework to evaluate generative AI models for journalistic use-cases, based on prior research on the topic.


Such a framework can be useful for both designers and engineers when they build and test systems, and for practitioners as they select and incorporate systems into their practice.


This framework and suggested evaluation metrics can also provide much-need transparency for readers.

Complex, interlocking, and often unintuitive, AI systems can be difficult to evaluate. Illustration: Relativity, by M. C. Escher, 1953.


How Journalism & AI Systems are Typically Evaluated

Currently, there are several ways researchers evaluate an AI system. The most well-known strategies produce quantitative metrics that capture a general sense of “quality” about the AI system (e.g. HELM).


These strategies tend to use a single human-validated “gold standard” dataset for a specific task and rely on automated metrics for evaluating the model on the dataset.


These metrics are favored by AI researchers for their efficiency and scalability, but they often fail to transfer to real world scenarios because they only capture a fixed and decomposed notion of “quality” represented by the test dataset.


Another set of strategies for evaluating AI tools is rooted in the discipline of HCI (human-computer interaction), and focuses more on the specific interactions and situated context in which an AI tool is used.


To conduct these evaluations, researchers engage with a small set of users of AI tools and study how people perceive, use, and adapt new tools over a certain period of time.


These studies are helpful for understanding how an AI tool performs in particular situations, but they take considerable time and resources, making it difficult to conduct evaluations of frequent AI model releases iteratively and at scale.


To empower journalists and editors with the ability to efficiently and effectively evaluate and select tools to adopt into their practice, we must develop AI evaluation strategies that are both relevant to the journalism context, and adaptable to support evaluation across different types of newsrooms and practices.


Laying Out the Framework

Here we propose a framework to help guide evaluations of the uses of AI tools in journalism. Our framework suggests that tools be evaluated along three axes: (1) the quality of AI model outputs, based on editorial interests and goals (2) quality of interaction with AI applications, based on needs and work processes of users, and (3) ethical alignment, based on professional values and newsroom standards.


We also propose that practitioners and researchers collaborate on the development of standards to evaluate these aspects of AI in the newsroom.


Output Quality

First, what do we mean by the quality of AI model outputs? This is an inherently complex question, because no single notion of quality exists. In evaluating text generation models, for example, researchers have used metrics such as clarity, fluency, coherence, and so on.


However, text produced for journalistic use-cases (e.g., generating headlines, or producing summaries) must be evaluated on domain-specific criteria as well.


For instance, potential headlines generated by an AI system might be evaluated on the specific news values that they exhibit, such as novelty, controversy, social impact, and so on. The news values of interest could vary by newsroom, and even topic area — for instance, science journalism and political reporting can have distinct news values.


Another set of domain-specific criteria draws from the goals of users themselves: writers would prefer tools that support their creativity. Thus, the range and variety (and biases) of creative ideas that a model’s outputs exhibit is another potential evaluation criterion. Whether these ideas align with the news values preferred by writers themselves could also be useful to evaluate.


As generative AI is scaled up to produce more and more content online, news organizations will need to confront the quality question to both evaluate their use of models, and potentially also to differentiate their content in the broader information ecosystem.


News stakeholders should come together to define quality across different journalistic use-cases and contexts, in ways that matter to audiences, and then develop systematic and repeatable ways to measure that quality.


Interaction Quality

Beyond sophisticated AI models, modern AI systems are complex, layered pieces of software. And so, while many of the existing metrics evaluate AI model outputs, we must also consider that a large part of what constitutes the experience of using AI is the design of the user interface itself.


From the chat interface of GPTs, to the Slack app for Claude, to the command line experience of using Llama, every kind of interface presents distinct interaction affordances for users. What kind of interaction affordances might journalists benefit from? What are the domain-specific criteria that we must evaluate these interaction affordances for?


In open-ended tasks (i.e., where there is no single, correct answer) where people collaborate with AI to solve problems and brainstorm for ideas, researchers evaluate criteria such as ease of use, enjoyment of use, and users’ feelings of ownership over the outputs.


Given the range of open-ended tasks in journalism (e.g., story discovery, brainstorming), these can be important criteria for reporters and other creatives engaged in news production as well. Based on the specifics of the task, other criteria may also emerge, e.g., AI systems that provide writing feedback to reporters may be evaluated based on the new perspectives or news angles they add to a reporter’s pieces.


Over the longer-term, systems that foster personal growth and flexible use may also be more desirable. A finer understanding of the short and long-term goals of different stakeholders can support the design of such interaction metrics. And of course there are also more closed ended tasks, like the classification of a document, or copyediting of a text, for which the interaction model should support efficient supervision and quality control.


Designing metrics such as these is a non-trivial challenge, one that we reiterate would be served well by drawing from reporters’ expertise (e.g., of what is a novel or appropriate angle), and researchers’ experience (of how to capture this in a measurable way), without impinging on users’ autonomy. And just as for quality dimensions above, understanding interaction quality should be context specific.


Ethics

Finally, on ethical alignment, there is no shortage of arguments for its importance for useful AI systems, as well as how complex it can be to actually achieve. We suggest that definitions of ethics for AI evaluation should draw from subjective and multivalent principles of journalistic practice, such as truth, independence, accountability.


Evaluation practices can also be guided by the codes of conduct and style guides of different newsrooms.


Once again, this is difficult for a number of reasons. AI models, especially generative AI models can produce varying and inconsistent outputs for similar prompts. How do you measure ethical alignment to any chosen value? Fine-tuned or updated versions further complicate this picture. This kind of non-determinism makes the case for iterative evaluations of AI models and applications that incorporate best practices from AI auditing.

Closing Notes

We started this blog post by talking about the rapid changes occurring in news production due to AI, and the reservations that exist around these technologies. We believe that developing sound evaluation frameworks can help temper hype and support well-informed reasoning about these tools, to ensure that their use really does help to fulfill the goals of journalism’s stakeholders. Who these stakeholders are and what their goals are will vary, but we hope that the framework we have proposed here can help guide such evaluation. Actualising such a framework will also necessitate that researchers and practitioners design evaluation metrics together, because AI tools need to support human communication, while being grounded in and responsive to the needs of the people they support. Easy!

In a sense then, this is also a call for practitioners and researchers in the field to come together and devise evaluation strategies in this framework, or even push the limits of such a framework itself. We are also open to collaborating and building on these ideas further. Please reach out to us to share your feedback, ideas, or disgruntlement. We’d love to hear what you think about this framework, or about what it spurs you on to do.

1 view0 comments

Comentarios


bottom of page