Responsible Recommendation in the Age of Generative AI

This is an edited transcript of my ROEGEN keynote talk for easier reading.

Michael D. Ekstrand. 2024. “Responsible Recommendation in the Age of Generative AI”. Given at the 1st Workshop on Risks, Opportunities, and Evaluation of Generative Models in Recommender Systems (ROEGEN) at RecSys 2024, March 18, 2024. https://mde.one/IR4U2.

Thank you very much for having me. Thank you for being here on the last day of the conference, after all of the partying and enjoying each other’s company last night.

I’ve been thinking for the last several years about what it means to do recommendation in a socially responsible way and what it means to do recommendation in a way that really advances the goals of the field; the goals of the technology, particularly as they were imparted to me really when I joined the community back in 2009 and then the subsequent years.

Today I want to talk about how those goals translate and affect how we think about what we’re doing with recommendation, now that generative AI and generative models are in the picture.
So yes, I’ve been doing a lot of fairness work, and this isn’t complete — I took these screenshots before a couple more papers were added, I think.

But I’ve been doing a lot of things around this question of “what does it mean for a recommender system to be fair?” as a subset of “what does it mean for a recommender system to be good for society?”.

Myearly interest in that came about on the tail end of the Coursera course that Yahsar mentioned, and other things like helping teach the course at Minnesota, and starting to develop the course at my first institution as a faculty.

And I had the question, the nagging question: this technology that I’ve spent all of this time teaching people to build, teaching people to evaluate, building tools to help them build —

Is it actually helping?

Is it actually good for society?

Is it achieving the promise that I believed, and that I still believe, it holds as a tool and an engine for helping provide economic opportunity, for helping provide cultural awareness?

There are a lot of good and pro-social things that it can achieve. Are we actually achieving them?

So I set about to try to figure out how to formalize that into a problem that we can try to measure.s I’ve done a bunch of work on different definitions and different measurements, and I’m not going to be talking about the details of that today.
But fairness is just one piece of the broader agenda of recommendation that is socially responsible, that advances pro-social goals, however, we define those. We kind of inherit a set of them from Western democracy that I personally think are pretty good, but also things like the things that Tobias and Lucas shared earlier this week about the environmental impact of the work that we are doing in recommendation. That’s an element of social responsibility: questions of accountability, and respect for people and their data and their autonomy.

Questions of safety and as it relates to harmful content or extremist content or radicalization pathways, these kinds of ways that the things being recommended, and in some cases perhaps the recommender system itself, are encouraging people to go down a path that is not good for them and not good for society. So there’s a lot of different questions that we can think about here.

And this is th the non-exhaustive list of topics as we frame it for the FAccTRec workshop that we had at the beginning of the week around this broad set of questions of responsibility. There are lot of different questions and problems that we talk about here when we’re thinking about responsibility.

But on the fairness side, I kind of want to walk you through a few changes. So I’m talking, I’ll talk maybe 60, 70% about fairness, as that’s the one I focused the most on, but also more as a synecdochie of this broader set of responsibility concerns.
So there’s been a lot of work for the last 10, 15 years on this question of algorithmic fairness, AI fairness. And it’s produced a lot of different definitions. We can see some. We have, we can talk about individual fairness, we can talk about group fairness, we can talk about demographic parity, a lot of different constructs that provide us with very interesting ways to analyze the outputs of classifiers and regression systems.
There are a number of assumptions that a lot of this work makes. We’ve got a problem that’s typically a classification or maybe a scoring problem that’s going to feed into a classification, occasionally a ranking. But at least for the first 10 years, most fairness work ignored rankings.

It’s often in this one-shot static setting. You have a data set, train test split. We want to test fairness on the test set. There’s been a growing body of work that’s dealing with that and looking at feedback loops over time as the system is retrained, what happens when you have a reinforcement learning system, et cetera?

But if you look at the set of findings that make up kind of the core of algorithmic fairness, you find these kinds of assumptions.

There’s another interesting assumption in a lot of it, which is that the system is not personalized. If we have a system that’s doing loan risk scoring, you don’t want to personalize that based on which banker is sitting there reviewing someone’s application. Given the same applicant, the system should be scoring them the same way, regardless of which banker in your organization is reviewing their application.
One of the things that we found as we started to figure out this problem of fairness in recommendation and fairness in search is that every one of those assumptions is violated almost inherently in the problem in a way that we can’t just take these findings from algorithmic fairness and apply them directly to recommendation or search in the general case.

We can apply them to subclasses of that. For example, if you have a toxicity classifier whose output is then going to be a ranking signal — that was one of the things that showed up when Twitter open-sourced their algorithm is they have that kind of a classifier in there. You could look at fairness on the toxicity classifier from the traditional perspective, but they don’t get you fairness on the final rankings. There are a few reasons for this.

One is the decisions are not independent. If I’m going to give you, like you refresh your Facebook timeline, home screen, and I stick your cousin’s picture in the first slot, by the fact that I did that, I cannot stick anybody else’s picture in the first slot. So that decision to put them in the first slot is also making the decision for every other candidate item in the set, not to put them in the first slot. So decision on one piece of content affects others.

They’re also repeated. You refresh that page multiple times a week. And so while we didn’t give something the first slot today, we might be able to give it the first slot tomorrow. This gives us a mechanism to deal with the first point, but we have to deal with that explicitly.

And also, there are multiple stakeholders. So when a lot of, a lot of classical fairness literature, there’s a number of stakeholders, but there’s really one set of stakeholders that we care about fairness towards. If we’re looking at hiring AI, we’re caring about fairness to the job applicants. We usually aren’t thinking about fairness to the company or fairness within the company in some way. We want to treat the job applicants fairly. The problem comes to us with a ready-made set of people, with one ready-made set of people that we want to try to treat fairly.

But in recommendation, in search, in discovery and information access, we don’t have that.

We have the users or consumers of the system.

We have the providers, the artists, the authors, the vendors, the vendors, et cetera, that are providing the products or the items or the songs or whatever.

In some domains, we have the subject. For example, in news recommendation, you have the readers of the news; you have the authors of the news articles and the outlets that publish them; but you also have the people and the communities and the locations that the news article is about. So even if you have, say, one journalistic organization, maybe with some journalists (maybe it was just one), you might want to say, you know, are different communities in our area having equitable representation of their needs and concerns and the things that affect them in the news, or is all of the news reporting on what happens downtown? You can ask that.

You could go a step further and look at, say, the positives and negatives: is all of the news reporting about the good things that are happening in this community and the crime that is happening in that community? That might be an unfair representation. Maybe we want to report about the crime that’s happening in the other community too. The key point is that there’s this third group.

There are a lot of different people who have an interest in being treated fairly. There’s a lot of different people who have an interest in the responsibility of the system in other ways as well.

To whom is the news recommender accountable?

Is it accountable primarily to the shareholders of the aggregator platform?

Is it accountable primarily to the users and the readers?

To the communities that it’s reporting on?

Who is it accountable to?

Who is it transparent to?

All of these things that we think about in responsibility have multiple different people with different and sometimes conflicting and competing interests in the problem.
We published one version of this in our Foundations in Trends paper; another version, we talked about this from a different perspective in our ECIR paper earlier this year.

If we’re thinking about a responsibility problem, or often my colleagues and I and many others in the community have started framing them around identifying a harm that we want to avoid to make it concrete.

There’s a lot of different dimensions that we want to think about this.

One is thinking about first who was being harmed, what I just talked about, the users of the system, the providers of the whatever those are, authors.

There can be multiple levels publishers, et cetera, the people who are being reported on, et cetera. — on what basis are they being harmed?

Are they being harmed on an individual basis, what we’d often call individual fairness, which in recommendation might say, if two artists have a song of comparable quality and comparable interest to this group of listeners, maybe make sure they show up in the recommendation list about the same amount of time. We can wind up, especially with ranking systems, they with their feedback loop can amplify very small differences. Maybe this artist is 2% less relevant, so we rank them second slot, but people pay a lot less attention to that second slot. So that 2% difference in relevance becomes a 50% difference in exposure, a 50% difference in clicks, and then the system learns from that. If we aren’t careful in our debiasing, it might learn, oh, there’s actually a 50, 60, 70% difference in relevance.

So if we’re not careful, the system’s design can amplify very small differences into very large differences.

But what basis are they being, is it a group-wise thing? Like our system is systematically less likely to recommend Latin American music, even to listeners who really want to listen to it.

I’ve been loving listening to a Mexican rock band over the last year, but my recommenders haven’t really been recommending very much other Latin American rock, even if I might very much enjoy it.

Then how are they harmed? Are they being denied some kind of a resource, like opportunity to have their music listened to? Are they being misrepresented? Are they being associated with harmful stereotypes, either in how artists or authors are being presented or in the way that your movie, your recommendations reflect on you, like, oh, your gender, your age, so we’re going to recommend things that are stereotypical for a 35 to 40 year old white man. Is that what’s happening in the recommendations?

The key thing, though, is to be specific because there’s a lot of conflicting and a lot of competing harms. And the way that we found most effective to try to navigate those is to not try to solve all the problems at once. For a variety of reasons, and I’ve got a citation for this, it’s literally impossible for me to go to the blackboard and write down a formula and say, this is fairness, optimize this, and your system will be fair

This will also apply to all of our other responseibility concerns. This is transparency, no, because the problems for all of these harms are deeply contextual, domain specific, stakeholder specific.

What I find very helpful, what many others I’ve talked to both in academia and in industry find very helpful, is to be specific: identify a specific problem. Locate it on this map. Work on it. Then work on the next one.

And as you’ve worked on a few, then you can start to see inductively ground up how they relate to each other. You can have the actual data, to determine first, is there a trade-off? There’s not always a trade-off. And if there is, you can start to reason about that.

Work on specific problems, and that also helps you get unlocked. Like often, sometimes I’ll be talking with students and they’ll be asking “where do I get started? There’s so many things.” Pick one. Pick one that matters to your organization. Pick one that you care about. Pick one that you’ve experienced. And work on it, locate it. This also helps us understand how it relates to other problems.

And then we’re going to next one.

So this is a lot about fairness and responsibility in general.
How does this relate to what’s happening as generative AI generative models get introduced into recommendation?

The injection of generative models into the AI landscape into recommendation has a number of, of changes and effects.
In classic recommendation, we might be generating a ranked list or a slate. Maybe we’re doing one at a time, placing an ad.

But it’s focused on the individual items. If we’re doing our display, even our explanations are usually in the form of like individual bits of information, like a tag or little templated things. We might like do some light NLP to extract a snippet, those kinds of things.

And with generative AI, we’re doing a lot of things. We might be generating those explanations rather than using a template, use Lama, mistral, whatever, generate an explanation for the item.

We might generate the whole recommendation list itself. I see a number of works attempting this.

In some cases, looking at generating the content itself. So rather than recommending existing content from a corpus, we’re actually generating content for the users need.

We might use just part of the model. We might just use its embedder and then feed its embedder into BPR or LambdaMART or whatever our favorite ranker is. I like that design; the current language models have powerful content understanding capabilities. We can use their embeddings in traditional techniques, I think, to a lot of effect.

We might generate descriptions.

We’ve seen some work this week about generating queries to go to a search engine to an enrich the results set, all kinds of things we can generate.
[skipped]
This brings some new harms. If we’re generating the list of items, what’s to say that all the items in that list actually exist? It could fabricate new items that don’t exist. It can fabricate content.

It can also repreproduce a lot of ability to reproduce stereotypes; poke most LLMs and stereotypes fall out all over the place.

But also, in some cases, it is replacing content creator, replacing artists, replacing authors, or at least trying to.
All of the things that I talked about in terms of the different stakeholders, in terms of the needs, to identify and be specific about locating your particular harms, the kinds of harms of resource allocation, representation, stereotyping, all of those are still in play. The exact way we measure them might change because the output format changes, it’s text or it’s an image instead of just a ranked list.

But the fundamental questions remain. We need to figure out who’s being harmed, how, on what basis, and then figure out how to measure it.

So a lot of things change, but the core still holds with some extensions.

We don’t have to worry about the traditional ranker making up an item that doesn’t exist. There’s been some interesting work on suggesting where in the latent space do we seem to be missing items; that’s fascinating.

But there’s some new harms, but the core doesn’t change. This is something I found useful throughout my career. As new technologies come, identify the stable core, because that helps you navigate and figure out how, when, and whether to engage with the new technology, how it works into the problems that you’re solving.
The core is figuring out how people are being affected in specific ways and then addressing that, dealing with that.
But also, especially on the fairness side, but also on another, like accountability, other aspects of responsibility, a lot of these questions, especially in the resource allocation side, are at their core about power and about resources. Who holds it? How is it being allocated? Who makes the decisions about that allocation?

So if we’re thinking about, say, I’ve done some work in author gender, fairness, and book recommendation. What we’re thinking there is about how is exposure to potential readers allocated among authors?

And with that exposure, influence and power in the landscape of reading and literature.

But also inherent in that as the recommender system has a lot of power, or the recommender system operators have a lot of power to make those decisions of how they’re going to be tuning their platform and its allocations.

How does that power get used responsibly?

One of the things that I’ve also been doing some thinking on and starting to try to work towards is participatory approaches that give the people who use the system or the people who’ve recommended the system, meaningful power in the process of designing and evaluating the systems that are going to affect them.

But where is the power? How is it being distributed and divested or concentrated? That doesn’t change. That hasn’t changed in all of human history. We’ve been thinking, we’ve been struggling with who has power and how; how do they use it?

We’re still doing that here.
So to try to navigate that, I find it useful to think about why we’re doing recommendation.

If we think about our goals with recommendation, that I think helps provide guidance for thinking about what problems we want to try to navigate, whether they’re with classical recommendation, whether they’re with incorporating generative models into the recommendation.
And there are a bunch of different reasons why we might try to recommend. We often talk about satisfying users’ needs for information, users’ preferences, etc.

Needs for information is very, very broad. We have the kind of decontextualized top-end recommendation, and I think of that as the “Nirvana information need”: “Here we are now. Entertain us.”

We might want to provide people with products.

We might want to be more paternalistic and promote healthy living, promote energy savings.

We might want to help users identify and set a goal and then find the products or the songs or the information that is going to help them meet their goal. The generative models might be very interesting there in terms of a better conversational agent experience to elicit that goal.

One of the challenges that we’ve had in thinking about if we want to do goal-directed recommendation, how do we do that? Rather than us saying, oh, our goal is to make you healthier, which I think is a good goal, how do we let people set their own goals and have the system help us?

How do we set the goal? Maybe an LLM chatbot can help with that.

I don’t know. Somebody go write that paper.

But we might want to help broaden cultural horizons.

Another thing, though, that’s important there is thinking about the allocation of opportunity, the allocation of resources. When I started in recommendation, so 2009, 2010, 2011, a lot of people in the community talking about in the hallways or mentioning it talks were reading and talking about Chris Anderson’s book, The Long Tail, about how the Internet and personalization allow for a much larger set of diverse products and opportunity, and particularly personalization being the key to matching these products to the customers that are going to buy them to make a much broader and more diverse and more interesting set of products, cultural content, et cetera, viable.

That’s a vision I still believe is possible.

But if that’s our vision, we need to think about what that means for the kinds of systems that we’re building and how we’re building and evaluating them.
For a long time, recommendation has been — sometimes explicitly — about people and connections.

Mounia mentioned this yesterday in her keynote about connecting people to people. When you recommend the song to the listener, when you recommend the video, you aren’t just connecting the listener, the viewer, to that item.

You are, in a meaningful sense, connecting them to the author, its creator.

And that is really, really powerful.

This Mexican band I’ve been listening to I found through YouTube’s recommender system. I like watching concert videos in the background while I work. The recommender suggests this band. I’ve never heard of them; I’ll give them a listen.

And they were fantastic.

And then I went to their concert a month ago or so.

But we’re connecting people to people.
This was also fairly explicit in an early paper that I love from CHI 1995, one of the earliest recommender systems papers, like the second year of what we’re now calling RecSys.

In this paper, the authors were explicit about the goal of recommendation as they saw it to promote a social process not to replace them, and to the role of people in being the sources of recommendations.

They were experimenting with user user recommendation. They did a user study where people were like, hey, could you like put me in touch with like the person who generated my recommendations?

I’d like to ask them on a date.

So it’s like move your recommender turned dating system.

But it was about connecting people to people.
And as I said, we have thinking about how we’re allocating opportunity and resources and how we improve the long tail.

So this is from a paper we published in TORs earlier this year that was looking at the distribution of exposure across the breadth of movie of, we were looking at the individual item as a proxy for the creator, but from different algorithms, we saw that one of our matrix factorizers was much better at distributing the exposure from the recommendation list across a much broader set of items than the nearest neighbor collaborative filter.

Looking at these kinds of things, how are we connecting people to people, and how are we distributing that connection?

This is also an interaction with fairness. Oh, I could improve, I can improve, say, the raw exposure of female artists by recommending that everybody listen to Taylor Swift and Beyoncé. But that’s not going to do very much for promoting the visibility of a broader set of female artists, and most people have probably already heard of Taylor Swift and Beyoncé and decided whether they would like to listen to them or not.

And so we need to think about it from multiple perspectives, but we need to think here about how it’s relating to the individual people.

One of the things that has concerned me as a risk (or more than risk) about some of the generative model stuff is when we’re getting to the point of trying to replace the content itself, then we’re no longer using the recommender to connect people to people.

We’re not connecting people to authors.

We’re not connecting people to the people being reported on in an accurate and humanly accountable way.

We’re now connecting people just to the system, which is kind of the antithesis of what Hill et al. wanted to do with recommendation.

But it also significantly changes what the recommender is doing as a piece of the social aspect of our sociotechnical fabric.

And we need to think about, is that really what we want to do?

If we’re thinking about the distribution of opportunity, the exposure, then if we’re just generating the content, or even if we’re just generating summaries and not clearly and in a way that people will actually click a link, pointing people back to the sources of that content, what we’re effectively doing is solving the equitable distribution of exposure content by taking all the exposure for ourselves.

We’re effectively distributing the zero remaining exposure.

This where the purpose of the recommendation feeds into how we think about it.

If our goal is to connect people to people, if our goal is to promote economic opportunity and influence opportunity by making people’s products and content and information visible, then we don’t want to deploy the technology in a way that is going to interfere with that.

And we want to look at the roles this new technology plays maybe are in other parts of the process, but we keep the core what we’re trying to do the same.

If we have other goals, then maybe we come to other conclusions.

Those are my goals.

But think about what your goal is and let that influence how you think about first whether you incorporate the generative model.

And then once you’ve decided to do it, which harms, which social impacts you’re going to try to measure and think about first.

It all stems from our goals and what we’re trying to accomplish with the system.
Another thing I want to talk about very briefly and then I want to get to some time for questions is, goals of manipulation.

This is a criminally under-cited paper from 1976 by Belkin and Robertson. The question that bothered them from this paper that as they’re looking at all of this work on information and both human systems and technical systems for helping people access information is this: so all this work being focused on empowering the user to find the information that they want.

The question that they wanted to ask in this paper is: to what extent could all of these findings be repurposed for systems that are oriented towards the sender of the information getting out what they want, regardless of what the user wants?
And effectively, propaganda machines. I think we have a lot of systems that are very, very much tuned for that.

That might be okay in some cases. I don’t know that the answer to their questions and provocations is never build such systems.

But they make a number of arguments that are worth thinking about in that context.

Particularly, they were concerned about the ability to repurpose information science at the time to be able to better engineer a propaganda machine to interact with people to be able to get their message out and persuade people in the direction that they wanted to go, perhaps in harmful and antisocial ways.

Don’t see that happening at all today, do we?
We’ve also got a lot of questions. I’m going to skip over some of these.

We have a lot of questions in thinking about demographics and stereotypes.

I’ve been trying to convince people for a little while, even without LLMs, to stop doing demographic recommendation.

Because one of the the core principles of personalization recommendation going way, way back, is find out what this person wants and give it to them, not find out what box this person is in and then give them the things for their box.

I like how John Riedl and Joe Konstan put it in their book 15 years ago, Word of Mouse: “box products, not people”.

And I think we can do some of that, say, with generative models for preference elicitation, of allowing people to express their preferences in rich ways.

Then let’s just find the things they want and skip the box.

However, we do need to understand the box. We need to understand what happens as different people come, perhaps with explicit cues about their identity, but also as they come to the system with all of their lived experience, with all of the cultural things that inform how they talk, how they write.

Is the system providing them with equitable responses?

Is it providing them with fair results?

Is it doing stereotype reinforcement? There’s been some work pre-LLM finding things like if you have an NLP tool to detect whether a piece of text is or is not in English, then African American Vernacular English, which is the linguistic term for the variant of English that is commonly spoken in black communities in the United States, someone writing in that was far more likely to have their text wrongly classified as not English.

I assume that those kinds of biases are in LLMs and generative models. If someone’s coming and they’re writing how they would write with their friends, is the system going to give them good results or is it going to force them to do what’s called code switching? (This is when someone who speaks in one vernacular code switches to the upper class,the educated vernacular to participate in other parts of society.)
There’s a lot of challenges in measuring this.

We have unstructured output from these things instead of rankings.

The output is not reliable in the sense of getting consistent results.

There’s a lot of computational expense.

There’s a lot of difficulty defining the actual measurement.

Lots of really hard problems that were already hard before we were doing generative models and are even harder now.
How do we actually mitigate?

We can’t just go retrain all of Mistral on a lark.

So how do we actually inject the appropriate mitigations into it? Many, many challenging problems there.
Also, the problem of generality. Fairness, responsibility concerns are deeply contextual, but we’re building products on top of models that are intended to be general. And so how do we navigate this?

We have a paper that we presented at HEAL 24 (at CHI), where we argue that the idea of a “fair LLM” can’t really exist because fairness is contextual and depends on all of the downstream applications in which you embed it. You might be able to do things in the LLM that reduce some of those risks, but you still have to retest fairness in every individual application because it’s contextual.

It also doesn’t compose. If you have a fair upstream model, the downstream product can put the unfairness back into it.
So what I want to leave you with, where I wanted to get to with all of this, is that there are significant changes in the interaction modalities, in the user experiences, in the structure of system inputs and outputs, as generative AI is being incorporated into recommendation.

That complexifies a number of challenges in measuring its fairness and measuring its social responsibility.

But the core questions: where is the power? who is being harmed? how? on what basis? what can we do about it?

Those questions remain and are things we can still need to think about, still need to navigate.

We should focus on the fundamentals in question. Should we do this? And then if so, how do we do this in a way that’s socially responsible, that respects users and their autonomy?

And with that, I’ll be happy to take whatever questions we have time for.