By Tom Fleischman
Not all online reviews are created equal.
Someone who’s posted thousands of times on sites such as Yelp or the Internet Movie Database (IMDb) might be seen as more believable than a relative novice reviewer who’s submitted just a handful. However, that veteran also might be harder to please than the average person, and give a more stringent rating to a product others view as high-quality. New research explores this conundrum.
“The idea we wanted to convey is a very simple one,” said Tommaso Bondi, assistant professor of marketing at Cornell Tech and the Samuel Curtis Johnson Graduate School of Management. “Everyone is using online reviews – we’re constantly learning from them. But to the extent that online reviews are subjective opinions, how much can we really learn from them?”
Bondi and colleagues contend that experts’ more stringent reviews have the effect of compressing aggregate ratings by penalizing higher-quality products compared to their lower-quality alternatives. To address this problem, the team developed a method for de-biasing ratings, which led to numerous rankings reversals between the original and the corrected movie rankings.
Their paper, “The Good, the Bad and the Picky: Consumer Heterogeneity and the Reversal of Product Ratings,” published Dec. 4 in Management Science. Co-authors are Michelangelo Rossi, assistant professor in digital economics at Institut Polytechnique de Paris; and Ryan Stevens, director of applied science at financial services firm Ramp.
Online ratings have been around nearly as long as the internet itself and, over time, a handful of people have become “super-experts,” Bondi said.
“We’re talking about consumers who end up leaving 10,000 reviews – you see a few of them on IMDb,” he said. “And these people co-exist with a large group of consumers who leave maybe five reviews. Reviews are aggregated using simple rules (such as their average), but the problem is that the more experienced consumers are, the better they are at choosing.”
That, Bondi said, is precisely why they are pickier in their numerical ratings. “As a result,” he said, “high-quality products are held to a higher standard of proof, and their average ratings suffer.”
For this research, the team built a theoretical model of first-generation users (reviewers, both expert and novice) and those who read them and make choices based on them. The main result of this two-period model: When experienced users are much more stringent than novices, and especially if experienced reviewers’ opinions are overweighted compared to novices’ (as is often the case on prominent online platforms such as Amazon and Yelp), rankings reversal can occur: Higher-quality products obtain lower average ratings than their lower-quality alternatives.
The researchers applied their model to more than 9,000 movie reviews on IMDb, the world’s most popular movie rating platform, as well as University of Minnesota-based MovieLens, which hosts more than 25 million reviews from 32,000 individual users. Importantly, IMDb divides its users by experience: Top 1,000 (the elite among its 200 million registered users) and Non-Top 1,000. The site displays both the number and average of ratings from both groups.
While IMDb does not display individual users’ histories, MovieLens does, which allowed the researchers to track its thousands of users over time and apply their de-biasing algorithm, which combines users’ ratings and stringency levels to normalize movies’ ratings without the bias of experts’ more exacting rankings.
On both platforms, they found that experienced users watch and rate, on average, better movies. For example, more experienced IMDb reviewers were more stringent for a striking 98% of the movies in their sample, across all genres. These reviewers also rate movies much more stringently on average: The difference is about 0.68 on a 0-10 scale. As a result, aggregate ratings fail to properly reward high quality movies.
To solve this problem, the team subtracted, for each movie, the stringent user’s rating from the average rating from both sites, using award nominations and wins as a proxy for quality. They then mechanically compute a new rating, with a user-stringency equation and a movie-rating equation that affect each other until they settle on a fixed point.
“These two things feed into each other,” Bondi said. “It’s like a ping pong ball that keeps going back and forth between aggregate ratings and individual stringencies until the process converges.”
The team’s de-biased ratings are less compressed than the original ones, and led to adjusted rankings for 8% of the 9,426 movies in their sample.
“Everyone liked ‘Oppenheimer,’” he said. “You’d think, ‘Oh, this is the type of movie that experts love.’ And it’s true it was one of the experts’ favorite movies. But they still liked it less than non-experts. That was interesting to us – the story really seems to be one of stringency much more than of relative preferences. Experts are always tough in their ratings, no matter the genre, style or director.”
Bondi and his team think the bias they identify will be even stronger in other product categories – restaurants, hotels or electronics, for example – where the discrepancies in choices, prices and ratings figure to be more pronounced.
Tom Fleischman is a writer for the Cornell Chronicle.