Michael Eriksson's Blog

A Swede in Germany

Posts Tagged ‘Technology

Giving the 5-star system a 1-star rating and how to do it better

with 2 comments

Star* ratings and similar schemes are increasingly popular, in contexts like eBay, Uber, product evaluations, human interactions, “social credit” systems, …

*Or whatever symbol happens to be used. Below I will silently assume the common-but-arbitrary 1–5 integer scale, and will simply speak of e.g. “3” and “5” to indicate ratings. Minor modifications might be needed if another scale is used, especially one with an even number of steps. I originally set out to differ between “rating”, to indicate an individual rating of some entity A by some entity B, and “ratings”, to indicate the sum/average of all ratings received; however, I let a half-finished text rest for too long and dropped the ball on this when I resumed writing. I have tried to clean up through a move towards using just “rating”, reserving “ratings” for regular plural cases, but I might not be entirely consistent.

These should sensibly be centered around 3 and have 3 as a default expectation, with 1 and 2 implying a correspondingly worse-than-expected effort (or whatever might be rated), and 4 and 5 a correspondingly better-than-expected effort. In reality, there is typically an enormous inflation, to the point that any rating lower than 5 is seen as an affront in some systems, and e.g. a (received) 4.5 average is seen as poor, rather than excellent. Side-effects include that the value of the rating systems is severely reduced, that individual average ratings are hard to judge without having a separate standard of comparison,* and that many are obsessed with having a perfect rating.

*In particular, a given value, e.g. 4.5, can have very different implications in different systems, or even the same system at different times, depending on how bad the inflation is.

This is a particular problem with systems where a high average rating is crucial for customer choice and/or a service provider might be booted for having a too low average; where similar applies to customers, who might e.g. be refused service for having a too poor rating; and/or where there is a mutual rating system, where a poor or “poor” rating can result in a retaliatory poor rating and good ratings can be traded for the purpose of a mutual ratings increase.

In some other areas, it is common practice to ignore the top value(s) and the bottom value(s) and to calculate an average based on the remaining values. For instance, in some sports* with points given by judges, we might have five judges, each giving a number of points—but the top and bottom rating is ignored and the three remaining ratings are used to form an average, which, in turn, becomes the final points used for inter-athlete comparisons. The series 1, 3, 3, 4, 4 then becomes 3, 3, 4, with a resulting rating of 3.3, 3.33, 3.333, or whatnot, depending on how many decimal places are used for display and comparison.**

*Think the likes of gymnastics, ice dancing, synchronized swimming. I have not checked whether any of these use the exact system described. (In particular, I use integers 1–5 solely for compatibility with the star ratings, while some sports have scales that go to 10 and/or allow a decimal position.) A reason for such systems is fear of deliberate cheating to favor one athlete over another based on e.g. nationality.

**Note that only one of the 4s is removed (and which one does not matter). The exact rating is, of course, 3 1/3.

The same approach might be a partial remedy to the problems with star ratings: Say that the top/bottom 10 percent of ratings are removed from the calculation. (The 10 is highly negotiable. Some modification might be needed for those with very few individual ratings.) In this scenario, the short-term* effect of a single major deviation from the typical rating will often be close to nothing, because the deviation is within the 10 percent.** Alternatively, it is not within the 10 percent, but then the cause of any changes to the average rating should rightfully and rationally*** be seen as a joint effect of the currently given rating and the previous 10 percent of good/bad ratings. Moreover, chances are that any individual “disaster rating”, e.g. a 1 given to someone with an average in the high 4s, will not differ that much from e.g. a 3 in terms of negative effect. (See excursion for some more detail and my reasons for not being more explicit.)

*A mid- or long-term effect is still possible, as later ratings can be in or out of the 10 percent depending on older ratings, and as later ratings can ultimately push an older rating in or out of the 10 percent.

**And, most of the time, the marginal effect through a movement of any other rating out of the ten percent will have little or no impact.

***The risk that someone has an irrational negative reaction remains, of course.

Now, a single bad rating from A to B is less likely to lead to a negative reaction from B, any retaliatory rating from B to A is less likely to affect A’s average, and a trade of good ratings is less likely to be successful. Correspondingly, a shift towards more sensible ratings is likely, because the pressure to give artificially good ratings, e.g. a 5 when a 3 was deserved, is reduced. (Other complications might remain, e.g. that someone might give a too good rating out of kindness or to avoid affront within an independent personal relationship. However, the effect of these will still usually be diminished.)

A potential downside of this scheme is that the quality of e.g. a service is best judged when things go wrong or otherwise something out of the ordinary occurs—there are far more pilots who can safely land an airplane with functioning engines on a regular airstrip than can pull a “Sully”. Particularly important is how own errors are handled, whether the customer is compensated, whether extra steps are taken to remedy the problems caused, etc.—or whether the customer is dismissed with the equivalent of a “Tough luck. Next!”. If* things go wrong less than 10 percent of the time, this scheme could filter out those ratings that are the most valuable.

*Based on my own experiences with various German businesses, an error rate of 10 percent would be optimistic, but other fields or other countries might have a better record. (And, yes, usually the customer is stuck with all the damage and all the problems, while the business shrugs its metaphorical shoulders.)

Another approach to achieve a similar effect could be to give some form of penalty to the rater when he is too generous (maybe, the same when he is too keen on giving low ratings) relative some intended distribution of ratings. If we say, e.g., that 5s are intended to be given 10 percent of the time, over all raters and ratees, then we might apply a temporary malus of -0.1 to the rating of the rater for every full percentage point that he currently exceeds those intended 10 percent in his generosity. (Again, likely with some special regulations for those with few given ratings.) Trading 5s could, depending on how many 5s have been given, be a particularly bad idea. For instance, if someone consistently trades 5s with everyone, his base rating would be 5—but the malus would be -0.1 for each of the 90 percentage points that he exceeds 10 percent, or -9, giving him a horrifying overall of -4.

Excursion on ill-advised approaches:
A harder variation of the second approach has been used* on some occasions: Put in a restriction that no-one, over a given period of time, is allowed to give more than a fix number or a fix percentage of certain ratings. This can then have the effect that someone more deserving of a 5 receives a 4, while someone less deserving receives the 5, merely through reasons of timing. Ditto those more or less deserving of a 1. Another ill-advised approach is to limit certain ratings to certain standings, e.g. that only those with an own rating above 4 are allowed to give out 5s. This might to some degree curb abuse, but comes with the downside that someone deserving of a 5 might not receive it, merely because of the rating-of-the-rater. (A potentially less harmful version is that everyone is allowed to give any rating, but that the weight of the rating when calculating the average rating depends on the rating-of-the-rater. However, this could have negative side-effects in other areas, e.g. that big players can form clubs of mutual admiration to remain big players and/or to keep others down. It could also be computationally too expensive if the number of raters and/or ratings is large.)

*I am uncertain on whether specifically for this type of rating; however, a similar idea is used on e.g. the Unz Review to limit certain actions by commenters.

Excursion on re-centering, the impact of extremes, and the ability to regain lost ground:
An additional benefit of a re-centering to an average of 3 is that the impact of an extreme value grows smaller (even absent other mechanisms, like the “10 percent” idea above) and that losses can actually be neutralized. In one alternate reality, someone expects a 5 and receives a 1, for a negative surprise of 4; in another, he expects a 3 and receives the same 1, for a negative surprise of 2. In a next step, in both realities, a new rating comes in at 5. In the second reality, the 1 and the 5 will average out at 3, with no harm done. In the first? There will typically be a much more marginal change. To illustrate the principle, consider an average rating of 5 over 19 ratings. The new average after the 1 is (5 * 19 + 1) / 20 = 4.8. After the 5, we more to (5 * 19 + 1 + 5) / 21 or just shy of 4.81. Throw in another twenty 5s, with no smaller values in between, and he will almost be back at 5 again, but to have a proud 5 on display, he will have to rely on rounding. Go back to the 3 scenario: after receiving the 1, the average sinks to (3 * 19 + 1) / 20 = 2.9; after receiving the 5, it bounces back to (3 * 19 + 1 + 5) / 21 = 3.

Excursion on school grades:
A very similar problem is common with school grades and similar ideas might be usable there, but care should be taken. For instance, if some variation of the second scheme was implemented, a teacher who just happens to have several genuine A-students (as opposed to “got an A, because everyone gets an A”-students) could effectively be punished for giving the A-students the grade that they truly deserve. (During my own school years, there was a stretch with four parallel classes, of which the top-3 or top-4 students were all in my class. The distribution of students is not necessarily even approximately uniform.)

Excursion on marginal effects of new ratings and changing borders:
A problem with discussing the above system (of ignoring the top/bottom 10 percent) is the subtlety of behavior when the borders change. I largely gloss over the issue, but to give some idea (and some idea of why I have glossed over it): If we originally have N ratings, the 10 percent amount to N/10,* and we simply adjust for the N/10 top/bottom entries. Add a new rating and the 10 percent become (N + 1)/10, shifting** both the upper and the lower border. No matter what numerical value the new rating has, we will then see a combination of at least four effects: (a) The denominator will increase, thereby slightly reducing the overall effect of all previous (non-ignored) ratings. (b) The shift of the lower border will cause some old entry to be removed/reduced. (c) Ditto the upper border. (d) Depending on what numerical value the new rating has relative the old ratings, it will either go among the non-ignored ratings and count in its own right or be ignored but shift one or two*** of the old ratings from the top (if it was high) or bottom (if it was low) towards counting. Note that the effect of (d), in the second case, will counteract the effect of either (b) or (c), which could give the affected old rating a net shove in the opposite direction relative (b) resp. (c).

*If this is an integer value, everything is fine. If not, a workaround is needed, e.g. by rounding or by virtually removing a fractional rating by reducing its weight. To the latter, which I would likely prefer: if we have 101 ratings, 10 percent is 10.1, and we could remove the 10 top/bottom entries as usual, and count the average by giving the next highest/lowest remaining rating a weight of 0.9, the other 79 the usual weight of 1, and then dividing the whole by 79 + 0.9 + 0.9 = 80.8. To the former, adding an individual new rating will have no effect on the number of ratings removed in nine cases out of ten, but will increase it by 1 in the tenth case. (Looking at the above, the (b) and (c) effects would then only manifest in the tenth case, while the (a) and (d) effects would always be present. For the other scheme, all effects are present all the time, but might move in fractions.)

**With reservations for approach per the previous footnote. (If these approaches seem hard to understand, feel free to instead add ten new ratings. I stick with a single rating because I am interested in the marginal effect.) However, note that both borders would shift simultaneously in both the mentioned approaches, and would likely do so in all or almost all, in some sense, reasonable approaches.

***The “two” arises when an old rating is counted partially and only a part of the shift is needed to make it count fully, while the remainder of the shift moves another old rating from not counting to counting partially. (Does not apply with the rounding scheme.)

Excursion on cheating:
In addition, there is a great risk of cheating of various kinds, especially when someone has a one-sided control over ratings. A particularly bad example is a contact form* that I encountered many years ago (I do not remember where): The form was constructed to automatically provide a star rating with every message—and the rating was preset to 5 stars. Worse, the form was constructed so that this aspect was not obvious. Not only does this result in a potentially highly undeserved rating claim (“We have an average 4.9 rating from our customers!!!!!!!!!”), but it could lead to paradoxes like a customer writing an angry complaint, amounting to a virtual no-star-at-all,** while the actual rating registered in the system would be a 5…

*Of course, contact forms are evil, per se. Non-evil entities publish email addresses and either do not use contacts form at all or only as a secondary and entirely optional means in addition to proper email.

**More generally, a cap at 1 star for poor ratings, while standard, is misleading: even one star should imply that some positive minimum standard has been reached. At the extreme, even a single Michelin star is a ringing endorsement. (Much unlike a single star in most rating schemes for hotels.) If the intended semantic of 1 is “complete failure”, it should not be associated with a star, but just be given as a number. (As a comparison, Swedish grading, during my own school years, was 1–5, with “1” equivalent to a U.S. “F”. This was fine, as no stars were involved.)

Advertisement

Written by michaeleriksson

February 11, 2023 at 1:23 pm

The trial of the year—Victory! (Follow up)

leave a comment »

As I wrote in March, a jury ruled in favour of Novell in the fight against SCO, whose widely-considered-faulty claims had caused great costs and uncertainty for a number of other parties (including, obviously, Novell).

There was still some remaining uncertainty in theory (considering the overall situation and previous judgements, a practical problem was unlikely), because there were further “findings of facts” and various motions to be decided by the judge. As Groklaw now reportse:

Judge Ted Stewart has ruled for Novell and against SCO. Novell’s claim for declaratory judgment is granted; SCO’s claims for specific performance and breach of the implied covenant of good fair and fair dealings are denied. Also SCO’s motion for judgment as a matter of law or for a new trial: denied. SCO is entitled to waive, at its sole discretion, claims against IBM, Sequent and other SVRX licensees.

CASE CLOSED!

Maybe I should say cases closed. The door has slammed shut on the SCO litigation machine.

Written by michaeleriksson

June 11, 2010 at 6:09 pm

Posted in Uncategorized

Tagged with , , , ,

The trial of the year—Victory!

leave a comment »

I recently wrote about the SCO vs. Novell trial, the verdict of which is now, with some delay, in:

A unanimous jury rejected SCO’s copyright claims, which likely means the end to this threat once and for all. Virtual champagne all around!

Of course, looking at the preceding decade, SCO has been harder to get rid of than Jason Voorhees; however, unlike Jason, it is not actually supernatural.

Written by michaeleriksson

March 31, 2010 at 4:01 am

Posted in Uncategorized

Tagged with , , , ,

The trial of the year

with 2 comments

Right now, a trial of great importance is underway: The battle between Novell (the good guys) and SCO (the bad guys) concerning the rights to Unix. Unfortunately, most people seem to be entirely unaware of it.

Why is this battle so important?

In order to understand this, a brief overview is needed, and will be given below. By necessity, it will be an over-simplification: The story is extremely convoluted, involves many parties, and is stretched over a very long time. For those interested in more details, I recommend Wikipediaw; for those truly interested, there are enormous amounts of material present at Groklawe or, in German, Heisee.

Some forty years ago, the operating system Unix takes its first steps at AT&T. This little toddler is to grow into one of the dominating server and workstation operating systems for several decades—and to be the progenitor of both Linux and Mac OS X.

In the early nineties, AT&T sells the rights to Novell (the first of the combatants). In 1995, some of these rights are sold to SCO (confusingly, not the second combatant). Here however, we encounter the point of contention: Which rights, exactly?

Only in 2000 does the second combatant, then called Caldera, enter the arena by buying the Unix business of the original SCO. Not long thereafter, Caldera changes its name to SCO Group, in an effort to capitalize on the strong brand-name of the original SCO, which it has also bought. Meanwhile the original SCO departs from our tale.

Having had a few less than successful years, SCO looks for a solution to its money problems, and in 2002 it begins the dangerous gamble of claiming more extensive rights to Unix than it was acknowledged to have—and that Linux would contain significant portions of unlicensed Unix code. Calls for proof are raised; none is given.

In 2003, all hell breaks lose. A slew of law suits are started: SCO v. IBM, Red Hat v. SCO, SCO v. Novell, SCO v. AutoZone, SCO v. DaimlerChrysler. Claims and counter-claims are made, and litigation that lasts until at least 2010 ensues. SCO’s most noteworthy claim: IBM owns it one billion dollar (yes, billion) relating to its alleged and allegedly illicit use of intellectual property allegedly belonging to SCO. This amount was later increased to five billion… To make matters worse, this has the appearances of pilot case, with more to follow upon success.


Side-note:

The above paragraph has been revised for two errors since the original publishing:

  1. When checking the numbers, I overlooked the increase to five billion dollars.

  2. I claimed that even one billion was far more than SCO was every worth. While I still hold this statement to be true, it is technically wrong, seeing that Caldera had a market capitalization of more than that shortly after its IPO. That number, however and IMO, was severely hyped, did not reflect actual sales and prospects, and dwindled soon afterwards. (See also CNET on the IPOe or historical share-price informatione of SCO.)

Generally, I gathered most facts from a few timelines on the given links, without revisiting the case to a greater depth. (I followed the case with great interest in the early years, but with the passage of time…) Correspondingly, there may be other errors in detail—not, however, in the big picture.


In parallel, SCO tries to leverage its claims in other ways, e.g. by trying to bluff companies merely using Linux into purchasing “anti-dote” licenses as protection against potential law suits for larger amounts.

As time goes by, SCO becomes more and more focused on these lawsuits, seeing the rest of its business disappear. It is now in a do-or-die situation—win the jackpot in court or end up in bankruptcy. It has become a company effectively geared at just one thing—litigation.

Because SCO is never able to produce evidence, it has little success, often see its claims struck down by summary judgments, and only manages to stay above the water-line through injections of additional capital, including from Linux’, Unix’, and Apple’s archenemy—Microsoft. Those claims that are not struck down are often stayed awaiting one of the other cases, either SCO v. IBM or SCO v. Novell.

In the autumn of 2007, the issue seems to be concluded: A summary judgment falls, stating that Novell is the rightful owner of the relevant Unix rights, which pulls out the carpet from all other cases; and SCO is effectively bankrupt.

However, hanging by a thread and protected by Chapter 11, SCO manages to remain to in the fight—and in August 2009, an appeals court finds that parts of the summary judgment were premature and must be treated in a full trial. This trial is now underway, expected to be concluded in the coming week (knock on wood).

As should be clear even from this greatly simplified overview, the situation has been highly chaotic, and great stakes are involved. Those who dig into the sources given above will find more chaos yet, including many other examples of highly disputable behaviour on the part of SCO—and many cases of infighting and internal intrigues.

Now, why is it important that SCO lose this trial? Mainly, were SCO to win, it would set a dangerous precedent with regard to making legal claims bordering on the frivolous, extorting money by means of legal threats, and making grossly misleading accusations against other organisations: The justice system is abused often enough as it is—with a SCO victory, we could see a flood of lawsuits where failing companies try to ensure their survival by suing wealthier companies, possibly causing immense damage to third parties along the way. In addition, it is still conceivable that a SCO victory could do great damage to the companies and communities involved in developing Linux, and similar lawsuits against other members of the extended Unix family would not be inconceivable—and consider if Linux takes a severe hit at the same time as Apple is locked up in ten years of costly litigation: All of Gaul could well be conquered by the Redmonds this time.

Notably, while the probability that SCO will win sufficiently many battles is small, the stakes are sufficiently high that there is still reason to be nervous. In football terms: We may be a few minutes away from the end of the fourth quarter and have a two-touchdown lead—but the game is the Superbowl.

The issue of ObamaCare may be more important, but neither the OJ trial(s) nor the actual Superbowl holds a candle.

Written by michaeleriksson

March 21, 2010 at 4:20 pm

Posted in Uncategorized

Tagged with , , , ,

Further notes on WordPress

leave a comment »

As hinted at in my last post, I have been fairly active in exploring WordPress recently. In particular, my excursions into the blogosphere have, until recently, mostly consisted of stumbling onto various blogs during researches, often followed by just reading that blog from beginning to end (skipping entries that turned out to be uninteresting, obviously): This way, I have built up a great mass of read blog entries, but without any continuity, little “compare and contrast”, and no view of the writers side (apart from the very different platform of OpenDiary)—and my recent activities here have given me much deeper insights into WordPress, how different blogs come across, how the writers side works, etc.

A few observations (with a tendency towards griping) on the more technical side:

  1. The whole “theme” thing is done the wrong way around: The themes should not be applied by the authors to their own blogs, but by the readers. This would make for greater consistency, make life easier for the readers, and avoid many annoyances. An article on my website on Separation of content and layout can provide a bit more information about what I mean.

    As an aside, OpenDiary has the same problem—and there I usually used Opera’s UserCSS functionality to just override anything the diarists had concocted. (Note that the themes there are not professionally ready made, like here, but entirely the work of the individual diarists. The result is a high frequency of truly abhorrent designs, with extremely bright and contrasting colors, red text on black backgrounds, and other variations that make the readers eyes hurt.)

  2. The administrative area is abysmally slow—a price to be paid for the extensive functionality. In the weighing of costs and benefits, I am the opinion that WordPress should have been content with less. (Reservation: My time here is sufficiently short that this could conceivably be a temporary shortage in band-width or server capacity. If so, I may have to revise this statement. Under no circumstances, however, would I like to deal with WordPress over a cell phone or a dial-up connection.)

  3. For some reason, HTML text entered with line-breaks is distorted by the artificial addition of paragraphs according to these line-breaks. Really unprofessional: The point of HTML (as opposed to Rich-Text or WYSIWYG editors) is that the actual HTML code can be entered (typically pasted from elsewhere) and be interpreted in the same manner as if it had been written in a plain HTML document.

  4. The Snap previews of links are evil. Compare a discussion on another bloge. I urge my fellow bloggers to follow the advice of that post and turn Snap off. Further, I re-iterate my comment on that post that this is a functionality that should be provided and configurable on the browser level, not on the blog/website level (similar to themes above).

    For users, I have not found any foolproof way to counter this. I tried a few alleged solutions using user-side JavaScript/CSS, but they proved ineffectual for some reason; the same was true for the alleged solution in the Snap FAQ. (And, upon inspection the source code was sufficiently convoluted that it would have taken me more time than I intended to waste to reliably find the right counter-measure.) Currently, I simply have JavaScript turned off per default. This fixes the problem, but can have negative side-effects elsewhere. It may, in particular, be necessary to re-activate it when doing something in the administrative area.

  5. I am puzzled as to why the statistics in the administrative area have a piece of Flash were a conventional image would be expected. There may be some additional functionality present that is not possible with an image, but hardly any that would justify the use of Flash (evil!); in particular, when considering that normal links, CSS, and JavaScript can do most (all?) things that could reasonably be wished for in this context. (Because I have Flash turned off in a very categorical manner, I cannot say what this hypothetical additional functionality would be—or if there is any at all: It could well be that the contents are static, and that the developers simply find generation of Flash easier than of an image.)

Written by michaeleriksson

March 6, 2010 at 3:08 pm

Posted in Uncategorized

Tagged with , , , ,