Welcome to PPR X :)

tl;dr: Higher spice = harder. Higher quality = you have a better score.

This project exists as a way to introduce as much objectivity as is possible into the process of determining how "hard" a chart is to score on, and by extension how "good" a given score is (normalized with respect to how hard the chart is). I'm purposefully avoiding saying that it's completely objective, of course, because that's impossible for many reasons. I'll get into a few over the course of this writeup. But first I want to thank a few people, without whom this would not have been possible:

Telperion, whose similar work on the results of ITL directly inspired mine, and motivated me to get this thing rolling.
SunnyDay, for building and maintaining 3icecream, and providing API access for score retrieval.
remywiki, which I referenced repeatedly to make sure my unlocks and license data was accurate.
The LIFE4 community, for rapid and high-quality early feedback.
Jake, for his support and encouragement, and listening to one too many programmer-nonsense-ramblings.

The spice ratings in PPR X might remind you of the DDRCommunity tier lists, or the 3icecream tier lists. In terms of methodology, it's closer to the 3icecream list. Rather than looking at raw counts of individual songs, I am comparing the scores of players that have played pairs of songs. However there are some key differences between PPR X spice ratings and 3icecreams. The biggest one is that PPR X spice ratings are irrespective of the chart's level: A DDR 14 with a spice rating of 9.5 should be exactly as hard to score as a 15 with a spice rating of 9.5. Practically, PPR X spice ratings have an accuracy bias toward the AAA threshold. Sunny also uses some data I do not have access to when computing the 3icecream tier lists, such as play count.

The quality ratings the PPR X presents you for your scores serve as a rough measure of how good that score is, relative to your other scores, ideally irrespective of any of the songs' difficulties. This means that, for example, you can identify your "best" scores by looking at your highest quality scores, and you can identify your biggest improvement opportunities by looking at the lowest quality. The goal-setting feature lets you choose a quality-equivalent target score on every chart PPR X tracks!

I update the list of charts and unlock criteria manually. This will result in a delay of between a few minutes and a few days when new charts or events are released. Similarly, I perform a full spice rating update less-than-monthly (maybe quarterly-ish). So the newest charts will be present in the list but be "auto-spiced"; PPR X will assign a spice rating that is the lowest for that chart's level, it will not assign a quality rating to your score, and it will assign your target score to be the lowest for that rating.

From this point I'm going to get into the gory details... if you're not interested in that, go check out your scores, enjoy PPR X, and play some DDR!

My approach

So how does all this work? I based my approach on Telperion's scobility project for the ITL final rankings. The high-level summary is:

Scrape every score from 3icecream for the charts I'm interested in
Perform comparisons between pairs of charts in which I look at the scores of players that have played both charts
Use the results of those comparisons to assign a spice rating to each of the charts
Use the spice ratings to assign a quality rating to a player's scores on each of the charts

For a variety of reasons, it was infeasible to perform this on the easiest charts. There are anomalies in the way players approach those charts that make the data unreliable (and in fact, that even applies to some small degree to some songs on the easier end of what I did consider). I struck a balance, choosing to include (1) all expert charts, (2) for each song, any chart that has a DDR rating the same as or higher than the expert chart (usually its challenge chart, but there is one special child), and (3) all charts rated 14 or harder.

To perform a score comparison, I put one song on the X axis and another on the Y axis, and put a point on the plot for each player who has played both charts. (Well, I don't do this, the computer does.) I subtract the scores from 1,000,000, so that the data point represents how many points are "missing" from each score. I then run a linear orthogonal distance regression against these points, forcing the line through the origin. The slope of the line is, in theory, a direct measure of how much harder one song is than the other. For example, if there's only one player who's played both, and that player has 900k on Y and 950k on X, Y is twice as hard as X.

These comparisons take a bit of time (a few seconds each), so it's impractical to compare every set of charts I care about. To this end I needed to reduce the number of comparisons I was making. For every rating 8 through 19, I determined the five most-played charts of that rating. These charts were each compared against all charts that had ratings one lower, the same, and one higher than that chart. For example, POSSESSION DSP is the most popular 14, and it was compared against every 13, 14, and 15. This resulted in about 25,000 comparisons, which takes about a day.

With the comparisons in hand, I assigned multiplicative ("raw") spice ratings. I did this by starting with an arbitrary set of ratings (I happened to choose each song's DDR rating). Then I iterate 500 times, and in each iteration, I do the following, for every chart:

Get the previous iteration's spice rating for every chart this chart was compared to
Multiply (if harder than that chart) or divide (if easier) each of those comparative spice ratings by the slope of the comparison
Take a weighted average of the results, where the weight is influenced by how many points were on the plot, and how much error there was in the regression

Now I have a "raw" spice rating for every chart. Since this raw rating was effectively determined by progressive multiplication, I use logarithms to bring the ratings into a sane scale range. I use log-base-2, so that "1 harder" means "twice as hard". I also shift the ratings to anchor them at MAX 300 expert = 10.0 (because I am nostalgic), which provides an anchor to prevent shifting over time as these ratings are updated. (MAX 300 will always be a 10, and will effectively be the point that all other charts are compared against.)

To determine the quality rating for a score, I simply take the log base 2 of the percentage of missing points (which will be a negative number), and subtract it from the spice rating (which will result in a number higher than the spice rating).

There is one main difference from Telperion's approach here. Telperion "lined up" all the ITL charts in an array and ordered them roughly monotonically according to their comparisons with other charts, assigned a 1 to the lowest, and used successive multiplication to get raw spice. Then in order to increase accuracy, he used a sliding window instead of only comparing with the neighboring charts. (This is an oversimplification; read his excellent post for the full details.) I originally approached the spice-assignment this way as well, but eventually realized I could skip the lineup step entirely, and use a weighted average against every other chart it was being compared with.

Flaws in applying this methodology to DDR

The elephant in the room is money score calculation... I really really wish EX was exposed to us. I initially dismissed this as not much of a concern -- if the system gets excited about every PFC, that's fine; so does DDR itself, right? But what happens in practice is that (for example) songs with very few steps end up with higher spice ratings, because the folks who get 1 great on them and 1 great on some other song (which, really, are about the same in "quality" for similarly-patterned charts even if one has many more steps) will push up the slope of the regression. Thus songs with very few steps get higher spice ratings, even though they're arguably easier to PFC. They *are* harder to AAA, and the system reflects that.

As compared to ITL, where the context provides a massive incentive to grind, DDR players (even the ones who use 3icecream) have a wide variety of goals and play patterns. Harder charts will get more play, driving up the scores for those charts, and making them appear easier than they are, which compresses the spice scale. Additionally, players who just barely pass two songs will probably have similar scores on both. A mash-pass on Endymion 18 will be like a 680, and a mash-pass on the 19 will be like a 620. That data point would suggest the 19 is only about 1.1x harder than the 18! This also compresses the spice scale.

Songs that people play once for completion and walk away from will come up as hard. Songs with limited access are likely only getting played by grinders, and will come up as easy. Et cetera et cetera et cetera. There's a lot about DDR that makes this much more finicky than ITL was.

I counteracted some of the above by using a low cutoff of 800k and a high cutoff of 999k. (Hey, those numbers look familiar...) This helped a little. I then added a weighting to the regression that biased toward pairs of scores where the higher of the two is 990k, trailing off in either direction. (This produces the AAA threshold bias I talked about in the intro.) This helped a little more, especially in the mid-low difficulties. But the high end still needed some serious help -- PPR X was treating 990k on New York A the equivalent of over 900k on Endymion challenge!

... So I fudged it. After I compute the normalized spice ratings, the high end gets some help. I manually fudge the ratings upward along a set of linear transformations. I make all of the following scores quality-equivalent: 990k New York A (hardest 16), 975k Pluto the First (hardest 17), 950k Possession 20th (hardest 18), 825k OTP (hardest 19 except Endy and Lach), and 760k Endy. These exactly correspond to score floor requirements for LIFE4 5.0 emerald V.

All in all I had a lot of fun with this. I thought a lot about DDR and learned a bunch of tech in the process. I plan to keep it running for the foreseeable future, and I hope you find it useful. If you've read this far, thank you! Now go smash some arrows :)