Wouldn’t it be great, having a tool ready to conduct our hard searches for correlations and coming up with the brightest insights available? Well SAP states that the Smart Discovery add on of SAP Analytics Cloud (SAC) is such an innovation. But is it really like that? Does it really come up with the analyses outcomes nobody would come up with? And are those analyses actually of a high quality? I was really happy to put that to the test. I have been investing a lot of my time in analyzing aspects in R, and this is what I will use to “unravel the Smart Discovery” part (well, to my abilities and opinion this is).
The analysis I set up is to see what actually influences the Value of a soccer player. The main purpose of actually starting the analysis, was the urge to predict a player’s value (so a player’s worth). To do this, I used the FIFA 2019 dataset, where I predicted values based on a created trained part of the set consisting of 60% of the actual set. Though predicting the correct value is/was way more difficult than I thought and only in 1286 of a total of 7372 predictions were correct, which is only a small 18% of all cases (of course this is kind of logical, as there are so many factors influencing a player’s value, which are not included in this dataset and this one is fairly limited…).
Nevertheless, it made me curious and I wanted to find out WHAT influenced a players value and find the relations. An analysis to which I hoped SAC Smart Discovery would provide me with some additional outcomes than the ones I already created over time…
In order to run the Smart Discovery of SAP Analytics Cloud (SAC), you select a measure or dimension you want to know “more about” as SAP states it. Using a player’s Value as the “I want to know more about” setting, there is one 1 page (in SAP’s “selling story” it is actually 4…) created and it looks like this of the image below. In orange my remarks are already included.
So there are actually some really interesting and fair aspects in the Smart Discovery outcome above:
But as my short analysis in the above picture already shows, there are 2 chart that really raised my eyebrows: “Value by Preferred Foot” and “Association between Jersey Number and Value by Club”. Let’s dive into both of them and see if those chart actually present the correct data…
Apparently it depends on your preferred foot (so whether you are a lefty of a righty) whether you have a high value as a player. I would think “both” would be the most valuable, though SAC states that a player is more valuable when he is right footed. But is this actually true? Or is this chart nothing more than a misleading visual?
The chart insinuates that when you are a “righty” your value as a player is way higher. When just looking at the chart and using a logical way of looking at it, I concluded that the value presented is the SUM of values. Not the correct way of presenting an analysis of Value by Preferred Foot if you ask me. It is generally known: there are more right footed players than there are left footed ones. Let’s use the analysis in RStudio using R (which is my back up for this whole document) to show whether this is correct:
So in this case it would be logical to calculate an average of the player’s value based on their preferred foot. And this results into the following:
|Preferred Foot||Average Value||Difference L vs R|
So actually it can be said that on average you are more valuable as a left footed player, as the numbers above don’t lie…
Note that the chart in SAC emphasizes on the fact that “Position ST” influences the chart the most. Just for your information: ST stand for striker. So let’s see whether this is actually merely a logical aspect or if Smart Discovery actually came up with an interesting point.
In total there are 27 positions, all varying from goalkeeper to striker covering all the possible positions on the field. Of all those positions, the ST position has the most players included (to know 2145). And when looking at the top 10 most valuable players on the position ST, the names of those players make it even more clear why they have such an influence on the chart:
So SAC insinuates that ST has a high influence on the chart, and they are not incorrect. The “only” part they missed is that there are more Strikers than for example Right Attacking Midfielders (RAM). This is why they have the highest influence on the Value, including also the valuable players included in the list of strikers. The real position with the highest value is not the Striker as can be seen in the chart below created in RStudio using ggplot, but it’s the LF position (Left Forward) including only 15 players, among which are players like Hazard, Dybala and Iniesta.
Due to the low number of players, it is not seen as a large influence on the value, but looking at the average value, this is a position that needs to be taken into account. And also in this case: the total list of LF players contains only 3 left footed players…
Though it must be said: SAP’s chart is correct. The total SUM of value for right footed players is way bigger, but this is obvious as there are more right footed players. Also the influence of Strikers is the largest, but this one is also logical, as this is the position that includes most players. But the question is: does the chart bring you any insights? Or is it the analysis we just did?
Let’s dive into the more complex chart to analyze: the correlation between jersey number and value…
The analysis on a so-called “relation” (or association as SAP refers to it) between Jersey Number, Club and the value is a little bit trickier. When I first saw this chart I was like: “hell no, no way you are more valuable when playing with number 10”.
First thing that really surprises me is that the Smart Discovery actually summed up the Jersey Numbers. This means when a Club had all players playing with Jersey Numbers over 30 instead of number 2, they will rise on the x-axis. So we are not even looking at an association between Jersey Number and Value if I might say so. But not taking this first strange aspect into account, let’s conduct an analysis to see whether there is some sort of truth in this chart. With that I merely take the title of the chart into account, than the data point. So I will be looking for the association between Jersey Number and Value by Club.
Apparently Real Madrid is the most valuable club as it is the highest dot on the y-axis. To check this, I lined the top 10 most valuable clubs in RStudio using the ggplot R package, generating the following chart:
A club’s value is calculated based on the sum of all player’s values. So the chart in R is equal to the chart on the y-axis of the SAC-chart. The higher a club is on the y-axis, the more valuable it is.
In order to know if a player is more valuable when they wear a certain number, it is good to look at the top 4 players (Neymar Jr, De Bruyne, Messi and Hazard) who are shown in the chart below. Those highlighted players are the players with a value over 90.000.000, and with that the 4 most valuable players.
Alright, knowing the names of the top 4 players, we can have a look at the Jersey Numbers.
|Neymar Jr||10||118.500.000||92||Paris Saint Germain|
|L. Messi||10||110.500.000||94||FC Barcelona|
|K. De Bruyne||7||102.000.000||91||Manchester City|
When you see the list like this, it seems that when you play with Jersey Number 10, you are very valuable. Or well: the most valuable players appear to be playing with mostly number 10. Depends on the way you formulate this aspect to see the relation… But does that mean that all players who play with number 10 are as valuable? No of course not. This list of the bottom 10 players with Jersey Number 10 confirms it:
|M. Etxeberria||10||0||74||No Club|
|I. Kovacs||10||0||73||No Club|
|S. Nakamura||10||0||72||Jubilo Iwata|
|J. Campos||10||0||71||No Club|
|B. Nivet||10||0||71||ESTAC Troyes|
|A. De Jong||10||0||59||No Club|
|B. Singh||10||0||58||No Club|
|R. Cretaro||10||40.000||57||Sligo Rovers|
|Ryan Yong Gi||10||50.000||58||Vegalta Sendai|
|K. Brennan||10||60.000||60||St. Patrick’s Athletic|
From this list, the names do not ring a bell (at least not to me), but they all do play with Jersey Number 10. So it is obvious that the Jersey Number does not influence your value as a player. Though playing for an important club from the Top 10 makes it possible that the Jersey Number influences the value. Or is it the other way around? Does the player with the high value choose his Jersey Number? So with that it would be merely the player’s personal influence, less than the Jersey Number’s influence that actually influences the player’s value. Playing with Jersey Number 10 for a club like VVV Venlo, doesn’t equalize a value of 118.500.000 like Neymar Jr has.
And creating a list based on the calculated correlations using the Spearman method (the dataset has large outliers, so hence the Spearman method), shows there is close to no correlation (SAP’s association) between a player’s value and his Jersey Number:
|Jersey Number||– 0.1779670|
With that are Jersey Numbers often also linked to the position on the field, so suggesting there is a relation between Jersey Number and value is rather strange. Even though lately in the football industry, as players get more and more to do with branding and merchandise, they prefer to keep their Jersey Number the same, even when they switch clubs. So the personal influence of a player on his Jersey Number is only growing, BUT only to a point where the CLUB actually thinks the player is valuable enough to obtain his preferred Jersey Number.
Based on the chart provided by SAC Smart Discovery, even more analyses are possible and there are way more things that I could analyze to show the “accurateness” of the chart provided. Though with this start, I think I made a fair point at not always believing what you see at first sight…
To provide yourself with some basic insights the SAC Smart Discovery can be helpful, though I would recommend not following it without a second thought. Smart Discovery is not a human and whatever data you drag into this part of the tool, the tool behaves on it just as it normally does and what it actually does in the background remains a black box. A measure is a measure, a dimension a dimension and that’s it. Make sure you know your data before just “accepting” what SAC Smart Discovery brings back to you. As you can see, it is not always what it looks like!
With that I need to point out that the charts created in SAC are not incorrect, though they do not add a lot of “added value” to the analysis. The charts are basic, though the titles can be very misleading (look at the one with the association between Jersey Number and Value…). In order to create the “perfect SAP selling speech Smart Discovery outcome”, data has to be set up in a very specific way (like SAP did for their code jams in order to show the usefulness of the Smart Discovery). Though in many cases the data YOU use differs and with that makes the Smart Discovery less useful as suggested.
But please, when we differ in opinion, I would like to invite you to exchange our thoughts and to get deeper into it together. Also, when you would like to receive the R analysis to back up this document, don’t hesitate to contact me on firstname.lastname@example.org.