This blog, along with all the necessary data, is available in different formats (PDF, HTML, Rmd, and MD) on DataSharp’s GitHub.
1. Tour de France Legends: A Data Dive into Records
Our Tour de France blog series returns for a third and final episode, in which we will learn how to combine tables, manipulate strings, and capitalise on all acquired knowledge to uncover new Tour de France trivia facts.
While this post can be read on its own, many juicy details are fully detailed in our first and second posts using Tour de France data. We recommend checking these first.
Today, we’re taking things further.
What We Learned In The Previous Posts
In our explorations of Tour de France trivia facts, we learned about a few practical coding techniques that we will practice here.
- Cleaning datasets to extract usable numbers from messy strings.
- Aggregating information using
tapply()to infer group statistics. - Evaluating results mindfully to uncover richer and more accurate insights.
What We’re Tackling Next
In this final post on the Tour de France theme, we’re levelling up.
We will explore more complex questions such as:
- How many riders won the Tour de France without winning a stage in that year?
- Of the riders who won the most Tours, who rode the most kilometres?
These questions will stretch our data skills further, as we will:
❶ Combine information across multiple columns and tables.
❷ Further practice data cleaning to prepare our data for robust analysis.
Ready? Let’s dive in.
2. Setting The Scene
2.1 Installing / Loading Your Packages
As we saw in previous posts, R’s default base package contains
numerous functionalities. However, it is common to need additional
functionalities that can be loaded with packages. This analysis is no
different. Today, we will use three packages: readr,
stringr, and stringi.
Installing packages in R is straightforward:
if(!require('readr')) install.packages('readr')
if(!require('stringr')) install.packages('stringr')
if(!require('stringi')) install.packages('stringi')
Loading packages is commonly done with the
library(package_name) function, specifying the package
name. However, it only works if the package is already installed on your
machine. The code above is slightly longer, but it attempts to load the
package using require(). If the package is not found, it
will first attempt to install it.
2.2 Loading The Data
If you’ve been following along, you’re already familiar with the structure of our Tour de France dataset and how to load it into R. If not, check out the previous posts for a step-by-step walkthrough.
For this analysis, we will once again load our three tables:
finishers <- read.csv('./data/tdf_finishers.csv')
stages <- read.csv('./data/tdf_stages.csv')
tours <- readr::read_csv('./data/tdf_tours.csv', locale=readr::locale(encoding="UTF-8"))
## Rows: 109 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Dates, Stages, Distance
## dbl (3): Year, Starters, Finishers
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2.3 Cleaning The Data
We also identified specific issues with the riders’ names in the table. We fixed them as follows. Check our previous post for a detailed explanation of these concepts.
finishers$Rider <- stringi::stri_trans_general(finishers$Rider, "latin-ascii")
stages$Winner <- stringi::stri_trans_general(stages$Winner, "latin-ascii")
finishers[finishers$Rider == "Chris Froome (UK)", 'Rider'] <- "Chris Froome (GBR)"
stages[stages$Winner == 'Maurice Dewaele (BEL)', 'Winner'] <- 'Maurice De Waele (BEL)'
3. Tour de France Trivia
3.1 How Many Riders Won the Tour de France Without Winning Any Stage That Year?
This question is similar to a question we explored in the first post (How Many Riders Won the Tour de France Without Ever Winning a Stage?), adding a new twist: Did the winner also win a stage during that specific Tour? Our previous analysese could only detect the Tour winners who never won a stage, but not those who won a stage one year and their Tour another.
It’s a subtle but important distinction that requires us to check, for each edition, whether the overall winner claimed any stage victories that year.
Why does this matter? In real-world analysis, refining your question often changes your approach entirely. Here, we move from a simple list comparison to a task that requires merging datasets to align detailed records across tables. That’s why understanding and framing the problem should always come before coding.
Theoretical Detour: Merging Datasets
To address this question, we’ll merge our finishers and
stages tables using Year as the key:
finisherscontains one row per rider per Tour (with their final rank).stagescontains one row per stage per Tour (with the stage winner).
Since both share Year, we can merge them using
merge(). In our case, since the Year columns
in both tables have the exact same set of values, we can perform a
simple join without overthinking it. However, if this weren’t the case,
we would need to decide between:
left_join(keep all rows from the left table),right_join(keep all rows from the right table),inner_join(keep only rows present in both),- and a few other options.
I won’t cover these advanced joins here, but stay tuned for a future post where we’ll tackle them in a clear and practical way. In our case, all these joins would produce the same result.
Step 1: Merging Tables
Merged datasets can become very large. It is always wise to
consider whether you can filter your tables before
merging. In our case, we don’t need the entire
finishers table but only the rows corresponding to the
winners of each edition.
We thus create a finishers_onlywinners table containing
only the riders who finished ranked 1 in each Tour de
France:
finishers_onlywinners <- finishers[finishers$Rank ==1, ]
stage_finish <- merge(stages, finishers_onlywinners, by='Year')
tail(stage_finish, n=10)
## Year Date Stage Course
## 2227 2022 2022-07-14 12 Briançon to Alpe d'Huez
## 2228 2022 2022-07-15 13 Le Bourg-d'Oisans to Saint-Étienne
## 2229 2022 2022-07-16 14 Saint-Étienne to Mende
## 2230 2022 2022-07-17 15 Rodez to Carcassonne
## 2231 2022 2022-07-19 16 Carcassonne to Foix
## 2232 2022 2022-07-20 17 Saint-Gaudens to Peyragudes
## 2233 2022 2022-07-21 18 Lourdes to Hautacam
## 2234 2022 2022-07-22 19 Castelnau-Magnoac to Cahors
## 2235 2022 2022-07-23 20 Lacapelle-Marival to Rocamadour
## 2236 2022 2022-07-24 21 Paris La Défense Arena to Paris (Champs-Élysées)
## Distance Type Winner Rank
## 2227 165.5 km (102.8 mi) Mountain stage Tom Pidcock (GBR) 1
## 2228 193 km (120 mi) Flat stage Mads Pedersen (DEN) 1
## 2229 192.5 km (119.6 mi) Medium-mountain stage Michael Matthews (AUS) 1
## 2230 202.5 km (125.8 mi) Flat stage Jasper Philipsen (BEL) 1
## 2231 178.5 km (110.9 mi) Mountain stage Hugo Houle (CAN) 1
## 2232 130 km (81 mi) Mountain stage Tadej Pogacar (SLO) 1
## 2233 143.5 km (89.2 mi) Mountain stage Jonas Vingegaard (DEN) 1
## 2234 188.5 km (117.1 mi) Flat stage Christophe Laporte (FRA) 1
## 2235 40.7 km (25.3 mi) Individual time trial Wout van Aert (BEL) 1
## 2236 116 km (72 mi) Flat stage Jasper Philipsen (BEL) 1
## Rider Time Team
## 2227 Jonas Vingegaard (DEN) 79h 33' 20" Team Jumbo–Visma
## 2228 Jonas Vingegaard (DEN) 79h 33' 20" Team Jumbo–Visma
## 2229 Jonas Vingegaard (DEN) 79h 33' 20" Team Jumbo–Visma
## 2230 Jonas Vingegaard (DEN) 79h 33' 20" Team Jumbo–Visma
## 2231 Jonas Vingegaard (DEN) 79h 33' 20" Team Jumbo–Visma
## 2232 Jonas Vingegaard (DEN) 79h 33' 20" Team Jumbo–Visma
## 2233 Jonas Vingegaard (DEN) 79h 33' 20" Team Jumbo–Visma
## 2234 Jonas Vingegaard (DEN) 79h 33' 20" Team Jumbo–Visma
## 2235 Jonas Vingegaard (DEN) 79h 33' 20" Team Jumbo–Visma
## 2236 Jonas Vingegaard (DEN) 79h 33' 20" Team Jumbo–Visma
dim(stage_finish)
## [1] 2236 11
Since there are many stages in each Tour de France, the
stages table contained multiple rows per year, while
finishers_onlywinners had only one row per
year. During the merge, the information from
finishers_onlywinners was duplicated as many times as
necessary to align with each corresponding stage in
stages.
Thanks to this merge, each row now displays both the stage winner and the corresponding Tour de France winner for that year. Now, to address our question, we need to identify all the editions where there is no match between these two columns.
Step 2: Identify If The Winner Won A Stage
With the merge, we were able to align Tour winners with stage winners. If we directly compare these two columns, we will obtain a boolean vector that tells us whether the two names on each line are identical:
stage_finish$stage_and_tour <- stage_finish$Rider == stage_finish$Winner
head(stage_finish)
## Year Date Stage Course Distance
## 1 1903 1903-07-08 3 Marseille to Toulouse 423 km (263 mi)
## 2 1903 1903-07-12 4 Toulouse to Bordeaux 268 km (167 mi)
## 3 1903 1903-07-13 5 Bordeaux to Nantes 425 km (264 mi)
## 4 1903 1903-07-01 1 Paris to Lyon 467 km (290 mi)
## 5 1903 1903-07-05 2 Lyon to Marseille 374 km (232 mi)
## 6 1903 1903-07-18 6 Nantes to Paris 471 km (293 mi)
## Type Winner Rank Rider
## 1 Plain stage Hippolyte Aucouturier (FRA) 1 Maurice Garin (FRA)
## 2 Plain stage Charles Laeser (SUI) 1 Maurice Garin (FRA)
## 3 Plain stage Maurice Garin (FRA) 1 Maurice Garin (FRA)
## 4 Plain stage Maurice Garin (FRA) 1 Maurice Garin (FRA)
## 5 Stage with mountain(s) Hippolyte Aucouturier (FRA) 1 Maurice Garin (FRA)
## 6 Plain stage Maurice Garin (FRA) 1 Maurice Garin (FRA)
## Time Team stage_and_tour
## 1 94h 33' 14" La Française FALSE
## 2 94h 33' 14" La Française FALSE
## 3 94h 33' 14" La Française TRUE
## 4 94h 33' 14" La Française TRUE
## 5 94h 33' 14" La Française FALSE
## 6 94h 33' 14" La Française TRUE
We can see that the new column contains TRUE when Maurice Garin, who won the first 1903 Tour de France, also won a stage that year, and FALSE when he didn’t.
Our objective has thus now become: Can we find at least one TRUE in that column for each specific year?
As explained in the last Tour de France post, this type of problem calls for
tapply().
In this case, however, we will not use an existing function, but we will provide our own.
We are focused on identifying the years when the winner did not win
any stage, i.e. identifying the year when the
stage_and_tour column only contains FALSE values.
There are several ways to test that, but we will take advantage of
one property of the boolean FALSE, which is also equal to 0 in R. Only
FALSE values thus mean that the sum of the vector is 0 (0 + 0 + 0 + 0 +
0 ..). We will hence test for which years the sum of all boolean values
from stage_and_tour is equal to 0.
TRUE means that the Tour winner did not win any stage (sum = 0), while FALSE means that he won at least one stage.
tour_but_no_stage <- tapply(stage_finish$stage_and_tour, stage_finish$Year, function(x) return(sum(x) == 0))
## Since tour_but_no_stage is a boolean vector, I can use it to subset itself
## to get the years when the overall winner did not win a single stage
tour_but_no_stage[tour_but_no_stage]
## 1904 1922 1956 1960 1966 1990 2006 2017 2019
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
🚴♂️ We can conclude that 9 Tour de France winners won the overall Tour without winning a single stage in the year they won.
Bonus Step: Can We Identify The Name of These 9 Riders?
Since we identified the years when these peculiar wins occurred, we
can subset one of our tables that contains both Year and
Rider (finishers_onlywinner or
stage_finish).
tour_but_no_stage <- names(tour_but_no_stage[tour_but_no_stage])
tour_but_no_stage
## [1] "1904" "1922" "1956" "1960" "1966" "1990" "2006" "2017" "2019"
winners_no_stage <- unique(stage_finish[stage_finish$Year %in% tour_but_no_stage, c('Year', "Rider")])
winners_no_stage
## Year Rider
## 7 1904 Henri Cornet (FRA)
## 199 1922 Firmin Lambot (BEL)
## 804 1956 Roger Walkowiak (FRA)
## 897 1960 Gastone Nencini (ITA)
## 1037 1966 Lucien Aimar (FRA)
## 1641 1990 Greg LeMond (USA)
## 1880 2006 Oscar Pereiro (ESP)
## 2111 2017 Chris Froome (GBR)
## 2153 2019 Egan Bernal (COL)
The final step consists of checking that no typos persist in the Tour or stage winner names. Since we do not have many names to check, we can do it manually.
In this list of 9 winners, we recognised the names of the three riders we previously identified as Tour winners without ever winning stages (Henri Cornet, Roger Walkowiak, and Egan Bernal). We can therefore visually inspect our data for the remaining six riders:
for(y in 1:nrow(winners_no_stage)) {
if(! winners_no_stage[y, 'Rider'] %in% c("Henri Cornet (FRA)", 'Roger Walkowiak (FRA)', 'Egan Bernal (COL)')){
print(paste("Winner name", winners_no_stage[y, 'Rider'], sep=": "))
print(unique(stage_finish[stage_finish$Year == winners_no_stage[y, 'Year'], ]$Winner))
print(noquote(""))
}
}
## [1] "Winner name: Firmin Lambot (BEL)"
## [1] "Philippe Thys (BEL)" "Romain Bellenger (FRA)" "Robert Jacquinot (FRA)"
## [4] "Jean Alavoine (FRA)" "Federico Gay (ITA)" "Felix Sellier (BEL)"
## [7] "Emile Masson (BEL)}" "Emile Masson (BEL)"
## [1]
## [1] "Winner name: Gastone Nencini (ITA)"
## [1] "Jean Graczyk (FRA)" "Julien Schepens (BEL)"
## [3] "Roger Riviere (FRA)" "Rene Privat (FRA)"
## [5] "Nino Defilippis (ITA)" "Andre Darrigade (FRA)"
## [7] "Graziano Battistini (ITA)" "Martin van Geneugden (BEL)"
## [9] "Kurt Gimmi (SUI)" "Michel Van Aerde (BEL)"
## [11] "Louis Proost (BEL)" "Fernando Manzaneque (ESP)"
## [13] "Rolf Graf (SUI)" "Pierre Beuffeuil (FRA)"
## [1]
## [1] "Winner name: Lucien Aimar (FRA)"
## [1] "Edy Schutz (LUX)" "Ferdinand Bracke (BEL)"
## [3] "Franco Bitossi (ITA)" "Rudi Altig (FRG)"
## [5] "Guido Reybrouck (BEL)" "Televizier-Batavus"
## [7] "Gerben Karstens (NED)" "Willy Planckaert (BEL)"
## [9] "Edward Sels (BEL)" "Henk Nijdam (NED)"
## [11] "Pierre Beuffeuil (FRA)" "Guido Marcello Mugnaini (ITA)"
## [13] "Georges Vandenberghe (BEL)" "Jo de Roo (NED)"
## [15] "Raymond Poulidor (FRA)" "Luis Otano (ESP)"
## [17] "Julio Jimenez (ESP)" "Tommaso de Pra (ITA)"
## [19] "Albert Van Vlierberghe (BEL)"
## [1]
## [1] "Winner name: Greg LeMond (USA)"
## [1] "Massimo Ghirotto (ITA)" "Johan Museeuw (BEL)"
## [3] "Olaf Ludwig (GDR)" "Thierry Marie (FRA)"
## [5] "Frans Maassen (NED)" "Panasonic-Sportlife"
## [7] "Moreno Argentin (ITA)" "Gerrit Solleveld (NED)"
## [9] "Jelle Nijdam (NED)" "Raul Alcala (MEX)"
## [11] "Erik Breukink (NED)" "Eduardo Chozas (ESP)"
## [13] "Thierry Claveyrolat (FRA)" "Gianni Bugno (ITA)"
## [15] "Miguel Indurain (ESP)" "Dimitri Konychev (URS)"
## [17] "Marino Lejarreta (ESP)" "Charly Mottet (FRA)"
## [19] "Guido Bontempi (ITA)"
## [1]
## [1] "Winner name: Oscar Pereiro (ESP)"
## [1] "Thor Hushovd (NOR)" "Jimmy Casper (FRA)"
## [3] "Robbie McEwen (AUS)" "Matthias Kessler (GER)"
## [5] "Oscar Freire (ESP)" "Serhiy Honchar (UKR)"
## [7] "Sylvain Calzati (FRA)" "Juan Miguel Mercado (ESP)"
## [9] "Denis Menchov (RUS)" "Yaroslav Popovych (UKR)"
## [11] "Jens Voigt (GER)" "Pierrick Fedrigo (FRA)"
## [13] "Frank Schleck (LUX)" "Michael Rasmussen (DEN)"
## [15] "Carlos Sastre (ESP)" "Matteo Tosatto (ITA)"
## [1]
## [1] "Winner name: Chris Froome (GBR)"
## [1] "Geraint Thomas (GBR)" "Marcel Kittel (GER)"
## [3] "Peter Sagan (SVK)" "Arnaud Demare (FRA)"
## [5] "Fabio Aru (ITA)" "Lilian Calmejane (FRA)"
## [7] "Rigoberto Uran (COL)" "Romain Bardet (FRA)"
## [9] "Warren Barguil (FRA)" "Michael Matthews (AUS)"
## [11] "Bauke Mollema (NED)" "Primoz Roglic (SLO)"
## [13] "Edvald Boasson Hagen (NOR)" "Maciej Bodnar (POL)"
## [15] "Dylan Groenewegen (NED)"
## [1]
These six riders conclusively did not win any stage in the year of their win.
📝 Note: There are prettier ways to do such checks. However, we used quick and dirty code to demonstrate that it is not always necessary to seek a pretty solution, especially when performing internal checks. If you’re asked for some insights from datasets, nobody will care how you arrived at them. The results you found and the confidence you have in them are what matters. So when in doubt, write a quick and dirty check to ensure the data supports your results. A few manual checks are perfectly fine.
3.2 Of the Riders Who Have Won the Most Tour de France, Who Rode the Most Kilometres?
A handful of cyclists share the record for the most Tour de France wins. But among them, who had to ride the most kilometres to achieve those victories?
Let’s say the record was 8 wins (it’s not — but imagine it was). We’d like to know which rider covered the most kilometres across their winning editions. That’s our challenge.
To tackle this, we can break the problem into manageable parts:
- Identify the current maximum number of wins.
- Retrieve the list of riders with that number of wins.
- Extract the distance of each Tour de France from the
tourstable. - Sum the distances for each winning rider.
Step 1: Find the Max Number of Tour Wins
We start by identifying how many times the most decorated riders have won the Tour.
We first subset the finishers table to those riders who
ranked first. Then we count how many times each name appears in the list
of winners, and we conclude by extracting the names of those who won the
most.
max_wins <- max(table(finishers[finishers$Rank ==1, "Rider"]))
paste("The maximum number of Tour victories is", max_wins)
## [1] "The maximum number of Tour victories is 5"
max_winners <- table(finishers[finishers$Rank == 1, "Rider"])
max_winners <- max_winners[max_winners == max_wins]
names(max_winners)
## [1] "Bernard Hinault (FRA)" "Eddy Merckx (BEL)" "Jacques Anquetil (FRA)"
## [4] "Miguel Indurain (ESP)"
➡️ 4 riders share the record for most Tour de France wins (5). But who among them had to ride the most kilometres?
Step 2: Clean the Distance Column
The tours table contains the total distance for each
edition, but the formatting isn’t quite ready for analysis.
Let’s peek at the data:
head(tours)
## # A tibble: 6 × 6
## Year Dates Stages Distance Starters Finishers
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 1903 "1\x9619 July 1903" 6 "2,428\xa0km (1,… 60 21
## 2 1904 "2\x9624 July 1904" 6 "2,428\xa0km (1,… 88 15
## 3 1905 "9\x9630 July 1905" 11 "2,994\xa0km (1,… 60 24
## 4 1906 "4\x9629 July 1906" 13 "4,637\xa0km (2,… 82 14
## 5 1907 "8 July \x96 4 August 1907" 14 "4,488\xa0km (2,… 93 33
## 6 1908 "13 July \x96 9 August 1908" 14 "4,497\xa0km (2,… 112 36
You’ll notice unusual characters in the Distance column.
We’ve seen this before.
It’s time to standardise data again.
tours$Distance <- stringi::stri_trans_general(tours$Distance, "latin-ascii")
head(tours)
## # A tibble: 6 × 6
## Year Dates Stages Distance Starters Finishers
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 1903 "1\x9619 July 1903" 6 2,428�km (1,509�… 60 21
## 2 1904 "2\x9624 July 1904" 6 2,428�km (1,509�… 88 15
## 3 1905 "9\x9630 July 1905" 11 2,994�km (1,860�… 60 24
## 4 1906 "4\x9629 July 1906" 13 4,637�km (2,881�… 82 14
## 5 1907 "8 July \x96 4 August 1907" 14 4,488�km (2,789�… 93 33
## 6 1908 "13 July \x96 9 August 1908" 14 4,497�km (2,794�… 112 36
The Distance column contains entries like
"3,414.4 km (1,121.6 mi)". We want to extract only the
first number (in km).
We can thus split the distance by a shared pattern among all the distances, which is ” km”. However, that space-like character is not really a space.
At this point, I am not sure which ASCII character
stringi::stri_trans_general() chose to replace the
non-ASCII one. But it doesn’t matter.
We do not need to know what it is to use it.
All we need to do is use one of the distance strings as a template for that pattern by cutting it where necessary. Let’s use the distance of the Tour 2022, which does not contain a decimal value.
tours$Distance[109]
## [1] "3,328�km (2,068�mi)"
str_pattern <- substr(tours$Distance[109], 6, 8)
str_pattern
## [1] "�km"
We cut the string betweeen the 6th and 8th positions, which correspond to the pattern we seek.
We can then inject this pattern in stringr::str_split()
to extract the distance in km.
tours$Distance_clean <- stringr::str_split(tours$Distance, str_pattern, simplify=TRUE)[, 1]
tail(tours$Distance_clean)
## [1] "3,540" "3,349" "3,366" "3,484" "3,414.4" "3,328"
We’re almost there!
You can see that the Distance_clean field is still a
character. We need to remove the commas and force the strings to
numerical values.
tours$Distance_clean <- as.numeric(stringr::str_replace(tours$Distance_clean, ",", ""))
tail(tours$Distance_clean)
## [1] 3540.0 3349.0 3366.0 3484.0 3414.4 3328.0
Step 3: Merge datasets
Then, we only have to merge the tours and
stage_finish tables, the latter being restricted to the 4
riders who won 5 Tours. For reading purposes, we also sort the table
using the order(, decreasing=TRUE) function to get the
longest Tours at the top.
all_dtst <- merge(stage_finish[stage_finish$Rider %in% names(max_winners), ], tours, by='Year')
all_dtst <- unique(all_dtst[, c('Year', 'Rider', 'Distance_clean')])
all_dtst[order(all_dtst[, 'Distance_clean'], decreasing=TRUE), ]
## Year Rider Distance_clean
## 1 1957 Jacques Anquetil (FRA) 4669
## 94 1964 Jacques Anquetil (FRA) 4504
## 25 1961 Jacques Anquetil (FRA) 4397
## 47 1962 Jacques Anquetil (FRA) 4274
## 145 1970 Eddy Merckx (BEL) 4254
## 71 1963 Jacques Anquetil (FRA) 4138
## 119 1969 Eddy Merckx (BEL) 4117
## 349 1985 Bernard Hinault (FRA) 4109
## 224 1974 Eddy Merckx (BEL) 4098
## 396 1992 Miguel Indurain (ESP) 3983
## 439 1994 Miguel Indurain (ESP) 3978
## 373 1991 Miguel Indurain (ESP) 3914
## 251 1978 Bernard Hinault (FRA) 3908
## 199 1972 Eddy Merckx (BEL) 3846
## 276 1979 Bernard Hinault (FRA) 3765
## 301 1981 Bernard Hinault (FRA) 3753
## 418 1993 Miguel Indurain (ESP) 3714
## 461 1995 Miguel Indurain (ESP) 3635
## 174 1971 Eddy Merckx (BEL) 3608
## 326 1982 Bernard Hinault (FRA) 3507
Of the four Tour winners with the most victories, Jacques Anquetil, known as Maître Jacques, has won the longest. If we look beyond the first place, we also notice that he actually won 5 of the 6 longest Tours listed here. This is not surprising for those familiar with cycling, as there has been a decreasing trend over the past 100 years.
We can make a simple plot to illustrate this trend:
plot(tours$Year, tours$Distance_clean)
i=2
for(rider in names(max_winners)) {
points(all_dtst[all_dtst$Rider == rider,
c('Year', 'Distance_clean')],
col=i, pch=i+13, cex=1.5)
i <- i+1
}
legend("topright", col=2:5, pch=15:18, names(max_winners))
Note: The i variable used in the for
loop is only here to select colours and symbols. The loop variable is
rider.
Step 4: Calculate the total distance across all wins
Finally, we can calculate the total distance associated with the 5
winning Tours of each of these champions, using the
tapply() function we discovered in the
previous blog post.
sort(tapply(all_dtst$Distance_clean, all_dtst$Rider, sum))
## Bernard Hinault (FRA) Miguel Indurain (ESP) Eddy Merckx (BEL)
## 19042 19224 19923
## Jacques Anquetil (FRA)
## 21982
Maitre Jacques rode almost 22,000 km to win his five Tours, while Bernard Hinault rode “only” 19,000 km.
4. Food for thought
- R base is more powerful than it looks. With a bit
of care, basic functions like
substr(),tapply(), andwhich.min()can take you surprisingly far, even with messy data and vague questions. - Break big problems into smaller ones. Identifying smaller tasks makes analysis easier. Don’t aim to “answer the question” in one go. List out the steps and take them one at a time.
- All roads lead to Rome. There’s no single “correct” way to solve a data problem. Write the code that makes sense to you. It doesn’t need to be fast, beautiful, or perfect. But it must be clear and correct.
- Know what you’re doing and do it with confidence. Take your time, double-check assumptions, and move forward when things make sense. You don’t need to rush to look competent; clarity is competence.
💛 Key takeaway: Good data analysis is not just about getting the numbers, but about understanding why they are correct, how you arrived at them, and being mindful of the hidden details in your data along the way.
5. What’s Next?
This was the last post on this Tour de France theme. We hope you enjoyed discovering a few data analysis tricks, expanding your data scientist mindset, and learning some handy facts for your next trial pursuit game night with friends.
In the next blog post series, we will continue to use sports data but combine it with the video game universe.
HINT: ⚽️ x 💧🔥🌱
👉 Stay tuned, and if you have questions or examples you’d like us to cover in the next post, drop them in the comments below!
📝 Your learning challenge until then: Can you identify the maximum number of stage wins for any given Tour win?
This blog, along with all the necessary data, is available in different formats (PDF, HTML, Rmd, and MD) on DataSharp’s GitHub.


