Commit a3c963b6 authored by Jolahn Vaudey's avatar Jolahn Vaudey
Browse files

Upload New File

parent c602d37f
---
title: "French given names per year per department"
author: "Jolahn VAUDEY"
date: "October, 2021"
output:
pdf_document: default
html_document:
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
# The environment
library(ggplot2)
library(readr)
library(dplyr)
version
```
## Download Raw Data from the website
```{r}
file = "dpt2020_txt.zip"
if(!file.exists(file)){
download.file("https://www.insee.fr/fr/statistiques/fichier/2540004/dpt2020_csv.zip",
destfile=file)
}
unzip(file)
```
## Build the Dataframe from file
```{r}
FirstNames <- read_delim("dpt2020.csv",delim =";")
```
```{r}
FirstNames
```
## Specificity of the data
As shown above, the data contains lines with the name "_PRENOMS_RARES", which do not correspond to any given name, of course.
There are also lines which have "XXXX" for the birth year, and I can't really say what it corresponds to.
We will have to be aware of these particularities for the analysis.
## Frequency analysis
In this first part, we will observe the evolution of the frequency for some first names from 1900 to 2020, starting with Michel.
Thus, we will need to reorganize the data: we will only keep the entries related to the name "Michel". Then, using a group by, we will count the number of occurrences for each years, as the department does not matter for this analysis.
Implicitly, this means that, if a year did not encompass the birth of anyone named Michel, it will not appear in the resultant data.
```{r}
MichelData = FirstNames[FirstNames$preusuel=="MICHEL",]
MichelData = MichelData[MichelData$annais!="XXXX",]
MichelData = MichelData %>% group_by(annais) %>% summarise(nombre = sum(nombre))
MichelData
```
Now that the data that interests us is collected in this structure, we can plot its content.
We will use a simple point graph to do so, as the data is discrete, we wouldn't want to make wrong assumptions while using lines.
We will also force the x-axis to show the whole period of 1900 until 2020, even if the name is only present for a shorter period of time.
```{r}
#library(ggplot2)
# Multiple versions of the graphic have been tried:
# Not good, x axis is too crowded
#ggplot(MichelData) + geom_point(aes(x=annais, y=nombre), color="blue")
#This option is preferred, as all the data is presented.
plot(MichelData,main="Evolution de 1900 à 2020 de l'utilisation du prénom Michel en France", xlim = c(1900, 2020), xlab="Année de naissance", ylab="Nombre de naissances",pch=19,col="blue",cex.main=0.8)
# Lacks information about the data (no points)
#plot(MichelData,main="Evolution de 1900 à 2020 de l'utilisation du prénom Michel en France", xlim = c(1900, 2020), xlab="Année de naissance", ylab="Nombre de naissances",type="l",pch=19,col="blue")
# The line between the points is badly rendered here.
#plot(MichelData,main="Evolution de 1900 à 2020 de l'utilisation du prénom Michel en France", xlim = c(1900, 2020), xlab="Année de naissance", ylab="Nombre de naissances",type="b",pch=19,col="blue")
```
Here, we can observe that Michel as a first name grew in usage from 1920 to 1950, where it reached its apex in popularity, then plummeted until approximately 1975, and then keep on declining at a slower pace.
We can now compare it to some other first names.
We use the exact same method to plot the corresponding datas, and thus we will define a function to do it.
```{r}
PlotSpecificFirstName <- function(name){
nameData = FirstNames[FirstNames$preusuel==name,]
nameData = nameData[nameData$annais!="XXXX",]
nameData = nameData %>% group_by(annais) %>% summarise(nombre = sum(nombre))
plot(nameData,main=paste(paste("Evolution de 1900 à 2020 de l'utilisation du prénom", name ), "en France"), xlim = c(1900,2020), xlab="Année de naissance", ylab="Nombre de naissances",pch=19,col="blue",cex.main=0.8)
}
PlotSpecificFirstName("KEVIN")
PlotSpecificFirstName("ETHAN")
PlotSpecificFirstName("MICHEL")
PlotSpecificFirstName("NICOLAS")
PlotSpecificFirstName("PAMELA")
PlotSpecificFirstName("MARINE")
PlotSpecificFirstName("MARIE")
PlotSpecificFirstName("EMMA")
```
Contrary to Michel, which existed in France in 1900 (even if it wasn't that popular), Kevin and Ethan are a lot newer: the first appeared during the 50s, while the latter only started in the 90s. Nicolas' curve looks a lots more like Michel's, only shifted to the right. In general, it seems that first names have a period during which they grow in opularity, until they reach a peak, then it begin to decline. How steep this curve is seems to vary quite a bit however.
For feminine names, like Marine, Emma and Pamela, the same rule seems to apply, even if the curve for Pamela specifically has quite the funny shape (it coincides with the popularity of the tv show "Dallas"). Marie's case is a bit more of an outlier, but that is to be expected, as it has a specific signification for christians. Its usage mostly decrease for the whole period, but if we look at the scale of the y-axis, we can observe that, in fact, it remained really popular until very recently.
## Most given name per year per gender
To establish this list of the most given names per year and per gender, we will group the data by both year of birth, gender and first name. This will allow us to compute the total number of birth for the corresponding year, first name and gender, accross all departments. Then, we will group the data by year of birth and gender, in order to select the maximum values in all these groups. This will allow us to finally answer the query.
In this case, we will remove the lines containing "_PRENOM_RARE" or those whose year is "XXXX", as they are irrelevant here (they do not correspond to a first name or a year, respectively).
```{r}
MostGiven = FirstNames[FirstNames$preusuel!="_PRENOMS_RARES",]
MostGiven = MostGiven[MostGiven$annais!="XXXX",]
MostGiven = MostGiven %>% group_by(annais,sexe,preusuel) %>% summarize(nombre = sum(nombre)) %>% ungroup() %>% group_by(annais,sexe) %>% filter(nombre == max(nombre)) %>% ungroup() %>% arrange(annais)
MostGiven
```
Thanks to these computations, we now have a table containing the most given names for each years, separated by gender.
## Synthesis
From the first answered question, we can emit an hypothesis:
It seems that most first names's popularity follow a similar curve, that grows for a period, reach a peak, then decreases to the point that the name is pretty much not given anymore. I think the curve may be steeper for more recent first names, but that would need further computations to verify.
On the other hand, having computed the list of most used first names for each year, it can be observed that the frequency of those most given names seems to decline over time. We will draw a separate graph for both male and female names to show this.
These graphs will show the evolution of the most popular names of each year's frequency along time.
```{r}
MostGivenM = MostGiven[MostGiven$sexe==1,] %>% select(-sexe, -preusuel)
MostGivenF = MostGiven[MostGiven$sexe==2,] %>% select(-sexe, -preusuel)
plot(MostGivenM,main="Evolution de 1900 à 2020 de la popularité des prénoms masculins les plus donnés", xlim = c(1900, 2020), xlab="Année de naissance", ylab="Nombre de naissances du prénom le plus populaire",pch=19,col="blue",cex.main=0.8)
plot(MostGivenF,main="Evolution de 1900 à 2020 de la popularité des prénoms féminins les plus donnés", xlim = c(1900, 2020), xlab="Année de naissance", ylab="Nombre de naissances du prénom le plus populaire",pch=19,col="blue",cex.main=0.8)
```
After observing these graphics, the question seems a bit more complicated than I initially thought. However, it still shows that, recently, the most popular names have quite a low frequency. Without more information (like the evolution of the number of births), we can't be certain, but I would emit the hypothesis that it shows that the variety in the given first names has dramatically increased compared to the start of the century.
To try to visualize this, we can plot the number of first names given each years that reach a certain threshold, for instance five hundred births (this threshold here is arbitrary).
```{r}
nbPopular = FirstNames[FirstNames$preusuel!="_PRENOMS_RARES",]
nbPopular = nbPopular[nbPopular$annais!="XXXX",]
nbPopular = nbPopular %>% group_by(annais,sexe,preusuel) %>% summarize(nombre = sum(nombre)) %>% ungroup() %>% group_by(annais,sexe) %>% filter(nombre >= 500) %>% summarize(nombre=n()) %>% ungroup() %>% arrange(annais)
nbPopularM = nbPopular[nbPopular$sexe==1,] %>% select(-sexe)
nbPopularF = nbPopular[nbPopular$sexe==2,] %>% select(-sexe)
plot(nbPopularM,main="Evolution de 1900 à 2020 du nombre de prénoms masculins donnés à plus de 500 personnes", xlim = c(1900, 2020), xlab="Année de naissance", ylab="Nombre de prénoms",pch=19,col="blue",cex.main=0.8)
plot(nbPopularF,main="Evolution de 1900 à 2020 du nombre de prénoms féminins donnés à plus de 500 personnes", xlim = c(1900, 2020), xlab="Année de naissance", ylab="Nombre de prénoms",pch=19,col="blue",cex.main=0.8)
```
These graphics seems to show an increase in the variety of at least mildly popular names across the century. It would however be better to use a percentage of the total births for each years instead of a fixed threshold.
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment