Commit da71da4a authored by Jolahn Vaudey's avatar Jolahn Vaudey
Browse files

Upload new file

parent 9be36418
---
title: "Data visualization"
author: "VAUDEY Jolahn"
date: "15/10/2021"
output:
pdf_document: default
html_document:
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Graph Representation
We start by creating our dataset.
```{r, echo=FALSE}
feetSize = c(17.5, 17.5, 17.5, 17.5, 18, 18, 18, 18, 18.5, 18.5, 18.5, 19, 19, 20, 20, 20, 20.5, 20.5, 20.5, 20.5, 21, 21, 21, 21, 21.5, 21.5, 21.5, 22, 22, 22, 22, 23, 23, 23, 23.5, 23.5, 23.5, 23.5, 24, 24, 24, 24.5, 24.5, 24.5, 24.5, 25, 25, 25, 25, 25.5, 25.5, 26, 26, 26, 26.5, 26.5, 26.5, 27, 27, 27, 27, 27.5, 27.5, 28, 28, 28, 28, 28.5, 28.5, 29, 29, 29)
mistakes = c( 15, 18, 19, 20, 16, 17, 18, 19, 14, 16, 17, 15, 16, 13, 14, 15, 12, 13, 14, 15, 10, 11, 13, 15, 10, 12, 13, 8, 10, 11, 12, 8, 9, 10, 7, 8, 9, 11, 6, 8, 9, 6, 7, 8, 10, 4, 6, 7, 8, 5, 6, 4, 5, 7, 3, 4, 5, 2, 3, 4, 7, 2, 3, 0, 1, 2, 4, 0, 2, 0, 1, 2)
data_used = data.frame(feetSize,mistakes)
```
We can see here that the variables feetSize and mistakes are both quantitative variables, and we will treat them as continous variables.
The data will be illustrated using a histogram, with 11 bins corresponding to the following intervals : ]17., 18], ..., ]28,29]. The y axis for each bins represents the average amounts of mistakes by students whose foot size lies in the corresponding interval.
```{r, echo=FALSE}
library(ggplot2)
ggplot(data=data_used, aes(x=feetSize, y=mistakes)) + stat_summary_bin(fun = "mean",geom="bar",binwidth=1,fill="blue",colour="black") + labs(title = "Average number of mistakes in a dictation by foot size\n", x = "foot size (in cm)", y = "Average number of mistakes") +
theme(axis.text.x = element_text(size = 14), axis.title.x = element_text(size = 16),
axis.text.y = element_text(size = 14), axis.title.y = element_text(size = 16),
plot.title = element_text(size = 16, face = "bold"))
```
This graph clearly shows a correlation between the students' feet size and their results during the dictation. It looks like students with larger feet make less mistakes.
## Summary of the variables
To make a summary of the relationship between these two variables, we will use a linear regression, to compute the number of errors as a linear function applied to the foot size.
```{r, echo=FALSE}
reg<-lm(mistakes ~.,data_used)
summary(reg)
```
The result is once more extremely clear, with an extremely low p-value: the number of errors seems to be negatively correlated with the feet size.
## Taking a step back
Of course, it seems utterly absurd that having huge feet would prevent a student from making mistakes in a dictation. However, the graphic we obtained using the collected data would make us think otherwise.
As said before, it seems that there is a negative correlation between the two variables. On the other hand, **it does not mean that there exists a causal link**.
As both the graph and the data analysis give the same absurd result, the issue should stem from the data itself. It is either incorrect, or does not capture enough factors (it only contains two variables after all) to allow us to reach correct conclusions.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment