In this project, we will explore yearly crime statistics from Hungary between 2009 and 2022.
# Loading the necessary packages
library(tidyverse)
library(rvest) # for webscraping
library(magrittr) # for getting objects out of lists
library(broom) # tidy model outputs
library(performance) # for model evaluation
library(tidytext) # for ordering factors within larger categories (only needed for pretty plotting)
# Setting the theme for plots
theme_set(theme_light())
We are using the official statistics from the Hungarian Statistics
Office. It is on a homepage, that we scrape using the
{rvest}
package.
crime_url <- "https://www.ksh.hu/stadat_files/iga/hu/iga0003.html"
# Get the raw data from the web
crime_raw <-
read_html(crime_url) |>
# The table has a double header. This causes some trouble with variable names.
# It is better to read all data with both headers, but not as a header but values
html_table(header = FALSE) |>
# the magrittr package has this function to extract the dataframe from the list that rvest returned
extract2(1)
# These are the original caterories in Hungarian
# I used the clipr extension of RStudio to copy the results of this following part to the clipboard
crime_raw |>
slice(2) |>
unlist() |>
unname()
## [1] "Év"
## [2] "Összesen"
## [3] "emberölésa"
## [4] "testi sértés"
## [5] "kábítószeres bűncselekményekb"
## [6] "közúti baleset okozása"
## [7] "ittas/bódult járművezetés"
## [8] "hivatalos személy és közfeladatot ellátó személy elleni erőszakc"
## [9] "garázdaság"
## [10] "lopás"
## [11] "rongálás"
## [12] "sikkasztás"
## [13] "csalás"
## [14] "Bűncselekmény százezer 14 éves és annál idősebb lakosra"
I pasted it to deepl to machine translate the category names. I created and saved a text file that contains the English category names.
crime_header <- read_lines(here::here("eng_header.txt"))
# Now I can combine the raw data with the English header names
crime_wide <-
crime_raw |>
set_names(crime_header) |>
# Remove the first 2 rows (they are the headers)
slice(-(1:2)) |>
# Change every variable to numeric. Note that they where characters previously.
# I also have to remove the space first before converting to numeric.
mutate(across(everything(), ~str_remove(., " ") |> as.numeric()))
# Create a long format dataset as well
crime_long <-
crime_wide |>
pivot_longer(-Year) |>
# Convert categories to sentence case for pretty plots
mutate(name = str_to_sentence(name))
Plot all categories to see trends.
crime_long |>
ggplot() +
aes(x = Year, y = value, color = name) +
geom_line(show.legend = FALSE, linewidth = 1.1) +
# Show linear trends
geom_smooth(method = "lm", show.legend = FALSE) +
facet_wrap(~name, scales = "free_y") +
scale_y_continuous(labels = scales::comma_format()) +
labs(y = NULL, title = "Changes")
## `geom_smooth()` using formula = 'y ~ x'
# Create specific datasets for individual regression analyses
dui <-
crime_long |>
filter(name == "Drink/driving")
total <-
crime_long |>
filter(name == "Total")
drug <-
crime_long |>
filter(name == "Drug offencesb")
Fitting linear regressions for each of the three selected categories. If we use the scale function, we will get standardized coefficients, that can be interpreted as correlations.
lm(scale(value) ~ scale(Year), data = dui) |>
summary()
##
## Call:
## lm(formula = scale(value) ~ scale(Year), data = dui)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2060 -0.7781 0.1376 0.5335 1.0691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.766e-16 2.144e-01 0.000 1.0000
## scale(Year) 6.370e-01 2.225e-01 2.863 0.0143 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8023 on 12 degrees of freedom
## Multiple R-squared: 0.4058, Adjusted R-squared: 0.3562
## F-statistic: 8.194 on 1 and 12 DF, p-value: 0.01429
lm(scale(value) ~ scale(Year), data = total) |>
summary()
##
## Call:
## lm(formula = scale(value) ~ scale(Year), data = total)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6189 -0.2352 0.0064 0.1295 0.7191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.282e-17 9.313e-02 0.00 1
## scale(Year) -9.423e-01 9.664e-02 -9.75 4.7e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3484 on 12 degrees of freedom
## Multiple R-squared: 0.8879, Adjusted R-squared: 0.8786
## F-statistic: 95.07 on 1 and 12 DF, p-value: 4.703e-07
lm(scale(value) ~ scale(Year), data = drug) |>
summary()
##
## Call:
## lm(formula = scale(value) ~ scale(Year), data = drug)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5758 -0.3100 -0.1457 0.2127 1.4309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.613e-16 1.426e-01 0.000 1
## scale(Year) 8.586e-01 1.480e-01 5.803 8.44e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5335 on 12 degrees of freedom
## Multiple R-squared: 0.7373, Adjusted R-squared: 0.7154
## F-statistic: 33.67 on 1 and 12 DF, p-value: 8.438e-05
Fitting polynomial regression on the DUI data. We use the
poly()
function to include second and third order
polynomials as predictors. See details on how the funciton works here:
https://stackoverflow.com/questions/19484053/what-does-the-r-function-poly-really-do
dui_model <-
dui |>
lm(value ~ poly(Year, 3), data = _)
summary(dui_model)
##
## Call:
## lm(formula = value ~ poly(Year, 3), data = dui)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1888.7 -901.5 448.2 882.5 1638.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12315.9 363.4 33.892 1.18e-11 ***
## poly(Year, 3)1 6902.4 1359.7 5.076 0.000480 ***
## poly(Year, 3)2 -3372.1 1359.7 -2.480 0.032542 *
## poly(Year, 3)3 -6317.7 1359.7 -4.646 0.000913 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1360 on 10 degrees of freedom
## Multiple R-squared: 0.8425, Adjusted R-squared: 0.7953
## F-statistic: 17.84 on 3 and 10 DF, p-value: 0.0002442
# The tidy function returns the coefficients as a data frame.
tidy(dui_model, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 12316. 363. 33.9 1.18e-11 11506. 13126.
## 2 poly(Year, 3)1 6902. 1360. 5.08 4.80e- 4 3873. 9932.
## 3 poly(Year, 3)2 -3372. 1360. -2.48 3.25e- 2 -6402. -342.
## 4 poly(Year, 3)3 -6318. 1360. -4.65 9.13e- 4 -9347. -3288.
You can create pretty model summary in html format using the following function.
sjPlot::tab_model(dui_model)
value | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 12315.93 | 11506.24 – 13125.62 | <0.001 |
Year [1st degree] | 6902.39 | 3872.81 – 9931.97 | <0.001 |
Year [2nd degree] | -3372.05 | -6401.64 – -342.47 | 0.033 |
Year [3rd degree] | -6317.71 | -9347.29 – -3288.12 | 0.001 |
Observations | 14 | ||
R2 / R2 adjusted | 0.843 / 0.795 |
Investigate the assumptions for linear model using the
{performance}
package.
check_model(dui_model)
We can create predictions for the model using the
augment()
function from the {broom}
package.
if we use original data as new data, we can also include the original
preditcors. We can visualize the model predictions based on the model
predictions.
augment(dui_model,
newdata = dui,
interval = "confidence") |>
ggplot() +
aes(x = Year, y = .fitted, ymin = .lower, ymax = .upper) +
geom_ribbon(alpha = .5, fill = "cyan") +
geom_line()
We can also fit models to all separate outcomes in a few lines of
code. R is a functional programming language that makes this easy,
through the map()
functions (that is part of
{tidyverse}
).
crime_nested <-
crime_long |>
group_by(name) |>
# Nesting the dataframe by names create separate small dataframes inside the dataframe
nest() |>
# We use the map function to interate through our large tibble and reach into the small tibbles inside, to fit a linear regression on each outcome. The result can be stored in a new variable inside the large tibble.
mutate(model = map(data, ~lm(scale(value) ~ poly(Year, 3), data = .x)),
tidy_model = map(model, tidy, conf.int = TRUE)
) |>
ungroup()
# This is how you can reach data data
crime_nested |>
filter(name == "Criminal mischief") |>
unnest(data)
## # A tibble: 14 × 5
## name Year value model tidy_model
## <chr> <dbl> <dbl> <list> <list>
## 1 Criminal mischief 2009 10384 <lm> <tibble [4 × 7]>
## 2 Criminal mischief 2010 13279 <lm> <tibble [4 × 7]>
## 3 Criminal mischief 2011 12724 <lm> <tibble [4 × 7]>
## 4 Criminal mischief 2012 12868 <lm> <tibble [4 × 7]>
## 5 Criminal mischief 2013 13092 <lm> <tibble [4 × 7]>
## 6 Criminal mischief 2014 13647 <lm> <tibble [4 × 7]>
## 7 Criminal mischief 2015 12689 <lm> <tibble [4 × 7]>
## 8 Criminal mischief 2016 11509 <lm> <tibble [4 × 7]>
## 9 Criminal mischief 2017 10547 <lm> <tibble [4 × 7]>
## 10 Criminal mischief 2018 9386 <lm> <tibble [4 × 7]>
## 11 Criminal mischief 2019 8587 <lm> <tibble [4 × 7]>
## 12 Criminal mischief 2020 8570 <lm> <tibble [4 × 7]>
## 13 Criminal mischief 2021 8313 <lm> <tibble [4 × 7]>
## 14 Criminal mischief 2022 8812 <lm> <tibble [4 × 7]>
Which categories have the strongest linear, quadratic, and cubic trends?
We arrange the data by the size of the coefficient for the appropriate polynomial term to see which trends are the strongest for each category.
crime_nested |>
unnest(tidy_model) |>
filter(term =="poly(Year, 3)1") |>
arrange(estimate)
## # A tibble: 13 × 10
## name data model term estimate std.error statistic p.value conf.low
## <chr> <list> <lis> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Theft <tibble> <lm> poly… -3.44 0.129 -26.6 1.29e-10 -3.72
## 2 Total <tibble> <lm> poly… -3.40 0.210 -16.2 1.68e- 8 -3.87
## 3 Crime pe… <tibble> <lm> poly… -3.39 0.212 -16.0 1.89e- 8 -3.86
## 4 Assault <tibble> <lm> poly… -3.35 0.173 -19.4 2.90e- 9 -3.74
## 5 Embezzle… <tibble> <lm> poly… -3.26 0.355 -9.18 3.47e- 6 -4.05
## 6 Homicide <tibble> <lm> poly… -3.21 0.389 -8.24 9.06e- 6 -4.07
## 7 Criminal… <tibble> <lm> poly… -3.20 0.261 -12.2 2.44e- 7 -3.78
## 8 Assault … <tibble> <lm> poly… -2.95 0.431 -6.84 4.54e- 5 -3.91
## 9 Criminal… <tibble> <lm> poly… -2.81 0.258 -10.9 7.18e- 7 -3.39
## 10 Fraud <tibble> <lm> poly… -2.59 0.568 -4.56 1.05e- 3 -3.85
## 11 Causing … <tibble> <lm> poly… -1.18 0.937 -1.26 2.37e- 1 -3.27
## 12 Drink/dr… <tibble> <lm> poly… 2.30 0.452 5.08 4.80e- 4 1.29
## 13 Drug off… <tibble> <lm> poly… 3.10 0.563 5.50 2.62e- 4 1.84
## # ℹ 1 more variable: conf.high <dbl>
crime_nested |>
unnest(tidy_model) |>
filter(term =="poly(Year, 3)2") |>
arrange(estimate)
## # A tibble: 13 × 10
## name data model term estimate std.error statistic p.value conf.low
## <chr> <list> <lis> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Fraud <tibble> <lm> poly… -1.39 0.568 -2.45 3.45e-2 -2.65
## 2 Criminal … <tibble> <lm> poly… -1.38 0.258 -5.33 3.34e-4 -1.95
## 3 Drink/dri… <tibble> <lm> poly… -1.12 0.452 -2.48 3.25e-2 -2.13
## 4 Assault <tibble> <lm> poly… -0.808 0.173 -4.67 8.75e-4 -1.19
## 5 Embezzlem… <tibble> <lm> poly… -0.435 0.355 -1.22 2.49e-1 -1.23
## 6 Drug offe… <tibble> <lm> poly… -0.343 0.563 -0.610 5.55e-1 -1.60
## 7 Theft <tibble> <lm> poly… -0.0179 0.129 -0.139 8.93e-1 -0.305
## 8 Crime per… <tibble> <lm> poly… 0.00908 0.212 0.0428 9.67e-1 -0.463
## 9 Assault o… <tibble> <lm> poly… 0.0262 0.431 0.0607 9.53e-1 -0.934
## 10 Total <tibble> <lm> poly… 0.0364 0.210 0.174 8.66e-1 -0.431
## 11 Homicide <tibble> <lm> poly… 0.396 0.389 1.02 3.33e-1 -0.471
## 12 Causing a… <tibble> <lm> poly… 1.40 0.937 1.50 1.65e-1 -0.684
## 13 Criminal … <tibble> <lm> poly… 1.43 0.261 5.47 2.73e-4 0.847
## # ℹ 1 more variable: conf.high <dbl>
crime_nested |>
unnest(tidy_model) |>
filter(term =="poly(Year, 3)3") |>
arrange(estimate)
## # A tibble: 13 × 10
## name data model term estimate std.error statistic p.value conf.low
## <chr> <list> <lis> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Drink/dri… <tibble> <lm> poly… -2.10 0.452 -4.65 9.13e-4 -3.11
## 2 Causing a… <tibble> <lm> poly… -0.932 0.937 -0.995 3.43e-1 -3.02
## 3 Drug offe… <tibble> <lm> poly… -0.360 0.563 -0.639 5.37e-1 -1.61
## 4 Criminal … <tibble> <lm> poly… 0.217 0.261 0.831 4.25e-1 -0.365
## 5 Assault <tibble> <lm> poly… 0.906 0.173 5.24 3.77e-4 0.521
## 6 Embezzlem… <tibble> <lm> poly… 0.957 0.355 2.69 2.25e-2 0.166
## 7 Total <tibble> <lm> poly… 1.01 0.210 4.80 7.23e-4 0.540
## 8 Theft <tibble> <lm> poly… 1.01 0.129 7.83 1.43e-5 0.722
## 9 Homicide <tibble> <lm> poly… 1.02 0.389 2.63 2.50e-2 0.157
## 10 Crime per… <tibble> <lm> poly… 1.03 0.212 4.86 6.57e-4 0.559
## 11 Fraud <tibble> <lm> poly… 1.08 0.568 1.90 8.71e-2 -0.188
## 12 Assault o… <tibble> <lm> poly… 1.57 0.431 3.64 4.51e-3 0.610
## 13 Criminal … <tibble> <lm> poly… 1.59 0.258 6.17 1.06e-4 1.02
## # ℹ 1 more variable: conf.high <dbl>
Plot the results for each polynomial.
crime_nested |>
unnest(tidy_model) |>
filter(term !="(Intercept)") |>
select(name, term, estimate, conf.low, conf.high) |>
# Using regex to get rid of the complicated parts of the polyonomial terms
mutate(term = str_remove_all(term, "poly\\(Year, 3|\\)"),
# Make the name variable a reordered factor, so it is arranged on the plot
name = reorder_within(name, by = estimate, within = term)) |>
ggplot() +
aes(x = estimate, y = name, xmin = conf.low, xmax = conf.high) +
# Add a reference line for no effect
geom_vline(xintercept = 0, lty = "dashed", color = "red") +
geom_pointrange() +
facet_wrap(~term, nrow = 3, scales = "free_y") +
# This is needed for the faceted rearrangement
scale_y_reordered() +
labs(y = NULL, title = "Show wihch polinomial (1, 2, or 3) of year fits the best to the data")