In this project, we will explore yearly crime statistics from Hungary between 2009 and 2022.

# Loading the necessary packages
library(tidyverse)
library(rvest) # for webscraping
library(magrittr) # for getting objects out of lists
library(broom) # tidy model outputs
library(performance) # for model evaluation
library(tidytext) # for ordering factors within larger categories (only needed for pretty plotting)

# Setting the theme for plots
theme_set(theme_light())

Getting the data

We are using the official statistics from the Hungarian Statistics Office. It is on a homepage, that we scrape using the {rvest} package.

crime_url <- "https://www.ksh.hu/stadat_files/iga/hu/iga0003.html"

# Get the raw data from the web
crime_raw <- 
    read_html(crime_url) |>
    # The table has a double header. This causes some trouble with variable names.
    # It is better to read all data with both headers, but not as a header but values
    html_table(header = FALSE) |> 
    # the magrittr package has this function to extract the dataframe from the list that rvest returned
    extract2(1)

# These are the original caterories in Hungarian
# I used the clipr extension of RStudio to copy the results of this following part to the clipboard
crime_raw |> 
    slice(2) |> 
    unlist() |> 
    unname()
##  [1] "Év"                                                              
##  [2] "Összesen"                                                        
##  [3] "emberölésa"                                                      
##  [4] "testi sértés"                                                    
##  [5] "kábítószeres bűncselekményekb"                                   
##  [6] "közúti baleset okozása"                                          
##  [7] "ittas/bódult járművezetés"                                       
##  [8] "hivatalos személy és közfeladatot ellátó személy elleni erőszakc"
##  [9] "garázdaság"                                                      
## [10] "lopás"                                                           
## [11] "rongálás"                                                        
## [12] "sikkasztás"                                                      
## [13] "csalás"                                                          
## [14] "Bűncselekmény százezer 14 éves és annál idősebb lakosra"

I pasted it to deepl to machine translate the category names. I created and saved a text file that contains the English category names.

crime_header <- read_lines(here::here("eng_header.txt"))

# Now I can combine the raw data with the English header names
crime_wide <- 
    crime_raw |> 
    set_names(crime_header) |> 
    # Remove the first 2 rows (they are the headers)
    slice(-(1:2)) |> 
    # Change every variable to numeric. Note that they where characters previously.
    # I also have to remove the space first before converting to numeric.
    mutate(across(everything(), ~str_remove(., " ") |> as.numeric()))

# Create a long format dataset as well
crime_long <- 
    crime_wide |> 
    pivot_longer(-Year) |> 
    # Convert categories to sentence case for pretty plots
    mutate(name = str_to_sentence(name))

Plot all categories to see trends.

crime_long |> 
    ggplot() +
    aes(x = Year, y = value, color = name) +
    geom_line(show.legend = FALSE, linewidth = 1.1) +
    # Show linear trends
    geom_smooth(method = "lm", show.legend = FALSE) +
    facet_wrap(~name, scales = "free_y") +
    scale_y_continuous(labels = scales::comma_format()) +
    labs(y = NULL, title = "Changes")
## `geom_smooth()` using formula = 'y ~ x'

Modeling with polynomials

Fitting polynomial regression on the DUI data. We use the poly() function to include second and third order polynomials as predictors. See details on how the funciton works here: https://stackoverflow.com/questions/19484053/what-does-the-r-function-poly-really-do

dui_model <- 
    dui |> 
    lm(value ~ poly(Year, 3), data = _)

summary(dui_model)
## 
## Call:
## lm(formula = value ~ poly(Year, 3), data = dui)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1888.7  -901.5   448.2   882.5  1638.4 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     12315.9      363.4  33.892 1.18e-11 ***
## poly(Year, 3)1   6902.4     1359.7   5.076 0.000480 ***
## poly(Year, 3)2  -3372.1     1359.7  -2.480 0.032542 *  
## poly(Year, 3)3  -6317.7     1359.7  -4.646 0.000913 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1360 on 10 degrees of freedom
## Multiple R-squared:  0.8425, Adjusted R-squared:  0.7953 
## F-statistic: 17.84 on 3 and 10 DF,  p-value: 0.0002442
# The tidy function returns the coefficients as a data frame.
tidy(dui_model, conf.int = TRUE)
## # A tibble: 4 × 7
##   term           estimate std.error statistic  p.value conf.low conf.high
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)      12316.      363.     33.9  1.18e-11   11506.    13126.
## 2 poly(Year, 3)1    6902.     1360.      5.08 4.80e- 4    3873.     9932.
## 3 poly(Year, 3)2   -3372.     1360.     -2.48 3.25e- 2   -6402.     -342.
## 4 poly(Year, 3)3   -6318.     1360.     -4.65 9.13e- 4   -9347.    -3288.

You can create pretty model summary in html format using the following function.

sjPlot::tab_model(dui_model)
  value
Predictors Estimates CI p
(Intercept) 12315.93 11506.24 – 13125.62 <0.001
Year [1st degree] 6902.39 3872.81 – 9931.97 <0.001
Year [2nd degree] -3372.05 -6401.64 – -342.47 0.033
Year [3rd degree] -6317.71 -9347.29 – -3288.12 0.001
Observations 14
R2 / R2 adjusted 0.843 / 0.795

Investigate the assumptions for linear model using the {performance} package.

check_model(dui_model)

We can create predictions for the model using the augment() function from the {broom} package. if we use original data as new data, we can also include the original preditcors. We can visualize the model predictions based on the model predictions.

augment(dui_model, 
        newdata = dui,
        interval = "confidence") |> 
    ggplot() +
    aes(x = Year, y = .fitted, ymin = .lower, ymax = .upper) +
    geom_ribbon(alpha = .5, fill = "cyan") +
    geom_line()

Automate modeling

We can also fit models to all separate outcomes in a few lines of code. R is a functional programming language that makes this easy, through the map() functions (that is part of {tidyverse}).

crime_nested <-
    crime_long |> 
    group_by(name) |> 
    # Nesting the dataframe by names create separate small dataframes inside the dataframe
    nest() |> 
    # We use the map function to interate through our large tibble and reach into the small tibbles inside, to fit a linear regression on each outcome. The result can be stored in a new variable inside the large tibble.
    mutate(model = map(data, ~lm(scale(value) ~ poly(Year, 3), data = .x)),
           tidy_model = map(model, tidy, conf.int = TRUE)
           ) |> 
    ungroup()

# This is how you can reach data data
crime_nested |> 
    filter(name == "Criminal mischief") |> 
    unnest(data)
## # A tibble: 14 × 5
##    name               Year value model  tidy_model      
##    <chr>             <dbl> <dbl> <list> <list>          
##  1 Criminal mischief  2009 10384 <lm>   <tibble [4 × 7]>
##  2 Criminal mischief  2010 13279 <lm>   <tibble [4 × 7]>
##  3 Criminal mischief  2011 12724 <lm>   <tibble [4 × 7]>
##  4 Criminal mischief  2012 12868 <lm>   <tibble [4 × 7]>
##  5 Criminal mischief  2013 13092 <lm>   <tibble [4 × 7]>
##  6 Criminal mischief  2014 13647 <lm>   <tibble [4 × 7]>
##  7 Criminal mischief  2015 12689 <lm>   <tibble [4 × 7]>
##  8 Criminal mischief  2016 11509 <lm>   <tibble [4 × 7]>
##  9 Criminal mischief  2017 10547 <lm>   <tibble [4 × 7]>
## 10 Criminal mischief  2018  9386 <lm>   <tibble [4 × 7]>
## 11 Criminal mischief  2019  8587 <lm>   <tibble [4 × 7]>
## 12 Criminal mischief  2020  8570 <lm>   <tibble [4 × 7]>
## 13 Criminal mischief  2021  8313 <lm>   <tibble [4 × 7]>
## 14 Criminal mischief  2022  8812 <lm>   <tibble [4 × 7]>

Which categories have the strongest linear, quadratic, and cubic trends?

We arrange the data by the size of the coefficient for the appropriate polynomial term to see which trends are the strongest for each category.

crime_nested |> 
    unnest(tidy_model) |> 
    filter(term =="poly(Year, 3)1") |> 
    arrange(estimate)
## # A tibble: 13 × 10
##    name      data     model term  estimate std.error statistic  p.value conf.low
##    <chr>     <list>   <lis> <chr>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>
##  1 Theft     <tibble> <lm>  poly…    -3.44     0.129    -26.6  1.29e-10    -3.72
##  2 Total     <tibble> <lm>  poly…    -3.40     0.210    -16.2  1.68e- 8    -3.87
##  3 Crime pe… <tibble> <lm>  poly…    -3.39     0.212    -16.0  1.89e- 8    -3.86
##  4 Assault   <tibble> <lm>  poly…    -3.35     0.173    -19.4  2.90e- 9    -3.74
##  5 Embezzle… <tibble> <lm>  poly…    -3.26     0.355     -9.18 3.47e- 6    -4.05
##  6 Homicide  <tibble> <lm>  poly…    -3.21     0.389     -8.24 9.06e- 6    -4.07
##  7 Criminal… <tibble> <lm>  poly…    -3.20     0.261    -12.2  2.44e- 7    -3.78
##  8 Assault … <tibble> <lm>  poly…    -2.95     0.431     -6.84 4.54e- 5    -3.91
##  9 Criminal… <tibble> <lm>  poly…    -2.81     0.258    -10.9  7.18e- 7    -3.39
## 10 Fraud     <tibble> <lm>  poly…    -2.59     0.568     -4.56 1.05e- 3    -3.85
## 11 Causing … <tibble> <lm>  poly…    -1.18     0.937     -1.26 2.37e- 1    -3.27
## 12 Drink/dr… <tibble> <lm>  poly…     2.30     0.452      5.08 4.80e- 4     1.29
## 13 Drug off… <tibble> <lm>  poly…     3.10     0.563      5.50 2.62e- 4     1.84
## # ℹ 1 more variable: conf.high <dbl>
crime_nested |> 
    unnest(tidy_model) |> 
    filter(term =="poly(Year, 3)2") |> 
    arrange(estimate)
## # A tibble: 13 × 10
##    name       data     model term  estimate std.error statistic p.value conf.low
##    <chr>      <list>   <lis> <chr>    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>
##  1 Fraud      <tibble> <lm>  poly… -1.39        0.568   -2.45   3.45e-2   -2.65 
##  2 Criminal … <tibble> <lm>  poly… -1.38        0.258   -5.33   3.34e-4   -1.95 
##  3 Drink/dri… <tibble> <lm>  poly… -1.12        0.452   -2.48   3.25e-2   -2.13 
##  4 Assault    <tibble> <lm>  poly… -0.808       0.173   -4.67   8.75e-4   -1.19 
##  5 Embezzlem… <tibble> <lm>  poly… -0.435       0.355   -1.22   2.49e-1   -1.23 
##  6 Drug offe… <tibble> <lm>  poly… -0.343       0.563   -0.610  5.55e-1   -1.60 
##  7 Theft      <tibble> <lm>  poly… -0.0179      0.129   -0.139  8.93e-1   -0.305
##  8 Crime per… <tibble> <lm>  poly…  0.00908     0.212    0.0428 9.67e-1   -0.463
##  9 Assault o… <tibble> <lm>  poly…  0.0262      0.431    0.0607 9.53e-1   -0.934
## 10 Total      <tibble> <lm>  poly…  0.0364      0.210    0.174  8.66e-1   -0.431
## 11 Homicide   <tibble> <lm>  poly…  0.396       0.389    1.02   3.33e-1   -0.471
## 12 Causing a… <tibble> <lm>  poly…  1.40        0.937    1.50   1.65e-1   -0.684
## 13 Criminal … <tibble> <lm>  poly…  1.43        0.261    5.47   2.73e-4    0.847
## # ℹ 1 more variable: conf.high <dbl>
crime_nested |> 
    unnest(tidy_model) |> 
    filter(term =="poly(Year, 3)3") |> 
    arrange(estimate)
## # A tibble: 13 × 10
##    name       data     model term  estimate std.error statistic p.value conf.low
##    <chr>      <list>   <lis> <chr>    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>
##  1 Drink/dri… <tibble> <lm>  poly…   -2.10      0.452    -4.65  9.13e-4   -3.11 
##  2 Causing a… <tibble> <lm>  poly…   -0.932     0.937    -0.995 3.43e-1   -3.02 
##  3 Drug offe… <tibble> <lm>  poly…   -0.360     0.563    -0.639 5.37e-1   -1.61 
##  4 Criminal … <tibble> <lm>  poly…    0.217     0.261     0.831 4.25e-1   -0.365
##  5 Assault    <tibble> <lm>  poly…    0.906     0.173     5.24  3.77e-4    0.521
##  6 Embezzlem… <tibble> <lm>  poly…    0.957     0.355     2.69  2.25e-2    0.166
##  7 Total      <tibble> <lm>  poly…    1.01      0.210     4.80  7.23e-4    0.540
##  8 Theft      <tibble> <lm>  poly…    1.01      0.129     7.83  1.43e-5    0.722
##  9 Homicide   <tibble> <lm>  poly…    1.02      0.389     2.63  2.50e-2    0.157
## 10 Crime per… <tibble> <lm>  poly…    1.03      0.212     4.86  6.57e-4    0.559
## 11 Fraud      <tibble> <lm>  poly…    1.08      0.568     1.90  8.71e-2   -0.188
## 12 Assault o… <tibble> <lm>  poly…    1.57      0.431     3.64  4.51e-3    0.610
## 13 Criminal … <tibble> <lm>  poly…    1.59      0.258     6.17  1.06e-4    1.02 
## # ℹ 1 more variable: conf.high <dbl>

Plot the results for each polynomial.

crime_nested |> 
    unnest(tidy_model) |> 
    filter(term !="(Intercept)") |> 
    select(name, term, estimate, conf.low, conf.high) |> 
    # Using regex to get rid of the complicated parts of the polyonomial terms
    mutate(term = str_remove_all(term, "poly\\(Year, 3|\\)"),
    # Make the name variable a reordered factor, so it is arranged on the plot
           name = reorder_within(name, by = estimate, within = term)) |> 
    ggplot() +
    aes(x = estimate, y = name, xmin = conf.low, xmax = conf.high) +
    # Add a reference line for no effect
    geom_vline(xintercept = 0, lty = "dashed", color = "red") +
    geom_pointrange() +
    facet_wrap(~term, nrow = 3, scales = "free_y") +
    # This is needed for the faceted rearrangement
    scale_y_reordered() +
    labs(y = NULL, title = "Show wihch polinomial (1, 2, or 3) of year fits the best to the data")