Skip to contents

Introduction

The vazul package provides functions for data blinding in research contexts. Data blinding helps prevent researcher bias by anonymizing data while preserving analytical validity. This vignette introduces the main functions and demonstrates their usage with practical examples.

There are two primary approaches to data blinding:

  1. Masking: Replaces original values with anonymous labels, completely hiding the original information.
  2. Scrambling: Randomizes the order of existing values while preserving all original data content.

Each approach is available at three levels:

Masking Functions

Masking functions replace categorical values with anonymous labels. This is useful when you want to completely hide the original information, such as treatment conditions or group assignments.

mask_labels() - Mask Vector Values

The mask_labels() function takes a character or factor vector and replaces each unique value with a randomly assigned masked label.

Parameters

  • x: A character or factor vector to mask
  • prefix: Character string to use as prefix for masked labels (default: "masked_group_")

Basic Usage

# Create a simple treatment vector
treatment <- c("control", "treatment", "control", "treatment", "control")

# Mask the labels
set.seed(123)
masked_treatment <- mask_labels(treatment)
masked_treatment
#> [1] "masked_group_01" "masked_group_02" "masked_group_01" "masked_group_02"
#> [5] "masked_group_01"

Notice that:

  • Each unique value receives a unique masked label
  • The same original value always maps to the same masked label
  • The assignment of masked labels to original values is randomized

Custom Prefix

You can customize the prefix used for masked labels:

set.seed(456)
mask_labels(treatment, prefix = "group_")
#> [1] "group_01" "group_02" "group_01" "group_02" "group_01"
set.seed(789)
mask_labels(treatment, prefix = "condition_")
#> [1] "condition_01" "condition_02" "condition_01" "condition_02" "condition_01"

Working with Factors

The function preserves factor structure when the input is a factor:

# Create a factor vector
ecology <- factor(c("Desperate", "Hopeful", "Desperate", "Hopeful"))

set.seed(123)
masked_ecology <- mask_labels(ecology)
masked_ecology
#> [1] masked_group_01 masked_group_02 masked_group_01 masked_group_02
#> Levels: masked_group_01 masked_group_02
class(masked_ecology)
#> [1] "factor"

Practical Example with Dataset

Let’s use the williams dataset to mask the ecology condition:

data(williams)

set.seed(42)
williams$ecology_masked <- mask_labels(williams$ecology)

# Compare original and masked values
head(williams[c("subject", "ecology", "ecology_masked")], 10)
#> # A tibble: 10 × 3
#>    subject        ecology   ecology_masked 
#>    <chr>          <chr>     <chr>          
#>  1 A30MP4LXV4MIFD Hopeful   masked_group_01
#>  2 A16X5FB3HAFCKN Desperate masked_group_02
#>  3 A1E9D1OT9VJYDZ Desperate masked_group_02
#>  4 A16FPOYD7566WI Hopeful   masked_group_01
#>  5 A11NOTVHWST7Y3 Desperate masked_group_02
#>  6 A3TDR6MXS6UO5Z Desperate masked_group_02
#>  7 A3OD4F0SA7EBCL Desperate masked_group_02
#>  8 A123PBQDU71I5O Hopeful   masked_group_01
#>  9 A25NGIY591U3DK Hopeful   masked_group_01
#> 10 A11WCFPJSR5VZP Desperate masked_group_02

Now researchers can analyze the data without knowing which condition is “Desperate” vs “Hopeful”.

mask_variables() - Mask Data Frame Columns

The mask_variables() function applies masking to multiple columns in a data frame simultaneously.

Parameters

  • data: A data frame
  • ...: Columns to mask (supports tidyselect helpers)
  • across_variables: If TRUE, all selected variables share the same masked labels; if FALSE (default), each variable gets independent masked labels

Independent Masking (Default)

By default, each column gets its own set of masked labels with the column name as prefix:

df <- data.frame(
  treatment = c("control", "intervention", "control", "intervention"),
  outcome = c("success", "failure", "success", "failure"),
  score = c(85, 92, 78, 88)
)

set.seed(123)
result <- mask_variables(df, c("treatment", "outcome"))
result
#>            treatment          outcome score
#> 1 treatment_group_01 outcome_group_01    85
#> 2 treatment_group_02 outcome_group_02    92
#> 3 treatment_group_01 outcome_group_01    78
#> 4 treatment_group_02 outcome_group_02    88

Notice that each column now has its own prefix (treatment_group_, outcome_group_).

Shared Masking Across Variables

When across_variables = TRUE, all selected columns share the same mapping:

df2 <- data.frame(
  pre_condition = c("A", "B", "C", "A"),
  post_condition = c("B", "A", "A", "C"),
  score = c(1, 2, 3, 4)
)

set.seed(456)
result_shared <- mask_variables(df2, c("pre_condition", "post_condition"),
                                across_variables = TRUE)
result_shared
#>     pre_condition  post_condition score
#> 1 masked_group_01 masked_group_03     1
#> 2 masked_group_03 masked_group_01     2
#> 3 masked_group_02 masked_group_01     3
#> 4 masked_group_01 masked_group_02     4

With shared masking, value “A” maps to the same label in both columns.

Using tidyselect Helpers

You can use tidyselect helpers to select columns:

set.seed(789)
mask_variables(df, where(is.character))
#>            treatment          outcome score
#> 1 treatment_group_01 outcome_group_02    85
#> 2 treatment_group_02 outcome_group_01    92
#> 3 treatment_group_01 outcome_group_02    78
#> 4 treatment_group_02 outcome_group_01    88

mask_variables_rowwise() - Row-Level Masking

The mask_variables_rowwise() function applies consistent masking within each row across multiple columns. This is useful when you have repeated measures or matched conditions.

Parameters

  • data: A data frame
  • ...: Column sets to mask (supports tidyselect helpers)
  • prefix: Character string to use as prefix for masked labels (default: "masked_group_")

Example: Masking Repeated Conditions

df <- data.frame(
  treat_1 = c("control", "treatment", "placebo"),
  treat_2 = c("treatment", "placebo", "control"),
  treat_3 = c("placebo", "control", "treatment"),
  id = 1:3
)

set.seed(123)
result <- mask_variables_rowwise(df, starts_with("treat_"))
result
#>           treat_1         treat_2         treat_3 id
#> 1 masked_group_03 masked_group_01 masked_group_02  1
#> 2 masked_group_01 masked_group_02 masked_group_03  2
#> 3 masked_group_02 masked_group_03 masked_group_01  3

Within each row, the original values are consistently mapped to masked labels, but the mapping is independent across rows.

Scrambling Functions

Scrambling functions randomize the order of values while preserving all original data content. This approach maintains the data distribution while breaking the connection between observations and their original values.

scramble_values() - Scramble Vector Order

The scramble_values() function randomly reorders the elements of a vector.

Parameters

  • x: A vector to scramble

Basic Usage with Different Data Types

# Numeric data
set.seed(123)
numbers <- 1:10
scramble_values(numbers)
#>  [1]  3 10  2  8  6  9  1  7  5  4
# Character data
set.seed(456)
letters_vec <- letters[1:5]
scramble_values(letters_vec)
#> [1] "e" "a" "c" "b" "d"
# Factor data
set.seed(789)
conditions <- factor(c("A", "B", "C", "A", "B"))
scramble_values(conditions)
#> [1] B A B C A
#> Levels: A B C

Key Properties

Scrambling preserves:

  • All original values (nothing is lost or changed)
  • The data type
  • The distribution of values
set.seed(100)
original <- c(1, 2, 2, 3, 3, 3, 4, 4, 4, 4)
scrambled <- scramble_values(original)

# Same values, different order
sort(original) == sort(scrambled)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

# Same frequency distribution
table(original)
#> original
#> 1 2 3 4 
#> 1 2 3 4
table(scrambled)
#> scrambled
#> 1 2 3 4 
#> 1 2 3 4

Practical Example with Dataset

data(williams)

set.seed(42)
williams$age_scrambled <- scramble_values(williams$age)

# The values are the same, just reordered
summary(williams$age)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   21.00   26.00   32.00   34.04   38.00   71.00
summary(williams$age_scrambled)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   21.00   26.00   32.00   34.04   38.00   71.00

# But individual correspondences are broken
head(williams[c("subject", "age", "age_scrambled")], 10)
#> # A tibble: 10 × 3
#>    subject          age age_scrambled
#>    <chr>          <dbl>         <dbl>
#>  1 A30MP4LXV4MIFD    34            25
#>  2 A16X5FB3HAFCKN    30            26
#>  3 A1E9D1OT9VJYDZ    40            25
#>  4 A16FPOYD7566WI    35            38
#>  5 A11NOTVHWST7Y3    26            25
#>  6 A3TDR6MXS6UO5Z    33            28
#>  7 A3OD4F0SA7EBCL    33            57
#>  8 A123PBQDU71I5O    30            32
#>  9 A25NGIY591U3DK    48            25
#> 10 A11WCFPJSR5VZP    33            43

scramble_variables() - Scramble Data Frame Columns

The scramble_variables() function scrambles the values of specified columns in a data frame.

Parameters

  • data: A data frame
  • ...: Columns to scramble (supports tidyselect helpers)
  • together: If TRUE, variables are scrambled together as a unit per row; if FALSE (default), each variable is scrambled independently
  • .groups: Optional grouping columns for within-group scrambling. Grouping columns must not overlap with the columns selected in .... If data is already grouped (a dplyr grouped data frame), existing grouping is ignored unless .groups is explicitly provided.

Independent Scrambling (Default)

Each column is scrambled independently:

df <- data.frame(
  x = 1:6,
  y = letters[1:6],
  group = c("A", "A", "A", "B", "B", "B")
)

set.seed(123)
scramble_variables(df, c("x", "y"))
#>   x y group
#> 1 3 e     A
#> 2 6 d     A
#> 3 2 b     A
#> 4 4 f     B
#> 5 5 a     B
#> 6 1 c     B

Notice that x and y are scrambled independently of each other.

Scrambling Together

When together = TRUE, the selected columns are scrambled as a unit, preserving row-level relationships:

set.seed(456)
scramble_variables(df, c("x", "y"), together = TRUE)
#>   x y group
#> 1 5 e     A
#> 2 6 f     A
#> 3 3 c     A
#> 4 2 b     B
#> 5 1 a     B
#> 6 4 d     B

Notice that the pairs (1, “a”), (2, “b”), etc., remain intact but are assigned to different rows.

Within-Group Scrambling

Use the .groups parameter to scramble within groups:

set.seed(2)
scramble_variables(df, "x", .groups = "group")
#> # A tibble: 6 × 3
#>       x y     group
#>   <int> <chr> <chr>
#> 1     1 a     A    
#> 2     3 b     A    
#> 3     2 c     A    
#> 4     5 d     B    
#> 5     6 e     B    
#> 6     4 f     B

Values of x are only swapped within their original group (A or B).

Combining Grouping and Together

You can combine both parameters:

set.seed(100)
scramble_variables(df, c("x", "y"), .groups = "group", together = TRUE)
#> # A tibble: 6 × 3
#>       x y     group
#>   <int> <chr> <chr>
#> 1     2 b     A    
#> 2     1 a     A    
#> 3     3 c     A    
#> 4     6 f     B    
#> 5     4 d     B    
#> 6     5 e     B

Practical Example with Dataset

data(williams)

# Scramble age and ecology within gender groups
set.seed(42)
williams_scrambled <- williams |>
  scramble_variables(c("age", "ecology"), .groups = "gender")

# Check that values are preserved within groups
williams |>
  group_by(gender) |>
  summarise(mean_age = mean(age, na.rm = TRUE))
#> # A tibble: 2 × 2
#>   gender mean_age
#>    <dbl>    <dbl>
#> 1      1     33.8
#> 2      2     34.6

williams_scrambled |>
  group_by(gender) |>
  summarise(mean_age = mean(age, na.rm = TRUE))
#> # A tibble: 2 × 2
#>   gender mean_age
#>    <dbl>    <dbl>
#> 1      1     33.8
#> 2      2     34.6

scramble_variables_rowwise() - Row-Level Scrambling

The scramble_variables_rowwise() function scrambles values within each row across specified columns. This is useful for scrambling repeated measures or item responses.

Parameters

  • data: A data frame
  • ...: Columns to scramble (supports tidyselect helpers). All selections are combined into a single set and scrambled together. If you want to scramble separate groups of columns independently, call the function multiple times.

Rowwise scrambling moves values between columns, so selected columns must be type-compatible. This function requires all selected columns to have the same class (or be an integer/double mix). For factors, the selected columns must also have identical levels.

Example: Scrambling Item Responses

df <- data.frame(
  item1 = c(1, 4, 7),
  item2 = c(2, 5, 8),
  item3 = c(3, 6, 9),
  id = 1:3
)

set.seed(123)
result <- scramble_variables_rowwise(df, c("item1", "item2", "item3"))
result
#>   item1 item2 item3 id
#> 1     3     1     2  1
#> 2     5     4     6  2
#> 3     8     9     7  3

Within each row, the values are shuffled among the item columns.

Combining Multiple Selectors (Single Combined Set)

Multiple selectors are combined into one set, so values can move between all selected columns:

df2 <- data.frame(
  day_1 = c(1, 4, 7),
  day_2 = c(2, 5, 8),
  day_3 = c(3, 6, 9),
  score_a = c(10, 40, 70),
  score_b = c(20, 50, 80),
  id = 1:3
)

set.seed(2)
result2 <- scramble_variables_rowwise(df2, starts_with("day_"), starts_with("score_"))
result2
#>   day_1 day_2 day_3 score_a score_b id
#> 1    20     3     2      10       1  1
#> 2     4    50    40       5       6  2
#> 3     7     8     9      80      70  3

Scrambling Separate Groups Independently (Call Multiple Times)

To scramble different groups of columns independently, call the function multiple times:

set.seed(42)
result3 <- df2 |>
  scramble_variables_rowwise(starts_with("day_")) |>
  scramble_variables_rowwise(starts_with("score_"))
result3
#>   day_1 day_2 day_3 score_a score_b id
#> 1     1     3     2      20      10  1
#> 2     4     5     6      50      40  2
#> 3     8     9     7      80      70  3

Handling Special Values

Missing Values (NA)

All masking functions preserve NA values in their original positions:

# Vector with NA values
x <- c("A", "B", NA, "A", NA, "C")

set.seed(123)
masked_x <- mask_labels(x)
masked_x
#> [1] "masked_group_03" "masked_group_01" NA                "masked_group_03"
#> [5] NA                "masked_group_02"

# NA positions are preserved
which(is.na(masked_x))
#> [1] 3 5

If all values in a vector are NA, the function will issue a warning and return the vector unchanged:

x_all_na <- c(NA_character_, NA_character_, NA_character_)
mask_labels(x_all_na)
#> Warning: All values in input are NA. Returning unchanged.
#> [1] NA NA NA

Empty Strings

Empty strings ("") are treated as valid categorical values and will be masked like any other value:

x_with_empty <- c("A", "", "B", "", "C")

set.seed(456)
masked_with_empty <- mask_labels(x_with_empty)
masked_with_empty
#> [1] "masked_group_01" "masked_group_04" "masked_group_03" "masked_group_04"
#> [5] "masked_group_02"

# Empty strings get their own masked label
unique(masked_with_empty)
#> [1] "masked_group_01" "masked_group_04" "masked_group_03" "masked_group_02"

This is different from NA values - empty strings are actual data values, not missing data.

Choosing Between Masking and Scrambling

Aspect Masking Scrambling
Original values Hidden (replaced) Preserved (reordered)
Distribution Changed (new labels) Unchanged
Best for Categorical variables Numeric or categorical
Use case Hide treatment conditions Break individual links
Reversibility Requires mapping key Irreversible

When to Use Masking

  • When you need to hide categorical labels (e.g., treatment conditions, group names)
  • When analysts should not know the meaning of categories
  • When you want different prefixes for different variables

When to Use Scrambling

  • When you want to preserve the original data distribution
  • When you need to break the link between observations and values
  • When working with numeric data that shouldn’t be categorically relabeled

Working with Included Datasets

The vazul package includes two research datasets for demonstration and practice.

MARP Dataset

The Many Analysts Religion Project (MARP) dataset contains 10,535 participants from 24 countries:

data(marp)
dim(marp)
#> [1] 10535    46

# Example: Scramble religiosity scores within countries
set.seed(42)
marp_blinded <- marp |>
  scramble_variables(starts_with("rel_"), .groups = "country")

# Original and scrambled have same country-level means
original_means <- marp |>
  group_by(country) |>
  summarise(rel_1_mean = mean(rel_1, na.rm = TRUE), .groups = "drop")

scrambled_means <- marp_blinded |>
  group_by(country) |>
  summarise(rel_1_mean = mean(rel_1, na.rm = TRUE), .groups = "drop")

all.equal(original_means$rel_1_mean, scrambled_means$rel_1_mean)
#> [1] TRUE

Williams Dataset

The Williams study dataset contains 112 participants from a stereotyping study:

data(williams)
dim(williams)
#> [1] 112  25

# Example: Mask the ecology condition for blind analysis
set.seed(42)
williams_blinded <- williams |>
  mask_variables("ecology")

# Analysts can work with masked conditions
williams_blinded |>
  group_by(ecology) |>
  summarise(
    n = n(),
    mean_impulsivity = mean(Impuls_1, na.rm = TRUE),
    .groups = "drop"
  )
#> # A tibble: 2 × 3
#>   ecology              n mean_impulsivity
#>   <chr>            <int>            <dbl>
#> 1 ecology_group_01    56             4.32
#> 2 ecology_group_02    56             4.61

Summary

The vazul package provides a comprehensive toolkit for data blinding:

Function Level Purpose
mask_labels() Vector Replace categorical values with anonymous labels
mask_variables() Data frame Mask multiple columns
mask_variables_rowwise() Row-wise Consistent masking within rows
scramble_values() Vector Randomize value order
scramble_variables() Data frame Scramble multiple columns
scramble_variables_rowwise() Row-wise Scramble values within rows

These functions help researchers conduct unbiased analyses by separating the analyst from knowledge about treatment conditions, group assignments, or individual data points.