Introduction
The vazul package provides functions for data blinding in research contexts. Data blinding helps prevent researcher bias by anonymizing data while preserving analytical validity. This vignette introduces the main functions and demonstrates their usage with practical examples.
There are two primary approaches to data blinding:
- Masking: Replaces original values with anonymous labels, completely hiding the original information.
- Scrambling: Randomizes the order of existing values while preserving all original data content.
Each approach is available at three levels:
-
Vector level:
mask_labels()andscramble_values()- operate on single vectors -
Data frame level:
mask_variables()andscramble_variables()- operate on columns in a data frame -
Row-wise level:
mask_variables_rowwise()andscramble_variables_rowwise()- operate within rows across columns
Masking Functions
Masking functions replace categorical values with anonymous labels. This is useful when you want to completely hide the original information, such as treatment conditions or group assignments.
mask_labels() - Mask Vector Values
The mask_labels() function takes a character or factor vector and replaces each unique value with a randomly assigned masked label.
Parameters
-
x: A character or factor vector to mask -
prefix: Character string to use as prefix for masked labels (default:"masked_group_")
Basic Usage
# Create a simple treatment vector
treatment <- c("control", "treatment", "control", "treatment", "control")
# Mask the labels
set.seed(123)
masked_treatment <- mask_labels(treatment)
masked_treatment
#> [1] "masked_group_01" "masked_group_02" "masked_group_01" "masked_group_02"
#> [5] "masked_group_01"Notice that:
- Each unique value receives a unique masked label
- The same original value always maps to the same masked label
- The assignment of masked labels to original values is randomized
Custom Prefix
You can customize the prefix used for masked labels:
set.seed(456)
mask_labels(treatment, prefix = "group_")
#> [1] "group_01" "group_02" "group_01" "group_02" "group_01"
set.seed(789)
mask_labels(treatment, prefix = "condition_")
#> [1] "condition_01" "condition_02" "condition_01" "condition_02" "condition_01"Working with Factors
The function preserves factor structure when the input is a factor:
# Create a factor vector
ecology <- factor(c("Desperate", "Hopeful", "Desperate", "Hopeful"))
set.seed(123)
masked_ecology <- mask_labels(ecology)
masked_ecology
#> [1] masked_group_01 masked_group_02 masked_group_01 masked_group_02
#> Levels: masked_group_01 masked_group_02
class(masked_ecology)
#> [1] "factor"Practical Example with Dataset
Let’s use the williams dataset to mask the ecology condition:
data(williams)
set.seed(42)
williams$ecology_masked <- mask_labels(williams$ecology)
# Compare original and masked values
head(williams[c("subject", "ecology", "ecology_masked")], 10)
#> # A tibble: 10 × 3
#> subject ecology ecology_masked
#> <chr> <chr> <chr>
#> 1 A30MP4LXV4MIFD Hopeful masked_group_01
#> 2 A16X5FB3HAFCKN Desperate masked_group_02
#> 3 A1E9D1OT9VJYDZ Desperate masked_group_02
#> 4 A16FPOYD7566WI Hopeful masked_group_01
#> 5 A11NOTVHWST7Y3 Desperate masked_group_02
#> 6 A3TDR6MXS6UO5Z Desperate masked_group_02
#> 7 A3OD4F0SA7EBCL Desperate masked_group_02
#> 8 A123PBQDU71I5O Hopeful masked_group_01
#> 9 A25NGIY591U3DK Hopeful masked_group_01
#> 10 A11WCFPJSR5VZP Desperate masked_group_02Now researchers can analyze the data without knowing which condition is “Desperate” vs “Hopeful”.
mask_variables() - Mask Data Frame Columns
The mask_variables() function applies masking to multiple columns in a data frame simultaneously.
Parameters
-
data: A data frame -
...: Columns to mask (supports tidyselect helpers) -
across_variables: IfTRUE, all selected variables share the same masked labels; ifFALSE(default), each variable gets independent masked labels
Independent Masking (Default)
By default, each column gets its own set of masked labels with the column name as prefix:
df <- data.frame(
treatment = c("control", "intervention", "control", "intervention"),
outcome = c("success", "failure", "success", "failure"),
score = c(85, 92, 78, 88)
)
set.seed(123)
result <- mask_variables(df, c("treatment", "outcome"))
result
#> treatment outcome score
#> 1 treatment_group_01 outcome_group_01 85
#> 2 treatment_group_02 outcome_group_02 92
#> 3 treatment_group_01 outcome_group_01 78
#> 4 treatment_group_02 outcome_group_02 88Notice that each column now has its own prefix (treatment_group_, outcome_group_).
Shared Masking Across Variables
When across_variables = TRUE, all selected columns share the same mapping:
df2 <- data.frame(
pre_condition = c("A", "B", "C", "A"),
post_condition = c("B", "A", "A", "C"),
score = c(1, 2, 3, 4)
)
set.seed(456)
result_shared <- mask_variables(df2, c("pre_condition", "post_condition"),
across_variables = TRUE)
result_shared
#> pre_condition post_condition score
#> 1 masked_group_01 masked_group_03 1
#> 2 masked_group_03 masked_group_01 2
#> 3 masked_group_02 masked_group_01 3
#> 4 masked_group_01 masked_group_02 4With shared masking, value “A” maps to the same label in both columns.
Using tidyselect Helpers
You can use tidyselect helpers to select columns:
set.seed(789)
mask_variables(df, where(is.character))
#> treatment outcome score
#> 1 treatment_group_01 outcome_group_02 85
#> 2 treatment_group_02 outcome_group_01 92
#> 3 treatment_group_01 outcome_group_02 78
#> 4 treatment_group_02 outcome_group_01 88
mask_variables_rowwise() - Row-Level Masking
The mask_variables_rowwise() function applies consistent masking within each row across multiple columns. This is useful when you have repeated measures or matched conditions.
Parameters
-
data: A data frame -
...: Column sets to mask (supports tidyselect helpers) -
prefix: Character string to use as prefix for masked labels (default:"masked_group_")
Example: Masking Repeated Conditions
df <- data.frame(
treat_1 = c("control", "treatment", "placebo"),
treat_2 = c("treatment", "placebo", "control"),
treat_3 = c("placebo", "control", "treatment"),
id = 1:3
)
set.seed(123)
result <- mask_variables_rowwise(df, starts_with("treat_"))
result
#> treat_1 treat_2 treat_3 id
#> 1 masked_group_03 masked_group_01 masked_group_02 1
#> 2 masked_group_01 masked_group_02 masked_group_03 2
#> 3 masked_group_02 masked_group_03 masked_group_01 3Within each row, the original values are consistently mapped to masked labels, but the mapping is independent across rows.
Scrambling Functions
Scrambling functions randomize the order of values while preserving all original data content. This approach maintains the data distribution while breaking the connection between observations and their original values.
scramble_values() - Scramble Vector Order
The scramble_values() function randomly reorders the elements of a vector.
Parameters
-
x: A vector to scramble
Basic Usage with Different Data Types
# Numeric data
set.seed(123)
numbers <- 1:10
scramble_values(numbers)
#> [1] 3 10 2 8 6 9 1 7 5 4
# Character data
set.seed(456)
letters_vec <- letters[1:5]
scramble_values(letters_vec)
#> [1] "e" "a" "c" "b" "d"
# Factor data
set.seed(789)
conditions <- factor(c("A", "B", "C", "A", "B"))
scramble_values(conditions)
#> [1] B A B C A
#> Levels: A B CKey Properties
Scrambling preserves:
- All original values (nothing is lost or changed)
- The data type
- The distribution of values
set.seed(100)
original <- c(1, 2, 2, 3, 3, 3, 4, 4, 4, 4)
scrambled <- scramble_values(original)
# Same values, different order
sort(original) == sort(scrambled)
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# Same frequency distribution
table(original)
#> original
#> 1 2 3 4
#> 1 2 3 4
table(scrambled)
#> scrambled
#> 1 2 3 4
#> 1 2 3 4Practical Example with Dataset
data(williams)
set.seed(42)
williams$age_scrambled <- scramble_values(williams$age)
# The values are the same, just reordered
summary(williams$age)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 21.00 26.00 32.00 34.04 38.00 71.00
summary(williams$age_scrambled)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 21.00 26.00 32.00 34.04 38.00 71.00
# But individual correspondences are broken
head(williams[c("subject", "age", "age_scrambled")], 10)
#> # A tibble: 10 × 3
#> subject age age_scrambled
#> <chr> <dbl> <dbl>
#> 1 A30MP4LXV4MIFD 34 25
#> 2 A16X5FB3HAFCKN 30 26
#> 3 A1E9D1OT9VJYDZ 40 25
#> 4 A16FPOYD7566WI 35 38
#> 5 A11NOTVHWST7Y3 26 25
#> 6 A3TDR6MXS6UO5Z 33 28
#> 7 A3OD4F0SA7EBCL 33 57
#> 8 A123PBQDU71I5O 30 32
#> 9 A25NGIY591U3DK 48 25
#> 10 A11WCFPJSR5VZP 33 43
scramble_variables() - Scramble Data Frame Columns
The scramble_variables() function scrambles the values of specified columns in a data frame.
Parameters
-
data: A data frame -
...: Columns to scramble (supports tidyselect helpers) -
together: IfTRUE, variables are scrambled together as a unit per row; ifFALSE(default), each variable is scrambled independently -
.groups: Optional grouping columns for within-group scrambling. Grouping columns must not overlap with the columns selected in.... Ifdatais already grouped (adplyrgrouped data frame), existing grouping is ignored unless.groupsis explicitly provided.
Independent Scrambling (Default)
Each column is scrambled independently:
df <- data.frame(
x = 1:6,
y = letters[1:6],
group = c("A", "A", "A", "B", "B", "B")
)
set.seed(123)
scramble_variables(df, c("x", "y"))
#> x y group
#> 1 3 e A
#> 2 6 d A
#> 3 2 b A
#> 4 4 f B
#> 5 5 a B
#> 6 1 c BNotice that x and y are scrambled independently of each other.
Scrambling Together
When together = TRUE, the selected columns are scrambled as a unit, preserving row-level relationships:
set.seed(456)
scramble_variables(df, c("x", "y"), together = TRUE)
#> x y group
#> 1 5 e A
#> 2 6 f A
#> 3 3 c A
#> 4 2 b B
#> 5 1 a B
#> 6 4 d BNotice that the pairs (1, “a”), (2, “b”), etc., remain intact but are assigned to different rows.
Within-Group Scrambling
Use the .groups parameter to scramble within groups:
set.seed(2)
scramble_variables(df, "x", .groups = "group")
#> # A tibble: 6 × 3
#> x y group
#> <int> <chr> <chr>
#> 1 1 a A
#> 2 3 b A
#> 3 2 c A
#> 4 5 d B
#> 5 6 e B
#> 6 4 f BValues of x are only swapped within their original group (A or B).
Combining Grouping and Together
You can combine both parameters:
set.seed(100)
scramble_variables(df, c("x", "y"), .groups = "group", together = TRUE)
#> # A tibble: 6 × 3
#> x y group
#> <int> <chr> <chr>
#> 1 2 b A
#> 2 1 a A
#> 3 3 c A
#> 4 6 f B
#> 5 4 d B
#> 6 5 e BPractical Example with Dataset
data(williams)
# Scramble age and ecology within gender groups
set.seed(42)
williams_scrambled <- williams |>
scramble_variables(c("age", "ecology"), .groups = "gender")
# Check that values are preserved within groups
williams |>
group_by(gender) |>
summarise(mean_age = mean(age, na.rm = TRUE))
#> # A tibble: 2 × 2
#> gender mean_age
#> <dbl> <dbl>
#> 1 1 33.8
#> 2 2 34.6
williams_scrambled |>
group_by(gender) |>
summarise(mean_age = mean(age, na.rm = TRUE))
#> # A tibble: 2 × 2
#> gender mean_age
#> <dbl> <dbl>
#> 1 1 33.8
#> 2 2 34.6
scramble_variables_rowwise() - Row-Level Scrambling
The scramble_variables_rowwise() function scrambles values within each row across specified columns. This is useful for scrambling repeated measures or item responses.
Parameters
-
data: A data frame -
...: Columns to scramble (supports tidyselect helpers). All selections are combined into a single set and scrambled together. If you want to scramble separate groups of columns independently, call the function multiple times.
Rowwise scrambling moves values between columns, so selected columns must be type-compatible. This function requires all selected columns to have the same class (or be an integer/double mix). For factors, the selected columns must also have identical levels.
Example: Scrambling Item Responses
df <- data.frame(
item1 = c(1, 4, 7),
item2 = c(2, 5, 8),
item3 = c(3, 6, 9),
id = 1:3
)
set.seed(123)
result <- scramble_variables_rowwise(df, c("item1", "item2", "item3"))
result
#> item1 item2 item3 id
#> 1 3 1 2 1
#> 2 5 4 6 2
#> 3 8 9 7 3Within each row, the values are shuffled among the item columns.
Combining Multiple Selectors (Single Combined Set)
Multiple selectors are combined into one set, so values can move between all selected columns:
df2 <- data.frame(
day_1 = c(1, 4, 7),
day_2 = c(2, 5, 8),
day_3 = c(3, 6, 9),
score_a = c(10, 40, 70),
score_b = c(20, 50, 80),
id = 1:3
)
set.seed(2)
result2 <- scramble_variables_rowwise(df2, starts_with("day_"), starts_with("score_"))
result2
#> day_1 day_2 day_3 score_a score_b id
#> 1 20 3 2 10 1 1
#> 2 4 50 40 5 6 2
#> 3 7 8 9 80 70 3Scrambling Separate Groups Independently (Call Multiple Times)
To scramble different groups of columns independently, call the function multiple times:
set.seed(42)
result3 <- df2 |>
scramble_variables_rowwise(starts_with("day_")) |>
scramble_variables_rowwise(starts_with("score_"))
result3
#> day_1 day_2 day_3 score_a score_b id
#> 1 1 3 2 20 10 1
#> 2 4 5 6 50 40 2
#> 3 8 9 7 80 70 3Handling Special Values
Missing Values (NA)
All masking functions preserve NA values in their original positions:
# Vector with NA values
x <- c("A", "B", NA, "A", NA, "C")
set.seed(123)
masked_x <- mask_labels(x)
masked_x
#> [1] "masked_group_03" "masked_group_01" NA "masked_group_03"
#> [5] NA "masked_group_02"
# NA positions are preserved
which(is.na(masked_x))
#> [1] 3 5If all values in a vector are NA, the function will issue a warning and return the vector unchanged:
x_all_na <- c(NA_character_, NA_character_, NA_character_)
mask_labels(x_all_na)
#> Warning: All values in input are NA. Returning unchanged.
#> [1] NA NA NAEmpty Strings
Empty strings ("") are treated as valid categorical values and will be masked like any other value:
x_with_empty <- c("A", "", "B", "", "C")
set.seed(456)
masked_with_empty <- mask_labels(x_with_empty)
masked_with_empty
#> [1] "masked_group_01" "masked_group_04" "masked_group_03" "masked_group_04"
#> [5] "masked_group_02"
# Empty strings get their own masked label
unique(masked_with_empty)
#> [1] "masked_group_01" "masked_group_04" "masked_group_03" "masked_group_02"This is different from NA values - empty strings are actual data values, not missing data.
Choosing Between Masking and Scrambling
| Aspect | Masking | Scrambling |
|---|---|---|
| Original values | Hidden (replaced) | Preserved (reordered) |
| Distribution | Changed (new labels) | Unchanged |
| Best for | Categorical variables | Numeric or categorical |
| Use case | Hide treatment conditions | Break individual links |
| Reversibility | Requires mapping key | Irreversible |
When to Use Masking
- When you need to hide categorical labels (e.g., treatment conditions, group names)
- When analysts should not know the meaning of categories
- When you want different prefixes for different variables
When to Use Scrambling
- When you want to preserve the original data distribution
- When you need to break the link between observations and values
- When working with numeric data that shouldn’t be categorically relabeled
Working with Included Datasets
The vazul package includes two research datasets for demonstration and practice.
MARP Dataset
The Many Analysts Religion Project (MARP) dataset contains 10,535 participants from 24 countries:
data(marp)
dim(marp)
#> [1] 10535 46
# Example: Scramble religiosity scores within countries
set.seed(42)
marp_blinded <- marp |>
scramble_variables(starts_with("rel_"), .groups = "country")
# Original and scrambled have same country-level means
original_means <- marp |>
group_by(country) |>
summarise(rel_1_mean = mean(rel_1, na.rm = TRUE), .groups = "drop")
scrambled_means <- marp_blinded |>
group_by(country) |>
summarise(rel_1_mean = mean(rel_1, na.rm = TRUE), .groups = "drop")
all.equal(original_means$rel_1_mean, scrambled_means$rel_1_mean)
#> [1] TRUEWilliams Dataset
The Williams study dataset contains 112 participants from a stereotyping study:
data(williams)
dim(williams)
#> [1] 112 25
# Example: Mask the ecology condition for blind analysis
set.seed(42)
williams_blinded <- williams |>
mask_variables("ecology")
# Analysts can work with masked conditions
williams_blinded |>
group_by(ecology) |>
summarise(
n = n(),
mean_impulsivity = mean(Impuls_1, na.rm = TRUE),
.groups = "drop"
)
#> # A tibble: 2 × 3
#> ecology n mean_impulsivity
#> <chr> <int> <dbl>
#> 1 ecology_group_01 56 4.32
#> 2 ecology_group_02 56 4.61Summary
The vazul package provides a comprehensive toolkit for data blinding:
| Function | Level | Purpose |
|---|---|---|
mask_labels() |
Vector | Replace categorical values with anonymous labels |
mask_variables() |
Data frame | Mask multiple columns |
mask_variables_rowwise() |
Row-wise | Consistent masking within rows |
scramble_values() |
Vector | Randomize value order |
scramble_variables() |
Data frame | Scramble multiple columns |
scramble_variables_rowwise() |
Row-wise | Scramble values within rows |
These functions help researchers conduct unbiased analyses by separating the analyst from knowledge about treatment conditions, group assignments, or individual data points.
