This vignette provides a step-by-step guide on how to create and apply custom validation rules in tenzing. The validation framework in tenzing is based on R6 classes that allow for flexible and configurable validation of contributor tables.
Overview
The validation system in tenzing consists of three main components:
-
ColumnValidator
– Ensures that required columns exist in the contributors table. -
Validator
– Runs logical checks on the contents of the table (e.g., missing values, duplicate names). -
ValidateOutput
– Combines column and data validation, allowing for customized validation pipelines usingYAML
configuration files.
By leveraging these components, you can create your own validation rules to check for specific issues in your data.
1. Defining Custom Validation Rules
Validation rules are written as functions in
tenzing. These functions should take the
contributors_table
as input and return a list with two
elements:
-
type
: Can be “success”, “warning”, or “error”, indicating the result of the check. -
message
: A user-friendly explanation of the check result.
Example: Custom Validation Function
Let’s say you want to create a check that ensures every contributor has a valid ORCID ID.
#' Check for valid ORCID IDs
#'
#' This function checks if the ORCID IDs in the `contributors_table` are formatted correctly.
#'
#' @param contributors_table A dataframe containing the contributors' information.
#'
#' @return A list containing:
#' \item{type}{Type of validation result: "success", "warning", or "error".}
#' \item{message}{An informative message indicating if any ORCID IDs are invalid.}
check_orcid <- function(contributors_table) {
if (!"ORCID" %in% colnames(contributors_table)) {
return(list(
type = "warning",
message = "No ORCID column found. ORCID validation skipped."
))
}
invalid_orcids <- contributors_table %>%
dplyr::filter(!grepl("^\\d{4}-\\d{4}-\\d{4}-\\d{4}$", .data$ORCID) & !is.na(.data$ORCID))
if (nrow(invalid_orcids) > 0) {
return(list(
type = "warning",
message = glue::glue("Invalid ORCID format for the following rows: {paste(invalid_orcids$rowname, collapse = ', ')}")
))
}
return(list(type = "success", message = "All ORCID IDs are correctly formatted."))
}
2. Configuring Validation with YAML
Once you define your validation function, you need to tell
tenzing to use it. This is done by specifying the
validation in a YAML
configuration file.
tenzing includes a set of predefined validation
helpers in R/validation_helpers.R
. These include: -
check_missing_order
– Ensures that all contributors have an
order in the publication. - check_duplicate_order
– Ensures
no duplicate order numbers unless multiple first authors exist. -
check_missing_surname
– Ensures all contributors have a
surname. - check_duplicate_names
– Ensures no duplicate
contributor names. - check_affiliation_consistency
–
Ensures only one affiliation format is used. - Many more…
These functions are automatically available when setting up validation in tenzing.
Writing Custom Dependencies
Some validations should only run if other conditions are met. You can
define dependencies in the YAML
configuration file.
Example: Adding ORCID Validation in
config/validator_vignette_example.yaml
validation_config:
validations:
- name: check_missing_order
- name: check_duplicate_order
- name: check_missing_surname
- name: check_missing_firstname
- name: check_duplicate_initials
- name: check_missing_corresponding
dependencies:
- '"Corresponding author?" %in% colnames(contributors_table)'
- name: check_missing_email
dependencies:
- '"Corresponding author?" %in% colnames(contributors_table)'
- 'self$results[["check_missing_corresponding"]]$type == "success"'
- '"Email address" %in% colnames(contributors_table)'
- name: check_orcid
dependencies:
- '"ORCID" %in% colnames(contributors_table)'
In this configuration:
- The
check_orcid
validation will only run if the column"ORCID"
exists. - The
check_missing_email
validation will only run if a corresponding author is specified and their email column exists.
3. Running Custom Validations
Once you have added the validation rule, you can run it using tenzing’s validation pipeline.
Step 1: Load the Configuration
config_path <- system.file("config/validator_vignette_example.yaml", package = "tenzing")
config_file <- yaml::read_yaml(config_path)
Step 2: Initialize the Validator
class
validator <- Validator$new()
validator$setup_validator(config_file$validation_config)
Step 3: Run Validations on Your Data
validate_results <- validator$run_validations(contributors_table = my_contributors_table)
Step 4: Inspect the Validation Results
purrr::map(validate_results, "type")
#> $check_missing_corresponding
#> [1] "success"
#>
#> $check_missing_email
#> [1] "success"
purrr::map(validate_results, "message")
#> $check_missing_corresponding
#> [1] "There is at least one author indicated as corresponding author."
#>
#> $check_missing_email
#> [1] "There are email addresses provided for all corresponding authors."
4. Understanding the Validator
Class
The Validator
class in tenzing is
responsible for running all validation checks.
Key Features of Validator
- It dynamically loads validation functions from
validation_helpers.R
. - It allows adding dependencies between validation rules.
- It executes only the specified validations from the configuration.
5. Ensuring Required Columns Exist with ColumnValidator
The ColumnValidator
class ensures that all necessary
columns are present before running validations.
Example: Configuring Required Columns
In config/columnvalidator_example.yaml
:
column_config:
rules:
minimal:
operator: "AND"
columns:
- Firstname
- Middle name
- Surname
- Order in publication
severity: "error"
affiliation:
operator: "OR" # Either legacy OR regex-based affiliation columns must be present
columns:
- Primary affiliation
- Secondary affiliation # Legacy columns
regex: "^Affiliation [0-9]+$" # Regex-based columns
severity: "error" # Make sure it's required for validation to pass
title:
operator: "AND"
columns:
- Corresponding author?
- Email address
severity: "warning"
Using Regex to Validate Column Names
You can also define regex-based column validation to match dynamically named columns.
Example: Using Regex for Affiliation Columns
column_config:
rules:
affiliation:
operator: "OR"
columns:
- Primary affiliation
- Secondary affiliation
regex: "^Affiliation [0-9]+$"
severity: "error"
This configuration: - Requires at least one affiliation column. -
Allows dynamic affiliation columns (e.g., Affiliation 1
,
Affiliation 2
, etc.). - Fails validation if no affiliation
column exists.
Running the ColumnValidator
config_path <- system.file("config/columnvalidator_example.yaml", package = "tenzing")
config_file <- yaml::read_yaml(config_path)
column_validator <- ColumnValidator$new(config_input = config_file$column_config)
column_results <- column_validator$validate_columns(my_contributors_table)
column_results
#> $minimal
#> $minimal$type
#> [1] "success"
#>
#> $minimal$message
#> All column requirements satisfied.
#>
#>
#> $affiliation
#> $affiliation$type
#> [1] "success"
#>
#> $affiliation$message
#> All column requirements satisfied.
#>
#>
#> $title
#> $title$type
#> [1] "success"
#>
#> $title$message
#> All column requirements satisfied.
6. Bringing It All Together with ValidateOutput
The ValidateOutput
class integrates both
Validator
and ColumnValidator
.
How It Works
Reads the configuration file.
Runs ColumnValidator to check required columns.
Runs Validator to check data integrity.
Returns the combined results.
Example Usage
validate_output_instance <- ValidateOutput$new(config_path = config_path)
validate_results <- validate_output_instance$run_validations(my_contributors_table)
validate_results
#> $minimal
#> $minimal$type
#> [1] "success"
#>
#> $minimal$message
#> All column requirements satisfied.
#>
#>
#> $affiliation
#> $affiliation$type
#> [1] "success"
#>
#> $affiliation$message
#> All column requirements satisfied.
#>
#>
#> $title
#> $title$type
#> [1] "success"
#>
#> $title$message
#> All column requirements satisfied.