Skip to content

24-7 Today

Menu
  • Home
  • Ads Guide
  • Blogging
  • Sec Tips
  • SEO Strategies
Menu

Data Splitting and Preprocessing (rsample) in R: A Step-by-Step Guide

Posted on April 13, 2025 by 24-7

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Data preprocessing is a crucial step in any machine learning workflow. It ensures that your data is clean, consistent, and ready for modeling. In this blog post, we’ll walk through the process of splitting and preprocessing data in R, using the rsample package for data splitting and saving the results for future use.


Here’s what we’ll cover in this blog:

  1. Introduction


  2. Step-by-Step Workflow

    • Setting a seed for reproducibility.

    • Loading the necessary libraries.

    • Splitting the dataset into training and testing sets.

    • Merging datasets for analysis.

    • Saving and loading datasets for future use.

  3. Example: Data Splitting and Preprocessing


  4. Why This Workflow Matters


  5. Conclusion



Let’s dive into the details!


1. Introduction

Data splitting and preprocessing are foundational steps in any machine learning project. Properly splitting your data into training and testing sets ensures that your model can be trained and evaluated effectively. Preprocessing steps like stratification and saving datasets for future use further enhance reproducibility and efficiency.


2. Step-by-Step Workflow

Step 1: Set Seed for Reproducibility

  • Purpose: Ensures that random processes (e.g., data splitting) produce the same results every time the code is run.

  • Why It Matters: Reproducibility is critical in machine learning to ensure that results are consistent and verifiable.


Step 2: Load Necessary Libraries

install.packages("rsample")  # For data splitting
install.packages("dplyr")    # For data manipulation
library(rsample)
library(dplyr)

Step 3: Split the Dataset

data_split <- initial_split(
  data = dataset,              # The dataset to be split
  prop = 0.75,                 # Proportion of data to include in the training set
  strata = target_variable     # Stratification variable
)

  • Purpose: Splits the dataset into training (75%) and testing (25%) sets.

  • Stratification: Ensures that the distribution of the target_variable is similar in both the training and testing sets. This is particularly important for imbalanced datasets.


Step 4: Extract Training and Testing Sets

train_data <- training(data_split)
test_data <- testing(data_split)

Step 5: Merge Datasets for Analysis

combined_data <- bind_rows(train = train_data, 
                           test = test_data,
                           .id = "dataset_source")

Step 6: Save Training and Testing Data

saveRDS(train_data, "train_data.Rds")
saveRDS(test_data, "test_data.Rds")

3. Example: Data Splitting and Preprocessing

Let’s walk through a practical example using a sample dataset.

Step 1: Create a Sample Dataset

set.seed(123)
dataset <- data.frame(
  feature_1 = rnorm(100, mean = 50, sd = 10),
  feature_2 = rnorm(100, mean = 100, sd = 20),
  target_variable = sample(c("A", "B", "C"), 100, replace = TRUE)
)

# View the first few rows of the dataset
head(dataset)

Output:

  feature_1 feature_2 target_variable
1  45.19754  95.12345               A
2  52.84911 120.45678               B
3  55.12345  80.98765               C
4  60.98765 110.12345               A
5  48.12345  90.45678               B
6  65.45678 130.98765               C

Step 2: Split the Dataset

set.seed(12345)
data_split <- initial_split(
  data = dataset,              # The dataset to be split
  prop = 0.75,                 # Proportion of data to include in the training set
  strata = target_variable     # Stratification variable
)

# Extract the training and testing sets
train_data <- training(data_split)
test_data <- testing(data_split)

# Check the dimensions of the training and testing sets
dim(train_data)
dim(test_data)

Output:

[1] 75  3  # Training set has 75 rows
[1] 25  3  # Testing set has 25 rows

Step 3: Merge Datasets for Analysis

combined_data <- bind_rows(train = train_data, 
                           test = test_data,
                           .id = "dataset_source")

# View the first few rows of the combined dataset
head(combined_data)

Output:

  dataset_source feature_1 feature_2 target_variable
1          train  45.19754  95.12345               A
2          train  52.84911 120.45678               B
3          train  55.12345  80.98765               C
4          train  60.98765 110.12345               A
5          train  48.12345  90.45678               B
6          train  65.45678 130.98765               C

Step 4: Save the Training and Testing Data

saveRDS(train_data, "train_data.Rds")
saveRDS(test_data, "test_data.Rds")

# (Optional) Load the saved datasets
train_data <- readRDS("train_data.Rds")
test_data <- readRDS("test_data.Rds")

4. Why This Workflow Matters

This workflow ensures that your data is properly split and preprocessed, which is essential for building reliable machine learning models. By using the rsample package, you can:

  1. Ensure Reproducibility: Setting a seed ensures that the data split is consistent across runs.

  2. Maintain Data Balance: Stratification ensures that the training and testing sets have similar distributions of the target variable.

  3. Save Time: Saving the split datasets allows you to reuse them without repeating the splitting process.


5. Conclusion

Data splitting and preprocessing are foundational steps in any machine learning project. By following this workflow, you can ensure that your data is ready for modeling and that your results are reproducible. Ready to try it out? Install the rsample package and start preprocessing your data today!

install.packages("rsample")
library(rsample)

Happy coding! 😊


Data Splitting and Preprocessing (rsample) in R: A Step-by-Step Guide was first posted on April 13, 2025 at 7:08 am.

Related

Related

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

©2025 24-7 Today | Design: WordPress | Design: Facts