STEP-CM Analytics
  • About
  • Services
  • Background
  • Contact

← Back to Services

Data cleaning, merging and formatting

Introduction

In this example, there are two synthetic datasets: patient information dataset and visit information dataset. The patient information datasetincludes information like name, species, breed, date of birth, sex, weight and vaccination status. The visit information dataset provides information about the patient’s visit to the veterinarian. Data collected in this dataset include date of visit, presenting complaint, the patient’s veterinarian for that visit, fees and outcome.

Suppose we wanted to get an idea of the presenting complaints that the clinic encountered. This would effect personnel training, infection control policies and inventory management. We could analyze just the visit information dataset but, at best, the analysis would be incomplete. Afterall, veterinary clinics are multispecies in nature. Determining the most common presentation is often not enough information. We need to understand these presentations across species, sex, age groups and breed. This requires that visit information data be merged with the patient information dataset.

Patient and visit information datasets are provided below. Upon visual inspection of patient information data, we notice that there are heterogeneous formats within each column. For example, under the species column, a dog is designated as “dog”, “DOG” or “Dog”. dob (date of birth) has dates in the format “YYYY-MM-DD”, “DD/MM/YYYY”, “MMMM DDDD YYYY”, “MM-YYYY”. There are also multiple ways of designating sex. Females are designated by “F” or “Female.” Males are designated by “M”, “m”, or “Male.” And in the vaccinated column, a vaccinated animal can be designated by “Yes”, “yes”, “YES”. Further, both weight and vaccinated have missing data.

For the visit information dataset, we can notice similar heterogeneities in data formats for visit_date and vet. There is also a likely error under fee_charged: $275,00 for Visit_ID = V004 as well as missing data (Visit_ID = V009).

  • Patients
  • Visits
Patient information data
patient_id animal_name species breed dob sex owner_id weight_kg vaccinated
P001 Biscuit dog Labrador Retriever 2019-04-12 M O011 28.4 Yes
P002 Luna Cat domestic shorthair 14/07/2020 F O002 4.1 yes
P002 Luna Cat domestic shorthair 14/07/2020 F O002 4.1 yes
P003 Max dog German Shepherd 2018-11-03 Male O003 34.7 No
P004 Whiskers cat Persian March 5 2021 F O004 NA YES
P005 Rocky Dog Bulldog 2020-06-22 M O005 24.9 No
P006 Cleo cat Siamese 2017-09-15 Female O006 3.8 yes
P007 Buddy dog Golden Retriever 07-2019 M O007 32.1 Yes
P008 Mochi cat Maine Coon 2022-01-30 F O008 5.6
P009 Zeus Dog Rottweiler 2016-03-18 M O009 47.2 No
P010 Nala dog Beagle 2021-08-09 F O010 10.3 Yes
P011 Shadow DOG Border Collie 2019-12-01 m O011 19.8 yes
Visit information data
visit_id pt_id visit_date reason vet fee_charged outcome follow_up_required
V001 P001 2024-01-15 Annual wellness Dr. Patel $185.00 Healthy No
V002 P002 February 3 2024 Vomiting dr. chen 200 Gastroenteritis - treated Yes
V003 P003 2024-01-22 Lameness Dr. Patel $340.50 Referred to ortho Yes
V004 P004 03/10/2024 Dental cleaning Dr. Osei $275,00 Completed No
V005 P005 2024-02-18 Annual wellness DR. CHEN $185.00 Healthy No
V006 P006 2024-01-30 Weight loss Dr. Patel $210.00 Hyperthyroidism suspected Yes
V007 P007 March 12 2024 Skin irritation Dr. Osei $155.00 Allergies - treatment started Yes
V008 P002 2024-03-01 Follow-up dr. chen $95.00 Resolved No
V009 P008 2024-02-25 First visit / neuter consult Dr. Patel Scheduled for surgery Yes
V010 P010 2024-03-08 Limping Dr. Osei $220.00 Soft tissue injury Yes
V011 P011 15/03/2024 Annual wellness Dr. Patel $185.00 Healthy No
V012 P099 2024-03-20 Emergency - trauma Dr. Chen $520.00 Stabilized Yes
V013 P003 2024-03-25 Ortho follow-up Dr. Patel $310.00 Surgery recommended Yes

Cleaning the patient information dataset

The species, breed, sex and vaccinated columns would all benefit by being in a single case. In this example, I convert all to lower case.

patient_info_data <- 
  patient_info_data %>%
  mutate(
    species = tolower(species), 
    breed = tolower(breed), 
    sex = tolower(sex),
    vaccinated = tolower(vaccinated)
  )

The dob column suffers from multiple formats. In the script below, I use the function parse_date_time from the lubridate package to convert them all to a single date format. Notice that for patient_id = P007, the original dob 07-2019 was converted to 2019-07-20. The original date of birth provided did not indicate which day the birthday was on. In cases like these, I would flag this and bring it to the attention of the subject-matter experts, who may go into their records to find the exact birthday. If unavailable, I would seek a consensus among the clinical experts that assuming a birthday of 2019-07-20 does not create a significant clinical difference when compared to another birthday in the month of July.

patient_info_data <- 
  patient_info_data %>%
  mutate(
    dob = parse_date_time(dob, orders = c("ymd", "dmy", "mdy", "my"))
  )

The sex column was partially addressed earlier by converting all uppercase letters to lowercase. However, while most sex designations are single letter, there are two that spell out male and female. This is addressed below.

patient_info_data <- 
  patient_info_data %>%
  mutate(
    sex = 
      case_when(
        sex == "male" ~ "m", 
        sex == "female" ~ "f", 
        .default = sex
      )
  )

For the weight_kg and vaccinated columns, there are missing values. In both cases, I would bring this to the attention of the subject-matter experts before attempting imputation. The patient with the missing weight (P004) received a dental cleaning. From my own clinical experience, I know dental cleanings in dogs and cats are always performed under general anesthesia (in contrast to horses, which are done in standing). It is certainty that a patient undergoing general anesthesia had a weight attained by the veterinarian. It is likely very possible to attain the patient’s weight by exploring the medical record further or contacting the primary veterinarian. If that is not possible, I would ask the subject-matter experts to provide a weight they felt was appropriate for this breed, sex, and age of cat along with any other comorbidities we may know about this cat. Imputation would be my last resort.

Similarly, I would bring the patient with the missing vaccination status to the attention of the subject-matter experts. Again, vaccination records may exist in the medical record and could be found with further investigation. Sometimes vaccination status is left blank because records are still being transferred from the previous veterinarian or a shelter. If the cat was from a shelter, it most likely has been vaccinated. According to the visit information dataset, the cat’s visit occurred on 2024-02-25. It was the cat’s first visit and had a neuter consult and was subsequently scheduled for surgery, most likely to get neutered. It was born on 2022-01-30. The cat’s first visit occurred more than two years after its date of birth. Again, leaning on my own clinical experience, this is not a shelter cat as most shelters will neuter (one main reason is population control as cats can reproduce rapidly) and vaccinate cats. Another option is to contact the primary veterinarian’s office. We may find that empty fields may have a meaning to that specific clinic. In the absence of any other information, I would err on the side of this cat’s vaccination status being unvaccinated, or simply unknown.

For this example, for both cases of missing data above, I will leave as is.

Another thing to note in the patient information dataset is that there is a repeat row of patient P002. This is something I would discuss with the subject-matter experts. But the two rows are completely identical, and likely shows up twice because that patient had two visits (as logged in the visit information dataset). For my purposes, this extra row is not needed.

patient_info_data = unique(patient_info_data)

Cleaning the visit information dataset

The visit_date column contains dates with various formats. Again, the parse_date_time function in the lubridate package handles this very well.

visit_info_data <- 
  visit_info_data %>%
  mutate(
    visit_date = parse_date_time(visit_date, orders = c("ymd", "B d Y", "dmy"))
  )

The vet column contains veterinarian names. I will make sure that they all have similar formats. Further, I have found it useful when working with names to make sure they are formatted like actual names. It can help prevent the extra step of re-formatting names when reporting out.

visit_info_data <- 
  visit_info_data %>%
  mutate(
    vet = str_to_title(vet)
  )

The missing value in the fee_charged column could be determined by contacting the primary veterinarian. If that is not possible, this value could be estimated by consulting with the appropriate subject-matter experts or researching various clinic websites that are similar to this clinic.

There is very little need for cleaning for the outcome and follow_up_required columns. In this case, I may just convert all values to lowercase letters in anticipation of having to append new data to the current dataset.

visit_info_data <-
  visit_info_data %>%
  mutate(
    outcome = str_to_lower(outcome), 
    follow_up_required = str_to_lower(follow_up_required)
  )

One final thing to investigate regarding the visit information dataset is that there is a patient with the ID P099, non-existent on the patient information dataset. The closest patient ID in the patient information dataset is P009. This is something to bring to the attention of the subject-matter experts and investigate whether this is a type or indeed a different patient. In this case, this is a typo. The correction is made below.

visit_info_data = 
  visit_info_data %>%
  mutate(
    pt_id = 
      case_when(
        pt_id == "P099" ~ "P009", 
        .default = pt_id
      )
  )

The cleaned datasets are shown below.

  • Patients
  • Visits
Patient information data
patient_id animal_name species breed dob sex owner_id weight_kg vaccinated
P001 Biscuit dog labrador retriever 2019-04-12 m O011 28.4 yes
P002 Luna cat domestic shorthair 2020-07-14 f O002 4.1 yes
P003 Max dog german shepherd 2018-11-03 m O003 34.7 no
P004 Whiskers cat persian 2021-03-05 f O004 NA yes
P005 Rocky dog bulldog 2020-06-22 m O005 24.9 no
P006 Cleo cat siamese 2017-09-15 f O006 3.8 yes
P007 Buddy dog golden retriever 2019-07-20 m O007 32.1 yes
P008 Mochi cat maine coon 2022-01-30 f O008 5.6
P009 Zeus dog rottweiler 2016-03-18 m O009 47.2 no
P010 Nala dog beagle 2021-08-09 f O010 10.3 yes
P011 Shadow dog border collie 2019-12-01 m O011 19.8 yes
Visit information data
visit_id pt_id visit_date reason vet fee_charged outcome follow_up_required
V001 P001 2024-01-15 Annual wellness Dr. Patel $185.00 healthy no
V002 P002 2024-02-03 Vomiting Dr. Chen 200 gastroenteritis - treated yes
V003 P003 2024-01-22 Lameness Dr. Patel $340.50 referred to ortho yes
V004 P004 2024-10-03 Dental cleaning Dr. Osei $275,00 completed no
V005 P005 2024-02-18 Annual wellness Dr. Chen $185.00 healthy no
V006 P006 2024-01-30 Weight loss Dr. Patel $210.00 hyperthyroidism suspected yes
V007 P007 2024-03-12 Skin irritation Dr. Osei $155.00 allergies - treatment started yes
V008 P002 2024-03-01 Follow-up Dr. Chen $95.00 resolved no
V009 P008 2024-02-25 First visit / neuter consult Dr. Patel scheduled for surgery yes
V010 P010 2024-03-08 Limping Dr. Osei $220.00 soft tissue injury yes
V011 P011 2024-03-15 Annual wellness Dr. Patel $185.00 healthy no
V012 P009 2024-03-20 Emergency - trauma Dr. Chen $520.00 stabilized yes
V013 P003 2024-03-25 Ortho follow-up Dr. Patel $310.00 surgery recommended yes

Merging the datasets

The operations to merge the two datasets are shown below.

combined = left_join(patient_info_data, visit_info_data, by = c("patient_id" = "pt_id"))

The final merged dataset is provided below:

Combined dataset
patient_id animal_name species breed dob sex owner_id weight_kg vaccinated visit_id visit_date reason vet fee_charged outcome follow_up_required
P001 Biscuit dog labrador retriever 2019-04-12 m O011 28.4 yes V001 2024-01-15 Annual wellness Dr. Patel $185.00 healthy no
P002 Luna cat domestic shorthair 2020-07-14 f O002 4.1 yes V002 2024-02-03 Vomiting Dr. Chen 200 gastroenteritis - treated yes
P002 Luna cat domestic shorthair 2020-07-14 f O002 4.1 yes V008 2024-03-01 Follow-up Dr. Chen $95.00 resolved no
P003 Max dog german shepherd 2018-11-03 m O003 34.7 no V003 2024-01-22 Lameness Dr. Patel $340.50 referred to ortho yes
P003 Max dog german shepherd 2018-11-03 m O003 34.7 no V013 2024-03-25 Ortho follow-up Dr. Patel $310.00 surgery recommended yes
P004 Whiskers cat persian 2021-03-05 f O004 NA yes V004 2024-10-03 Dental cleaning Dr. Osei $275,00 completed no
P005 Rocky dog bulldog 2020-06-22 m O005 24.9 no V005 2024-02-18 Annual wellness Dr. Chen $185.00 healthy no
P006 Cleo cat siamese 2017-09-15 f O006 3.8 yes V006 2024-01-30 Weight loss Dr. Patel $210.00 hyperthyroidism suspected yes
P007 Buddy dog golden retriever 2019-07-20 m O007 32.1 yes V007 2024-03-12 Skin irritation Dr. Osei $155.00 allergies - treatment started yes
P008 Mochi cat maine coon 2022-01-30 f O008 5.6 V009 2024-02-25 First visit / neuter consult Dr. Patel scheduled for surgery yes
P009 Zeus dog rottweiler 2016-03-18 m O009 47.2 no V012 2024-03-20 Emergency - trauma Dr. Chen $520.00 stabilized yes
P010 Nala dog beagle 2021-08-09 f O010 10.3 yes V010 2024-03-08 Limping Dr. Osei $220.00 soft tissue injury yes
P011 Shadow dog border collie 2019-12-01 m O011 19.8 yes V011 2024-03-15 Annual wellness Dr. Patel $185.00 healthy no

← Back to Services