| Patient information data | ||||||||
| patient_id | animal_name | species | breed | dob | sex | owner_id | weight_kg | vaccinated |
|---|---|---|---|---|---|---|---|---|
| P001 | Biscuit | dog | Labrador Retriever | 2019-04-12 | M | O011 | 28.4 | Yes |
| P002 | Luna | Cat | domestic shorthair | 14/07/2020 | F | O002 | 4.1 | yes |
| P002 | Luna | Cat | domestic shorthair | 14/07/2020 | F | O002 | 4.1 | yes |
| P003 | Max | dog | German Shepherd | 2018-11-03 | Male | O003 | 34.7 | No |
| P004 | Whiskers | cat | Persian | March 5 2021 | F | O004 | NA | YES |
| P005 | Rocky | Dog | Bulldog | 2020-06-22 | M | O005 | 24.9 | No |
| P006 | Cleo | cat | Siamese | 2017-09-15 | Female | O006 | 3.8 | yes |
| P007 | Buddy | dog | Golden Retriever | 07-2019 | M | O007 | 32.1 | Yes |
| P008 | Mochi | cat | Maine Coon | 2022-01-30 | F | O008 | 5.6 | |
| P009 | Zeus | Dog | Rottweiler | 2016-03-18 | M | O009 | 47.2 | No |
| P010 | Nala | dog | Beagle | 2021-08-09 | F | O010 | 10.3 | Yes |
| P011 | Shadow | DOG | Border Collie | 2019-12-01 | m | O011 | 19.8 | yes |
Data cleaning, merging and formatting
Introduction
In this example, there are two synthetic datasets: patient information dataset and visit information dataset. The patient information datasetincludes information like name, species, breed, date of birth, sex, weight and vaccination status. The visit information dataset provides information about the patient’s visit to the veterinarian. Data collected in this dataset include date of visit, presenting complaint, the patient’s veterinarian for that visit, fees and outcome.
Suppose we wanted to get an idea of the presenting complaints that the clinic encountered. This would effect personnel training, infection control policies and inventory management. We could analyze just the visit information dataset but, at best, the analysis would be incomplete. Afterall, veterinary clinics are multispecies in nature. Determining the most common presentation is often not enough information. We need to understand these presentations across species, sex, age groups and breed. This requires that visit information data be merged with the patient information dataset.
Patient and visit information datasets are provided below. Upon visual inspection of patient information data, we notice that there are heterogeneous formats within each column. For example, under the species column, a dog is designated as “dog”, “DOG” or “Dog”. dob (date of birth) has dates in the format “YYYY-MM-DD”, “DD/MM/YYYY”, “MMMM DDDD YYYY”, “MM-YYYY”. There are also multiple ways of designating sex. Females are designated by “F” or “Female.” Males are designated by “M”, “m”, or “Male.” And in the vaccinated column, a vaccinated animal can be designated by “Yes”, “yes”, “YES”. Further, both weight and vaccinated have missing data.
For the visit information dataset, we can notice similar heterogeneities in data formats for visit_date and vet. There is also a likely error under fee_charged: $275,00 for Visit_ID = V004 as well as missing data (Visit_ID = V009).
| Visit information data | |||||||
| visit_id | pt_id | visit_date | reason | vet | fee_charged | outcome | follow_up_required |
|---|---|---|---|---|---|---|---|
| V001 | P001 | 2024-01-15 | Annual wellness | Dr. Patel | $185.00 | Healthy | No |
| V002 | P002 | February 3 2024 | Vomiting | dr. chen | 200 | Gastroenteritis - treated | Yes |
| V003 | P003 | 2024-01-22 | Lameness | Dr. Patel | $340.50 | Referred to ortho | Yes |
| V004 | P004 | 03/10/2024 | Dental cleaning | Dr. Osei | $275,00 | Completed | No |
| V005 | P005 | 2024-02-18 | Annual wellness | DR. CHEN | $185.00 | Healthy | No |
| V006 | P006 | 2024-01-30 | Weight loss | Dr. Patel | $210.00 | Hyperthyroidism suspected | Yes |
| V007 | P007 | March 12 2024 | Skin irritation | Dr. Osei | $155.00 | Allergies - treatment started | Yes |
| V008 | P002 | 2024-03-01 | Follow-up | dr. chen | $95.00 | Resolved | No |
| V009 | P008 | 2024-02-25 | First visit / neuter consult | Dr. Patel | Scheduled for surgery | Yes | |
| V010 | P010 | 2024-03-08 | Limping | Dr. Osei | $220.00 | Soft tissue injury | Yes |
| V011 | P011 | 15/03/2024 | Annual wellness | Dr. Patel | $185.00 | Healthy | No |
| V012 | P099 | 2024-03-20 | Emergency - trauma | Dr. Chen | $520.00 | Stabilized | Yes |
| V013 | P003 | 2024-03-25 | Ortho follow-up | Dr. Patel | $310.00 | Surgery recommended | Yes |
Cleaning the patient information dataset
The species, breed, sex and vaccinated columns would all benefit by being in a single case. In this example, I convert all to lower case.
patient_info_data <-
patient_info_data %>%
mutate(
species = tolower(species),
breed = tolower(breed),
sex = tolower(sex),
vaccinated = tolower(vaccinated)
)The dob column suffers from multiple formats. In the script below, I use the function parse_date_time from the lubridate package to convert them all to a single date format. Notice that for patient_id = P007, the original dob 07-2019 was converted to 2019-07-20. The original date of birth provided did not indicate which day the birthday was on. In cases like these, I would flag this and bring it to the attention of the subject-matter experts, who may go into their records to find the exact birthday. If unavailable, I would seek a consensus among the clinical experts that assuming a birthday of 2019-07-20 does not create a significant clinical difference when compared to another birthday in the month of July.
patient_info_data <-
patient_info_data %>%
mutate(
dob = parse_date_time(dob, orders = c("ymd", "dmy", "mdy", "my"))
)The sex column was partially addressed earlier by converting all uppercase letters to lowercase. However, while most sex designations are single letter, there are two that spell out male and female. This is addressed below.
patient_info_data <-
patient_info_data %>%
mutate(
sex =
case_when(
sex == "male" ~ "m",
sex == "female" ~ "f",
.default = sex
)
)For the weight_kg and vaccinated columns, there are missing values. In both cases, I would bring this to the attention of the subject-matter experts before attempting imputation. The patient with the missing weight (P004) received a dental cleaning. From my own clinical experience, I know dental cleanings in dogs and cats are always performed under general anesthesia (in contrast to horses, which are done in standing). It is certainty that a patient undergoing general anesthesia had a weight attained by the veterinarian. It is likely very possible to attain the patient’s weight by exploring the medical record further or contacting the primary veterinarian. If that is not possible, I would ask the subject-matter experts to provide a weight they felt was appropriate for this breed, sex, and age of cat along with any other comorbidities we may know about this cat. Imputation would be my last resort.
Similarly, I would bring the patient with the missing vaccination status to the attention of the subject-matter experts. Again, vaccination records may exist in the medical record and could be found with further investigation. Sometimes vaccination status is left blank because records are still being transferred from the previous veterinarian or a shelter. If the cat was from a shelter, it most likely has been vaccinated. According to the visit information dataset, the cat’s visit occurred on 2024-02-25. It was the cat’s first visit and had a neuter consult and was subsequently scheduled for surgery, most likely to get neutered. It was born on 2022-01-30. The cat’s first visit occurred more than two years after its date of birth. Again, leaning on my own clinical experience, this is not a shelter cat as most shelters will neuter (one main reason is population control as cats can reproduce rapidly) and vaccinate cats. Another option is to contact the primary veterinarian’s office. We may find that empty fields may have a meaning to that specific clinic. In the absence of any other information, I would err on the side of this cat’s vaccination status being unvaccinated, or simply unknown.
For this example, for both cases of missing data above, I will leave as is.
Another thing to note in the patient information dataset is that there is a repeat row of patient P002. This is something I would discuss with the subject-matter experts. But the two rows are completely identical, and likely shows up twice because that patient had two visits (as logged in the visit information dataset). For my purposes, this extra row is not needed.
patient_info_data = unique(patient_info_data)Cleaning the visit information dataset
The visit_date column contains dates with various formats. Again, the parse_date_time function in the lubridate package handles this very well.
visit_info_data <-
visit_info_data %>%
mutate(
visit_date = parse_date_time(visit_date, orders = c("ymd", "B d Y", "dmy"))
)The vet column contains veterinarian names. I will make sure that they all have similar formats. Further, I have found it useful when working with names to make sure they are formatted like actual names. It can help prevent the extra step of re-formatting names when reporting out.
visit_info_data <-
visit_info_data %>%
mutate(
vet = str_to_title(vet)
)The missing value in the fee_charged column could be determined by contacting the primary veterinarian. If that is not possible, this value could be estimated by consulting with the appropriate subject-matter experts or researching various clinic websites that are similar to this clinic.
There is very little need for cleaning for the outcome and follow_up_required columns. In this case, I may just convert all values to lowercase letters in anticipation of having to append new data to the current dataset.
visit_info_data <-
visit_info_data %>%
mutate(
outcome = str_to_lower(outcome),
follow_up_required = str_to_lower(follow_up_required)
)One final thing to investigate regarding the visit information dataset is that there is a patient with the ID P099, non-existent on the patient information dataset. The closest patient ID in the patient information dataset is P009. This is something to bring to the attention of the subject-matter experts and investigate whether this is a type or indeed a different patient. In this case, this is a typo. The correction is made below.
visit_info_data =
visit_info_data %>%
mutate(
pt_id =
case_when(
pt_id == "P099" ~ "P009",
.default = pt_id
)
)The cleaned datasets are shown below.
| Patient information data | ||||||||
| patient_id | animal_name | species | breed | dob | sex | owner_id | weight_kg | vaccinated |
|---|---|---|---|---|---|---|---|---|
| P001 | Biscuit | dog | labrador retriever | 2019-04-12 | m | O011 | 28.4 | yes |
| P002 | Luna | cat | domestic shorthair | 2020-07-14 | f | O002 | 4.1 | yes |
| P003 | Max | dog | german shepherd | 2018-11-03 | m | O003 | 34.7 | no |
| P004 | Whiskers | cat | persian | 2021-03-05 | f | O004 | NA | yes |
| P005 | Rocky | dog | bulldog | 2020-06-22 | m | O005 | 24.9 | no |
| P006 | Cleo | cat | siamese | 2017-09-15 | f | O006 | 3.8 | yes |
| P007 | Buddy | dog | golden retriever | 2019-07-20 | m | O007 | 32.1 | yes |
| P008 | Mochi | cat | maine coon | 2022-01-30 | f | O008 | 5.6 | |
| P009 | Zeus | dog | rottweiler | 2016-03-18 | m | O009 | 47.2 | no |
| P010 | Nala | dog | beagle | 2021-08-09 | f | O010 | 10.3 | yes |
| P011 | Shadow | dog | border collie | 2019-12-01 | m | O011 | 19.8 | yes |
| Visit information data | |||||||
| visit_id | pt_id | visit_date | reason | vet | fee_charged | outcome | follow_up_required |
|---|---|---|---|---|---|---|---|
| V001 | P001 | 2024-01-15 | Annual wellness | Dr. Patel | $185.00 | healthy | no |
| V002 | P002 | 2024-02-03 | Vomiting | Dr. Chen | 200 | gastroenteritis - treated | yes |
| V003 | P003 | 2024-01-22 | Lameness | Dr. Patel | $340.50 | referred to ortho | yes |
| V004 | P004 | 2024-10-03 | Dental cleaning | Dr. Osei | $275,00 | completed | no |
| V005 | P005 | 2024-02-18 | Annual wellness | Dr. Chen | $185.00 | healthy | no |
| V006 | P006 | 2024-01-30 | Weight loss | Dr. Patel | $210.00 | hyperthyroidism suspected | yes |
| V007 | P007 | 2024-03-12 | Skin irritation | Dr. Osei | $155.00 | allergies - treatment started | yes |
| V008 | P002 | 2024-03-01 | Follow-up | Dr. Chen | $95.00 | resolved | no |
| V009 | P008 | 2024-02-25 | First visit / neuter consult | Dr. Patel | scheduled for surgery | yes | |
| V010 | P010 | 2024-03-08 | Limping | Dr. Osei | $220.00 | soft tissue injury | yes |
| V011 | P011 | 2024-03-15 | Annual wellness | Dr. Patel | $185.00 | healthy | no |
| V012 | P009 | 2024-03-20 | Emergency - trauma | Dr. Chen | $520.00 | stabilized | yes |
| V013 | P003 | 2024-03-25 | Ortho follow-up | Dr. Patel | $310.00 | surgery recommended | yes |
Merging the datasets
The operations to merge the two datasets are shown below.
combined = left_join(patient_info_data, visit_info_data, by = c("patient_id" = "pt_id"))The final merged dataset is provided below:
| Combined dataset | |||||||||||||||
| patient_id | animal_name | species | breed | dob | sex | owner_id | weight_kg | vaccinated | visit_id | visit_date | reason | vet | fee_charged | outcome | follow_up_required |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P001 | Biscuit | dog | labrador retriever | 2019-04-12 | m | O011 | 28.4 | yes | V001 | 2024-01-15 | Annual wellness | Dr. Patel | $185.00 | healthy | no |
| P002 | Luna | cat | domestic shorthair | 2020-07-14 | f | O002 | 4.1 | yes | V002 | 2024-02-03 | Vomiting | Dr. Chen | 200 | gastroenteritis - treated | yes |
| P002 | Luna | cat | domestic shorthair | 2020-07-14 | f | O002 | 4.1 | yes | V008 | 2024-03-01 | Follow-up | Dr. Chen | $95.00 | resolved | no |
| P003 | Max | dog | german shepherd | 2018-11-03 | m | O003 | 34.7 | no | V003 | 2024-01-22 | Lameness | Dr. Patel | $340.50 | referred to ortho | yes |
| P003 | Max | dog | german shepherd | 2018-11-03 | m | O003 | 34.7 | no | V013 | 2024-03-25 | Ortho follow-up | Dr. Patel | $310.00 | surgery recommended | yes |
| P004 | Whiskers | cat | persian | 2021-03-05 | f | O004 | NA | yes | V004 | 2024-10-03 | Dental cleaning | Dr. Osei | $275,00 | completed | no |
| P005 | Rocky | dog | bulldog | 2020-06-22 | m | O005 | 24.9 | no | V005 | 2024-02-18 | Annual wellness | Dr. Chen | $185.00 | healthy | no |
| P006 | Cleo | cat | siamese | 2017-09-15 | f | O006 | 3.8 | yes | V006 | 2024-01-30 | Weight loss | Dr. Patel | $210.00 | hyperthyroidism suspected | yes |
| P007 | Buddy | dog | golden retriever | 2019-07-20 | m | O007 | 32.1 | yes | V007 | 2024-03-12 | Skin irritation | Dr. Osei | $155.00 | allergies - treatment started | yes |
| P008 | Mochi | cat | maine coon | 2022-01-30 | f | O008 | 5.6 | V009 | 2024-02-25 | First visit / neuter consult | Dr. Patel | scheduled for surgery | yes | ||
| P009 | Zeus | dog | rottweiler | 2016-03-18 | m | O009 | 47.2 | no | V012 | 2024-03-20 | Emergency - trauma | Dr. Chen | $520.00 | stabilized | yes |
| P010 | Nala | dog | beagle | 2021-08-09 | f | O010 | 10.3 | yes | V010 | 2024-03-08 | Limping | Dr. Osei | $220.00 | soft tissue injury | yes |
| P011 | Shadow | dog | border collie | 2019-12-01 | m | O011 | 19.8 | yes | V011 | 2024-03-15 | Annual wellness | Dr. Patel | $185.00 | healthy | no |