Motivation
I often have raw data where a variable has been provided as dummy variables
as opposed to a character/factor variable. For example job_employed = 1
,
job_unemployed = 0
instead of job = employed
. While some modeling
algorithms require the use of dummy variables, this can make it challenging
to detect missing/multiple data and to create tables. For example in the German
Credit Data1, if you want a count of the purpose of the loan field,
you first
have to sum
and then pivot_longer
and it doesn’t tell you if any
observation
has no values true or multiple values true.
library(tidyverse)
data(GermanCredit, package = "caret")
GermanCredit %>% summarise_at(vars(starts_with('Purpose')), sum)
## Purpose.NewCar Purpose.UsedCar Purpose.Furniture.Equipment
## 1 234 103 181
## Purpose.Radio.Television Purpose.DomesticAppliance Purpose.Repairs
## 1 280 12 22
## Purpose.Education Purpose.Vacation Purpose.Retraining Purpose.Business
## 1 50 0 9 97
## Purpose.Other
## 1 12
GermanCredit %>%
summarise_at(vars(starts_with('Purpose')), sum) %>%
pivot_longer(cols = everything(),
names_to = 'Purpose',
values_to = 'count',
names_pattern = '^[^.]+[.](.*)')
## # A tibble: 11 x 2
## Purpose count
## <chr> <dbl>
## 1 NewCar 234
## 2 UsedCar 103
## 3 Furniture.Equipment 181
## 4 Radio.Television 280
## 5 DomesticAppliance 12
## 6 Repairs 22
## 7 Education 50
## 8 Vacation 0
## 9 Retraining 9
## 10 Business 97
## 11 Other 12
Now, what if you want to look at checking account status vs housing status. This is very difficult/tedious to do without converting to factors.
GermanCredit %>%
summarise(Lt0_and_Rent = sum(CheckingAccountStatus.lt.0 & Housing.Rent),
`0to200_and_Rent` = sum(CheckingAccountStatus.0.to.200 &
Housing.Rent))
## Lt0_and_Rent 0to200_and_Rent
## 1 65 48
# et cetera.
To that end I have created a function to convert dummy variables to factors. It requires that your variables have a seperator so that it can detect the main variable name from the level. My original function was quite slow and I appreciated the extremely valuable help provided by Christophe Dervieux via this Rstudio community post .
fct_dummy <- function(data,
variables = tidyselect::everything(),
sep = '.') {
variables <- rlang::enquo(variables)
# transform to long format the dummy columns
tmp <-
tidyr::pivot_longer(data,
cols = intersect(tidyselect::contains(sep),
!!variables),
names_to = c("groups", "levels"),
names_pattern = paste0("^([^'", sep, "]*)[",
sep, "](.*)"))
# get the groups name for column selection after
groups <- unique(tmp$groups)
# keep only non dummy value and do not keep temp value col
tmp <- dplyr::select(
dplyr::filter(tmp, value == 1),
-value)
# function to return 'multiple' if more than 1 value is present
ret_multiple <- function(x){
if(length(x) > 1) return('multiple')
return(x)
}
# tranform to wide format
tmp <- tidyr::pivot_wider(
tmp,
names_from = groups,
values_from = levels,
values_fn = list(levels = ret_multiple))
# convert to factors the groups column
dplyr::mutate_at(
tmp,
groups,
~ forcats::as_factor(.)
)
}
Here is the original German Credit data and the converted tibble.
glimpse(GermanCredit)
## Observations: 1,000
## Variables: 62
## $ Duration <int> 6, 48, 12, 42, 24, 36, 24, 36,…
## $ Amount <int> 1169, 5951, 2096, 7882, 4870, …
## $ InstallmentRatePercentage <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, …
## $ ResidenceDuration <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, …
## $ Age <int> 67, 22, 49, 45, 53, 35, 53, 35…
## $ NumberExistingCredits <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, …
## $ NumberPeopleMaintenance <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, …
## $ Telephone <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, …
## $ ForeignWorker <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Class <fct> Good, Bad, Good, Good, Bad, Go…
## $ CheckingAccountStatus.lt.0 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus.0.to.200 <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ CheckingAccountStatus.gt.200 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus.none <dbl> 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, …
## $ CreditHistory.NoCredit.AllPaid <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory.ThisBank.AllPaid <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory.PaidDuly <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, …
## $ CreditHistory.Delay <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ CreditHistory.Critical <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, …
## $ Purpose.NewCar <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, …
## $ Purpose.UsedCar <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Purpose.Furniture.Equipment <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
## $ Purpose.Radio.Television <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Purpose.DomesticAppliance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Repairs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Education <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, …
## $ Purpose.Vacation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Retraining <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Business <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Other <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds.lt.100 <dbl> 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, …
## $ SavingsAccountBonds.100.to.500 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds.500.to.1000 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ SavingsAccountBonds.gt.1000 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ SavingsAccountBonds.Unknown <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ EmploymentDuration.lt.1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ EmploymentDuration.1.to.4 <dbl> 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, …
## $ EmploymentDuration.4.to.7 <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, …
## $ EmploymentDuration.gt.7 <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ EmploymentDuration.Unemployed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ Personal.Male.Divorced.Seperated <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Personal.Female.NotSingle <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Personal.Male.Single <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, …
## $ Personal.Male.Married.Widowed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ Personal.Female.Single <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors.None <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ OtherDebtorsGuarantors.CoApplicant <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors.Guarantor <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ Property.RealEstate <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, …
## $ Property.Insurance <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
## $ Property.CarOther <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ Property.Unknown <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.Bank <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.Stores <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.None <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Housing.Rent <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Housing.Own <dbl> 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, …
## $ Housing.ForFree <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, …
## $ Job.UnemployedUnskilled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Job.UnskilledResident <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, …
## $ Job.SkilledEmployee <dbl> 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, …
## $ Job.Management.SelfEmp.HighlyQualified <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
new_dat <- fct_dummy(GermanCredit)
glimpse(new_dat)
## Observations: 1,000
## Variables: 21
## $ Duration <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, …
## $ Amount <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6…
## $ InstallmentRatePercentage <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2…
## $ ResidenceDuration <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4…
## $ Age <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25,…
## $ NumberExistingCredits <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1…
## $ NumberPeopleMaintenance <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Telephone <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1…
## $ ForeignWorker <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Class <fct> Good, Bad, Good, Good, Bad, Good, Good, Goo…
## $ CheckingAccountStatus <fct> lt.0, 0.to.200, none, lt.0, lt.0, none, non…
## $ CreditHistory <fct> Critical, PaidDuly, Critical, PaidDuly, Del…
## $ Purpose <fct> Radio.Television, Radio.Television, Educati…
## $ SavingsAccountBonds <fct> Unknown, lt.100, lt.100, lt.100, lt.100, Un…
## $ EmploymentDuration <fct> gt.7, 1.to.4, 4.to.7, 4.to.7, 1.to.4, 1.to.…
## $ Personal <fct> Male.Single, Female.NotSingle, Male.Single,…
## $ OtherDebtorsGuarantors <fct> None, None, None, Guarantor, None, None, No…
## $ Property <fct> RealEstate, RealEstate, RealEstate, Insuran…
## $ OtherInstallmentPlans <fct> None, None, None, None, None, None, None, N…
## $ Housing <fct> Own, Own, Own, ForFree, ForFree, ForFree, O…
## $ Job <fct> SkilledEmployee, SkilledEmployee, Unskilled…
It is now straightforward to look at groupings of factors.
new_dat %>%
count(Housing, CheckingAccountStatus)
## # A tibble: 12 x 3
## Housing CheckingAccountStatus n
## <fct> <fct> <int>
## 1 Own lt.0 170
## 2 Own 0.to.200 192
## 3 Own none 304
## 4 Own gt.200 47
## 5 ForFree lt.0 39
## 6 ForFree 0.to.200 29
## 7 ForFree none 32
## 8 ForFree gt.200 8
## 9 Rent lt.0 65
## 10 Rent 0.to.200 48
## 11 Rent none 58
## 12 Rent gt.200 8
We can also see if we have multiples or missing data points
GermanCredit[1, 13]
## [1] 0
GermanCredit[1, 13] <- 1
GermanCredit[1, 57]
## [1] 1
GermanCredit[1, 57] <- 0
new_dat <- fct_dummy(GermanCredit)
new_dat %>%
count(Housing, CheckingAccountStatus)
## Warning: Factor `Housing` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 13 x 3
## Housing CheckingAccountStatus n
## <fct> <fct> <int>
## 1 Own 0.to.200 192
## 2 Own none 304
## 3 Own lt.0 169
## 4 Own gt.200 47
## 5 ForFree 0.to.200 29
## 6 ForFree none 32
## 7 ForFree lt.0 39
## 8 ForFree gt.200 8
## 9 Rent 0.to.200 48
## 10 Rent none 58
## 11 Rent lt.0 65
## 12 Rent gt.200 8
## 13 <NA> multiple 1
You can also use the variables argument to only convert certain variables.
dat <- rename_all(GermanCredit, str_replace,
pattern = '[.]', replacement = '_')
glimpse(fct_dummy(dat, variables = starts_with('P'), sep = '_'))
## Observations: 1,000
## Variables: 45
## $ Duration <int> 6, 48, 12, 42, 24, 36, 24, 36,…
## $ Amount <int> 1169, 5951, 2096, 7882, 4870, …
## $ InstallmentRatePercentage <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, …
## $ ResidenceDuration <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, …
## $ Age <int> 67, 22, 49, 45, 53, 35, 53, 35…
## $ NumberExistingCredits <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, …
## $ NumberPeopleMaintenance <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, …
## $ Telephone <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, …
## $ ForeignWorker <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Class <fct> Good, Bad, Good, Good, Bad, Go…
## $ CheckingAccountStatus_lt.0 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus_0.to.200 <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ CheckingAccountStatus_gt.200 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus_none <dbl> 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, …
## $ CreditHistory_NoCredit.AllPaid <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory_ThisBank.AllPaid <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory_PaidDuly <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, …
## $ CreditHistory_Delay <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ CreditHistory_Critical <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, …
## $ SavingsAccountBonds_lt.100 <dbl> 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, …
## $ SavingsAccountBonds_100.to.500 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds_500.to.1000 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ SavingsAccountBonds_gt.1000 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ SavingsAccountBonds_Unknown <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ EmploymentDuration_lt.1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ EmploymentDuration_1.to.4 <dbl> 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, …
## $ EmploymentDuration_4.to.7 <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, …
## $ EmploymentDuration_gt.7 <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ EmploymentDuration_Unemployed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ OtherDebtorsGuarantors_None <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ OtherDebtorsGuarantors_CoApplicant <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors_Guarantor <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans_Bank <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans_Stores <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans_None <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Housing_Rent <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Housing_Own <dbl> 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, …
## $ Housing_ForFree <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, …
## $ Job_UnemployedUnskilled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Job_UnskilledResident <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, …
## $ Job_SkilledEmployee <dbl> 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, …
## $ Job_Management.SelfEmp.HighlyQualified <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ Purpose <fct> Radio.Television, Radio.Televi…
## $ Personal <fct> Male.Single, Female.NotSingle,…
## $ Property <fct> RealEstate, RealEstate, RealEs…
I have not yet decided what to do with this function, if I want to put it in one of my own packages or try and add it to another package. Please feel free to use it you find it useful.
This post uses the German Credit data from the University of California Irving Machine Learning Repository. If you have the
caret
package installed, it is included.↩