Dummy to Factor

Motivation
I often have raw data where a variable has been provided as dummy variables as opposed to a character/factor variable. For example job_employed = 1, job_unemployed = 0 instead of job = employed. While some modeling algorithms require the use of dummy variables, this can make it challenging to detect missing/multiple data and to create tables. For example in the German Credit Data¹, if you want a count of the purpose of the loan field, you first have to sum and then pivot_longer and it doesn’t tell you if any observation has no values true or multiple values true.

library(tidyverse)
data(GermanCredit, package = "caret")
GermanCredit %>% summarise_at(vars(starts_with('Purpose')), sum)

##   Purpose.NewCar Purpose.UsedCar Purpose.Furniture.Equipment
## 1            234             103                         181
##   Purpose.Radio.Television Purpose.DomesticAppliance Purpose.Repairs
## 1                      280                        12              22
##   Purpose.Education Purpose.Vacation Purpose.Retraining Purpose.Business
## 1                50                0                  9               97
##   Purpose.Other
## 1            12

GermanCredit %>% 
  summarise_at(vars(starts_with('Purpose')), sum) %>% 
  pivot_longer(cols = everything(), 
               names_to = 'Purpose', 
               values_to = 'count', 
               names_pattern = '^[^.]+[.](.*)')

## # A tibble: 11 x 2
##    Purpose             count
##    <chr>               <dbl>
##  1 NewCar                234
##  2 UsedCar               103
##  3 Furniture.Equipment   181
##  4 Radio.Television      280
##  5 DomesticAppliance      12
##  6 Repairs                22
##  7 Education              50
##  8 Vacation                0
##  9 Retraining              9
## 10 Business               97
## 11 Other                  12

Now, what if you want to look at checking account status vs housing status. This is very difficult/tedious to do without converting to factors.

GermanCredit %>% 
  summarise(Lt0_and_Rent = sum(CheckingAccountStatus.lt.0 & Housing.Rent), 
            `0to200_and_Rent` = sum(CheckingAccountStatus.0.to.200 & 
                                      Housing.Rent))

##   Lt0_and_Rent 0to200_and_Rent
## 1           65              48

# et cetera.

To that end I have created a function to convert dummy variables to factors. It requires that your variables have a seperator so that it can detect the main variable name from the level. My original function was quite slow and I appreciated the extremely valuable help provided by Christophe Dervieux via this Rstudio community post .

fct_dummy <- function(data, 
                      variables = tidyselect::everything(), 
                      sep = '.') {
  variables <- rlang::enquo(variables)
  # transform to long format the dummy columns
  tmp <- 
    tidyr::pivot_longer(data, 
                        cols = intersect(tidyselect::contains(sep),  
                                         !!variables),
                        names_to = c("groups", "levels"),
                        names_pattern = paste0("^([^'", sep, "]*)[", 
                                               sep, "](.*)"))
  
  # get the groups name for column selection after
  groups <- unique(tmp$groups)
  
  
  # keep only non dummy value and do not keep temp value col
  tmp <- dplyr::select(
    dplyr::filter(tmp, value == 1),
    -value)
  
  # function to return 'multiple' if more than 1 value is present
  ret_multiple <- function(x){
    if(length(x) > 1) return('multiple')
    return(x)
  }
  
  
  # tranform to wide format   
  tmp <- tidyr::pivot_wider(
    tmp,
    names_from = groups, 
    values_from = levels, 
    values_fn = list(levels = ret_multiple))
  
  
  # convert to factors the groups column
  dplyr::mutate_at(
    tmp,
    groups,
    ~ forcats::as_factor(.)
  )
}

Here is the original German Credit data and the converted tibble.

glimpse(GermanCredit)

## Observations: 1,000
## Variables: 62
## $ Duration                               <int> 6, 48, 12, 42, 24, 36, 24, 36,…
## $ Amount                                 <int> 1169, 5951, 2096, 7882, 4870, …
## $ InstallmentRatePercentage              <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, …
## $ ResidenceDuration                      <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, …
## $ Age                                    <int> 67, 22, 49, 45, 53, 35, 53, 35…
## $ NumberExistingCredits                  <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, …
## $ NumberPeopleMaintenance                <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, …
## $ Telephone                              <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, …
## $ ForeignWorker                          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Class                                  <fct> Good, Bad, Good, Good, Bad, Go…
## $ CheckingAccountStatus.lt.0             <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus.0.to.200         <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ CheckingAccountStatus.gt.200           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus.none             <dbl> 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, …
## $ CreditHistory.NoCredit.AllPaid         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory.ThisBank.AllPaid         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory.PaidDuly                 <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, …
## $ CreditHistory.Delay                    <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ CreditHistory.Critical                 <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, …
## $ Purpose.NewCar                         <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, …
## $ Purpose.UsedCar                        <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Purpose.Furniture.Equipment            <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
## $ Purpose.Radio.Television               <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Purpose.DomesticAppliance              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Repairs                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Education                      <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, …
## $ Purpose.Vacation                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Retraining                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Business                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Other                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds.lt.100             <dbl> 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, …
## $ SavingsAccountBonds.100.to.500         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds.500.to.1000        <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ SavingsAccountBonds.gt.1000            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ SavingsAccountBonds.Unknown            <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ EmploymentDuration.lt.1                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ EmploymentDuration.1.to.4              <dbl> 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, …
## $ EmploymentDuration.4.to.7              <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, …
## $ EmploymentDuration.gt.7                <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ EmploymentDuration.Unemployed          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ Personal.Male.Divorced.Seperated       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Personal.Female.NotSingle              <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Personal.Male.Single                   <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, …
## $ Personal.Male.Married.Widowed          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ Personal.Female.Single                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors.None            <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ OtherDebtorsGuarantors.CoApplicant     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors.Guarantor       <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ Property.RealEstate                    <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, …
## $ Property.Insurance                     <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
## $ Property.CarOther                      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ Property.Unknown                       <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.Bank             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.Stores           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.None             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Housing.Rent                           <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Housing.Own                            <dbl> 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, …
## $ Housing.ForFree                        <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, …
## $ Job.UnemployedUnskilled                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Job.UnskilledResident                  <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, …
## $ Job.SkilledEmployee                    <dbl> 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, …
## $ Job.Management.SelfEmp.HighlyQualified <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …

new_dat <- fct_dummy(GermanCredit)
glimpse(new_dat)

## Observations: 1,000
## Variables: 21
## $ Duration                  <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, …
## $ Amount                    <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6…
## $ InstallmentRatePercentage <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2…
## $ ResidenceDuration         <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4…
## $ Age                       <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25,…
## $ NumberExistingCredits     <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1…
## $ NumberPeopleMaintenance   <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Telephone                 <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1…
## $ ForeignWorker             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Class                     <fct> Good, Bad, Good, Good, Bad, Good, Good, Goo…
## $ CheckingAccountStatus     <fct> lt.0, 0.to.200, none, lt.0, lt.0, none, non…
## $ CreditHistory             <fct> Critical, PaidDuly, Critical, PaidDuly, Del…
## $ Purpose                   <fct> Radio.Television, Radio.Television, Educati…
## $ SavingsAccountBonds       <fct> Unknown, lt.100, lt.100, lt.100, lt.100, Un…
## $ EmploymentDuration        <fct> gt.7, 1.to.4, 4.to.7, 4.to.7, 1.to.4, 1.to.…
## $ Personal                  <fct> Male.Single, Female.NotSingle, Male.Single,…
## $ OtherDebtorsGuarantors    <fct> None, None, None, Guarantor, None, None, No…
## $ Property                  <fct> RealEstate, RealEstate, RealEstate, Insuran…
## $ OtherInstallmentPlans     <fct> None, None, None, None, None, None, None, N…
## $ Housing                   <fct> Own, Own, Own, ForFree, ForFree, ForFree, O…
## $ Job                       <fct> SkilledEmployee, SkilledEmployee, Unskilled…

It is now straightforward to look at groupings of factors.

new_dat %>% 
  count(Housing, CheckingAccountStatus)

## # A tibble: 12 x 3
##    Housing CheckingAccountStatus     n
##    <fct>   <fct>                 <int>
##  1 Own     lt.0                    170
##  2 Own     0.to.200                192
##  3 Own     none                    304
##  4 Own     gt.200                   47
##  5 ForFree lt.0                     39
##  6 ForFree 0.to.200                 29
##  7 ForFree none                     32
##  8 ForFree gt.200                    8
##  9 Rent    lt.0                     65
## 10 Rent    0.to.200                 48
## 11 Rent    none                     58
## 12 Rent    gt.200                    8

We can also see if we have multiples or missing data points

GermanCredit[1, 13]

## [1] 0

GermanCredit[1, 13] <- 1

GermanCredit[1, 57]

## [1] 1

GermanCredit[1, 57] <- 0

new_dat <- fct_dummy(GermanCredit)
new_dat %>% 
  count(Housing, CheckingAccountStatus)

## Warning: Factor `Housing` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## # A tibble: 13 x 3
##    Housing CheckingAccountStatus     n
##    <fct>   <fct>                 <int>
##  1 Own     0.to.200                192
##  2 Own     none                    304
##  3 Own     lt.0                    169
##  4 Own     gt.200                   47
##  5 ForFree 0.to.200                 29
##  6 ForFree none                     32
##  7 ForFree lt.0                     39
##  8 ForFree gt.200                    8
##  9 Rent    0.to.200                 48
## 10 Rent    none                     58
## 11 Rent    lt.0                     65
## 12 Rent    gt.200                    8
## 13 <NA>    multiple                  1

You can also use the variables argument to only convert certain variables.

dat <- rename_all(GermanCredit, str_replace, 
                  pattern = '[.]', replacement = '_')
glimpse(fct_dummy(dat, variables = starts_with('P'), sep = '_'))

## Observations: 1,000
## Variables: 45
## $ Duration                               <int> 6, 48, 12, 42, 24, 36, 24, 36,…
## $ Amount                                 <int> 1169, 5951, 2096, 7882, 4870, …
## $ InstallmentRatePercentage              <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, …
## $ ResidenceDuration                      <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, …
## $ Age                                    <int> 67, 22, 49, 45, 53, 35, 53, 35…
## $ NumberExistingCredits                  <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, …
## $ NumberPeopleMaintenance                <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, …
## $ Telephone                              <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, …
## $ ForeignWorker                          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Class                                  <fct> Good, Bad, Good, Good, Bad, Go…
## $ CheckingAccountStatus_lt.0             <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus_0.to.200         <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ CheckingAccountStatus_gt.200           <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus_none             <dbl> 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, …
## $ CreditHistory_NoCredit.AllPaid         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory_ThisBank.AllPaid         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory_PaidDuly                 <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, …
## $ CreditHistory_Delay                    <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ CreditHistory_Critical                 <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, …
## $ SavingsAccountBonds_lt.100             <dbl> 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, …
## $ SavingsAccountBonds_100.to.500         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds_500.to.1000        <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ SavingsAccountBonds_gt.1000            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ SavingsAccountBonds_Unknown            <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ EmploymentDuration_lt.1                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ EmploymentDuration_1.to.4              <dbl> 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, …
## $ EmploymentDuration_4.to.7              <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, …
## $ EmploymentDuration_gt.7                <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ EmploymentDuration_Unemployed          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ OtherDebtorsGuarantors_None            <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ OtherDebtorsGuarantors_CoApplicant     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors_Guarantor       <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans_Bank             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans_Stores           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans_None             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Housing_Rent                           <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Housing_Own                            <dbl> 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, …
## $ Housing_ForFree                        <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, …
## $ Job_UnemployedUnskilled                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Job_UnskilledResident                  <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, …
## $ Job_SkilledEmployee                    <dbl> 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, …
## $ Job_Management.SelfEmp.HighlyQualified <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ Purpose                                <fct> Radio.Television, Radio.Televi…
## $ Personal                               <fct> Male.Single, Female.NotSingle,…
## $ Property                               <fct> RealEstate, RealEstate, RealEs…

I have not yet decided what to do with this function, if I want to put it in one of my own packages or try and add it to another package. Please feel free to use it you find it useful.

This post uses the German Credit data from the University of California Irving Machine Learning Repository. If you have the caret package installed, it is included.↩