Why Factor is one of the most amazing things in R & forcats helps you realize it (2024)

Published in

learn data science

10 min read

Mar 20, 2017

How many categories and how are they all doing?

A simple example would be US State names like ‘California’, ‘New York’, ‘Texas’, etc. We know that there are always 50 of them (or maybe more when including other special districts.). And even when we happen to not have data for some of the states we sometimes still want to see all the 50 states listed so that we can know which states have the data and which states don’t. In R, by making this ‘US State’ column as Factor data type we can keep all these 50 states as 50 levels regardless of whether each state has data or not in a given data frame.

How are categories sorted (ordered)?

Another example would be the day of week names, such as ‘Sunday’, ‘Monday’, ‘Tuesday’, ‘Wednesday’, etc. There are 7 of them (or 7 levels). In this case though, not only we care about the number of the levels but also we care about the order of the values. For example, when we visualize these, we would expect them to be sorted as ‘Sunday’, ‘Monday’, ‘Tuesday’, ‘Wednesday’, and so on like below, instead of as an alphabetical order, such as ‘Friday’, ‘Monday’, ‘Saturday’, ‘Sunday’, etc.

Why Factor is one of the most amazing things in R & forcats helps you realize it (3)

Again, by converting this ‘Day of Week’ column to Factor data type, we can not only register the 7 days of the week as 7 levels but also define the order of how they should be sorted appropriately.

Factor data type alone separates R from other BI tools

So basically, with Factor data type, we can register the levels (number of the categories) and the orders as part of the columns (or variables) natively so that we can let the columns dictate how to handle such level and sorting information. This is a huge advantage especially comparing to other tools like Excel or typical BI tools.

A gift from God — forcats package

But the only problem is, it was not so straightforward to assemble and manage such levels and order with Factor for many of us. The typical syntax for defining the levels and order would look something like this.

factor(criteria, levels = c(1,2,3), labels = c("low", "medium", "high"), ordered = TRUE)

It’s a lot going on in this one function call and a bit intimidating.

But then, at the middle of the last summer, a package called ‘forcats’ was delivered by a god of ‘tidyverse’, Hadley Wickham. And just like any other things he touches upon, all the sudden not only it makes much easier to work with Factor, but also it has made Factor an essential part of my data analysis flow ranging from visualizing data to building machine learning models.

In this post, I’m going to demonstrate why Factor should be your best friend and how ‘forcats’ package makes the journey of knowing Factor data type super easy and fun. I’m going to use Exploratory as a front end (UI), but obviously, you can do the same things in RStudio or other tools as well.

Take a look at the chart below. It is showing the similarities among countries based on the United Nations General Assembly’s voting history based on the data I downloaded from here. Each Scatter chart represents the years each of the past US Presidents served.

Why Factor is one of the most amazing things in R & forcats helps you realize it (4)

Now, you would notice that those Scatter charts are actually sorted by the US President names alphabetically, not by the years they served. It starts with ‘Barack Obama’ and ends with ‘Ronald Reagan’. But obviously, it would be much easier to see them being sorted by the years so that we can see the trend by time.

The US President names were originally from another data frame called ‘presidents’ like below, and it was later joined to the main data frame with ‘left_join’ command from ‘dplyr’ package.

Why Factor is one of the most amazing things in R & forcats helps you realize it (5)

Luckily, there is a column called ‘year’ so we can use this column to define the order of the US President names for ‘PRESIDENT’ column with Factor.

There are two ways to do this.

One is to sort the data by ‘year’ column first. Then use ‘fct_inorder’ function from ‘forcats’ package.

fct_inorder(PRESIDENT)

This function sets the order based on the original order in the data set.

Why Factor is one of the most amazing things in R & forcats helps you realize it (6)

Another way is to use ‘fct_reorder’ function, which can take another column as a reference for the order, so the original data doesn’t need to be sorted beforehand.

fct_reorder(PRESIDENT, YEAR, fun = first)

I’m setting an aggregate function called ‘first’ to pick the first value of ‘year’ for each president so that the first year of each president would be used to define the order of the President names.

Why Factor is one of the most amazing things in R & forcats helps you realize it (7)

You can go to Summary view and confirm the new order as well.

Why Factor is one of the most amazing things in R & forcats helps you realize it (8)

Either way, by going back to Small Multiple, we can now see the scatter charts being sorted based on the order that was set at the US President column level.

Why Factor is one of the most amazing things in R & forcats helps you realize it (9)

As you have seen, once you set the order rule to the column, which is ‘PRESIDENT’ in this case, everything else including the charts will start respecting the order. This means you don’t need to configure such order rules separately for each chart.

Reverse the order

If we want to show the US Presidents from the latest to the oldest, then we can simply use ‘fct_rev’ from ‘forcats’ package to reverse the original order we have set above.

fct_rev(PRESIDENT)

Why Factor is one of the most amazing things in R & forcats helps you realize it (10)

When you have a column with many unique categorical values and assign it to Color, you will end up with a chart like below.

Why Factor is one of the most amazing things in R & forcats helps you realize it (11)

Here, I have assigned US State Code column to Color to show the ratio of each airline carrier’s flights by US states for this particular time period. but obviously, it’s hard to compare among the states inside each of the bars. So, typically what we would end up doing is to move the states with small ratios into ‘Others’ bucket so that we can compare among the major states.

This is when the ‘fct_lump’ function from ‘forcats’ package comes to rescue. It keeps only a top N number of the categories, in this case, that is US states, and moves everything else into an ‘Other’ bucket.

fct_lump(state, n=5)

Now you can see only CA (California), FL (Florida), GA (Georgia), IL (Illinois), TX (Texas), and Other in colors like below, and it’s much easier to compare those top 5 states in each carrier.

Why Factor is one of the most amazing things in R & forcats helps you realize it (12)

By the way, as I have talked about this in the following blog about ‘Anomaly Detection, being able to create ‘Other’ bucket is useful when building machine learning models as well because some categorical values with small ratios might not have enough data to create the models.

Introduction to Anomaly Detection in R with ExploratoryOne of the latest and exciting additions to Exploratory is Anomaly Detection support, which is literally to detect…blog.exploratory.io

Keep Top 5 for Each Group with Group By

We can switch the Y-Axis calculation to ‘% of Total’ for the above chart. This will make it easier to see the ratio of the states like below.

Why Factor is one of the most amazing things in R & forcats helps you realize it (13)

Now when you look at ‘Hawaiian’ airline though, you would notice that most of the flights are in ‘Other’ group (Blue color). This is because the calculation used to create ‘Other’ group by ‘fct_lump’ function was done against the entire data set and the top 5 states happen to be only less than 10% for Hawaiian airline. This means that we have just lost valuable information, especially for this carrier. 😱

Not to worry. If you are a ‘frequent’ reader of this blog, you know we can simply add ‘group_by’ step to group the data frame before the ‘top 5’ frequent calculation.

Before adding the ‘group_by’ step, click ‘Pin’ button first to pin the chart to the final result step.

Why Factor is one of the most amazing things in R & forcats helps you realize it (14)

Now, go to the step before the step where we used ‘fct_lump’ function, that is ‘Separate’ step in this example. Then, select ‘Group By’ from Add button menu and select ‘CARRIER’ column to group the data by the carrier.

Why Factor is one of the most amazing things in R & forcats helps you realize it (15)

Once the command is run, the ‘Other’ calculation with ‘fct_lump’ function will be done automatically, and you will see the top 5 states and ‘Other’ in each of the carrier bars. 🎉

Why Factor is one of the most amazing things in R & forcats helps you realize it (16)

We can see that ‘HI (Hawaii)’ is the most frequent state for Hawaiian airline, which is kind of expected.

Now you might want to control the way the airline carrier names are sorted at X-Axis.

Why Factor is one of the most amazing things in R & forcats helps you realize it (17)

Let’s say we want to show ‘JetBlue’, ‘Southwest’, and ‘Virgin’ first then show the rest as is. For this, we can use the ‘fct_relevel’ function from ‘forcats’ package like below.

fct_relevel(name, "JetBlue", "Southwest", "Virgin")

Why Factor is one of the most amazing things in R & forcats helps you realize it (18)

If you have a ‘Group By’ step in your data wrangling pipeline, make sure that you do this operation before the step because you don’t want to set a different sorting order in each group.

Setting the base level of the categorical data is critical for some of the machine learning algorithms. For example, when you run ‘Survival Analysis — Cox Regression Model’ for your customer retention analysis, you will see the result like below.

Why Factor is one of the most amazing things in R & forcats helps you realize it (19)

The ‘Hazard Ratio’ column under Parameter Estimate shows the ratio of the customers who are more likely to quit your service for each country where the customers live or operating system they use. But what ratio, right? Well, these values are basically the relative values compared against the ‘base’ levels, which are shown under ‘Summary of Fit’ table above.

For example, the first line of ‘India’ shows 1.2534, which means that the users from India are 25% more chance to quit than the ones from the United States, which is the base level for ‘country’.

1.2534 - 1 = 0.2534

So it’s important to know what the base levels are for each of the predictor columns when working with this types of machine learning models.

Now, what if you want to set the base level to something else? For example, let’s say your primary customers are in Japan and on Mac OS, and you want to understand other customers in a comparison to such primary customer type. The base level is essentially the first level in the factor column, and you can manually set the level easily with ‘fct_relevel’ function from ‘forcats’ package.

fct_relevel(country, "Japan")

Why Factor is one of the most amazing things in R & forcats helps you realize it (20)

In Summary view, you can see that the new column called ‘country_japan_as_baseline’ being created with ‘Japan’ as ‘Base Level’. You can compare to the original column ‘country’, which shows ‘United States’ as ‘Base Level’.

Why Factor is one of the most amazing things in R & forcats helps you realize it (21)

And when you use this ‘country_japan_as_baseline’ column for building the Cox Regression model again, you will see the Hazard Ratio being re-calculated based on this new base level.

Why Factor is one of the most amazing things in R & forcats helps you realize it (22)

And now we can understand that the users from India are 44% more likely to quit compared to the users from Japan. 😱

Why Factor is one of the most amazing things in R & forcats helps you realize it (2024)

FAQs

Why are factors useful in R? ›

In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. Historically, factors were much easier to work with than characters.

Explore More ›

What does the factor function do in R? ›

The factors are the variable in R, which takes the categorical variable and stores data in levels. The primary use of this function can be seen in data analysis and specifically in statistical analysis.

Keep Reading ›

What do forcats do in R? ›

The forcats package is part of the tidyverse and is useful for dealing with factors. Factors are simply categorical variables, useful for controlling the levels and order of a vector. Categorical or discrete variables, as opposed to continuous variables, are often qualitiative and can take on a finite number of values.

What is the difference between factor and character in R? ›

While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

Show Me More ›

Why is R factor important? ›

Another class of plasmids, R factors, confers upon bacteria resistance to antibiotics. Some Col factors and R factors can transfer themselves from one cell to another and thus are capable of spreading rapidly through a bacterial population.

Keep Reading ›

What is a factor and why is it important? ›

A factor is a number that fits exactly into a given number, or divides a particular number with no remainder (fraction or decimal). They can also be identified as pairs of numbers that multiply together to make another number. A factor is always a positive integer (whole number).

Get More Info ›

What does R factor tell you? ›

The best-known quality score for x-ray structures is the so-called R-factor. It describes the correlation between the measured data and the final macromolecular structure. The lower the R-factor, the better.

Tell Me More ›

What is the point of as factor in R? ›

factor” Function in R. Purpose: Converts a vector to a factor. Required Argument(s):

View Details ›

What is factor factor in R? ›

unfactor: Convert factor into appropriate class

Description. This function gets a factor vector, data. ...
Value. In case of providing a vector as an input, a character vector or numeric vector. ...
Details. This function turns factors to their real values. ...
See Also. as.character , as.numeric.

Discover More Details ›

What is Forcats used for? ›

The goal of the forcats package is to provide a suite of useful tools that solve common problems with factors. Factors are useful when you have categorical data, variables that have a fixed and known set of values, and when you want to display character vectors in non-alphabetical order.

Learn More Now ›

What is R function used for? ›

A key feature of R is functions. Functions are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.

Find Out More ›

How do you change the order of factors in Forcats? ›

You want to manually change the order of a factor's levels.

Step 1 - Create a vector of new factor levels. ...
Step 2 - Call forcats::fct_relevel() on your factor. ...
Step 3 - Set the second argument of fct_relevel() to the vector of new factor levels from Step 1.

Learn More Now ›

What does a factor mean in R? ›

Conceptually, factors are variables in R which take on a limited number of different values; such variables are often refered to as categorical variables.

Discover More Details ›

What is the difference between factor and unique in R? ›

factor variables can have factor levels that are not present in the data but defined as factor levels. unique applied to a character vector will only show the unique values that are actually present in the data.

Tell Me More ›

Is factor a data structure in R? ›

The factor is a data structure which is used for fields which take only predefined finite number of values. These are the variable which takes a limited number of different values.

Find Out More ›

What is the purpose of finding factors? ›

In order for a student to expand or reduce fractions or to add and subtract unlike fractions, they need to know how to find factors for each number in the fraction. Students may know multiplication facts, but still, find it difficult to come up with all the possible factor pairs for a given number.

See Details ›

Why do we use factor variables? ›

So if you're planning to compare the distribution of subsets, you'll want a factor. Third, factor variables can help make huge data smaller, since each observation is stored as an integer and the levels are only stored once.

View Details ›

Why do we use factor? ›

Factor analysis is a powerful tool when you want to simplify complex data, find hidden patterns, and set the stage for deeper, more focused analysis. It's typically used when you're dealing with a large number of interconnected variables, and you want to understand the underlying structure or patterns within this data.

Get More Info Here ›