r$> starwars %>% dplyr::filter(height > 100)# A tibble: 74 × 14 name height mass hair_color skin_color eye_color birth_year sex <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> 1 Luke Sky… 172 77 blond fair blue 19 male 2 C-3PO 167 75 NA gold yellow 112 none 3 Darth Va… 202 136 none white yellow 41.9 male 4 Leia Org… 150 49 brown light brown 19 fema… 5 Owen Lars 178 120 brown, gr… light blue 52 male 6 Beru Whi… 165 75 brown light blue 47 fema… 7 Biggs Da… 183 84 black light brown 24 male 8 Obi-Wan … 182 77 auburn, w… fair blue-gray 57 male 9 Anakin S… 188 84 blond fair blue 41.9 male 10 Wilhuff … 180 NA auburn, g… fair blue 64 male # … with 64 more rows, and 6 more variables: gender <chr>,# homeworld <chr>, species <chr>, films <list>, vehicles <list>,# starships <list>
read_csv("https://github.com/eggplants/nijisanji-v23d-status/raw/master/result.csv") %>% dplyr::filter(name %in% c("葛葉", "社築", "剣持刀也", "でびでび・でびる"))# ℹ Use `spec()` to retrieve the full column specification for this data.# ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.# # A tibble: 4 × 5# name popularity `2dv2` `2dv3` `3d` # <chr> <dbl> <chr> <chr> <chr># 1 葛葉 125 o o o # 2 剣持刀也 60 o x o # 3 社築 60 o x o # 4 でびでび・でびる 45 o o o
dplyr::filter() - Conditional Extraction from Data Frames
Published Apr 24, 2022
⋅
Updated Nov 3, 2025
⋅
8 minutes read
Note
This old post is translated by AI.
##Introduction
This article explains the filter() function from dplyr! The filter() function is simple yet deep, capable of applying complex filters depending on how you use it.
This time, rather than the usual "reference dictionary when in trouble" style, I hope you'll read it in a "fundamentally understand the principles of filter()" style ♪
##Checking Usage
###Basic Usage
Rows that satisfy the condition written inside filter() are extracted.
r$> starwars %>% dplyr::filter(height > 100)# A tibble: 74 × 14 name height mass hair_color skin_color eye_color birth_year sex <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> 1 Luke Sky… 172 77 blond fair blue 19 male 2 C-3PO 167 75 NA gold yellow 112 none 3 Darth Va… 202 136 none white yellow 41.9 male 4 Leia Org… 150 49 brown light brown 19 fema… 5 Owen Lars 178 120 brown, gr… light blue 52 male 6 Beru Whi… 165 75 brown light blue 47 fema… 7 Biggs Da… 183 84 black light brown 24 male 8 Obi-Wan … 182 77 auburn, w… fair blue-gray 57 male 9 Anakin S… 188 84 blond fair blue 41.9 male10 Wilhuff … 180 NA auburn, g… fair blue 64 male# … with 64 more rows, and 6 more variables: gender <chr>,# homeworld <chr>, species <chr>, films <list>, vehicles <list>,# starships <list>
Info
From here on I'm writing filter() as dplyr::filter(), but this is to avoid unintended collisions with filter() from other packages. You usually don't need to be aware of this, but since filter() and select() are cases where functions with the same name exist very frequently, I do this for bug avoidance.
###Common Usage 1: Evaluating with Formulas for Numeric
There are endless examples of usage, but I'll give a few common pattern examples. The first is evaluation with formulas.
For example, to filter rows where mass is 50 or more from the starwars dataset, do as follows:
starwars %>% dplyr::filter(mass > 50)
Formulas can be complex like the following and still filter without problems.
A point to be careful about is that if you apply >, >=, <=, < operators to a non-numeric column, behavior probably not intended by you will occur.
starwars %>% mutate(name > 1)
The name column is character type, so character > 1 would clearly be an error in general programming languages. However, in R, it doesn't error.
"mojiretsu" > 1# TRUE
Division and multiplication properly give errors.
"mojiretsu" / 1# Error in "mojiretsu"/1 : non-numeric argument to binary operator
###Common Usage 2: Match/Mismatch
Judging by whether there's an exact match is also a common pattern.
starwars %>% filter(species == "Droid")
Conversely, when you want to evaluate non-match, use the != operator.
starwars %>% filter(species != "Droid")
Info
In logical operations (processing to judge TRUE or FALSE), "!" represents negation 🖐️
###Common Usage 3: Extracting Rows Containing a String
For example, suppose you want to extract rows containing "Skywalker" (partial match) from the starwars dataset. For column names, partial match selection like contains("Skywalker") is possible, but this is achieved through a special method called tidyselect. starts_with() and contains() cannot be executed on rows.
If you want to filter strings on rows, use a package called stringr.
Setting aside the logic of why we write it this way, for now, remember filter(str_detect(column_name, "string")) as an idiom.
Also, since str_detect() supports regular expressions, complex filters are possible!
starwars %>% filter(str_detect(name, "^S")) # Pattern matching "starts with S"
###Common Usage 4: NA Judgment
A common case in data analysis is when NA (missing values) are included in the data. You can judge NA for a specific column using the built-in function is.na().
For example, suppose you have this data frame:
tibble(var1 = LETTERS[1:10], var2 = c(1, 2, 3, NA, 5, 6, NA, NA, 9, 10))# # a tibble: 10 × 2# # var1 var2# # <chr> <dbl># # 1 a 1# # 2 b 2# # 3 c 3# # 4 d na# # 5 e 5# # 6 f 6# # 7 g na# # 8 h na# # 9 i 9# # 10 j 10
If you want to filter where the var2 column is not NA, use filter() like this:
For the simple purpose of "removing rows containing NA" like this, just using na.omit() is easier, but be careful that na.omit() removes all rows containing even one NA ⚠️
Adding this after posting the article because "come to think of it, I use this a lot too 💡".
When you want to extract rows based on a column containing strings, if that column is composed of diverse data, filtering becomes difficult. If you want to extract "rows matching any of ~~", the %in% operator is good.
read_csv("https://github.com/eggplants/nijisanji-v23d-status/raw/master/result.csv") %>% dplyr::filter(name %in% c("Kuzuha", "Yashiro", "Kenmochi", "Debi-debi Devil"))# ℹ Use `spec()` to retrieve the full column specification for this data.# ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.# # A tibble: 4 × 5# name popularity `2dv2` `2dv3` `3d`# <chr> <dbl> <chr> <chr> <chr># 1 Kuzuha 125 o o o# 2 Kenmochi 60 o x o# 3 Yashiro 60 o x o# 4 Debi-debi Devil 45 o o o
Sometimes you create match lists by hand like this, but it would be smarter to semi-automatically create them using colnames() from another data frame or using the paste() function 🤖
###Convenient Usage: Combining with between()
When I researched thoroughly about dplyr::filter() usage, something I didn't know came up 😏 It was in the official documentation though... 💧
Combining with between() makes it possible to filter "X or more and Y or less" for numeric columns 💡
starwars %>% dplyr::filter(between(height, 100, 130))# # A tibble: 2 × 14# name height mass hair_color skin_color eye_color birth_year# <chr> <int> <dbl> <chr> <chr> <chr> <dbl># 1 Sebulba 112 40 none grey, red orange NA# 2 Gasgano 122 NA none white, bl… black NA# # … with 7 more variables: sex <chr>, gender <chr>,# # homeworld <chr>, species <chr>, films <list>,# # vehicles <list>, starships <list>
I probably won't use it much...
###Difficult but Worth Remembering: Combining with group_by()
group_by() itself might be difficult for beginners, but for now just remember that group_by() is a function that enables processing by group.
For example, to calculate "data taller than average within each species" for starwars data, do as follows:
starwars %>% group_by(species) %>% dplyr::filter(height > mean(height)) %>% # Move species to front for clarity relocate(species, .before = name)# # A tibble: 6 × 14# # Groups: species [6]# species name height mass hair_color skin_color eye_color# <chr> <chr> <int> <dbl> <chr> <chr> <chr># 1 Gungan Roos Tar… 224 82 none grey orange# 2 Zabrak Darth Ma… 175 80 none red yellow# 3 Twi'lek Bib Fort… 180 NA none pale pink# 4 Mirialan Luminara… 170 56.2 black yellow blue# 5 Kaminoan Lama Su 229 88 none grey black# 6 Wookiee Tarfful 234 136 brown brown blue# # … with 7 more variables: birth_year <dbl>, sex <chr>,# # gender <chr>, homeworld <chr>, films <list>,# # vehicles <list>, starships <list>
Examples like combining with mean(), max() are probably most common.
Another is using n() to cut by group data count. n() is a simple function that gets the row count of a given data frame (tibble), but it's used quite subtly.
The following example uses species as a group variable and extracts only cases with two or more data. Used when categorical variables (group variables) are very miscellaneous and numerous but you want to roughly take the average.
Info
In bioinformatics, used when doing genus-level analysis of gut microbiome composition. When using genus information as a group, many bacteria with only one data point exist, dominated by miscellaneous information. Using n()>1 can exclude this unnecessary information.
starwars %>% group_by(species) %>% dplyr::filter(n() > 1)# # A tibble: 58 × 14# # Groups: species [9]# name height mass hair_color skin_color eye_color birth_year# <chr> <int> <dbl> <chr> <chr> <chr> <dbl># 1 Luke … 172 77 blond fair blue 19# 2 C-3PO 167 75 NA gold yellow 112# 3 R2-D2 96 32 NA white, bl… red 33# 4 Darth… 202 136 none white yellow 41.9# 5 Leia … 150 49 brown light brown 19# 6 Owen … 178 120 brown, gr… light blue 52# 7 Beru … 165 75 brown light blue 47# 8 R5-D4 97 32 NA white, red red NA# 9 Biggs… 183 84 black light brown 24# 10 Obi-W… 182 77 auburn, w… fair blue-gray 57# # … with 48 more rows, and 7 more variables: sex <chr>,# # gender <chr>, homeworld <chr>, species <chr>, films <list>,# # vehicles <list>, starships <list>
At this point it's getting pretty niche, so just come back to this article when you forget.
##Properly Understanding the dplyr::filter() Function
Up to now, I've introduced typical uses of the filter() function. It might seem like diverse uses at first glance, but filter() operates on a consistent principle. That principle is that it takes a logical value vector as an argument and only keeps rows that are TRUE.
For example, let's think about the very first example.
starwars %>% dplyr::filter(height > 100)
This example filtered rows where height is greater than 100. Actually, just the expression height > 100 is processing that returns a logical value vector.
To make it clear, let's extract the result of height > 100 into a new column using mutate().
starwars %>% mutate(height_greater_than_100 = height > 100, .before = name)# # A tibble: 87 × 15# height_greater_th… name height mass hair_color skin_color# <lgl> <chr> <int> <dbl> <chr> <chr># 1 TRUE Luke… 172 77 blond fair# 2 TRUE C-3PO 167 75 NA gold# 3 FALSE R2-D2 96 32 NA white, bl…# 4 TRUE Dart… 202 136 none white# 5 TRUE Leia… 150 49 brown light# 6 TRUE Owen… 178 120 brown, gr… light# 7 TRUE Beru… 165 75 brown light# 8 FALSE R5-D4 97 32 NA white, red# 9 TRUE Bigg… 183 84 black light# 10 TRUE Obi-… 182 77 auburn, w… fair# # … with 77 more rows, and 9 more variables: eye_color <chr>,# # birth_year <dbl>, sex <chr>, gender <chr>,# # homeworld <chr>, species <chr>, films <list>,# # vehicles <list>, starships <list>
The column we just added has become a vector of logical values (TRUE, TRUE, FALSE), right? The reveal is that rows that are TRUE in this column were extracted. Since any logical value vector is fine, you can even give a logical value vector directly as an argument. Let's try giving a logical value vector with TRUE for the first two rows and FALSE for 85 rows as an argument to filter.
starwars %>% dplyr::filter(c(TRUE, TRUE, rep(FALSE, 85)))# # A tibble: 2 × 14# name height mass hair_color skin_color eye_color# <chr> <int> <dbl> <chr> <chr> <chr># 1 Luke Skywalker 172 77 blond fair blue# 2 C-3PO 167 75 NA gold yellow# # … with 8 more variables: birth_year <dbl>, sex <chr>,# # gender <chr>, homeworld <chr>, species <chr>,# # films <list>, vehicles <list>, starships <list>
As shown, only the first two rows were extracted. So, the conditions for filter() seem diverse, but anything that gives a TRUE/FALSE logical value vector is fine. This is why the filter() function can be operated flexibly.
###Applied Example: Using a Logical Vector Column for Filter Conditions
Next, I'll introduce a technique I use to prevent bugs 👮♂️
You can use filter with complex conditions using logical operations (and/or/not.. calculations), but honestly I'm a bit weak at it 😓 Because when multiple logical operators are combined, the expression becomes complex like below, and when looking back later, it becomes hard to understand what conditions are being filtered.
starwars %>% dplyr::filter(height > 100 & mass < 60 & species == "Human")
Code like this seems dangerous as a source of bugs. When you want to filter with complex conditions, preparing the conditions themselves as columns like below makes for more understandable coding 😮
##Latest Information: Combining with if_any() if_all()
dplyr::if_any() and dplyr::if_all() are new functions just added in 2021 🌞 Honestly I'm not using them in practice yet, so this time I'll briefly introduce usage combined with dplyr::filter().
Using dplyr::if_all(), you can apply the same filter condition to multiple columns at once. For example, to apply the condition "50 or more" to "all numeric data columns", do as follows:
starwars %>% dplyr::filter( if_all(where(is.numeric), function(x) x > 50) )
Info
We're using an anonymous function for the x > 50 part.
The if_all() used here is used when all are TRUE. On the other hand, if_any() is used when even one is TRUE. Pretty understandable naming which is nice 😄
Caution
In the past (more than two years ago), filter_at() and filter_all() were used for the above operations, but now they are deprecated. Since there's a high possibility they won't work in the future, let's stop using them.
##Summary
So, this was an introduction to the dplyr::filter() function! In data transformation using tidyverse, the filter() function is one of the top 3 most used important functions. I hope this article helps you use the filter() function even more conveniently than before ♪