dplyr::filter() - Conditional Extraction from Data Frames

Published Apr 24, 2022
Updated Nov 3, 2025
8 minutes read
Note

This old post is translated by AI.

##Introduction

This article explains the filter() function from dplyr! The filter() function is simple yet deep, capable of applying complex filters depending on how you use it.

This time, rather than the usual "reference dictionary when in trouble" style, I hope you'll read it in a "fundamentally understand the principles of filter()" style ♪

##Checking Usage

###Basic Usage

Rows that satisfy the condition written inside filter() are extracted.

pacman::p_load(tidyverse)
starwars %>%
    dplyr::filter(height > 100)
r$> starwars %>%
        dplyr::filter(height > 100)
# A tibble: 74 × 14
   name      height  mass hair_color skin_color eye_color birth_year sex
   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke Sky…    172    77 blond      fair       blue            19   male
 2 C-3PO        167    75 NA         gold       yellow         112   none
 3 Darth Va…    202   136 none       white      yellow          41.9 male
 4 Leia Org…    150    49 brown      light      brown           19   fema…
 5 Owen Lars    178   120 brown, gr… light      blue            52   male
 6 Beru Whi…    165    75 brown      light      blue            47   fema…
 7 Biggs Da…    183    84 black      light      brown           24   male
 8 Obi-Wan …    182    77 auburn, w… fair       blue-gray       57   male
 9 Anakin S…    188    84 blond      fair       blue            41.9 male
10 Wilhuff …    180    NA auburn, g… fair       blue            64   male
# … with 64 more rows, and 6 more variables: gender <chr>,
#   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#   starships <list>

Info

From here on I'm writing filter() as dplyr::filter(), but this is to avoid unintended collisions with filter() from other packages. You usually don't need to be aware of this, but since filter() and select() are cases where functions with the same name exist very frequently, I do this for bug avoidance.

###Common Usage 1: Evaluating with Formulas for Numeric

There are endless examples of usage, but I'll give a few common pattern examples. The first is evaluation with formulas.

For example, to filter rows where mass is 50 or more from the starwars dataset, do as follows:

starwars %>%
    dplyr::filter(mass > 50)

Formulas can be complex like the following and still filter without problems.

starwars %>%
    filter(log10(mass * height / 100) * pi > 7)

A point to be careful about is that if you apply >, >=, <=, < operators to a non-numeric column, behavior probably not intended by you will occur.

starwars %>%
    mutate(name > 1)

The name column is character type, so character > 1 would clearly be an error in general programming languages. However, in R, it doesn't error.

"mojiretsu" > 1
# TRUE

Division and multiplication properly give errors.

"mojiretsu" / 1
#  Error in "mojiretsu"/1 : non-numeric argument to binary operator

###Common Usage 2: Match/Mismatch

Judging by whether there's an exact match is also a common pattern.

starwars %>%
    filter(species == "Droid")

Conversely, when you want to evaluate non-match, use the != operator.

starwars %>%
    filter(species != "Droid")

Info

In logical operations (processing to judge TRUE or FALSE), "!" represents negation 🖐️

###Common Usage 3: Extracting Rows Containing a String

For example, suppose you want to extract rows containing "Skywalker" (partial match) from the starwars dataset. For column names, partial match selection like contains("Skywalker") is possible, but this is achieved through a special method called tidyselect. starts_with() and contains() cannot be executed on rows.

If you want to filter strings on rows, use a package called stringr.

starwars %>%
    filter(str_detect(name, "Skywalker"))

Setting aside the logic of why we write it this way, for now, remember filter(str_detect(column_name, "string")) as an idiom.

Also, since str_detect() supports regular expressions, complex filters are possible!

starwars %>%
    filter(str_detect(name, "^S")) # Pattern matching "starts with S"

###Common Usage 4: NA Judgment

A common case in data analysis is when NA (missing values) are included in the data. You can judge NA for a specific column using the built-in function is.na().

For example, suppose you have this data frame:

tibble(var1 = LETTERS[1:10], var2 = c(1, 2, 3, NA, 5, 6, NA, NA, 9, 10))
# # a tibble: 10 × 2
# #    var1   var2
# #    <chr> <dbl>
# #  1 a         1
# #  2 b         2
# #  3 c         3
# #  4 d        na
# #  5 e         5
# #  6 f         6
# #  7 g        na
# #  8 h        na
# #  9 i         9
# # 10 j        10

If you want to filter where the var2 column is not NA, use filter() like this:

tibble(var1 = LETTERS[1:10], var2 = c(1, 2, 3, NA, 5, 6, NA, NA, 9, 10)) %>%
    dplyr::filter(!is.na(var2))

For the simple purpose of "removing rows containing NA" like this, just using na.omit() is easier, but be careful that na.omit() removes all rows containing even one NA ⚠️

na.omit()
tibble(var1 = LETTERS[1:10], var2 = c(1, 2, 3, NA, 5, 6, NA, NA, 9, 10)) %>%
   mutate(var3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 1))  %>%
   na.omit()
 
# # A tibble: 1 × 3
#   var1   var2  var3
#
# 1 J        10     1

###Common Usage 5: String is Contained In

Adding this after posting the article because "come to think of it, I use this a lot too 💡".

When you want to extract rows based on a column containing strings, if that column is composed of diverse data, filtering becomes difficult. If you want to extract "rows matching any of ~~", the %in% operator is good.

read_csv("https://github.com/eggplants/nijisanji-v23d-status/raw/master/result.csv") %>%
    dplyr::filter(name %in% c("Kuzuha", "Yashiro", "Kenmochi", "Debi-debi Devil"))
 
# ℹ Use `spec()` to retrieve the full column specification for this data.
# ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 4 × 5
#   name             popularity `2dv2` `2dv3` `3d`
#   <chr>                 <dbl> <chr>  <chr>  <chr>
# 1 Kuzuha                  125 o      o      o
# 2 Kenmochi                 60 o      x      o
# 3 Yashiro                  60 o      x      o
# 4 Debi-debi Devil          45 o      o      o

Data source: nijisanji-v23d-status https://github.com/eggplants/nijisanji-v23d-status/blob/master/result.csv

Sometimes you create match lists by hand like this, but it would be smarter to semi-automatically create them using colnames() from another data frame or using the paste() function 🤖

###Convenient Usage: Combining with between()

When I researched thoroughly about dplyr::filter() usage, something I didn't know came up 😏 It was in the official documentation though... 💧

https://dplyr.tidyverse.org/reference/between.html

Combining with between() makes it possible to filter "X or more and Y or less" for numeric columns 💡

starwars %>%
    dplyr::filter(between(height, 100, 130))
 
# # A tibble: 2 × 14
#   name    height  mass hair_color skin_color eye_color birth_year
#   <chr>    <int> <dbl> <chr>      <chr>      <chr>          <dbl>
# 1 Sebulba    112    40 none       grey, red  orange            NA
# 2 Gasgano    122    NA none       white, bl… black             NA
# # … with 7 more variables: sex <chr>, gender <chr>,
# #   homeworld <chr>, species <chr>, films <list>,
# #   vehicles <list>, starships <list>

I probably won't use it much...

###Difficult but Worth Remembering: Combining with group_by()

group_by() itself might be difficult for beginners, but for now just remember that group_by() is a function that enables processing by group.

For example, to calculate "data taller than average within each species" for starwars data, do as follows:

starwars %>%
    group_by(species) %>%
    dplyr::filter(height > mean(height)) %>%
    # Move species to front for clarity
    relocate(species, .before = name)
 
# # A tibble: 6 × 14
# # Groups:   species [6]
#   species  name      height  mass hair_color skin_color eye_color
#   <chr>    <chr>      <int> <dbl> <chr>      <chr>      <chr>
# 1 Gungan   Roos Tar…    224  82   none       grey       orange
# 2 Zabrak   Darth Ma…    175  80   none       red        yellow
# 3 Twi'lek  Bib Fort…    180  NA   none       pale       pink
# 4 Mirialan Luminara…    170  56.2 black      yellow     blue
# 5 Kaminoan Lama Su      229  88   none       grey       black
# 6 Wookiee  Tarfful      234 136   brown      brown      blue
# # … with 7 more variables: birth_year <dbl>, sex <chr>,
# #   gender <chr>, homeworld <chr>, films <list>,
# #   vehicles <list>, starships <list>

Examples like combining with mean(), max() are probably most common.

Another is using n() to cut by group data count. n() is a simple function that gets the row count of a given data frame (tibble), but it's used quite subtly.

The following example uses species as a group variable and extracts only cases with two or more data. Used when categorical variables (group variables) are very miscellaneous and numerous but you want to roughly take the average.

Info

In bioinformatics, used when doing genus-level analysis of gut microbiome composition. When using genus information as a group, many bacteria with only one data point exist, dominated by miscellaneous information. Using n()>1 can exclude this unnecessary information.

starwars %>%
    group_by(species) %>%
    dplyr::filter(n() > 1)
 
# # A tibble: 58 × 14
# # Groups:   species [9]
#    name   height  mass hair_color skin_color eye_color birth_year
#    <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl>
#  1 Luke …    172    77 blond      fair       blue            19
#  2 C-3PO     167    75 NA         gold       yellow         112
#  3 R2-D2      96    32 NA         white, bl… red             33
#  4 Darth…    202   136 none       white      yellow          41.9
#  5 Leia …    150    49 brown      light      brown           19
#  6 Owen …    178   120 brown, gr… light      blue            52
#  7 Beru …    165    75 brown      light      blue            47
#  8 R5-D4      97    32 NA         white, red red             NA
#  9 Biggs…    183    84 black      light      brown           24
# 10 Obi-W…    182    77 auburn, w… fair       blue-gray       57
# # … with 48 more rows, and 7 more variables: sex <chr>,
# #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
# #   vehicles <list>, starships <list>

At this point it's getting pretty niche, so just come back to this article when you forget.

##Properly Understanding the dplyr::filter() Function

Up to now, I've introduced typical uses of the filter() function. It might seem like diverse uses at first glance, but filter() operates on a consistent principle. That principle is that it takes a logical value vector as an argument and only keeps rows that are TRUE.

For example, let's think about the very first example.

starwars %>%
    dplyr::filter(height > 100)

This example filtered rows where height is greater than 100. Actually, just the expression height > 100 is processing that returns a logical value vector.

To make it clear, let's extract the result of height > 100 into a new column using mutate().

starwars %>%
    mutate(height_greater_than_100 = height > 100, .before = name)
 
# # A tibble: 87 × 15
#    height_greater_th… name  height  mass hair_color skin_color
#    <lgl>              <chr>  <int> <dbl> <chr>      <chr>
#  1 TRUE               Luke…    172    77 blond      fair
#  2 TRUE               C-3PO    167    75 NA         gold
#  3 FALSE              R2-D2     96    32 NA         white, bl…
#  4 TRUE               Dart…    202   136 none       white
#  5 TRUE               Leia…    150    49 brown      light
#  6 TRUE               Owen…    178   120 brown, gr… light
#  7 TRUE               Beru…    165    75 brown      light
#  8 FALSE              R5-D4     97    32 NA         white, red
#  9 TRUE               Bigg…    183    84 black      light
# 10 TRUE               Obi-…    182    77 auburn, w… fair
# # … with 77 more rows, and 9 more variables: eye_color <chr>,
# #   birth_year <dbl>, sex <chr>, gender <chr>,
# #   homeworld <chr>, species <chr>, films <list>,
# #   vehicles <list>, starships <list>

The column we just added has become a vector of logical values (TRUE, TRUE, FALSE), right? The reveal is that rows that are TRUE in this column were extracted. Since any logical value vector is fine, you can even give a logical value vector directly as an argument. Let's try giving a logical value vector with TRUE for the first two rows and FALSE for 85 rows as an argument to filter.

starwars %>%
    dplyr::filter(c(TRUE, TRUE, rep(FALSE, 85)))
 
# # A tibble: 2 × 14
#   name           height  mass hair_color skin_color eye_color
#   <chr>           <int> <dbl> <chr>      <chr>      <chr>
# 1 Luke Skywalker    172    77 blond      fair       blue
# 2 C-3PO             167    75 NA         gold       yellow
# # … with 8 more variables: birth_year <dbl>, sex <chr>,
# #   gender <chr>, homeworld <chr>, species <chr>,
# #   films <list>, vehicles <list>, starships <list>

As shown, only the first two rows were extracted. So, the conditions for filter() seem diverse, but anything that gives a TRUE/FALSE logical value vector is fine. This is why the filter() function can be operated flexibly.

###Applied Example: Using a Logical Vector Column for Filter Conditions

Next, I'll introduce a technique I use to prevent bugs 👮‍♂️

You can use filter with complex conditions using logical operations (and/or/not.. calculations), but honestly I'm a bit weak at it 😓 Because when multiple logical operators are combined, the expression becomes complex like below, and when looking back later, it becomes hard to understand what conditions are being filtered.

starwars %>%
    dplyr::filter(height > 100 & mass < 60 & species == "Human")

Code like this seems dangerous as a source of bugs. When you want to filter with complex conditions, preparing the conditions themselves as columns like below makes for more understandable coding 😮

starwars %>%
    mutate(
        is_tall = height > 100,
        is_light = mass < 60,
        is_human = species == "Human"
        ) %>%
    dplyr::filter(is_tall & is_light & is_human)

##Latest Information: Combining with if_any() if_all()

dplyr::if_any() and dplyr::if_all() are new functions just added in 2021 🌞 Honestly I'm not using them in practice yet, so this time I'll briefly introduce usage combined with dplyr::filter().

Using dplyr::if_all(), you can apply the same filter condition to multiple columns at once. For example, to apply the condition "50 or more" to "all numeric data columns", do as follows:

starwars %>%
    dplyr::filter(
        if_all(where(is.numeric), function(x) x > 50)
    )

Info

We're using an anonymous function for the x > 50 part.

The if_all() used here is used when all are TRUE. On the other hand, if_any() is used when even one is TRUE. Pretty understandable naming which is nice 😄

Caution

In the past (more than two years ago), filter_at() and filter_all() were used for the above operations, but now they are deprecated. Since there's a high possibility they won't work in the future, let's stop using them.

##Summary

So, this was an introduction to the dplyr::filter() function! In data transformation using tidyverse, the filter() function is one of the top 3 most used important functions. I hope this article helps you use the filter() function even more conveniently than before ♪

See you next time! ⛄