DS4N6 Knowledge >> Tips and Tricks >> Searching a full DataFrame for a Regular Expression

Searching a full DataFrame for a Regular Expression

If you come from a UNIX/Linux background, I'm sure one of your favorite commands is grep. It is so easy to use and so powerful!

grep allows you do regular expression searches in text data (typically files).

The good news is that pandas includes a function (str.contains()) that allows you to do regex directly (the python equivalent is “re”, but you will need to import it).

The bad news though is that the pandas world does not have something as easy and intuitive as grep. In fact, doing a search for a regular expression in a Series (or DataFrame column) is quite ugly:

    df[df['column'].str.contains('myregex')]

And it gets even uglier if you want to search in every column of your dataframe:

    df[df.apply(lambda row: row.astype(str).str.contains(regex).any(),axis=1)

There are a couple of options for str.contains (case, regex, etc.) that you can find useful to modify, so take a look at the official str.contains function page

In order to make things easier, we will be introducing a helper function in the next version of the ds4n6.py library: search_regex_df.

It is pretty simple, so you can define it in your notebook and start using it right now.

The usage is simple: search_regex_df(mydf,“myregex”[,reverse=True])

As you can see, if you want to reverse the results (i.e. get the lines that do not match the regex, the equivalent of our beloved “grep -v”), you can set the optional parameter “reverse” to True (reverse=True).

The best part is that, since the result is also a DataFrame, you can continue to “pipe” actions/functions/filters on the results!

search_regex_df:

def search_regex_df(df,regex,reverse=False):
    if reverse == False:
        results=df[df.apply(lambda row: row.astype(str).str.contains(regex).any(),axis=1)]
    else:
        results=df[~df.apply(lambda row: row.astype(str).str.contains(regex).any(),axis=1)]

    return results