VKI Studios is now Cardinal Path! www.CardinalPath.com
Learn more about Cardinal Path

GA Basics: Regex and Google Analytics

After my last GA beginners post, WendyW asked about the regex portion of the GAIQ study guide. This prompted me to go back and look at it again, only to find that Google's presentation could certainly use some embellishment.

For those who aren't familiar with Regex, it's a system for matching strings of text, numbers, symbols, etc. Commands are written in Regex's language, then matched through a regex processor which identifies text which matches your specification. 

Regex is incredibly useful when working with, well, almost anything. Google allows (very) limited Regex in search, more robust regex in analytics, many notepad applications have built in regex processors for their find/replace systems, and htaccess makes great use of it.

Today I'm going to take you all through the basics of Regex and show you how you can use it with Google Analytics to enhance your reports, filters, and more.

Contents

  1. The Language of Regex
  2. What you can do with Regex and Google Analytics
  3. Tools for Testing your Regex

Part 1: The Language of Regex

Regex commands can range from simple and incredibly full featured depending on how much "language" you use with them. The following is a hefty (but not complete) selection of regex terms:

. : The period is a wild card. It can represent any character what-so-ever.

+ : repeats the previous character 1 or more times.

* : repeats the previous character 0 or more times.

() : Parentheses represent a set of "tokens" or rule elements. For instance, (.+) would match any set of characters. This allows you to apply an operator to an entire group. So for instance, if you wanted to match the word "what" you would type "what", but if you wanted it to also catch "whatwhat" then you could use "(what)+".

Parentheses also create a "back reference", which can be recalled with a "quot; in many regex engines.

[] : Square brackets represents a range. For instance [a-t] would match any lower case letter between a and t. You can also have multiple items within a bracket, such as [a-zA-Z0-9\s-#"=] which would match any single letter, number, space, hyphen, number sign, quotation, or equals sign. (Yes, this would be better written [\w\s-#"=], but I was making a point about ranges)

{} : Curly brackets are odd. They define repetition. So (what){2} would only match two repetitions of what (whatwhat). Alternatively (what){2,7} would count between two and seven repetitions of what (including 3 repetitions, 4 repetitions, 5 ,6)

\d :Represents any digit

\s : Represents any whitespace element (space, tag, etc.)

\w : Represents any alphanumeric character or underscore

$ : Dollar sign matches the end of a string. In htaccess it can also be used to recall sets that have been previously defined by parenthesis.

^ : The carrot has two purposes. It can match the start of a string, but also it can negate characters in characters sets. So ^[a-z]$ will only match a a string that starts and ends with a single lower case alpha character, (^[a-z]) will match any string that does not contains characters other than a lower case letter. So aaa will not match, aAa will match, and AAA will match.

- : a hyphen creates a range. For instance, a-z would match any character from a to z (though not any uppercase characters)

| : The bar stands for "or". So a|b will match a or b.

\ : slash means "literally". So while "." would match any character "\." would only match periods. Similarly while "?" would match the end of a sentance, "\?" would match a question mark. In certain implementations of regex (eg. Notepad ++) slash can also be used with numbers to repeat areas that have previously been defined by brackets (same as $1, $2, etc. in htaccess).   

?: Matches a string that does or does not contain this. So for example "[1080 ]? Howe st" would match "1080 Howe st." or "Howe st." but not "64 Howe st." while "64?" would match "6" or "64". The question mark also has the dual purpose of making an expression "lazy" (normally regex is greedy). Greed and laziness makes my head hurt (both in real life and in regex) so I'll just leave this one to LunaMetrics (good greed and bad greed)

Got all of that remembered? No? Don't worry.

Part 2: What you can do with Regex and Google Analytics

So now you know the syntax, what can you do with it?

Image Search

Ani Lopez, over at his blog, has a great example of how you can use regex to create a filter/custom segment that tracks image search. To test this open up GA, open traffic sources, then go to the source tab and select referral path (instead of source).

At the bottom of this report open your filters and enter the following string:

^(/imgres|/images/view|/images/search)

Remembering your regex language, what does the above string mean? ^ = line begins with; | = or; () = a set; so (Bolded words matching symbols):

line begins with '/imgres' or '/images/view' or '/images/search'.

Keywords

Thanks to a post by Darcy, we get a lot of people looking for how to find out who is looking at their Facebook pics/profile. Maybe I want to filter them out, or even better I want to know if they convert (or if they're just garbage traffic) so I can sort them with the following keyword filter:

(viewed|looking|looked|looks|copied|viewing)[\w\s']+face\s?book

How do we read this?

Match any line that contains 'viewed' or 'looking' or 'looked' or 'looks' or 'copied' or 'viewing', then has any alpha-numeric characters, whitespaces, or hyphens repeated any number of times, followed by 'face', an optional whitespace, then 'book'. 

IP Filters

Let's say your network has an IP range of 123.456.789.1 to 123.456.789.999, and you don't want GA it count visits and interactions to your site from within your office. You could write a filter blocking

123\.456\.789\..*

or if you wanted to be more exact&hellip

123\.456\.789\.[0-9]+.

In the former you're stating that any IP range from 123.456.789. to 123.456.789.[an infinite number of 9's]. 

But lets make it more interesting. Let's say you have the ip range 64.281.31.1 to 64.281.31.999. How would you write a rule for that?

If you looked at my example above and wrote

64\.281\.31\..*
or
64\.281\.31\.\d{1,3}

then you are wrong.

Why?

Because that also matches 164.281.31.[1-99]. Yes, numbers before your string will count! So you have to write it as:

^64\.281\.31\..*

Part 3: Tools for Testing your Regex

Regex can be complex, and before you start applying it you will want to test it. There are two easy solutions. 

  • The first is RegexPal, which allows you to input text and test regex rules on it.

  • The second is RegExr which provides a more complex interface than Regexpal, but also gives you a reference and the option to save regex commands. It also has an Adobe AIR downloadable version located here.

I am sure you can all make some imaginative use of Regex in GA, and I do hope you post your own interesting regex expressions.

Comments (Comment Moderation is enabled. Your comment will not appear until approved.)
.