Text processing, n-grams and Paid Search

Posted on by

One of the coolest things about paid search advertising is matching ads to what someone is searching for. There are very few channels that will let you target intent to such precision. There are some cavaets. Depending on the match type, the keywords that are being bid on won’t always be the same as the search terms that triggered the ad. AdWords provides tools like negative keywords to help manage this and ensure that the right ads appear on the right search results, the trick here is to identify exactly what needs to be removed and what is working.

Example distribution of relative clicks to individual search terms

Most reasonably large AdWords accounts with a good range of match types should see clicks, spend and impressions by search term follow an exponential distribution very similar to the example above. There will be a small subset of search terms that account for a large amount of activity individually, and then a long tail that individually do not, but given a large enough account and a solid long tail strategy, can provide a good return. Or not. It is all about how well the data can be leveraged. The challenge with larger accounts is how to analysis this information effectively and at scale.

N-Grams and Aggregating the Tail

Text analysis is an interesting field, one which has a lot of relevance to search. Things like Google’s N-Gram viewer fall into this area. Regardless of the keyword they were matched to, search terms can share common phrase structures across the entire account, including long tail search terms. While individually many of the search terms in an account won’t have enough volume for analysis, the when aggregated the phrase parts they include can.

There are a number of different tools available for breaking down a corpus into it’s n-grams, including a few R packages like the descriptively named ‘ngram’ one. I used it for my phrasePartAnalysis code on github to take an AdWords search term report, extract both 2 and 3 word grams and prepare the data for analysis. The code was written for R, an open source statistical software package. The script itself can be slow with larger data sets as text processing seems to be fairly computationally intensive. This scriot will process an AdWords search term report to link performance metrics to n-grams. What is done with the data next is probably the most interesting part, and I have included a number of different graphs and produced output to help explore the information.

Where the Value is

At some point in the data analysis process actually looking at the data can be very useful and most stats packages make this very easy, including R. Base R has a lot of graphing tools, and the ggplot2 package is almost essential for most projects. In most cases one of the best ways to start is to look at the shape, or distribution of the values of interest.

Most of the following plots were created with a different set of data to that provided in the GitHub repository. The output from that set of example data is available through GitHub pages here: http://anthonypc.github.io/phrasePartAnalysis/. This example data I used to produce the plot below follows a similar exponential curve in volume as the raw search terms did before. In this case it is more an indication of how homogeneous the traffic the account is generating. In the below example, it is pretty clear that a very small number of phrase parts cover a lot of the activity.

Ngrams and Clicks

For this kind of search term analysis one of the most useful things to look for is how certain phrase parts perform across the account. Some combinations of words will be triggered by multiple keywords and sometimes, across different campaigns. A certain phrase that works well in one area of the account might not perform in another. Products such as domestic and international travel provide a good example of this. The phrase “to brisbane” is fine for a domestic product, but useless in an international campaign, though it can appear in both due to keyword or match type strategies.

There are a number of things that you can do with the data set produced in this initial process, and looking for phrase parts appearing across different parts of the account with different performance characteristics is one of the most valuable. In practice most of the time, this is where I find most of the value. Most of the work needed to identify these cases can be done in excel using an exported file from the script or by reviewing the tables generated as per the example R markdown output.

Graph Time

In addition to the raw numbers produced in the tables and exported CSV files graphs are very useful for getting an understanding for the data. The R script linked above has a number of simple graphs included for exploring how key values are distributed by ngrams, campaigns and labels in the data set. One important thing to keep in mind when graphing this kind of information is that not all kinds of graphs are appropriate. For example, the distribution of activity by most dimensions is not normal. Visualisations like box plots which are more useful when the data falls into something approaching a normal distribution can be a little misleading.

Clicks (log) per group

The same again when plotting a heavily skewed sample as per the above boxplot. A histogram or a kernal density graph would probably have been more useful.

logClicks to logCost

Bivariate graphs like Scatter plots are very useful for paid search data, where some of the most interesting points are bivariate or multivariate outliers. In the data used for the graph above there are a number of points with both extreme clicks and cost values.

Extreme Ngrams by cost

In the example code extreme values are labeled. This is done with an ifelse rule using a z-score of greater than 2 against cost. A more sophisticated approach may be to use a statistic like Cook’s Distance.

## Quick scatter plot for visualising extreme values where only the highest cost examples are labelled.
ggplot(graphSet01.S2[Clicks > 10], aes(x = Clicks, y = cvr, size = sqrtCost)) + geom_point() + facet_wrap( ~ Labels, ncol = 2) + geom_text(aes(label = ifelse((Cost-mean(Cost))/sdCost > 2, ngram, "")), hjust = 1, vjust = 1) + ggtitle("Example CVR scatterplot by Labels [unfiltered]")


Outliers and Influential Points

Influential Points

There are a few techniques used for testing assumptions for multivariate regression, Mahalanobis Distance and Cook’s Distance. The data used here is certainly not appropriate for regression, the two tests mentioned above can be used to identify points that do not exhibit the same relationship between Clicks and Conversions.

A number of examples of this are included at http://anthonypc.github.io/phrasePartAnalysis/ as well as in the analysis script file in the GitHub repository.


This is really just the start to what can be done with search term analysis. An ngram model like the one presented above has a number of weaknesses. As it is coded, it will not catch minor misspells or typos and group them together. The code does not include any multivariate analysis nor does it use a model to detect influential points within a campaign or label group. Though this is possible, and is easily supported by R.

Leave a Reply

Your email address will not be published. Required fields are marked *