Click-fraud is a
major challenge for online advertisers.
Here's how Google
battles the scourge.
Click-fraud is
a central issue for advertisers, for search engines,
and for Google in particular. NYU Stern Professor of
Information Systems Alexander Tuzhilin recently investigated
how Google manages its system to detect fraud. He visited
Google’s campus,
interviewed Google employees, and learned about the
company’s inspection and detection systems. His
results were reassuring, but also showed some potential
room for improvement. The following article is a synopsis
of the report that Tuzhilin filed with the court in
Texarkana, Arkansas, on July 21, 2006, in the Lane’s
Gifts v. Google settlement. |
The online advertising and
e-commerce industries are growing rapidly. No company helped
spur it along more, or benefited more from its rise, than
Google. The search engine, launched in 1999, is now the basis
of a $140 billion company. By aggregating content, allowing
Internet surfers to search effectively, and providing a platform
for advertisers, Google has emerged as a dominant player
in the industry.
The virtue of online advertising is
that users can easily track metrics in a way they can’t in other media. Technology
allows advertisers to learn precisely how many people view
an ad, how many click on it, and how many wind up purchasing
the product or service advertised. At the same time, online
advertisers – many of whom pay on the basis of the number
of clicks on their ads – must grapple with another issue
created by technology: the potential for what’s known
as click-fraud. In cyberspace, after all, numerous parties
have incentives to generate traffic on advertisements or websites
that may not be legitimate.
In Internet advertising, the predominant
model is CPC – Cost
Per Click, also known as Pay Per Click (PPC), under which an
advertiser pays only when a visitor clicks on the ad. (A second
model called Cost per Mille (CPM), also known as CPI (Cost
Per Impression), under which an advertiser pays per one thousand
impressions of the ad, is also used by Google but is not subject
to click-fraud and, therefore, was not a concern in Tuzhilin’s
study.) The CPC/PPC model has two fundamental problems. First,
good click-through rates are not necessarily indicative of
good conversion rates; just because someone clicks on an ad
doesn’t mean she’ll buy the product. Second, it
does not offer any built-in fundamental protection mechanisms
against click-fraud.
Google has two main advertising programs.
AdWords, launched in 2002, allows advertisers to purchase
CPC-based advertising that displays ads based on the keywords
specified in users’ search
queries. When a user executes a Google search, ads for relevant
words are shown alongside search results. An advertiser has
a certain budget associated with a keyword, which is allocated
for a specified time period. Each click decreases the budget
by the amount paid for the ad, until it reaches zero during
that time period. If the balance reaches zero, the ad stops
showing. Therefore, an advertiser or its partner can deplete
the budget of a competitor by repeatedly clicking on the ad.
The AdSense program, launched in March
2003, is a way for website owners (publishers), ranging from
The New York Times to small blogs, to display Google’s ads on their sites. AdSense
for Search (AFS) lets Google place ads on publishers’ websites
when users make keyword-based searches on their sites. AdSense
for Content (AFC) automatically delivers targeted ads to the
publisher’s web pages that the user is visiting. In both
cases, the publishers and Google are being paid by advertisers
on the PPC basis. All the partner sites in the constantly evolving
network are periodically reviewed and monitored to detect possible
problems.
Under this model, publishers have a
direct incentive to attract traffic to their websites and
encourage visitors to click on Google’s ads on the site. But overzealous and unethical
users can “stretch” or directly abuse this system
by generating invalid clicks, particularly in the Adsense for
Content category.
Defining Invalid Clicks
To manage the AdSense and AdWords programs,
Google collects information about querying and clicking activities.
This “raw” clicking
data is cleaned, preprocessed, and stored in various internal
logs. When advertisers are billed, they receive customized
reports describing the clicking and billing activities. But
since the smallest unit of analysis is one day, advertisers
cannot know if a particular click on a particular ad was
marked as valid or invalid by Google, and Google refuses
to provide this information to advertisers. Google defines
invalid clicks as “Clicks … generated through
prohibited means, and intended to artificially increase click … counts
on a publisher account.” Advertisers are not charged
for what Google deems to be invalid clicks.
Invalid clicks can come from a range of sources: individuals
deploying automated clicking programs or software applications
(called bots), low-cost workers paid to click on links, or
technical glitches. Some of these invalid clicks are clearly
fraudulent, while others are just invalid. Some are easy to
detect, while others are very hard.
When evaluating the validity of a click, Tuzhilin notes that
it is necessary to understand the intent of the user. Unfortunately,
in several cases it is hard or even impossible to determine
the true intent of a click using technological means. For example,
a person might have clicked on an ad, looked at it, went somewhere
else, but then decided to have another look shortly thereafter
to make sure she got all the necessary information. Is this
second click invalid? The intent cannot be operationalized
and detected by technological means with any reasonable measure
of certainty.
Nevertheless, Google’s
Click Quality Team works to identify all invalid clicks
regardless of their nature and origin. To do so, it employs
a two-fold strategy of prevention and detection. First, Google
discourages invalid clicking activities on its network by making
the lives of unethical users more difficult and less rewarding.
For example, it installs measures to make it difficult
to register using false identities.
Beyond prevention, Google
has built four lines of defense against invalid clicks: pre-filtering,
online filtering, automated offline detection, and manual
offline detection. It employs two main methodologies in seeking
out invalid clicks. In the Anomaly-based (or Deviation-from-the-norm-based)
approach, one may not know what invalid clicks are. But one
can know what constitutes “normal” clicking activities,
assuming that abnormal activities are relatively infrequent
and do not distort the statistics of the normal activities.
Invalid clicks are those that significantly deviate (mainly
in the statistical sense) from established norms. In the Rules-based
approach, one specifies a set of rules identifying invalid
clicking activities (alternatively, one can also identify a
set of other rules identifying valid clicking activities).
An example of such a rule is: “IF a doubleclick occurred,
THEN the second click is invalid.” The operational definitions
Google uses cannot be released to the general public because
unethical users will immediately take advantage. However, if
it is not known to the public what valid and invalid clicks
are, how would the advertisers know for what exactly they are
being charged? This is the essence of the fundamental problem
of the PPC model.
Filtering Efforts
Most of Google’s efforts are
focused on the second approach: detection. Google employs
a series of filters. Tuzhilin found that certain clicks
are removed immediately from the logs before they are even
seen by the online filters. After this preliminary stage,
the next three lines of defense against invalid clicks
include online filtering, automated offline detection,
and manual offline detection.
"Thanks to
the quality of the inspection tools, the high levels
of experience and professionalism of the Click Quality
inspectors, and the existence of certain investigation
processes, guidelines, and procedures, these inspections
are generally successful." |
Online Filtering. Several rules-based online filters monitor
various logs for certain conditions and detect the clicks satisfying
these conditions, then mark them as invalid and remove them.
The invalid clicks are removed only at the end of the filtering
process; therefore, each filter sees every click. However,
each invalid click is associated with the first filter in the
pecking order that detected it. It turns out that the vast
majority of invalid clicks are detected by the first few most
powerful filters, and the last few filters in the pecking order
detect only a small portion of invalid clicks that have not
been detected yet by the previously applied filters. The Click
Quality Team constantly works on the development of new filters
and the improvement of the current set of filters.
The Click Quality Team provides only
indirect evidence that Google filters perform reasonably
well. For example, newly introduced and revised filters
detect only a few additional invalid clicks. A recently
introduced filter managed to detect only 2 to 3 percent
of its invalid clicks not already detected by other filters.
And the offline invalid click detection methods (to be
discussed below) detect relatively few invalid clicks in
comparison to the filters. This observation does not provide
irrefutable evidence that the filters work well – it
could be that the offline methods work poorly. But the low
ratio of the offline to the online detections provides some
evidence that the online filters perform reasonably well.
Automated Offline Detection. The next stage is the offline
detection and removal of invalid clicks that managed to pass
the online filtering stage. First, Google deploys alerts, which
are used for detecting more complex and more subtle patterns
of invalid clicking activities. Since these clicks cannot be
safely removed by filters, the filters pass them as valid,
and alerts identify them in the offline analysis stage and
pass these suspicious clicks to human experts for manual investigations.
Alerts can also check for various conditions more complex than
those used in filters. These alerts take into consideration
a broader set of deciding factors and can monitor these factors
over longer time periods. When alerts are issued, they are
manually investigated by the Click Quality Team, based on their
priority.
Manual Offline Detection. The Operations group of the Click
Quality Team conducts manual reviews of potentially invalid
clicking activities. Investigation requests are generated from
various sources: from advertisers noticing unusual clicking
activities, from alerts, from customer service representatives
who might notice something questionable, from publishers, and
from an automated system that examines publishers and determines
whether they are spammers.
Google’s goal is to conduct proactively as many of these
investigations as possible. Another goal is to investigate
the suspicious publishers in the early stages of their inappropriate
activities before they receive payments. The basic idea behind
most of these investigations is to discover unexpected behavior
of the entities being investigated. Based on experience, the
investigators look for the deviations from these “normal” behaviors
using the inspection tools. Once such deviations are discovered,
the investigator drills down into the problem and uncovers
the reasons causing these deviations and, most likely, the
source and reasons for the inappropriate activity or a set
of activities. Tuzhilin has personally observed several such
inspections. Thanks to the quality of the inspection tools,
the high levels of experience and professionalism of the Click
Quality inspectors, and the existence of certain investigation
processes, guidelines, and procedures, these inspections are
generally successful.
Evolving Eco-System
The offline invalid click detection methods detect relatively
few invalid clicks. Again, this could be because the offline
methods perform poorly. However, the Click Quality Team puts
much thought into developing reasonable offline methods.
Therefore, even if they did not perform that well, the low
ratio of the offline to the online detections of invalid
clicks would still provide some evidence that the online
filters perform reasonably well.
The Click Quality Team provided additional indicators that
led Tuzhilin to conclude that the click detection system performs
reasonably well. Since late 2004, the number of inquiries about
invalid clicks for the Click Quality Team has increased drastically,
but the number of refunds for invalid clicks provided by Google
did not change significantly. Since each inquiry about invalid
clicks leads to an investigation, this means that significantly
fewer investigations result in refunds. The total amount of
reactive refunds that Google provides to advertisers as a result
of their inquiries is miniscule in comparison to the potential
revenues that Google foregoes due to the removal of invalid
clicks (and not charging advertisers for them).
This evidence doesn’t provide
proof beyond a reasonable doubt. And there is room for improvement.
For example, Google could make greater use of classifier-based
filters based on well-known data-mining methods. Data-mining
methods allow for the construction of statistical models
based on past data that can classify new clicks as either
valid or invalid and also assign some degree of certainty
to this classification.
Still, Tuzhilin believes that the indirect
evidence provides a sufficient degree of comfort to conclude
that these filters work reasonably well. This does not mean,
however, that any particular advertiser cannot be hurt badly
by fraudulent attacks. One simply should not generalize such
incidents to other cases and draw premature conclusions.
Also the Click Quality Team realizes that battling click-fraud
is an arms race. The Internet is a constantly evolving eco-system,
and Google is making efforts to stay “ahead of the curve” and
to get ready for more advanced forms of click-fraud by developing
the next generation of online filters.