question archive Hadoop the Cookie Cutter   A cookie is data that a Web site stores on your computer to record something about its interaction with you

Hadoop the Cookie Cutter   A cookie is data that a Web site stores on your computer to record something about its interaction with you

Subject:ManagementPrice: Bought3

Hadoop the Cookie Cutter

 

cookie is data that a Web site stores on your computer to record something about its interaction with you. The cookie might contain data such as the date you last visited, whether you are currently signed in, or something else about your interaction with that site. Cookies can also contain a key value to one or more tables in a database that the server company maintains about your past interactions. In that case, when you access a site, the server uses the value of the cookie to look up your history. Such data could include your past purchases, portions of incomplete transactions, or the data and appearance you want for your Web page. Most of the time cookies ease your interaction with Web sites.

Cookie data includes the URL of the Web site of the cookie's owner. Thus, for example, when you go to Amazon, it asks your browser to place a cookie on your computer that includes its name, www.amazon.com. Your browser will do so unless you have turned cookies off.

third-party cookie is a cookie created by a site other than the one you visited. Such cookies are generated in several ways, but the most common occurs when a Web page includes content from multiple sources. For example, Amazon designs its pages so that one or more sections contain ads provided by the ad-servicing company, DoubleClick. When the browser constructs your Amazon page, it contacts DoubleClick to obtain the content for such sections (in this case, ads). When it responds with the content, DoubleClick instructs your browser to store a DoubleClick cookie. That cookie is a third-party cookie. In general, third-party cookies do not contain the name or any value that identifies a particular user. Instead, they include the IP address to which the content was delivered.

On its own servers, when it creates the cookie, DoubleClick records that data in a log, and if you click on the ad, it will add the fact of that click to the log. This logging is repeated every time DoubleClick shows an ad. Cookies have an expiration date, but that date is set by the cookie creator, and they can last many years. So, over time, DoubleClick and any other third-party cookie owner will have a history of what they've shown, what ads have been clicked, and the intervals between interactions.

But the opportunity is even greater. DoubleClick has agreements not only with Amazon but also with many others, such as Facebook. If Facebook includes any DoubleClick content on its site, DoubleClick will place another cookie on your computer. This cookie is different from the one that it placed via Amazon, but both cookies have your IP address and other data sufficient to associate the second cookie as originating from the same source as the first. So, DoubleClick now has a record of your ad response data on two sites. Over time, the cookie log will contain data to show not only how you respond to ads but also your pattern of visiting various Web sites on all those sites in which it places ads.

You might be surprised to learn how many third-party cookies you have. The browser Firefox has an optional feature called Lightbeam that tracks and graphs all the cookies on your computer. Figure 3-32 shows the cookies that were placed on my computer as I visited various Web sites. As you can see, in Figure 3-32a, when I started my computer and browser, there were no cookies. The cookies on my computer after I visited www.msn.com are shown in Figure 3-32b. At this point, there are already eight third-party cookies tracking. After I visited five sites, I had 27 third-party cookies, and after I visited seven sites I had 69, as shown in Figures 3-32c and d.

Figure 3-32 Third-Party Cookie Growth

Source: © Mozilla Corporation.

Figure 3-32 Full Alternative Text

Who are these companies that are gathering my browser behavior data? If you hold your mouse over one of the cookies, Lightbeam will highlight it in the data column on the right. As you can see in Figure 3-30d, after visiting seven sites, DoubleClick was connected to a total of 16 other sites, only seven of which are sites I visited. So, DoubleClick is connecting to sites I don't even know about and on my computer. Examine the connection column on the right. I visited MSN, Amazon, MyNorthwest, and WSJ, but who are Bluekai and Rubiconproject? I never heard of them until I saw this display. They, apparently, have heard of me, however!

Third-party cookies generate incredible volumes of log data. For example, suppose a company, such as DoubleClick, shows 100 ads to a given computer in a day. If it is showing ads to 10 million computers (possible), that is a total of 1 billion log entries per day, or 365 billion a year. Truly this is Big Data.

Storage is essentially free, but how can they possibly process all that data? How do they parse the log to find entries just for your computer? How do they integrate data from different cookies on the same IP address? How do they analyze those entries to determine which ads you clicked on? How do they then characterize differences in ads to determine which characteristics matter most to you? The answer, as you learned in Q3-6, is to use parallel processing. Using a MapReduce algorithm, they distribute the work to thousands of processors that work in parallel. They aggregate the results of these independent processors and then, possibly, move to a second phase of analysis where they do it again. Hadoop, the open source program that you learned about in Q3-6, is a favorite for this process.

(See the collaboration exercise on page 107 for a continuation of the discussion: third-party cookies—problem? Or opportunity?)

Questions

3-17. Suppose you are an ad-serving company, and you maintain a log of cookie data for ads you serve to Web pages for a particular vendor (say, Amazon).

1)How can you use this data to determine which are the best ads?

2)How can you use this data to determine which are the best ad formats?

3)How could you use records of past ads and ad clicks to determine which ads to send to a given IP address?

4)How could you use this data to determine how well the technique you used in your answer to question c was working?

pur-new-sol

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE