Large Ad Data: How Accurate Does it need to be?

Digital calipers. Category:Calipers
Image via Wikipedia

I got into a discussion with @durana today about doing some simple calculations on real-time display ad data.  I pointed out to him that if we are looking at a real-time flow of data coming in of say display ad impressions and clicks and page views and we were to calculate a click-thru rate or conversion rate that the data would look really messy and lack integrity.

My fear was that a media buyer is used to looking at daily conversion rates and click thru rates and we are watching a 24 or 48 hour build of data flow in before our eyes, then rates could be inflated or deflated.  He then pointed out that in fact this would be more accurate.

Now, I know we need to consider re-set cookies and new cookies and all of that but….I mean do you really?  If you are truly looking at real-time data flow in the door and you have a real-time bidding engine to take advantage of that and you can look at users flow through specific sites and audiences from Impression, to Click, and/or to page view, and then to sale and cascade from audience to audience…….then you don’t need to do as much sampling and data validating.  You just need to know the cost that you are incurring in real-time and how much that would go up or down based on the click-thru rate and conversion rate that you are looking at and estimate the impact. 

Also, if you are looking at the effect of your change in real-time then you can make up for any inaccuracies so quickly that it doesn’t matter in the first place considering accuracy is only valuable if time passes between when you take action and receive the data on the outcome of your action.  But you are acting in real-time and can get the results in real-time, then why so accurate???

Reblog this post [with Zemanta]
Comments

So, where is the data and how fast can it go?

Math paper
Image via Wikipedia

Working at a big company you get very used to how things are done.  Data has a certain availability and you need to work around this and play within the confines you are given as any major infrastructure changes could take months or years and there is always the fear of killing the golden goose and switching costs from going to from one system to another.  Basic rule of thumb is use what is currently working and innovate within the confines of it unless there is a major loss or need to change things.

I’m specifically talking about adserving technology, the databases that store information that is gleaned out of adserving and then the structure of this data to query and report on.

I got in a discussion with the CEO of a emerging data company who I have a ton of respect for recently whom also has an extensive technology background.  Our discussion was around real-time data reporting and the feasibility thereof.  Typically most adservers dump logs into massive long-term storage databases using either hadoop, neteeza, or even oracle to store this.  There is definitely a maximum number of records you can insert per second, limitations on the structure of this data which could complicate how you pull it out later for reporting, and furthermore the more data you store in one place, the harder it is to pull out a tiny piece of it on a quick recall.

When talking about advertising, for an individual advertiser or marketer, you are talking about between 10 and 100 million display ad impressions per day along with tens or hundreds of thousands of clicks and tens or hundreds of thousands of page views on a daily basis so to break that down lets say you need to store:

100 mm impressions

300k clicks

20 k page view events

Per day totaling 100,316,000 event records

so that’s 4,179,833 events per hour

or 69,663 per minute

or  1,161 per second

Then you have to think about how that number will spike during certain hours of the day and then lets say you definitely want to design the system to handle a lot of advertisers so let’s say 1,000 advertisers….you are talking about being able to handle between 1 MM and 20 MM events inserted into the database per second.

So how do you do this while managing costs? And is it even doable?

One thing is for sure that you need to own the data source so importing data from third party adservers or publishers is off the table because the server to server transfer alone will add valuable seconds on to your process and if you ever plan to do calculations on that data like conversion rates, match processes, click thru rates, you are adding time on to process that, and we haven’t even gotten to querying that data out of the database yet…..we’re just inserting it.  That said querying it out if structured properly will be a lot quicker and easier because you just have one or two users querying at a time per advertiser’s data set.

Anyway…..just thinking out loud about the problem at hand and the gap between real-time bidding and actually pulling the data based on real-time bidding into a reporting interface so a human being can actually look at a marketing problem and address it or examine it.

As a business side person myself, would love to see anyone else’s commentary on what database structures they have used and the feasibility of a project like this.

Reblog this post [with Zemanta]
Comments