So, where is the data and how fast can it go?
Image via Wikipedia
Working at a big company you get very used to how things are done. Data has a certain availability and you need to work around this and play within the confines you are given as any major infrastructure changes could take months or years and there is always the fear of killing the golden goose and switching costs from going to from one system to another. Basic rule of thumb is use what is currently working and innovate within the confines of it unless there is a major loss or need to change things.
I’m specifically talking about adserving technology, the databases that store information that is gleaned out of adserving and then the structure of this data to query and report on.
I got in a discussion with the CEO of a emerging data company who I have a ton of respect for recently whom also has an extensive technology background. Our discussion was around real-time data reporting and the feasibility thereof. Typically most adservers dump logs into massive long-term storage databases using either hadoop, neteeza, or even oracle to store this. There is definitely a maximum number of records you can insert per second, limitations on the structure of this data which could complicate how you pull it out later for reporting, and furthermore the more data you store in one place, the harder it is to pull out a tiny piece of it on a quick recall.
When talking about advertising, for an individual advertiser or marketer, you are talking about between 10 and 100 million display ad impressions per day along with tens or hundreds of thousands of clicks and tens or hundreds of thousands of page views on a daily basis so to break that down lets say you need to store:
100 mm impressions
300k clicks
20 k page view events
Per day totaling 100,316,000 event records
so that’s 4,179,833 events per hour
or 69,663 per minute
or 1,161 per second
Then you have to think about how that number will spike during certain hours of the day and then lets say you definitely want to design the system to handle a lot of advertisers so let’s say 1,000 advertisers….you are talking about being able to handle between 1 MM and 20 MM events inserted into the database per second.
So how do you do this while managing costs? And is it even doable?
One thing is for sure that you need to own the data source so importing data from third party adservers or publishers is off the table because the server to server transfer alone will add valuable seconds on to your process and if you ever plan to do calculations on that data like conversion rates, match processes, click thru rates, you are adding time on to process that, and we haven’t even gotten to querying that data out of the database yet…..we’re just inserting it. That said querying it out if structured properly will be a lot quicker and easier because you just have one or two users querying at a time per advertiser’s data set.
Anyway…..just thinking out loud about the problem at hand and the gap between real-time bidding and actually pulling the data based on real-time bidding into a reporting interface so a human being can actually look at a marketing problem and address it or examine it.
As a business side person myself, would love to see anyone else’s commentary on what database structures they have used and the feasibility of a project like this.
- Bsac what keeps me awake at night final (slideshare.net)
- Partition Wise Joins (blogs.oracle.com)
- Algebraix Data Awarded Patent for Breakthrough in Data Management (eon.businesswire.com)
- Data Explosion: Analytics Software Must Adapt or Die (readwriteweb.com)
