Wednesday, August 11, 2010

Clickstreams, footstreams, sensorstreams, tweetstreams, and otherstreams

Came across an interesting term today -- “footstream”. Used by Jeff Holden at Whrrl to describe geolocated data event that has a particular meaning or importance to it. Here’s his description from a talk he gave at Where 2.0 in 2009.

People vote with their feet. An individual person visits places that are in some way important to that person... Location-based services [can now] provide us with the ability to capture, in digital form, the places people go. And “places” does not mean just the lat/longs, the cities or zip codes or neighborhoods. ... We can capture which businesses or other points of interest individual people visit. This data set is the real-world analog of a clickstream in the Web domain; in fact, we might call it a “footstream.”
This comparison of footstreams to clickstreams is interesting and apt. The pervasiveness of capturing and analyzing clickstream data was recently explored in-depth in a recent WSJ series. To anyone in analytics, interactive advertising, ecommerce, and other consumer tech industries, the practice of capturing clickstream is not necessarily new.
What is new is the proliferation of companies getting data over the past 2-3 years. Several years ago, websites would typically have just a embed to capture clickstream data -- their analytics program. And if it was sent to the service provider, it wasn’t use beyond providing the analytics service and for internal provider needs. Now sites have upwards of 60 services included within their pages that capture clickstream data and metadata around the clickstream data. The ad widgets, recommendation widgets, and other services that appear on a pages all use the data the click to provide the appropriate response. They also store and use this data across their networks and for secondary and tertiary purposes (market research, subsequent service requests, selilng it to third parties, etc.)

When the comparison is made between footstreams to clickstreams, you can see where geolocation is going. You can see that the data being captured will be used to provide benefit now and stored and processed in the future for individual users and for benefits of third parties. You can also see the issues and magnitude in dealing with this data. Capturing and processing clickstream is not a simple matter. When servicing a number of high-traffic sites, it quickly becomes overwhelming -- such to the point its difficult to make use of it because of the amount of data and the complexity of the variables (sites, pages, and query stream parameters are just the tip).

As industries grow around the use of more and more realtime atomic streams of data -- tweets, smart meter data, sensor data -- we’re increasingly seeing patterns in dealing with this deluge of streamed data. The capture, processing, storing, analyzing, and archiving these streams takes thinking. It also horizontal scaling of both servers and data storage. And stateless approaches to web application and development. Something that developing in the cloud helps with immensely. (In a subsequent post, we’ll explore these patterns.)

Products and services built to process email, securities trading, ecommerce transactions, even user-generated video are used to these issues. But a number of industries are just beginning to see what they’re in for. The Smart Grid and Internet of Things has been getting a lot of attention recently.
But only this summer has there been much mention of data handling issues for these areas. That will change. Especially as the streams grow in amount and complexity and as the derivative uses become more apparent.

Every web application is now an event transaction processing application. It's just a matter of what type of datastream you're working with.


  1. Great post. It made me think of what Eric Schmidt recently said:

    "Every two days now we create as much information as we did from the dawn of civilization up until 2003, according to Schmidt. That’s something like five exabytes of data, he says.

    Let me repeat that: we create as much information in two days now as we did from the dawn of man through 2003."

    Imagine how much bigger that will be when billions of devices are generating data 24/7.

  2. And data isn't stored just one time. It's replicated across data stores multiple times for persistence and recovery. This is done at several layers of the service stack and so one piece of data could get replicated 3 or more times. (App developers might store data at multiple locations in the cloud. Cloud providers in turn will likely replicate each store to meet their slas.)