Transparency: Livestreaming 1,600 API Requests per Second
One of Tapstream’s most brag-worthy features is our Live Tab. It’s where our customers can see a live feed of their marketing and app data sent to Tapstream from all over the world. We created it so customers could easily debug their SDK implementations and peek at their campaigns as they go live. Often, the first sign of traction app publishers see is not the download numbers in the App Store, but the volume of clicks seen inside this Live Tab, and then a flood of in-app events that follow shortly after.
With the live tab, our customer is able to view all the hits (clicks on our URL shortener taps.io) and events (app activity tracked by Tapstream’s SDK) coming into Tapstream and inspect individual users’ timelines. At the time of this writing, our servers can see over 10,000 API requests per second, and all of those events are available for viewing via the live tab.
These requests are not just dumb ad redirects: each hit has to support deferred deeplinking (i.e. a very smart deeplink redirection), and then shortly thereafter has to be tied with any app installs to create a conversion timeline – all in real-time. These are then consumed by apps as they run for the first time to customize what new users see: a specific product inside an ecommerce app, a friend’s profile in a social app or a specific hotel inside a travel app, for example.
The basics: Hits and Events
Every hit, event, and page impression on the Live tab begins its life as an HTTP request to one of our geo-distributed web nodes. These nodes take care of gathering as much information as they can from the request, from the http headers for things like referrer and ip, to form parameters set by our SDK and custom query string parameters set by our customers. Tapstream uses protobufs as a data interchange format, so all this information is used to populate a protobuf-generated data structure representing the message, which is serialized and sent on its way to our processing nodes.
Everyone has a different definition of “big data”, but we certainly get enough traffic to keep us on our toes when it comes to carefully managing throughput. We aim to deliver sub-second delays between data collection and reporting, and one of the key open-source projects that makes this possible is Kafka. Kafka is a messaging system designed for high throughput applications like ours. It can buffer a huge amount of sequential data with fast reads and writes, by carefully managing data locality on-disk.
Kafka sits between our web nodes, which collect the data, and our Storm cluster, which produces the conversion and reporting data that a customers sees when they look at the dashboard. One of the nice things about Kafka is that multiple services can consume messages from its buffer independently, which is how our live tab can exist without negatively affecting our core product.
The machinery for consuming hits and events from Kafka for display in the live tab runs on our frontend API servers, which back Tapstream’s dashboard. There, we spin up a python process with two jobs: 1) to consume messages from Kafka, and 2) to initialize and retain WebSocket connections initiated with the Live Tab api endpoint. All messages must be deserialized, but luckily protobuf makes this fast. The listener checks the account id of each message against the set of account ids that are presently listening to the live tab, and discards the message if nobody cares about it. Messages that are relevant to someone’s immediate interests are formatted as JSON, then published to the WebSocket channel.
On the dashboard side, when a customer navigates to the live tab, a WebSocket connection is opened up to the live API endpoint. The connection is held open (as WebSocket connections are wont to be) so that messages can be delivered with no polling overhead. Here, the messages are nicely rendered into the Live tab’s Ember template, ready to be analyzed.
For the purposes of not blowing up our users’ browsers, messages are discarded after 1000 have been delivered, and a maximum of 100 will be shown at one time. Without any filters applied, the incoming events are adaptively filtered, so that only (on average) 5/second are rendered, but the sampling is discarded when a filter is applied, and (up to 100) matching messages are displayed.
When a session id filter is applied, something different happens. We still display messages with that session id, but if a conversion – that is, a pairing between a hit session and an event session – is present for that session, we also display hits with the paired session id. We also pull up all the older hits and events on that timeline (i.e., hits on those session ids) and display them. This provides information for all of that user’s activity, going back to our current data retention interval (presently 60 days). This information is retrieved from another of our API endpoints, the timeline lookup api.
The live tab server machinery also listens on another Kafka channel. When a conversion is made in our processing cluster, the session pair that just converted is published to Kafka as an attribution. These attribution events are also collected by the live tab driver and sent to relevant clients. When this happens, a special “Converted with” bit is applied to the event.
These can be hard to find in organic traffic, since anyone with enough traffic to get frequent conversions will be subject to enough sampling to make them unlikely. If you see one of these in the wild, count yourself lucky! More often, the conversion seen when the customer themselves is testing our their implementation. The green circle means you’re good to go!
Want to try it out? Why not sign up for a free Tapstream account today?