Rethinking Data Collection To Avoid Heartbreak

This is very much a post for myself to log my thoughts and learnings as I continue to take on new projects and opportunities that require me to work new mental muscles.

tl;dr – We need to be very aware and intentional about what data we collect about our users. Not just for the sake of meeting data regulation requirements, but for the sake of maintaining clean, relevant, and actionable data.

When you’re working with something like Google Tag Manager, you have a lot of options for how to extract data. You can use CSS classes to target specific elements, you can use built-in browser events, and/or (my personal favorite) you can push information to the data layer and capture it there.

GTM is, of course, only one tool and it is still primarily used for web sites/platforms. At this point in time, most businesses have multiple digital tools or platforms they use beyond standard websites that will or should collect data. Separately, all of those data sets can be helpful and serve very specific purposes. What a lot of teams miss, is the magic that happens when you’re able to merge insights or data from those different sources together.

The world of data regulation is changing (if not mutating) rapidly, making it nearly impossible to collect data at the scale and level of precision we’ve been working towards. We’re now dealing with TCF v 2.0 consent requirements being enforced for EU users which will have a fairly direct impact on any online publication that relies on programmatic ad revenue. If governments start enforcing that level of consent for all analytics tracking, we’ll quickly arrive at a place where we simply can’t understand our top of the funnel users in an actionable way.

To avoid some of this nerdy data heartbreak, we need to shift how we’re thinking about collecting, storing, and even associating data.

Till now, the “gold standard” has been to associate all of our insights around a specific user. We’d start with a user id, push that to our other platforms and then merge collected insights to achieve some level of identity resolution to understand what the user did to progress from discovery to conversion.

For high-value and low volume conversion models (mostly B2B) there’s still a lot of value in identifying and understanding individuals and that won’t likely change. Those final conversions usually require a ( or several) human interaction(s) and happen over an extended period of time.

In this situation, you’re likely storing the user’s (lead’s) information in something like a CRM and capturing it through an action like a form completion, account connection, or another that requires a manual submission from the user. At the point of submission, you simply need to include a terms & services agreement that includes your policy for data collection, cookies, and privacy/sharing and require manual acknowledgement of the t&c before the action is completed and the information is passed to your system.

Note: I’m intentionally leaving identity resolution out of this post as I haven’t been able to find enough information to say whether or not merging historical anonymized data with recently consented data is against any standing regulations or at risk of becoming outlawed.

For higher volume and lower value conversion models, we need to think about how much we need to know about the user in order to be effective. We know that personalizing messages has a positive impact on user engagement, but when it comes to real-time features, recommendations, and data sharing – do you really need to be using that personalized data?

Let’s look at a feature like a recommendation engine for products:

Your system likely has (or can have) a randomized number for all users. It’s easier and more reliable to use a number than a name to avoid issues with similar names, updated name entries, or misspellings. We see WordPress take the same approach with categories and GA take the same approach with custom dimensions. It’s as simple as acknowledging that numbers make more sense to machines than words.

All of the history for that user is likely associated with that number, this includes purchases, product views, items added or removed from the cart, etc.

If the system is asked to make product recommendations to a user you really need to look at actions associated with the user id and then compare that to product information to identify the query you want to run. For example:

“I want to recommend similar products to users who have had an item in their cart for more than 3 days.”

For this you essentially “ask” your system to look at the product in the cart, match other products based on how you defined “similar”, and then tell it where/how to display or present those options. This logic can then be applied to all users who had an item in their cart for more than 3 days.

This specific feature or strategy requires absolutely no use of PII. There’s no reason for needing to present their name, address, or any other insights you may have, which means you’re never exposing PII in the browser.

Moral of The Story

  • Think through your features, tools, channels, and use cases to evaluate what data you actually need to be collecting or leveraging across your tools.
  • Think about the difference between features meant to make a user’s interaction with you easier and more enjoyable e.g. remembering their shipping address and features meant to persuade or push them into doing what you want them to do e.g. using their name or personal information to make it seem like you know them better than you do.
  • Gather real feedback from your users to better understand the line between offering convenience or ease of use and being creepy and unnerving with the data you have available.

I’ll probably write more things and you might just want to read those too.

Subscribe to make sure you don’t miss any of the good stuff.

Blog at