Data Science-ing for Insight — Embrace Your Domain Expert Partners

Randy Au
6 min readFeb 28, 2019

--

When it comes to finding data driven insights for use in business, working with expert people is key.

I briefly touched upon this in my previous article about succeeding as a data scientist in a startup about how people should leverage domain experts, also called subject matter experts, and I’m expanding on that here because of how important it is.

TL;DR: People who work with data? It’s great that you’ve got all this data and math and all, but life is so much better and you’ll be much more impactful if you’ve got input from the people who work close to where data comes from.

For the purposes of this article, I’m going to take a very broad definition of domain knowledge. It’s information that is unique to a domain/system and wouldn’t be self evident in the data. It’s the kind of stuff someone who’s working closely with the system would find self-evident, perhaps somewhat obscure and layered in technical details.

For example, it could be something obvious like “most people leave work in the evening” or technical like “package shipments won’t ship today unless payment is completed before 2pm due to warehouse processing”. It’s knowledge and logic that is likely to have an effect on any conclusions you draw from your data, but isn’t explicitly encoded within the data itself.

From the perspective of just being a data scientist seeing data entries in tables and logs, you’d have no idea why something is happening.

Every time I fire a linguist, the performance of our speech recognition system goes up. — Fred Jelinek

In machine learning, there is always this tension between using the knowledge of experts in problem being studied, and using completely different approaches. The way humans make sense of the world is very different from how we teach machines to make sense of the world. There’s also no guarantee that the human way is the optimal way either.

So, keeping in mind that there’s a balance to be struck somewhere, let’s dive into how domain knowledge has many uses in a data science application.

Insight vs Optimization

I’m going to make a big distinction about insight vs optimization here — depending on if your goal is to obtain one or the other, which comes down to what questions you’re asking, the value of the knowledge and opinion of experts is significant

Insight is a deep understanding about something, ideally it’s something that will allow you to make one or more decisions or changes that brings a dramatic change to the business. Sometimes it’s just an idea that gets used to make something better, sometimes it’s as drastic as “we shouldn’t even be in this business to begin with.”

In searching for insight, having domain knowledge can be very powerful. A single dataset rarely holds all the quirks and behaviors of a system within it, so a correlation or behavior pattern found within a data often doesn’t have the context needed to connect it to the business. But that same trivial piece of information pulled from the data can connect to domain knowledge in an expert’s head and bring about useful insight.

Meanwhile, searching for optimization is a different activity. It tends to be results driven (“raise sales by 3%”) compared to broader insight questions (“what are people doing?”). This is the kind of situation where A/B testing came to the forefront of modern tech development processes — showing that it is very difficult for anyone, even so-called experts, to predict which widget design would do better in the wild.

At the same time, many A/B tests don’t yield much in the way of insight by themselves. Why is the blue button doing better than the red one? It’s hard to say why with just a dataset. You could always further hypothesize and test and try to figure out the deep insight as to why, but the act of optimization itself is happy to stop before that point.

I’m aware of making a dichotomy here that’s not exactly true. With good scientific inquiry, you can take a bunch of small optimization tests and build a testable theory for why something happens. Once you establish that, you’ve developed an insight. Similarly, experts tuning a system like FICO credit scores for a long time using all sorts of methods can leave very little room for machine learning to outperform humans.

Domain Knowledge Partners

The position of data scientist, and to a lesser degree advanced analysts, are some of the most cross-functional positions around. They’re required to work closely with backend and frontend engineering, execs, sales and marketing, finance, customer support, and just about anyone else who gathers, processes, or uses data up and down the corporate ladder.

There’s no way you can know everything they do, so make friends and make use of their knowledge. They’ve seen more “training examples” than you have just by virtue of doing their job. The comic below is fair warning.

Yes, I like this one comic a lot. https://xkcd.com/1831/

Luckily, data skills are some of the most generalizable skills out there. Just about aspect of work can be approached with researcher lens and potentially improved in some way. It’s often easy to make new friends if your value proposition is trying to help them, so long as you actually and honestly listen to them.

So here’s a non-exhaustive list of things these wonderful people can help you with.

Explaining special cases

Probably the most important reason of all.

The world is full of things that seem illogical at first glance but have some kind of deeper reasoning behind it. Seemingly arcane and inefficient safety regulations exist because they’re written in blood, there’s a giant gap in your high value customer sales funnel data because those accounts go through a special purpose CRM software that’s poorly integrated with your back end, the company stopped making revenue on this specific week before you started working here because we were under a DDoS attack.

Before you go insane trying to figure out all these details from data table’s, just ask someone.

A Source of shortcuts

While it is scientifically super interesting to be able to teach a model to compete a task without giving it a priori knowledge about the world, it’s really hard! Also, very few people in industry has time for that, most of us work in a “we’d like to make a decision yesterday” scenario.

Just like having an informed prior injected into your Bayesian model instead of defaulting to a naive uniform one, having access to domain knowledge lets you make simplifying assumptions to make things easier to develop. For example, if you know that some cases are pathological, like a shipment can’t go out before an order is made, you can safely ignore those cases that pop up (because there’s always pathological data in a db).

Sure, it’s always important to check assumptions, even expert opinions, but if we have to choose to spend time either checking the assumptions of an expert vs checking something that even the expert can’t explain, I’d first focus on the unexplained one before circling back.

Sanity checking and hypothesis generation

There’s that old saying that “all models are wrong, but some are useful”, the trick is knowing how wrong you are.

In predictive analytics/ML settings you have standard techniques to measure this involving holdouts in your training sets, but such methods aren’t as common in more generalized analysis situations.

One way to work around this is by asking an expert about how “I found this conclusion in the data, how does this line up with your view of the world?” Their reactions are useful (“Oh, that looks about right/wrong”) to interesting (“Wait, really? How come?”).

The best times are when they come up with interesting hypotheses as to why some data looks peculiar, because guess what, you’re as the data person is best equipped to experiment and test to see if that hypothesis is true!

Triangulation of results

Similar to sanity checking, you can use experts to help triangulate results. If we assume there’s a base reality that all our models and experiences are referencing, different methods and systems should reflect similar results.

I once created a revenue model based solely on customer renewal metrics, the finance team had a completely independent model they were using to predict future cash flows. When we compared the outputs of both models, they agreed to within roughly 5–10% of each other.

The fact that both models came so close despite having very different methodologies gave both teams greater confidence that we hadn’t overlooked something. When later compared to real revenue, both models came within about 5% of actual.

Go out there, make some friends

I’m super introverted, all this is not easy, but if you’re open to having interactions, it usually works out. Keep at it.

--

--

Randy Au
Randy Au

Written by Randy Au

I stress about data quality a lot. Data nerd/scientist, camera junkie. Quant UXR @Google Cloud. Formerly @bitly, @Meetup, @primarydotcom. Opinions are my own.

Responses (1)