The Art of Approximating Reality with Data

Data science should never be performed in a vacuum, so data scientists must become comfortable with asking for what they need and listening to what their nontechnical colleagues need.

At first glance, asking whether data science is an art or a science seems unnecessary – the answer is right there in the name. And while probing this question further does not undermine this self-evidence — data science is, indeed, a science — it also exposes strikingly similar imperatives at the heart of both art and data science.

Insofar as data science is an attempt to arrange facts and figures in a way that reflects the empirical world, it, like (representational) art, produces approximations of reality — “approximations” being the operative and, arguably, the most interesting word here.

There is an inherent tension between uniqueness and reusability in data science’s approximations. On the one hand, a model designed to solve one problem and one problem only — i.e. a model that hews as closely as possible to reality — may be incredibly useful in a specific set of circumstances, but will be otherwise useless. On the other hand, a model designed to be highly transposable — i.e. a model that captures only the broad strokes of reality — may be applicable in many sets of circumstances, but will deliver only shallow insights in each.

Just as pinpointing and producing the desired degree of abstraction from reality — the “perfect approximation,” as it were — is the crux of the painter’s or sculptor’s artistry, so, too, is it the crux of the data scientist’s practice. In short, effective data science involves finding ways to resolve this tension between uniqueness and reusability in a productive way. To clear the path to such a resolution in a real-world business environment, two things are absolutely essential: good data infrastructure and active, trusting listening and communication.

Having the Right Data Is More Important Than Having the Most Data

Data science is not alchemy. No matter how sophisticated, no algorithm is capable of papering over sub-par data infrastructure or ill-conceived data collection processes, at least not over the long term. In other words, when it comes to driving meaningful improvements to an organization’s business, the “data” of “data science” is just as important as the “science.” If an organization wants to ask — and answer — penetrating key business questions (KBQs), it must take pains to collect, purchase, or otherwise acquire not only extensive, accurate data, but the right data.

But while many organizations are already capturing some data that pertains to their customers, only a fraction of them are capturing data that aligns with the depth (or precision) of insight they wish to achieve. This is where a data scientist’s input becomes invaluable. From Internet of Things devices to ecommerce platforms to point-of-sale systems in brick-and-mortar stores, the myriad mechanisms of data collection at organizations’ disposal mean that the size of an organization’s storage infrastructure is practically the only limit on the volume of data it gathers. That said, depending on the nature of an organization’s key business questions, more data may not necessarily lead to more accurate answers.

Fortunately, an organization’s data science partner will be able to guide the organization through setting up the data infrastructure it needs to capture the right data — i.e. the data required for the analyses that will generate answers to its KBQs.

The Importance of Listening

If the right data amounts to a data scientist’s paint, brush, and canvas, the ways in which the data scientist combines tools and techniques to produce a model that reflects a “perfect approximation” of the conditions at the heart of an organization’s KBQs amount to their artistry. Creating the space for a data scientist to deploy this artistry — and do so strategically — is where listening and communication come into play.

To develop a model that produces high-impact, highly actionable insights, a data scientist must account for a great deal of minutiae. It is not unusual for a team of data scientists to dedicate a vast majority of the hours they spend on a project to essential but fairly routine tasks like data cleansing. For instance, especially when working with second- or third-party data, data scientists must begin a project by performing various entity resolution and deduplication tasks. Then, the data scientists need ample time and space to explore their data, discover interesting relationships within it, and leverage technologies to shape these relationships and, ultimately, use them to make predictions.

To a nontechnical observer, the underlying mechanics of each of these steps typically register somewhere between confusing and totally unintelligible. A business stakeholder may understand that these steps are important — or even why they are important — but they are unlikely to understand how a team of data scientists needs to structure and approach their work in order to complete these steps efficiently and effectively. As such, a team of nontechnical stakeholders needs to trust a team of technical stakeholders when the latter tells the former that a question is simply too specific to be answered with the data infrastructure that is currently in place or the data science that currently exists. Again, data science is an approximative (or, more precisely, probabilistic) science, and it cannot be expected to answer a question like, “How will Customer X behave on Day Y at Time Z?”  with absolute certainty.

However, this active, trusting listening must run both ways. A team of data scientists needs to trust a team of business stakeholders when the latter tells the former that the organization is only interested in pursuing a certain set of business goals. It can be tempting for a data scientist to advocate for the development or implementation of a model or solution solely because it is interesting from a data science perspective, but without taking an organization’s business interests into account, this will seldom amount to anything more than data science for data science’s sake — which is to say, some interesting analytics that have little to say about the circumstances at hand. (This is not to suggest that data scientists have no place in helping organizations develop their KBQs — quite the opposite, in fact.)

Approximating Reality with Purpose

Conveniently, the natural give-and-take between data science and business stakeholders often ends up setting a course toward a data-driven solution that is a “perfect approximation” of reality. Stakeholders with their eyes on the books tend to steer a model away from being too abstracted from the conditions on the ground, whereas stakeholders with their eyes on the database tend to steer the model away from being tailored too closely to a set of highly specific (i.e. unrepeatable) conditions.

Of course, this give-and-take notwithstanding, the development, implementation, and fine-tuning of data-driven models ultimately falls under the purview of data scientists. It is data scientists — not business stakeholders — who understand how to add or subtract degrees of abstraction from a model in pursuit of various aims, as well as what kind of data is needed to maximize a model’s value.

But, unlike an artist’s end product, a data scientist’s end product is a means to an end, not an end unto itself. A data scientist’s artistry — their development of a model that approximates reality — is always shaped by an outside force, by the objective(s) it is expected to achieve. And, to be clear, this should not be understood as an imposition or restriction, but as a necessary condition of effective data science.

While artistic approximations of reality may be considered a success if they produce aesthetic pleasure, moral edification, or intellectual stimulation, the success of (data) scientific approximations of reality hinges on their production of actionable insights that drive genuine business value. Such production requires the right materials (data infrastructure) and the right muse (business goals), neither of which a data scientist can determine without collaborating with and listening to their nontechnical colleagues.

What do you think?

This site uses Akismet to reduce spam. Learn how your comment data is processed.