Info Image

Does Unstructured Data Really Exist?

Does Unstructured Data Really Exist? Image Credit: TechnoLight/BigStockPhoto.com

According to Data Management Solutions Review, 80% of all data will be unstructured within four years. Not surprisingly, the data industry is rushing to either provide solutions for this problem  or claim they already solve it. But what they’re not getting is that unstructured data doesn’t really exist in the first place. Rather - it’s “covertly” structured - and the sooner enterprises understand (and accept) this, the sooner they will be able to actually deal with this massive influx of data by being able to fully contextualize and act on it.

Let’s start with the basic definition of data.

What is data?

Per the Cambridge Dictionary, data is:

Information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer.

The problem is: How can one single piece of data exist all by itself, without any context? Sure, there are some applications where the entire data set could consist of a trivial stream of IP addresses and timestamps, representing user activity. But even in that case, our data still has structure—a pair consisting of an IP address and a timestamp. Structure isn’t optional. It’s what gives data value. If you distill the aforementioned data set down to an unordered list of IPs and an associated unordered list of timestamps, you would lose most of the value. Even when working with streams of theoretically unstructured audio or email text you still have metadata which is critical for providing context, and underpins most of the value. To give an example: There is limited value in using fancy machine learning to understand gigabytes of text, audio and video if you can’t use metadata to fit individual findings into a broader strategic picture.

Why “unstructured” really means “covertly structured”

If data could really be completely unstructured, you’d only need to store a single record with a gigantic, undefined blob attached to it. But in reality there are multiple record entries, each with its own key to uniquely identify it, which means, at a minimum, the people storing the data implicitly accept that it does indeed have structure. Over time, more and more of this covert structure comes into view and we eventually find ourselves working with child records and foreign keys.

So, what we have in fact been dealing with all along is not really unstructured data but covertly structured data. It has a structure, but that structure is not visible outside of the application.

And in fact, covertly structured data has made individual developer’s lives much simpler, as they no longer have to deal with corporate data models or other teams with different goals and timescales and are free to use whatever data structures make sense to them for their immediate purposes.

Why most data platforms can’t support covertly structured data

Allowing developers to own the data structures means we get to market much faster. But as applications grow and get connected to other applications, challenges emerge. These challenges will be eerily familiar to anyone old enough to remember life before the RDBMS.

Not synchronizing the structure with other applications before deployment is arguably a form of technical debt—minor issues, such as the format of address fields, become serious issues when you are trying to verify that two things are semantically the same but syntactically different because the developers never spoke to each other or with a DBA. More serious issues, such as storing the data for the same thing in multiple locations, can become really problematic.

Over time, though, even more fundamental problems will emerge as newly required features such as transactions, foreign key lookups, and running totals put data platforms “designed” for unstructured data to the test for various reasons:

  1. If every application invents its own data structures from scratch, data sharing between applications will be a challenge.
  2. If the client is the only part of the application that understands the data format, then it’s not possible to send the needed data back and forth across the network. Changing a single bit in a 20kb record will use just as much network bandwidth as much as creating the same 20kb record from scratch, because everything goes across the wire, every time.
  3. Sooner or later, a requirement for transactions, in the form of simultaneous coordinated changes to multiple data items, will appear. While people can, and do, retrofit ACID, it goes against the grain of unstructured data and rarely performs well.
  4. There are practical and technical limits to how much unstructured data you can store for a single key, and once you exceed those, you need to break up the data into multiple objects. Understanding and maintaining the connections between these objects becomes a new chore for developers.
  5. What happens if your business needs rapid answers to questions like “How many widgets were sold in Texas last week?” if your data lacks structure until a piece of code re-instantiates each object? What if you have 47million of them?
  6. How do you implement the General Data Protection Regulation’s “Right To Be Forgotten” if you lack the internal ability to remember customers?

These are real issues. Many vendors that were pure key-value stores are now adding SQL layers and support for transactions retroactively. We’re also seeing schema registries appear, but that just brings us to Andrew Tanenbaum’s quip about “The nice thing about standards is that you have so many to choose from”.

Unstructured data: the bottom line

There is no doubt that unstructured data, especially in the context of streaming, has rapidly become a thing to talk about. But like a lot of things in the software industry, it is misunderstood and, per the point of this whole article, mislabeled. In order to use data, you need to be able to understand it, and in order to understand it, you need to give it structure, and all data, in the end, has some kind of structure, even the data that people are now calling “unstructured”.

So, when we talk about “unstructured” data, we really mean “covertly structured” data, as somebody, somewhere, knows what the data means.

The fundamental change here is that we have broken with the classic RDBMS-era ideal  of the “One Big Schema”, also known as the “Enterprise Data Model,” which aspired to a world where  everyone  in an organization understood all the data in the system, at all times.. Arguably we have now gone back to where we started before the invention of the RDBMS, with a seperate data model optimized for each application. This is a double-edged sword, and technological leaders need to be aware of the downside. Leaders need to ask the awkward questions, and deploy solutions to mitigate the challenges created by not having clear, common, schemas.

NEW REPORT:
Next-Gen DPI for ZTNA: Advanced Traffic Detection for Real-Time Identity and Context Awareness
Author

David Rolfe is VoltActiveData’s Director of Product Marketing. He has 30 years of data industry experience, half of which has been in a telco context. In prior roles he designed and built charging, policy and mediation systems.

PREVIOUS POST

Push to Eliminate 'Digital Poverty' to Drive Demand for Satellite-Powered Broadband Connectivity Post Pandemic