Data is the Blood of an Organization
By Robin Bloor
You can think of data as the life blood of an organization. For most organizations, stop the flow of data, and they are instantly insensible. They are probably incapable of action and in mortal peril.
It was not always so. But as decade gave way to decade, more and more of an organization’s data found its way into the computer systems and began to live life inside a digital universe – one that now stretches from data center to data center, into the cloud and onto desktops and mobiles devices everywhere. This digital universe will soon expand into devices of every kind, from fire hydrants to irrigation systems. In fact, it is already doing so.
If we pursue the analogy of “data as blood” we need to think of raw data as “food” that is digested in a series of “processes” until it is pure enough to “cross the walls of the e” and enter into the bloodstream. Once into the bloodstream, depending on its nature, it can be stored in different places. It may be dispatched to a data warehouses, which is a little like a human liver. It serves the organization with data in many contexts, just as the liver gets involved in providing a whole host of useful things for the rest of the body. It stores glycogen, becomes involved in protein synthesis, produces hormones, cleans the blood, aids in the digestion of lipids and stores more vitamins than you can shake a stick at.
Just like the food, vital materials (vitamins) and the hormones (messages) transported round by the blood stream, data fulfills a wide variety of roles in the operation of any organization. But no matter what systems it supports, if data is not pure and it is not managed well in its use, “illnesses and infirmities” develop.
The bloodstream analogy seems to suggest that data ingest (digestion) is a far more complex process than we generally think it is. It is certainly the case that man’s digestion of food is also a very complex process, and it is worth taking note.
In IT terms, we used to live in a world where data was captured directly by the organization, by data entry clerks whose data was cleansed by a whole series of validation rules that ensured the data was fairly clean, if not sparkling clean, when it entered the organization. We built systems that also captured the data structure perfectly. In summary, when data entered the organization, we knew where it came from, exactly what it was and we could manage its life cycle – even though, sad to say – few organizations put much effort into managing the data life cycle. They preferred instead to “let there be entropy.”
That cosy cloistered world has long since vanished, and while there may still be applications where such excellent care is taken of data, they are not the norm; they are in fact rare. We have entered the era of Big Data, and thanks to the processing power now available at relatively low cost, an organization can ingest data from almost any source. That includes internal log files that faithfully record the activities of network devices, operating systems, databases, websites and many other things. It also includes external log files over which the organization has very little control, which may emanate from mobile devices or embedded processors or RFID tags or, to be honest, any processor anywhere doing something “interesting” to the organization.
The Digestion of Unstructured Data
We also have the phenomenon of so-called “unstructured data.” As the term “unstructured” has never been well-defined in IT dictionaries, we have to use the word with some care. We personally think of data as being unstructured if we (the organization) did not determine or fully record its structure. All digital data actually has some structure, otherwise it would never have been produced. But very little data declares its structure overtly – as it could by using, for example, XML tags or JSON definitions or by being stored formally in a database that has a catalog that defines every item of data coherently.
So all the social media data (tweets, Facebook postings, blogs, comments in forums, etc.) is unstructured and needs to have structure added to it in order to be processed. Similarly, log files are unstructured. Data freely available or even bought from public sources is sometimes only semi-structured. And a good deal of internal data (emails, text messages, etc.) is poorly structured.
This is where Hadoop has found a life for itself. You simply cannot take all that external and even internal-but-never-formally-mined data and drop it into a data warehouse. So the concept of a data lake, or better, data reservoir, or even better, data refinery, has been invented. And the IT industry seems to believe that the perfect vehicle for this is a scale-out file system called Hadoop. And, in our opinion, Hadoop really works in this role, mainly because it is a scale-out key value store that can hold as much data as you please and it does not demand that you hold the metadata of any of the data you drop into it.
So lets get back to our analogy and declare Hadoop to be the stomach and small intestine. It is awash with enzymes (processes) that clean the data, secure the data, determine the metadata and generally ensure that foreign data, food for the organization, is properly digested before it enters the blood stream. And what is not required can be thrown away.
The point is this: if data constitutes the blood of an organization, the new data needs to be properly “digested” before we let it into the bloodstream.
About Robin Bloor
Robin is co-founder and Chief Analyst of The Bloor Group. He has more than 30 years of experience in the world of data and information management. He is the creator of the Information-Oriented Architecture, which is to data what the SOA is to services. He is the author of several books including, The Electronic B@zaar, From the Silk Road to the eRoad; a book on e-commerce and three IT books in the Dummies series on SOA, Service Management and The Cloud. He is an international speaker on information management topics. As an analyst for Bloor Research and The Bloor Group, Robin has written scores of white papers, research reports and columns on a wide range of topics from database evaluation to networking options and comparisons to the enterprise in transition.