Fortune, which has a great deal of power in other matters but especially in war, can bring about great changes in a situation through very slight forces. – Julius Caesar
Whatever conventional wisdom says, history does repeat itself. A study of early American explorers has lessons to offer that are relevant to high-technology companies hoping to conquer the bold, new territory of big data.
On April 30, 1803, American diplomats in Paris signed a treaty to purchase 828,000 square miles of French colonial territory in the center of the North American continent for $15M (roughly $250M in current terms), doubling the land area of the United States with a pen stroke. This huge swath of territory, today known as the Louisiana Purchase, encompassed nearly all of the Great Plains, a vast collection of grasslands and forest beginning on the banks of the Mississippi River that gradually faded into steppe country moving west to the Rocky Mountains.
Jefferson commissioned an expedition to explore the newly acquired domain. Led by Army officers Meriwether Lewis and William Clark, a nine-man survey party departed St. Louis, Missouri in May 1804, following the Missouri River to its Montana headwaters by boat and then, leaving the boundaries of the Louisiana Purchase, proceeded overland down the Columbia River to the Pacific coast of Oregon, which they reached in late November 1805.
The primary objectives of the mission were to establish an American presence across the western reaches of the continent to the Pacific Ocean, as well as finding a viable water route for transcontinental commerce. Besides doing their best to establish trading relations with native tribes they came across, the expedition reported on the geographic, botanical, and wildlife details of the journey and the vast numbers of buffalo, elk, deer, and other animals encountered. Though the intent to find a navigable waterway across the continent remained unrealized, in every other respect the exploration of that vast western wilderness was a tremendous success for the expedition, resulting in many new discoveries.
The Lewis and Clark reconnaissance triggered other voyages into the West that grew into an avalanche of adventurers, pathfinders and pioneers. Yet its beginnings were modest – nine men, each with wilderness skills, fragmentary information from trappers and basic tools like compasses, muskets, boats, knives, and such which they wielded with seasoned expertise.
The explorers of the past were great men and we should honor them. But let us not forget that their spirit lives on. It is still not hard to find a man who will adventure for the sake of a dream or one who will search, for the pleasure of searching, not for what he may find. – Sir Edmund Hillary
Thomas Moran, “Grand Canyon of the Yellowstone”, 1872 (source: commons.wikimedia.org)
At the beginning of the new millennium, we stand at a strikingly similar juncture in the field of high technology. The near worldwide ubiquity of wired and wireless communications has led to the growth of fantastical amounts of data fueled by computing power centralized in a plethora of server farms and datacenters, each with their accompanying storage servers holding massive amounts of information. The demands on these platforms for gathering, processing and holding information is destined to grow geometrically as the rise of the Internet of Things (IoT) and the growing deployment of increasingly advanced Robotics spreads electronic sensors, controllers and transmitters thru every facet of modern life. Stated differently – we are on the borders of a vast, nearly unexplored frontier of big data, with this data wilderness poised to expand by orders of magnitude.
Just as the American West drew in hunters, trappers and prospectors seeking buffalo herds, beaver ponds and gold veins, the big data frontier is attracting new kinds of explorers such as data scientists and digital marketers mining data and searching for hidden patterns of meaning in seemingly boundless expanses of information. There can be no question about it – the potential of big data to provide insights into matters such as consumer sentiment, social body language, aggregated financial transactions, traffic patterns, industrial process variations, product uses, and lifetimes, medical issues and myriad other fields of inquiry will have long term transformational effects on what we buy, where we live, and work and even the political, social and economic fabric of society. This is why the monumental breadth of the big data wilderness is attracting so much interest, because the data itself is a tremendous store of untapped value awaiting discovery.
Charles Marion Russell, “Whose Meat”, 1914 (source: wikiart)
Exploring this growing wilderness of information, however, will not be without its trials and hazards.
Repeating the experiences of those trailblazers who found no two mountain ranges, valleys, forests, meadows, grasslands, deserts or canyons that were alike, digital pioneers are encountering data repositories and processing centers that all differ from each other in size, location, configuration, content, trustworthiness (in terms of security as well as data integrity) and means of access.
The very nature of storage systems and how they affect data changes our ability to use and interact with it. Some data, when digitized, suffers losses. The formats in which data are stored are varied and often enough in conflict, as the data can reside in CDs, DVDs, magnetic tape, HDD, floppy disks batch cards, and so forth. The very structure of the data can vary as well, depending on things such as the original source and the storage mechanism.
Data retrieval is the flip side of data storage. It is already a common problem for data explorers to have difficulty determining what they have and how best to use it. This can lead to inefficiencies and extra costs, as data may be stored redundantly, with copies isolated in silos and never fully reconciled with each other as the mass of data grows and/or changes. These sorts of problems generally result from a lack of data awareness .
There are quite a few organizations who believe they can protect themselves from the above data awareness problems by engaging in uber data collection , following the flawed principle of “more is better.” Yet this can easily exacerbate existing difficulties in extracting value from data. Unguided data acquisition & storage and unbridled data capture cannot lead to meaningful usage, intelligence, or analytics.
However much information a given enterprise has accumulated, the utility of that information depends on where the organization is on the data science learning curve. A multitude of clients with tactical, operational and strategic requests confront the typical IT department on a daily and even hourly basis, all in the face of changing technology and business exigencies. Both data and analysis results need to be distributed – to different clients and across different hardware platforms. The variety of uses, purposes and channels for the same data present such a load of requirements that it becomes something of a Gordian Knot to support.
There are significant parallels between the big data frontier and the exploration of the American West. There are certain basic principles that were instrumental to success in opening up the frontier west of the Mississippi that will be vital for realizing the same level of achievement in the exploration and utilization of big data. These maxims are:
- Successful explorers don't simply march into a wilderness in a random direction and wander capriciously. They have objectives that define the scope of the exploration and help determine a course for navigation. To put it more plainly, the journey must be initiated within a specific context .
- The exploration must be methodical and scientific, with good planning and deliberate actions. Such a systematic approach greatly assists in discovering 'nuggets' of knowledge which were not originally sought or even anticipated.
- Having the largest frontier you can imagine and determining how to draw a boundary around it are both secondary to what exactly is contained within that wilderness and what you plan to do with it.
The key to embracing the exploration of big data successfully is being able to capture context – in terms of the available data, the needs of clients and the existing hardware and software IT environment. Context can be elusive – it changes frequently and is actually dependent on the data available. Nevertheless, no data analysis will have real meaning or value without it.
Context provides a needed extra dimension to a territorial representation. In the same way that a United States Geological Survey (USGS) map includes the vertical dimension of terrain with a far greater amount of detail than a simple 2D highway map, context in big data provides deep insight into the sort of data required and the layers of meaning in that dataset.
There is a methodology that top tier professionals use to establish proper context for big data issues. As a discipline, data science is methodical and, of course, scientific. First and foremost, experienced data scientists determine whether the data at hand is trustworthy. Without this, all subsequent work with such a dataset is essentially devoid of substance.
The dataset must also suit the meaning and purpose of the client. In other words, having big data is less important than having the right data. Having the ability to ascertain this accurately depends heavily on the experience of the data scientist.
Operational issues also need resolution for establishing a proper context. If the data is available “as is”, without it being necessary to solve format transformation riddles, the dataset becomes much more useful. Finally, the data must be available “on demand”, which depends primarily on the hardware and software architecture supporting the client.
The best architected systems contain data that can serve the needs of multiple contexts. Such data should not be dependent on a single storage source, but from a deliberately orchestrated, coordinated and maintained redundancy of storage sources, systems and databases.
“In omnibus autem negotiis priusquam adgrediare, adhibenda est praeparatio diligens.”
In all matters, before beginning, a diligent preparation should be made. – Cicero
Before setting off to explore big data's new frontier, the explorers have to be as carefully prepared as Lewis and Clark, Kit Carson, Davey Crockett, Daniel Boone, Buffalo Bill Cody, and all the other famous frontiersmen were. By scraping together every tidbit of information they could from trappers and previous explorers who had made more limited excursions into the West, Lewis and Clark did exactly what seasoned data scientists do in the present – finding any data that is already at hand, no matter what form in which it may be stored or in what condition it might be, and matching it up as best as possible with the needs and interests of those asking for the data. It takes professionals with skills honed from long experience to place all of this in its proper context. In the end, successfully establishing context for every user, system and dataset depends on having subject matter experts in data science – a commodity in particularly short supply nowadays.
The tools and technology to perform this, whether the compass, maps, flints, musket and other gear of Jim Bridger or the Apache Hadoop distribution and associated utilities residing on a data scientist's workstation, all have their strengths and limitations. How to best use all the items in the kit, avoiding the pitfalls that come from improperly using those tools and compensating effectively for the eventuality that some desirable implements are not available is another characteristic that separates the neophyte from the professional and spells the difference between success and failure.
This editorial series will explore all of the above factors in depth, filtering out hyperbole, misinformation and hearsay along the way. We will pick up this discussion again in the next installment in our exploration of big data.