An oft-repeated anecdote is that which relates of an army officer, somewhat of a hero-worshiper, who, upon meeting Carson, exclaimed, effusively: “So this is the great Kit Carson, who has made so many Indians run!” “Yes,” drawled Carson, “sometimes I run after them but most times they war runnin' after me.” – as quoted by Edwin Legrand Sabin
Poet George Santayana is credited with saying “Those who cannot remember the past are condemned to repeat it.” As we explore the newest wilderness of big data, there are lessons to be learned from the pioneers of America's earliest history.
Let's look at one famous explorer. John Colter was a frontiersman from Kentucky and a member of the Lewis and Clark expedition of 1806. A superlative hunter and scout, Colter found the mountain passes that allowed the team to cross the Rockies and established peaceful relations with Indian tribes encountered along the way to the expedition's final destination of the Columbia River estuary on the Pacific.
M.C. Poulsen, “John Colter Meeting Shoshones at Castle Rock” (Source: theautry.org)
Deeply entranced with the wilderness he had just reconnoitered, Colter left the expedition on its return leg and helped other groups of trappers and adventurers as a guide for several years. He also became the first European to explore the Yellowstone country of northwest Wyoming.
The most famous incident in John Colter's career happened in southwest Montana in 1809. Colter and John Potts, also an ex-member of the Lewis & Clark party, had the singular misfortune of encountering a very large group of Blackfoot Indians. The three tribes of the Blackfoot nation dominated the harsh steppe country of central and eastern Montana and were at war with all the tribes near them – the Cree, Sioux, Shoshone, Arapaho, Crow, almost a dozen tribes in all. Unfortunately for Colter and Potts, the innate xenophobia of the Blackfoot extended to European explorers.
Potts was killed and Colter taken captive. The Blackfoot tribesmen stripped Colter naked and decided to introduce him to a little tradition of theirs called 'running the gauntlet.' This event involved having the 'guest of honor' either walk or run through a corridor of Blackfoot braves, each of whom was armed with a stick or club with which they would strike the captive. Depending on the mood of the braves, surviving the gauntlet was a dicey proposition.
The Blackfoot outnumbered Colter by several hundred to one and quite naturally felt supremely confident about their total power over his fate. The warriors explained their desires to him and then eagerly lined up along the borders of the gauntlet, ready to have some fun at John Colter's expense.
Colter stood roughly 10 yards before the start of the corridor wearing only his birthday suit and considered the parameters of the situation from his own perspective. In an epiphany, Colter came up with a plan. He then leaned forward in preparation to traverse the gauntlet, then suddenly turned around and took off like a jackrabbit in the opposite direction.
The tribesmen stood stunned for a moment, astonished by this turn of events. A group then broke from the line to pursue Colter. What the Blackfoot braves didn't know, however, was that John Colter was renowned in Kentucky for his fleetness of foot. He managed to outrun and elude his pursuers, then made his way across country to a trader's outpost in a week and a half, hungry but with skin and scalp intact.
The incident provides a vital lesson for those of us who are journeying into the immense wilderness of the big data frontier. John Colter survived this event because he viewed the circumstances of the situation more completely than the Blackfoot tribesmen. Stated differently: Colter understood his problem in the proper context.
The machine does not isolate man from the great problems of nature but plunges him more deeply into them. – Saint-Exupery
Gustave Courbet, “The Forest in Autumn”, 1842 (Source: wikiart.org)
Let's examine the issue of context in the Blackfeet-John Colter case in greater detail. The issue at hand was whether Colter would be alive or dead in the next several minutes. Let's consider the context as perceived by each of the players in this story:
“This is fun!”
“Not my best day…”
“We've done this before and know how to handle it.”
“They've done this before and know how to handle it.”
“The trespasser must die!”
“I really, really wanna get outta here!”
“The victim has no choice in the matter.”
“There are all these people lined up in front of me! Wait…There's nobody behind me….”
“There is no escape.”
“Nobody has ever escaped before… but I'm a very fast runner.”
Each 'client' in this event had a different perspective and set of goals. The failure on the part of the Blackfeet in achieving their desired outcome was that they did not fully appreciate the context of the event – that Colter was not completely in their thrall but had a realistic chance of escaping. To put it in data science terms, each party had access to the same data set, but they did not assess equally the relevance of each piece of data in the context of the problem at hand. This difference in context directly led to the failure of the Blackfeet and the success of Colter in realizing their individual objectives.
The problem of lost context is being repeated many times over in the big data era. The typical business response to big data has been to accumulate great volumes of it. To that end, firms have created large repositories and purchased associated tools to facilitate user analysis. Yet the biggest complaint to date in executive boardrooms is a classic “can't see the forest because of the trees” quandary – that the considerable IT investments in setting up big data infrastructure and have it available for further study have not produced commensurate results in terms of market insights, competitive gains, or revenue growth. To put it more plainly, most companies have yet to discover a compelling eturn on investment (ROI) for their big data expenditures.
What many companies are beginning to realize is that technology alone cannot lead to data science-driven insights. In terms of the perspective of IT infrastructure, big data appears to be a problem that can be reduced to four variables that are often referred to as the “4 V's”:
- Volume – the gross quantity of data being managed
- Velocity – the speed and efficiency with which data can be collected
- Variety – the many different forms & formats of data
- Veracity – the fundamental integrity of the data
Yet this definition is actually quite shallow. In order to discover true value from Big Data using data science best practices, a more complete and accurate definition of the 4 V's is needed. While these attributes represent the key characteristics of 'big' data, each poses a problem as one attempts to overcome their magnitude. Many businesses are, in fact, mired in the vastness of the problems presented by these 4 V's. A superior elaboration of each of the V's would be as follows:
- Volume – the gross quantity of data being managed, as well as identifying the “particular” or “right” data which is relevant to a given task . An abundance of data does not lead to answers unless there is a “right” focus. Knowing what data to use is critical.
- Velocity – the speed and efficiency with which “right” data can both be collected and made available for consumption to a client . Uninhibited and continuous capture of untethered data cannot exhibit meaningful correlation unless the 'right' relationships are established.
- Variety – the many different forms & formats of data as compared to the consumption needs of the client. Variability of format, structure and media may convey dissimilarities resulting in misinterpretation. Knowing the 'right' congruent meaning of dissimilar data forms is important.
- Veracity – the fundamental integrity of the data and its relevance to the problem at hand .
Putting it another way: in order to explore big data effectively, collecting a mountain of data doesn't solve your problems per se . It is essential to understand the background and meaning of the data available in order to purpose it correctly. In other words, effective data science requires that the data must have context .
Let's use an illustrative example. In southeastern Utah is a range of peaks called the Henry Mountains. This spectacular range, sprouting from a desert that makes Mars look like Florida and climbing to over 11,500 feet, is the most isolated mountain range in the lower 48. Below are two maps of the region.
Which of these maps contains more useful data? Obviously, it depends on the needs of the client – a tourist will have different objectives and information needs than, say, a geologist or prospector. In other words, the deciding criterion is the context of use.
There will come a time when you believe everything is finished; that will be the beginning. – Louis L'Amour, “ Lonely on the Mountain”
Experienced woodsmen and pathfinders made detailed preparations before an expedition. Motives for the journey determined major and secondary goals, defined the composition of the team and its equipment, helped in planning for contingencies, focused the gathering of any available scraps of information on their relevance to the specific region to be explored and assisted in formulating a projected route for the party, including the point and moment of departure, a schedule of milestones, and the expected return date.
The planning for big data analysis by veteran data scientists is virtually identical. At every stage, it is context-driven and is viewed holistically from beginning to end. An adroit approach can be broken down into the following process:
- Start with context. Describe the problem in finest detail, capture historical events, activities of interest, sought after outcomes and the roles of stakeholders. This part of the process tends to be collaborative and interactive, like all planning normally is.
- Determine what data is needed (type, nature, quantity, and quality) and prepare it if needed.
- Create an analysis algorithm true to the nature of the inquiry, its objectives and the available data. This is an iterative task, as this is a journey of discovery.
- Compile the results and present them in terms that the target audience can understand.
John Colter surveyed the Yellowstone and Grand Teton regions in the winter of 1807. Winter in Yellowstone is forbidding in the extreme. Yet. Colter managed to do what even the local Indians thought was completely crazy to even attempt. The context presented by the terrain, its wildlife and weather conditions molded the planning for Colter's objective to explore this wilderness.
When exploring a new territory, understanding context and planning with that context in mind spells the difference between success and failure. In the American West, this determined whether a frontiersman would return from an expedition with reports of stunning natural wonders or, years later, a wandering trapper, a party of Indians on the warpath or a hopeful prospector would come across bleached bones half covered in desert sand or emerging from a melting snow bank.
Possible outcomes for data scientists are usually not quite as stark, but the importance of context holds just as true. An invalid context can be terribly consequential, as a resultant prediction or insight could misguide you, resulting in a potentially disastrous decision. Success or failure for a data scientist is rooted in the consequences which stem from a missed facet of the context history (based on incorrect or incomplete context framing), how well the true nature of the problem under examination is understood, who all the stakeholders are and how well their needs and concerns have been addressed in the data science endeavor, and many other factors as well. The significance of context is so central to effective data science that we will return to examining it time and again in this editorial series on big data.