Tactics must evolve but the mission hasn’t changed
As ‘Big Data’ continues to dominate discussions in the analytics space, along comes the notion of ‘Big Data Analytics’ to add confusion in the marketplace. If big data analytics warrants its own discipline, then its methodologies and approaches should be significantly different from what has been used in traditional analytics. On closer examination, I would argue that the analytics at the heart of big data analytics remain fundamentally the same, but require even greater focus on the core business problem to be solved.
Big data has always been with us, it just wasn’t discussed as widely as it is today. The traditional users of big data were direct marketing firms and credit card companies but the growth of digital technology and new devices have altered the paradigm so that many organizations now have easy access to large volumes of information. Technologies like Hadoop have facilitated the processing and consumption of ever-increasing volumes of data.
Indeed, data scientists—or ‘data miners’ in the last-century vernacular—have always contended with volume. And they have always earned their salaries through their ability to transform raw source data into meaningful insights. In any exercise, creating a meaningful analytical file is still the most important first step but now data scientists must also be able to both identify the business problem and create a data environment that provides the information foundation to develop a business solution.
The new reality
Before the digital explosion of the Internet and social media, a typical project would involve the data miner asking for as much data as possible. The rationale was to allow the data miner to filter out all the noise in the data which represented structured data. But in our big data world, massive volumes of semi-structured and unstructured data no longer lend themselves to this approach. The initial ’ask’ of the data needs to be filtered.
Historically, raw data consisted of transaction records, customer files, campaign data and, perhaps, geodemographic data. All this data was structured but the information was meaningless in its raw state. Data miners had to ’work the data,’ applying an extensive variable derivation to process it all into meaningful variables or fields. It wasn’t unusual for this type of data transformation process to generate several hundred variables.
By contrast, in much of today’s exploding digital environment, the data arrive either in semi-structured or unstructured format. The newer challenge for data scientists is to first convert this raw data into meaningful variables. Extraction tools now allow the data scientist to identify key fields and information without knowing the data structure or location of the information. The use of NOSQL databases and programming languages such as Python, R and Java provide one approach to transforming semi-structured and unstructured data into some meaningful format.
But this extraction is meaningless unless a further transformation occurs. Data scientists need to remember the business problem they are trying to solve.
For example, if I am trying to understand how engagement with Coca-Cola in social media has changed both prior to and after a marketing promotion, I might do the following:
- Extract all tweets with keywords related to Coca-Cola that occurred two months prior to the promotion date and two months after the promotion date.
- Convert that data to JSON objects and extract the date field using Java type programming or some API.
- Create an analytical file of a structured table with only one date field.
- Create a graphical trend report using a tool such as Tableau that depicts tweet counts—prior to and after the promotion.
Further, if I want to learn whether a tweet refers to Coca-Cola in a positive or negative manner, I could turn to sentiment analysis tools and create a graphical trend report—again using a tool like Tableau—to graph the different sentiments over time.
But this general reporting of tweet behavior over a period of time is insufficient to effectively determine how a promotion has altered social media engagement. The extraction process needs to be much more focused in order to address the specific business question. Especially when it comes to social media, the old “give me everything” approach simply consumes too many resources in the attempt to make sense of the data. Identifying and understanding a business problem traditionally is one of the four key steps in the data mining process, but it is even more critical when dealing with social media data today.
Content adds context
Besides identifying simple engagement and sentiment, the analysis should also probe more deeply into the content. Are certain themes or topics emerging in the social media conversations? The use of text mining and text analytics tools allow this type of more exhaustive probing. But again, what is the business problem we are trying to solve? If the challenge is creating more customer engagement, text mining may reveal that certain themes or topics are more relevant in driving this engagement to higher levels as a result of the marketing campaign.
Clearly, the business problem must dictate how data scientists use social media. Suppose we want to build a customer retention model that uses social media such as tweets to determine customer satisfaction. The first issue concerns the ability to match customer records from the company’s database against the individuals who are engaging in social media. The second issue is one of reliability: some current research questions whether the comments of people on social media truly represent the opinions of the “silent majority.” Furthermore, there may be privacy issues raised in using this type of information. If our intention is to build better retention models, we might seriously question the usefulness of appending social media to customer records given these issues.
Big data and especially social media data will continue to grow. As analytics practitioners, we can no longer respond by ‘extracting everything.’ Today more than ever, we truly need to understand the business problem so that we can effectively extract the right information when building the solution. While we continue to see great developments in software and technology, the real challenge for analytics and data science is human-related: having the right analysts who are trained and educated on the principles of data mining as well as business analysis. This ability to understand the domain knowledge of a given business, grasp its major issues and dissect its challenges will become even more paramount in any data scientist’s skill set. In that sense, big data analytics may differ from traditional analytics, but I regard it more as an enlargement of the discipline that makes data scientists even more valuable to the 21st century organization.