By Richard Boire
In my last article, I discussed the evolution of advanced analytics and its use within an increasingly digital environment. One conclusion from the discussion was the need for a new or emerging role for the data scientist. With the digital revolution data is at our fingertips. But data is like oil: which is useless unless it can be transformed into meaningful information.
Transformation advancements in technology have provided better tools that are enablers for data scientists in deriving the necessary meaningful information that can be applied in the next business initiative. This could be the development of a predictive model which uses deep learning technology (the real math behind artificial intelligence (AI)), or the simple development of an analytical file which allows a variety of business users to conduct exploratory data analysis.
In both cases the need for advanced analytics (the deep learning predictive model) versus the need for non-advanced analytics (exploratory data analysis) is determined by the business problem. Yet the development of these tools alongside the more powerful data processing tools provide the capacity to solve more business problems that are data driven. The bottleneck is no longer technology or data but instead the actual human or data scientist. What do I mean by this?
In the past, the tools used to create analytical file as well as applying advanced analytics techniques required much time in terms of processing to complete the task. This limitation mitigated the use of data that is now readily available from social media and smart devices such as phones, sensors, Fitbits, etc. Advancements such as in-memory processing along with parallel type data processing or the core Hadoop (Big Data) technology have essentially eliminated processing time as a barrier.
Advancements in software development have also enhanced the toolkit of the data scientist. Software providers have created tools that have now automated the development of model algorithms, in effect creating a “model factory” type environment.
These tools can create hundreds of models “on the fly” where the best model is presented depending on the evaluation criteria of the user. Even if the final model from a given technique has been determined, ensemble modelling is another tool which in effect combines different algorithms from different modelling techniques into one overall ”best” model.
There is no question that these developments in data processing, coupled with developments in facilitating the use of different modelling techniques, have greatly reduced the time to develop a data science solution.
The evolution of programming
But there is still the need for “coding” or programming as 85%-90% of a data scientist’s work is the creation of an analytical file, which still consumes much of the data scientist’s time.
In articles and posts as well as my book, Data Mining for Managers-How to use data (Big and Small) to solve business problems, I discuss the arduous process and approach in working with raw data and transforming them into meaningful analytical files or environments for a variety of analytics exercises.
Historically, the early practitioners of data science were primarily SAS programmers with SPSS programmers being a distant second. R programming gained prominence in the late 1990s, which was the evolution of S+ software that had been available since the 1980s. All of these packages were developed out of the need for doing statistical analysis.
However, the advent of the Internet introduced languages that were more focussed in the discipline of computer science or computer engineering. Languages such as C+, Java and Python gained in popularity due to the more object-oriented nature of the data on the Internet. Another advantage of these programming languages is their open source nature where much of this technical knowledge and learning is more easily shared. Technical platforms such as GitHub are a testament to this.
The need for data science to make sense of all this data further accelerated the use of these languages with Python emerging as the key programming tool for many data scientists today. Python’s ability to make sense out of all this data is further amplified by its ability to empower coders with the ability to develop deep learning (AI) models.
The syntax of the Python language allows the user tremendous flexibility in trying many different options when trying to develop these AI models. These option features are often referred to as hyperparameter tuning, which is the real underpinning in building optimal AI models or algorithms. Of course, this presumes that the 85% work in creating the optimum analytical file has been done and we are now just feeding this analytical file into the deep learning routine with all its various user options and parameters.
Python’s popularity has now extended to the more traditional structured areas, such as the relational database and data mart systems that house much of a company’s non-online data. But the popularity of Python amongst data scientists has created integration products that are now features of most of the leading commercial software analytics providers. For example, SAS programmers can now integrate python code directly into their SAS script.
Demand for less-technical tools
Yet, so far, all the languages that I have discussed require a level of technical expertise in being able to program, which is using the right logic and syntax to generate a given outcome.
But the need and demand for data scientists continues to grow. Accordingly, companies have emerged to try and fill this void by empowering the discipline of data science towards individuals who are less technical from a programming standpoint.
Instead of writing actual syntax or script, users now generate a flow chart of sequential tasks that are necessary to build the solution. Each task is now represented as a module within the GUI interface where the user simply drops this module down onto the flowchart, instead of being coded or programmed by the data scientist. Listed below is one schematic flowchart example from Alteryx, a leading provider of this type of software: (See chart)
There are other providers as well that offer this kind of GUI interface with SAS Enterprise Miner representing another option. Vastly improved visualization tools, such as Tableau, can then take the output file from such a platform and allow the user to present the solution in a variety of different options. The key objective in the visual presentation of this solution rests on the narrative that represents the desired communication between the data scientist and the business stakeholder.
The data scientist of tomorrow
The emergence of the above types of tools have expanded the data science field to individuals with minimal programming experience. But the discipline of data science relies on a foundation of knowledge that is programming and software independent.
For example, the rigour and discipline which is utilized in transforming raw data into meaningful data inputs or an analytical file represents a deep base of knowledge that comprises a significant component of any data science program. As mentioned earlier in the article, 85%-90% of data work is in this area according to leading practitioners. Another critical piece of knowledge is deep knowledge in statistics. Knowledge of calculus and linear algebra is also a growing requirement due to the increased practical use of AI and deep learning.
From a practical perspective, though, it is about the ability to understand the mathematical and/or statistical output and more importantly, what it means to the business. This requires extensive knowledge in how to properly measure the performance of solutions and in understanding such technical concepts as overfitting of solutions.
The underlying theme that resonates from this preceding discussion is the ability to gain data science knowledge that will not be replaced by either existing and/or future software. This implies that training should be geared less on the technical, such as programming, and more on knowledge that is required to be a successful practitioner. Tools and technology will continue to improve, thereby resulting in reduced demand for specialists and their technical prowess.
Instead the need will be for that data science generalist who has the required depth of knowledge when it comes to data and mathematics/statistics but who also has the breadth of knowledge to apply it to an infinite array of business problems. The demand for these generalists will focus on their ability to think through a business problem. These generalists will be the “chess masters” as they align the right tools, the right data and the right mathematics in solving the right business problem.
Richard Boire is president of Boire Analytics. He can be reached at firstname.lastname@example.org.