Like most data scientists, we have been impacted by this lack of tool support and were getting by using a mish-mash of toolkits and heavy doses of custom coding.
Introduction
Statistical analysis of data is used when seeking to understand or explain observed phenomena. Over time, tens of thousands of analytical metrics have been introduced that have specific uses in targeted domains, some broad, some narrow. One would be hard pressed to identify an analytical scenario across any domain of study and practice – the physical sciences, engineering, computer science, biology, medicine, business management, music, arts – where tools based on statistics and probability do not play an integral role.
Interestingly, the implements humans use to build and fix physical structures have evolved similarly. The sheer number of such tools have resulted in user-friendly curations, such as the auto mechanic’s toolbox, the plumber’s toolbox, the gardener’s toolbox, etc.. Same is the case with analytical tools – there are so many of them, a majority targeted for use in narrow problem areas - it is impossible for a “lay” analyst to use without appropriate grouping and focus. This has led to the introduction of practitioner “toolboxes”, curated collections of analytical tools for a problem domain – the biologist’s analytical toolkit, the physicist’s toolbox etc.
Toolboxes for data engineers and data scientists
Those of us who work with data to impact the operations of business enterprises have also come to rely on statistical/probabilistic analysis to understand and explain various facets of the enormous amounts of data produced at modern companies. Data engineering, a subfield of software engineering, has evolved to create and maintain data-driven applications. Because effective data engineering requires thorough understanding of the data that anchor these applications, many analytical toolkits have evolved for this purpose: pandas, Apache Spark/Pyspark, Scikit, RapidMiner, to name a few.
While data engineers need to understand data to build applications, data scientists work with data to predict outcomes. As data scientists and ML engineers ourselves, we pay particular attention to toolkits that enable data analysis that leads to accurate and resilient predictive models.
Because data science and data engineering use the same base organizational data as a starting point, data scientists have come to rely on data engineering tools for their analytical needs. Having used these toolkits for EDA/DDA for several years, we’ve come to realize these are not optimal, and that the data science practice has matured to a point where the industry needs analytical toolkits specifically crafted for data scientists. There are two primary reasons for this:
• Thorough EDA/DDA involve a common set of tasks that are seldom found in a single existing data analysis toolkit. We end up using several open-source libraries for analysis tasks.
• The field of model building using ML has evolved to a point where certain specific differentiated requirements need metrics that are not available in existing toolkits
We provide 3 examples below, of everyday, high value EDA tasks that data scientists perform regularly. None of these is possible by using a single analytical toolkit. The first two examples require assemblage via a suite of toolkit libraries, and the third is not supported at all.
Example 1 (Monotonic Binning)
Binning is a frequently used transformation technique to convert numerical attributes to discrete values (integer or categorical). In monotonic binning, instead of mapping to a pre-fixed count of bins, the actual number of bins is dynamically computed, ensuring event rates align with the number of bins. This computation is driven by a statistic known as Spearman Rank Correlation, which should be either +1 or -1, between the bin index and event rate. While performing such transformations is a common task for data scientists, we haven’t found tools and libraries that directly allow monotonic binning. We end up using multiple libraries to pull this off like Pyspark, pandas & SciPy.
Example 2 (Association Evaluation)
Measuring the association, or the strength of the relationship between the relative movements of two attributes, is one of the most common EDA tasks performed by data scientists. Traditionally, data analysts have employed the Pearson’s Correlation Coefficient (PCC) as a standard statistic to measure pairwise correlation between variables – the PCC is offered widely in various data toolkits. Yet, the PCC has a number of drawbacks when used by data scientists: a) It works only with continuous variables, b) It only accounts for linear relationships between variables, and c) It is very sensitive to outliers. This last property is a big problem for predictive modelers. To avoid these issues, there are newer correlation statistics like Phik (𝜙k), that work consistently between categorical, ordinal and interval variables, capture non-linear dependencies and reduces to the PCC in case of a bivariate normal input distribution. We, and many other data scientists we know, use the Phik correlation coefficient a lot in data analysis, but to compute it, we need to have the following packages Phi_K, popmon & Pyspark.
Example 3 (Data Stability)
One of the most difficult problems for ML modelers is building resilient models, particularly those that are resistant to data drift. What makes this problem more challenging is the lack of metrics that can provide indications of how likely data items are to drift over time. If a modeler knew that certain attributes were likely to drift, she could choose to exclude (or conservatively manage) their use in feature engineering, thereby rendering the features, and therefore eventual models drift resistant. We call this property stability and use it widely in building models that are resilient from scratch. Estimating drift before building your model is a difficult and stability metrics are not available in existing analytical systems. For a more extensive discussion on drift and stability, check out our recent article on the topic.
While these are only three examples of common EDA activities not supported in singular toolkits, we run into the same problem for several other tasks. It is not that the many current widely used data analysis toolkits are not good – they are excellent. It’s just that data scientists and ML engineers have refined predictive model building to a science that has its own unique, native and repeatable analytical requirements, and one of the main reasons that data analysis and preparation consumes nearly 50% of time in model building is the lack of focused tool support.
Like most data scientists, we have been impacted by this lack of tool support. And, like others, we were getting by using a mish-mash of toolkits and heavy doses of custom coding. The result was long EDA and data prep times, that eventually produced models that performed well in training and testing but displayed a lack of resiliency when operationalized. Frustrated by this, our team thought through the analytical requirements of a Data Scientist’s Toolbox and created an open source project, Anovos, to meet our needs and the needs of other data scientists. We have used Anovos internally in our data prep and feature engineering workflow. This has resulted in not only spending far less time on this step than we used to, and productivity has increased significantly, but we are also producing drift-resistant features that are anchoring more resilient operationalized models.
The Data Scientist’s Toolbox
A data scientist does data analysis for two reasons:
• Exploratory analysis of data for model building purposes, typically to analyze, clean and prepare data as part of the feature engineering workflow
• Supporting MLOps to address unexpected behavior of operationalized model and address the data quality issues
To support these, the toolbox needs to support the following five broad capability classes:
• Comprehensive descriptive statistics
• Data quality checking and treatment: To assess data quality at both row-level and column-level and provide appropriate treatment options to fix detected issues.
• Association evaluation: To reveal the interaction between different attributes and/or the relationship between an attribute and the (binary) target variable.
• Data transformation: Support selected pre-processing and transformation functions (e.g., binning, encoding, etc.) which are required for statistics generation and quality checks.
• Drift and stability evaluation: To compute data drift, after the drift has occurred, and to estimate the stability of data attributes, i.e., to estimate how likely a specific attribute is to drift over time. Stability identification can have profound impact on model resiliency and is currently not supported in existing offerings.
While the above classes capture a vast majority of broad tasks that a data scientist performs, the devil is in the details, there are certain specific analytics that are essential for the ML modeler for effective feature engineering. Below, we provide a list of analytics the data scientist’s toolkit must support, grouped under each above mentioned capability.
Descriptive Statistics
• Measures of counts
• Measures of central tendency
• Measures of cardinality: Statistics related to unique values seen in an attribute
• Measures of percentiles
• Measures of dispersion
• Measures of shape: Statistics related to the shape of an attribute's distribution, of significant use to data scientists and model builders. These are also alternatively termed “measures of the moment” when computed for numerical attributes.
Data Quality Checking and Treatment
• Null detection (Row Level & Column Level)
o Treatment options can be the removal of rows with high null columns. For columns with high NULL entries, a common treatment option is to replace the NULLs with specified “central” values (MEAN, MEDIAN, MODE etc.)
• IDness detection: To assess the ratio of the number of unique values seen in an attribute by the number of non-null rows
o Treatment option is the removal of columns with high IDness
• Biasedness detection: To flag a column/attribute if they are biased or skewed towards one specific value.
o Typical treatment option is the removal of columns above a defined bias threshold
• Invalid entries detection: To check for certain suspicious patterns in an attributes’ values
o Typical treatment is the replacement of invalid entries with null values.
Association Evaluation
• Association Matrix: To assess the relationship between attributes, both among independent variables and between an independent and a dependent variable/attribute. We list below methods most used by the data science community to understand such associations
o Correlation Analysis
o Information Value (IV)
o Information Gain (IG)
o Variable Clustering
• Attribute to target association: for both numerical and categorical columns against event rate
Data Transformation
This is required for pre-processing and cleaning to prepare the data in a manner suitable to be used for downstream modeling and prediction purposes.
o Data cleaning (Some of the data cleaning options are detailed under the Data
Quality Checking and Treatment list)
o Targeted transformations: An attribute can be subjected to different transformations to make the feature meaningful and predictive. Some examples are attribute binning, monotonic binning, standardization, normalization, imputation, dimension deduction etc.
Drift and Stability Measurement
o Data Drift Statistic: Metric that measures drift with respect to a benchmark
o Data Stability Index: Metric indicating how likely an attribute will drift over time
We believe the above capabilities, bundled in one toolkit, would satisfy the majority of the needs of a majority of data scientists.
Anovos: An analytic toolkit for the data science and ML community
Today, Anovos offers a comprehensive analytical toolkit for data scientists, enabling deep data analysis and stable feature creation, via better data health checks, comprehensive metric tools and drift and stability analysis. By rethinking ingestion and transformation, and including deeper analytics, drift identification, and stability analysis, Anovos improves productivity and helps data scientists build more resilient, higher performing models.
Our goal is to continue to enhance Anovos to create a near-automated data pipeline from ingestion through feature engineering (see the roadmap). Today, Anovos includes all the analytics identified above in a single package. We use it every day in our data science and modeling tasks and have not only improved productivity, but also the resilience of the eventual production models.