Tracking Data Operations#
Throughout the machine learning development pipeline, we move data from one component to another. The data collection journey itself unfolds through multiple stages, encompassing data gathering, consolidation, annotation (particularly relevant in supervised training scenarios), and inter-annotator agreement. In a complex model development such as LLMs, developing human-in-the-loop approaches can improve the reliability and cohesion of their outputs. However, designing these system is a challenging process, ranging from inherent biases and copyright concerns to the inadvertent inclusion of potentially harmful content.
Possible sources and formats of data#
For example, in LLM development:
Most text datasets are generated by web scraping which involves crawling websites to extract text data.
Or, it can be requested by APIs (Application Programming Interfaces) to collect data from platforms like social media, news websites, and other online sources.
Researchers produce ready-to-use text corpora which includes more guided and structured data such as books, articles, and research papers.
Collecting conversational data such as chat logs can help language models to understand and generate human-like responses.
Code Repositories and programming-related content such as GitHub and StackOverflow can improve the model’s understanding of code.
There are also fine-tuning datasets specific to certain tasks or industries for fine-tuning the model on specific use cases.
We can extend this list further… But, our focus in this chapter is data quality evaluation techniques, which has significant impact on achieving overall safety and fairness. We can categorise the evaluation approaches into three: (1) Supervised (manual) evaluation, (2) Semi-supervised (semi-automated) evaluation, and (3) Unsupervised (automated) evaluation.
Supervised (manual) evaluation#
Self Evaluation#
Creating data cards: Creating data cards in a standardised format (example) can help to understand possible issues beforehand. For example, The Data Cards Playbook asks for data source distribution related to bias (e.g. geographic, and gender), and sampling tasks.
Using surveys and frameworks:
For example, UK Statistics Authority’s Guidelines can help researchers evaluate the data collection practices and evaluation approaches following a structured ethical assessment.
Pipino et al.’s data quality assessment provides a generalised way to assess data quality It is a well-known assessment approach applied by several businesses in the last decade. Pipino et al.’s data quality assessment framework offers a standardised and comprehensive approach to evaluating data quality, making it applicable to various industries [Pipino et al., 2002]. It has been adopted by businesses in the last two decades to ensure the reliability and usability of their data. By utilising this framework, we can systematically evaluate the data quality and identify areas for improvement.
FAIR principles are the most popular assessment criteria among the reproducible research community [Wilkinson et al., 2016]. The principles “put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.”
Expert Evaluation#
One of the most reliable, but expensive solution is annotating bias in the dataset by expert human annotators. We can use the guidelines and frameworks that we mentioned in the “self-evaluation.” Spinde et al. created a 34 item questionnaire to accurately map bias perception: https://media-bias-research.org/publications/ BABE (Bias Annotations By Experts) dataset is created using this approach [Spinde et al., 2021].
However, asking for the perception of human annotators is also quite challenging, as the bias can be rather subjective.
Unsupervised (automated) evaluation#
Training a classifier to detect issues: One example of this approach is Dbias bias detection package uses a language model to detect issues in the text considering ten dimensions (gender, race, social status, age, disability, religion, profession, nationality, education, and body size) [Raza et al., 2024]. They adopted the Inside-Outside-Beginning (IOB) annotation scheme [Ramshaw and Marcus, 1995] to classify and annotate ‘BIAS’ entities. Then, they trained a BERT model to predict BIAS entities in a dataset.
Semi-supervised (semi-automated) evaluation#
Annotating with active learning: e.g. https://proceedings.scipy.org/articles/majora-212e5952-001
Comparing the longitudinal representations: Based on the defined threshold, if data augmentation techniques result in different fairness scores, we should consider re-evaluating the dataset or choosing the fairer data transformation methodology.
Improving Data Quality#
Traditionally, practitioners improve data quality throughout the “data preprocessing” stage. So, most of the data quality improving techniques are also called as “data pre-processing techniques.” These can help improve the security, privacy and fairness of machine learning models. For example, data sanitization and outlier detection can help resolve privacy issues. We can improve the model fairness using augmentation techniques by introducing out of distribution samples and balancing techniques ensure equal representation of different classes. We selected some widely used techniques below and explained them in more detail:
Data Sanitization: Cleaning and purifying the dataset by identifying and removing errors, inconsistencies, or irrelevant information. This method aims to improve data quality by detecting and correcting inaccuracies, inconsistencies, or missing values in the dataset. It ensures that the data is reliable and suitable for training machine learning models, reducing the likelihood of introducing noise or bias during the learning process.
Outlier detection and removal: Identifying and eliminating data points that deviate significantly from the majority of the dataset. By employing statistical or machine learning techniques, outliers—data points that are unusually distant from the rest—are identified and either corrected or removed. This process helps in enhancing the robustness and generalization capability of machine learning models by preventing them from being overly influenced by anomalous data.
Data Augmentation/Balancing: These are the techniques to increase the diversity and balance of the training dataset, particularly in situations where certain classes are underrepresented. These methods include various techniques such as duplicating minority class samples, generating synthetic data, or applying transformations to existing data to create a more balanced representation of different classes. This helps prevent the model from being biased towards the majority class and improves its ability to generalize across diverse instances. One specific approach is Counterfactual Data Substitution (Github Repo Example), which involves replacing instances in the dataset with counterfactual instances. Counterfactual instances are generated by modifying the features of existing instances while preserving their class labels. This technique aims to create diverse yet plausible examples to enhance the robustness and generalization of machine learning models.
Data Calibration: Adjusting the output probabilities of a model to align with the actual likelihood of the predicted outcomes. Calibration techniques are applied to ensure that predicted probabilities accurately reflect the true likelihood of a given prediction. This is particularly important for models used in probabilistic decision-making, such as those involving confidence scores.
Instance Weighting: Assigning different weights to individual instances in the training dataset to influence the model’s learning process. By assigning higher weights to certain instances, the learning algorithm gives more importance to those instances during training. This can be useful in scenarios where certain instances are more critical or representative than others, helping to address imbalances or prioritize specific data points in the learning process.
Disparate-Impact Remover: (AI360 Demo) The technique aims to mitigate or eliminate disparate impact in a dataset. Disparate impact refers to situations where a model’s predictions may disproportionately favour one group over another, leading to unfair or biased outcomes. This method typically involves modifying the features or samples in the dataset to ensure a more equitable representation of different groups. It aims to equalize the impact of the model’s predictions across various demographic or categorical subgroups, promoting fairness in the model’s performance.
Learning Fair Representations: (AI360 Demo) This approach seeks to train models in such a way that the learned representations are inherently fair, meaning they do not encode discriminatory information or biases. This method involves incorporating fairness constraints during the model training process. By explicitly considering fairness metrics or constraints, the algorithm learns to generate representations that are less influenced by sensitive attributes, such as race or gender. This helps in building models that are more equitable and less prone to producing biased predictions.
Optimised Pre-Processing: (AI360 Demo) This approach involves systematically preparing and transforming the data to enhance the performance of machine learning models. The optimization process may include various steps such as feature scaling, dimensionality reduction, and handling missing data. This method aims to improve the efficiency and effectiveness of the subsequent machine learning model by addressing specific challenges present in the raw data. It might involve techniques like normalization, imputation of missing values, or feature engineering to ensure that the input data is well-suited for the chosen learning algorithm.
Reweighing: (AI360 Demo) This technique involves assigning different weights to different samples in the dataset to address class imbalance or bias issues. In situations where certain classes are underrepresented or overrepresented in the training data, reweighing assigns different weights to instances of each class. This helps the model to give more importance to the minority class, ensuring that it is not overshadowed by the majority class. This method is particularly useful in scenarios where there is an imbalance in the distribution of classes, preventing the model from being skewed towards the majority class.
Differential privacy: A privacy-preserving concept in data analysis that aims to provide strong guarantees about the protection of individual privacy while still allowing meaningful analysis of the aggregate data. It can be integrated into machine learning algorithms to train models on sensitive data without revealing details about any particular individual in the training set. It is especially a good defence mechanism for model inversion attacks.
Data perturbation: It involves intentionally introducing small random variations or alterations to the values in a dataset to protect individual privacy or enhance the robustness of machine learning models. Perturbing data often involves adding random noise to the original values. This can be done by introducing a small amount of randomness to numerical values, categorical variables, or even by perturbing the structure of the dataset. In addition to noise, more sophisticated techniques involve masking or obfuscating certain features to prevent the identification of specific individuals while still allowing for meaningful analysis at an aggregate level.
Practitioners report these steps in their pipelines and reports to create more auditible ML models.