The ODI’s Data Taxonomy#
#TODO: Add “equity” and “fairness” considerations for each data item.
See the original document: https://theodi.org/news-and-events/blog/a-data-for-ai-taxonomy/
Category |
Type of Data |
Description |
---|---|---|
Developing AI systems |
Existing data |
Data not directly used for model training but as the basis for creating training datasets. |
Developing AI systems |
Training data |
Data processed to train AI models by helping them recognize patterns and improve accuracy. |
Developing AI systems |
Reference data |
Data used to enrich training datasets with context, such as knowledge graphs or linguistic resources. |
Developing AI systems |
Fine-tuning data |
Smaller datasets used to adapt pre-trained models for specialized tasks while preserving their capabilities. |
Developing AI systems |
Testing and validation data |
Data used to test models during development to ensure accuracy and representativeness. |
Developing AI systems |
Benchmarks |
Datasets used to evaluate a model’s performance and accuracy against unseen data. |
Developing AI systems |
Synthetic data |
Algorithmically generated data used for training, fine-tuning, or benchmarking models. |
Developing AI systems |
Data about the data |
Information about the datasets used to develop AI models, such as their size, source, and composition. |
Deploying AI systems |
Model weights |
Numerical values representing the relationships learned by a model during training. |
Deploying AI systems |
Local data |
Data an AI model processes in a specific deployment context, depending on its purpose and architecture. |
Deploying AI systems |
Prompts |
Instructions or queries given to AI systems to generate responses, commonly in generative models. |
Deploying AI systems |
Outputs from models |
Generated data from AI systems, such as text, audio, video, or structured outputs. |
Monitoring AI systems |
Data about models |
Information disclosed about AI models, including version, performance, and ethical considerations. |
Monitoring AI systems |
Data about model usage and performance in context |
Data collected during model use, such as query logs and performance metrics, used for improvements. |
Monitoring AI systems |
Registers of model deployments |
Authoritative lists of AI models deployed in specific contexts or sectors, maintained by governments or organizations. |
Monitoring AI systems |
Data about the AI ecosystem |
Data about the broader AI ecosystem, including models, incidents, policies, and workforce statistics. |