The ODI’s Data Taxonomy

The ODI’s Data Taxonomy#

#TODO: Add “equity” and “fairness” considerations for each data item.

See the original document: https://theodi.org/news-and-events/blog/a-data-for-ai-taxonomy/

Category

Type of Data

Description

Developing AI systems

Existing data

Data not directly used for model training but as the basis for creating training datasets.

Developing AI systems

Training data

Data processed to train AI models by helping them recognize patterns and improve accuracy.

Developing AI systems

Reference data

Data used to enrich training datasets with context, such as knowledge graphs or linguistic resources.

Developing AI systems

Fine-tuning data

Smaller datasets used to adapt pre-trained models for specialized tasks while preserving their capabilities.

Developing AI systems

Testing and validation data

Data used to test models during development to ensure accuracy and representativeness.

Developing AI systems

Benchmarks

Datasets used to evaluate a model’s performance and accuracy against unseen data.

Developing AI systems

Synthetic data

Algorithmically generated data used for training, fine-tuning, or benchmarking models.

Developing AI systems

Data about the data

Information about the datasets used to develop AI models, such as their size, source, and composition.

Deploying AI systems

Model weights

Numerical values representing the relationships learned by a model during training.

Deploying AI systems

Local data

Data an AI model processes in a specific deployment context, depending on its purpose and architecture.

Deploying AI systems

Prompts

Instructions or queries given to AI systems to generate responses, commonly in generative models.

Deploying AI systems

Outputs from models

Generated data from AI systems, such as text, audio, video, or structured outputs.

Monitoring AI systems

Data about models

Information disclosed about AI models, including version, performance, and ethical considerations.

Monitoring AI systems

Data about model usage and performance in context

Data collected during model use, such as query logs and performance metrics, used for improvements.

Monitoring AI systems

Registers of model deployments

Authoritative lists of AI models deployed in specific contexts or sectors, maintained by governments or organizations.

Monitoring AI systems

Data about the AI ecosystem

Data about the broader AI ecosystem, including models, incidents, policies, and workforce statistics.