Skip to main content

Datasets

A dataset is a structured collection of data designed for specific purposes, such as analysis, reporting, or application development. Datasets are organized to provide meaningful insights and are typically managed within catalogs maintained by their respective owners.

How Datasets Are Organized

Datasets are grouped into catalogs based on their type, purpose, and ownership.

Each catalog belongs to a specific catalog type. Catalog type may be one of the following:

  • Geoportals for geographical data
  • Indicators catalogs for tabulated values
  • Open data portals for diverse datasets

Other catalog types are also possible.

Catalog types help users understand the focus and scope of the datasets within them.

Attributes of Datasets

Datasets possess various attributes that can be grouped into the following categories:

Nature of Content Attributes

  • Sources: Origin of datasets, encompassing APIs, repositories, and portals. These sources differ in accessibility, reliability, and use cases.
  • Data Themes: Classify datasets by their themes, such as those defined by the European Union's INSPIRE Directive. These themes align datasets with EU policy areas, including agriculture, transportation, and environmental monitoring.
  • Topic Categories: Categorize datasets based on ISO 19115, which defines a taxonomy for geospatial datasets to ensure interoperability and consistency.

Geographical and Linguistic Attributes

  • Country: Indicates the primary country related to the dataset, helping users filter datasets by national scope or relevance.

  • Region and Subregion: Provides more specific geographic context, such as states, provinces, or local administrative areas.

  • Language: Specifies the language of the dataset's content, ensuring users can understand and process the dataset effectively.

    IMPORTANT
    Data, title, and description of a dataset are usually written in the same language. Therefore, if you write your query, for example, in Hungarian, you get datasets described in this language and containing data expressed in it.

Technical Attributes

  • Dataset Type: Defines the nature of the data within a dataset, such as geospatial data, tabular data, or text documents. Each dataset belongs to a singular dataset type, ensuring clear classification for filtering and analysis.
  • Data Formats: Describe the formats used to represent dataset resources, such as CSV, JSON, or XML. A single dataset may include files in multiple formats.
  • Maintenance Software: Identifies the software used to manage the catalog containing the dataset, providing insights into the dataset's organizational quality.

Usage Limitation Attributes

  • License: Specifies the dataset's usage rights, such as public domain, Creative Commons, or proprietary licenses. This ensures compliance with legal terms and supports filtering datasets by usage restrictions.

Structure of a Dataset

Each dataset consists of multiple resources that contain its actual content. These resources include:

  • Data Files: Files that directly contain the dataset payload, such as tabular data or geospatial maps.
  • Metadata: Descriptive information about the dataset, including its purpose, origin, and attributes.
  • Links to Data Sources: References to external systems or APIs for accessing additional information.

NOTE
The structure and organization of resources within a dataset depend on the dataset publisher and may vary based on the dataset's purpose and type.