Skip to main content

Fetching a Dataset from a Python Script

Objective

The goal of this procedure is to represent a dataset retrieved from the Dateno API as a Pandas dataframe in Python. This enables further analysis, filtering, or visualization using familiar data science tools.

Plan

The process consists of four main steps:

  1. Fetching a dataset card from Dateno.
  2. Extracting the URL of an associated CSV file.
  3. Downloading the CSV file.
  4. Loading data from the CSV file into a Pandas dataframe.

Source

Check out this Google Colab notebook for a working example.

TIP
Save a copy of the Google Colab notebook to your Google Drive if you want to run or modify the script. Otherwise these operations are unavailable to you because of access estrictions.

Comments

get_dateno_search_api_url() -> str

Purpose
Returns the base URL of the Dateno search API.

Arguments
None.

Returns
A string with the base API URL.

get_my_dateno_api_key() -> str

Purpose
Returns the user's personal Dateno API key.

Arguments
None.

Returns A string containing the API key.

assemble_dataset_card_url(dataset_id: str) -> str

Purpose
Constructs the full API URL for fetching a dataset card.

Arguments

Returns
A string containing the API URL.

fetch_dataset_card(dataset_id: str) -> dict

Purpose
Fetches a dataset card from the Dateno API using the dataset identifier.

Arguments

Returns
A dictionary representing the dataset card in JSON format, or None if the request fails.

get_dataset_csv_table_url(dataset_card: dict) -> str

Purpose
Extracts the URL of the CSV resource from the dataset card.

Arguments

Returns
A string with the CSV resource URL, or an empty string if not found.

Operational Principle
The function accesses the _source field in the dataset card, then looks for a resources list within it. It iterates through the resources and checks the format property of each item. If it finds a resource with format equal to CSV, it returns the value of its url property. If no such resource is found, it returns an empty string.

fetch_dataset_csv_table(dataset_csv_table_url: str) -> str

Purpose
Fetches the raw CSV data from a direct download URL.

Arguments

  • dataset_csv_table_url: Direct URL to the CSV file.

Returns
CSV content as a string, or an error message if the request fails.

parse_csv_table(csv_string: str) -> pd.DataFrame

Purpose
Parses a CSV-formatted string into a Pandas DataFrame. Detects the column separator automatically.

Arguments

  • csv_string: CSV data as a plain string.

Returns
A Pandas DataFrame containing the parsed table.

show_dataframe(df: pd.DataFrame, dataset_name: str, count_lines: int = 10)

Purpose
Displays dataset metadata and a preview of its contents.

Arguments

  • df: A Pandas DataFrame representing the dataset.
  • dataset_name: Display name for the dataset.
  • count_lines: Number of rows to preview (default: 10).

Returns
None. Outputs are printed to the console.

load_dataset_csv_table(dataset_id: str, dataset_name: str)

Purpose
Coordinates the full workflow: retrieves a dataset card, extracts the CSV URL, downloads the CSV data, parses it into a DataFrame, and displays a preview.

Arguments

Returns
None. Outputs are printed to the console.