Fetching a Dataset from a Python Script
Objective
The goal of this procedure is to represent a dataset retrieved from the Dateno API as a Pandas dataframe in Python. This enables further analysis, filtering, or visualization using familiar data science tools.
Plan
The process consists of four main steps:
- Fetching a dataset card from Dateno.
- Extracting the URL of an associated CSV file.
- Downloading the CSV file.
- Loading data from the CSV file into a Pandas dataframe.
Source
Check out this Google Colab notebook for a working example.
TIP
Save a copy of the Google Colab notebook to your Google Drive if you want to run or modify the script. Otherwise these operations are unavailable to you because of access estrictions.
Comments
get_dateno_search_api_url() -> str
Purpose
Returns the base URL of the Dateno search API.
Arguments
None.
Returns
A string with the base API URL.
get_my_dateno_api_key() -> str
Purpose
Returns the user's personal Dateno API key.
Arguments
None.
Returns A string containing the API key.
assemble_dataset_card_url(dataset_id: str) -> str
Purpose
Constructs the full API URL for fetching a dataset card.
Arguments
dataset_id
: Unique identifier of the dataset.
Returns
A string containing the API URL.
fetch_dataset_card(dataset_id: str) -> dict
Purpose
Fetches a dataset card from the Dateno API using the dataset identifier.
Arguments
dataset_id
: Unique identifier of the dataset.
Returns
A dictionary representing the dataset card in JSON format, or None
if the request fails.
get_dataset_csv_table_url(dataset_card: dict) -> str
Purpose
Extracts the URL of the CSV resource from the dataset card.
Arguments
dataset_card
: Dictionary representing the dataset card.
Returns
A string with the CSV resource URL, or an empty string if not found.
Operational Principle
The function accesses the _source field in the dataset card, then looks for a resources list within it. It iterates through the resources and checks the format property of each item. If it finds a resource with format equal to CSV
, it returns the value of its url property. If no such resource is found, it returns an empty string.
fetch_dataset_csv_table(dataset_csv_table_url: str) -> str
Purpose
Fetches the raw CSV data from a direct download URL.
Arguments
dataset_csv_table_url
: Direct URL to the CSV file.
Returns
CSV content as a string, or an error message if the request fails.
parse_csv_table(csv_string: str) -> pd.DataFrame
Purpose
Parses a CSV-formatted string into a Pandas DataFrame. Detects the column separator automatically.
Arguments
csv_string
: CSV data as a plain string.
Returns
A Pandas DataFrame containing the parsed table.
show_dataframe(df: pd.DataFrame, dataset_name: str, count_lines: int = 10)
Purpose
Displays dataset metadata and a preview of its contents.
Arguments
df
: A Pandas DataFrame representing the dataset.dataset_name
: Display name for the dataset.count_lines
: Number of rows to preview (default: 10).
Returns
None. Outputs are printed to the console.
load_dataset_csv_table(dataset_id: str, dataset_name: str)
Purpose
Coordinates the full workflow: retrieves a dataset card, extracts the CSV URL, downloads the CSV data, parses it into a DataFrame, and displays a preview.
Arguments
dataset_id
: Unique identifier of the dataset in the Dateno registry.dataset_name
: Descriptive label used in the output.
Returns
None. Outputs are printed to the console.