bblocks.cleaning_tools.clean

Functions

clean_number(→ float | int)

Clean a string and return as float or integer.

clean_numeric_series(→ pandas.DataFrame | pandas.Series)

Clean a numeric column in a Pandas DataFrame or a Pandas Series which is

to_date_column(→ pandas.Series)

Converts a Pandas series into a date series.

convert_id(→ pandas.Series)

Takes a Pandas' series with country IDs and converts them into the desired type.

date_to_str(→ pandas.Series)

Converts a Pandas' series into a string series.

format_number(→ pandas.Series)

Formats a Pandas' numeric series into a formatted string series.

convert_to_datetime(→ pandas.Series | pandas.Timestamp)

Custom function to convert values to datetime.

Module Contents

bblocks.cleaning_tools.clean.clean_number(number: str | pandas.Series, to: Type = float) float | int

Clean a string and return as float or integer. When selecting to=int, the default python round behaviour is used.

Parameters:
  • number – the string to clean

  • to – the type to convert to (int or float)

bblocks.cleaning_tools.clean.clean_numeric_series(data: pandas.Series | pandas.DataFrame, series_columns: str | list | None = None, to: Type = float) pandas.DataFrame | pandas.Series

Clean a numeric column in a Pandas DataFrame or a Pandas Series which is meant to be numeric. When selecting to=int, the default python round behaviour is used.

Parameters:
  • data – it accepts a series or a dataframe. If a dataframe is passed, the column(s) to clean must be specified

  • series_columns – optionally declared (only when _data is a dataframe). To apply to one or more columns.

  • to – the type to convert to (int or float)

bblocks.cleaning_tools.clean.to_date_column(series: pandas.Series, date_format: str | None = None) pandas.Series

Converts a Pandas series into a date series. The series must contain integers or strings that can be converted into datetime objects

bblocks.cleaning_tools.clean.convert_id(series: pandas.Series, from_type: str = 'regex', to_type: str = 'ISO3', not_found: str | None = None, *, additional_mapping: dict = None) pandas.Series

Takes a Pandas’ series with country IDs and converts them into the desired type.

Parameters:
  • series – the Pandas series to convert

  • from_type – the classification type according to which the series is encoded. Available types come from the country_converter package (https://github.com/konstantinstadler/country_converter#classification-schemes) For example: ISO3, ISO2, name_short, DACcode, etc.

  • to_type – the target classification type. Same options as from_type

  • not_found – what to do if the value is not found. Can pass a string or None. If None, the original value is passed through.

  • additional_mapping – Optionally, a dictionary with additional mappings can be used. The keys are the values to be converted and the values are the converted values. The keys follow the same datatype as the original values. The values must follow the same datatype as the target type.

bblocks.cleaning_tools.clean.date_to_str(series: pandas.Series, date_format: str = '%d %B %Y') pandas.Series

Converts a Pandas’ series into a string series.

Parameters:
  • series – the Pandas series to convert to a formatted date string

  • date_format – the format to use for the date string. The default is “%d %B %Y”

bblocks.cleaning_tools.clean.format_number(series: pandas.Series, as_units: bool = False, as_percentage: bool = False, as_millions: bool = False, as_billions: bool = False, decimals: int = 2, add_sign: bool = False, other_format: str = '{:,.2f}') pandas.Series

Formats a Pandas’ numeric series into a formatted string series.

Parameters:
  • series – the series to convert to a formatted string

  • as_units – formatted with commas to separate thousands and the specified decimals

  • as_percentage – formatted as a percentage with the specified decimals. This assumes that the series contains numbers where 1 would equal 100%.

  • as_millions – divided by 1 million, formatted with commas and the specified decimals

  • as_billions – divided by 1 billion, formatted with commas and the specified decimals

  • decimals – the number of decimals to use

  • add_sign – add a plus sign to positive numbers

  • other_format – Other formats to use. This option can only be used if all others are false. Examples are available at: https://mkaz.blog/code/python-string-format-cookbook/

bblocks.cleaning_tools.clean.convert_to_datetime(date: str | int | pandas.Series) pandas.Series | pandas.Timestamp

Custom function to convert values to datetime. It handles integers or strings that represent only a year.