Module preprocessing (2.29.0)

Transformers that prepare data for other estimators. This module is styled after scikit-learn's preprocessing module: https://scikit-learn.org/stable/modules/preprocessing.html.

Classes

LabelEncoder

LabelEncoder(     min_frequency: typing.Optional[int] = None,     max_categories: typing.Optional[int] = None, )

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

MaxAbsScaler

MaxAbsScaler()

Scale each feature by its maximum absolute value.

This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

MinMaxScaler

MinMaxScaler()

Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

OneHotEncoder

OneHotEncoder(     drop: typing.Optional[typing.Literal["most_frequent"]] = None,     min_frequency: typing.Optional[int] = None,     max_categories: typing.Optional[int] = None, )

Encode categorical features as a one-hot format.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka 'one-of-K' or 'dummy') encoding scheme.

Note that this method deviates from Scikit-Learn; instead of producing sparse binary columns, the encoding is a single column of STRUCT<index INT64, value DOUBLE>.

Examples:

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding.  >>> from bigframes.ml.preprocessing import OneHotEncoder >>> import bigframes.pandas as bpd  >>> enc = OneHotEncoder() >>> X = bpd.DataFrame({"a": ["Male", "Female", "Female"], "b": ["1", "3", "2"]}) >>> enc.fit(X) OneHotEncoder()  >>> print(enc.transform(bpd.DataFrame({"a": ["Female", "Male"], "b": ["1", "4"]})))                 onehotencoded_a               onehotencoded_b 0  [{'index': 1, 'value': 1.0}]  [{'index': 1, 'value': 1.0}] 1  [{'index': 2, 'value': 1.0}]  [{'index': 0, 'value': 1.0}] <BLANKLINE> [2 rows x 2 columns]

PolynomialFeatures

PolynomialFeatures(degree: int = 2)

Generate polynomial and interaction features.

StandardScaler

StandardScaler()

Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:z = (x - u) / s where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).