- 2.29.0 (latest)
- 2.28.0
- 2.27.0
- 2.26.0
- 2.25.0
- 2.24.0
- 2.23.0
- 2.22.0
- 2.21.0
- 2.20.0
- 2.19.0
- 2.18.0
- 2.17.0
- 2.16.0
- 2.15.0
- 2.14.0
- 2.13.0
- 2.12.0
- 2.11.0
- 2.10.0
- 2.9.0
- 2.8.0
- 2.7.0
- 2.6.0
- 2.5.0
- 2.4.0
- 2.3.0
- 2.2.0
- 1.36.0
- 1.35.0
- 1.34.0
- 1.33.0
- 1.32.0
- 1.31.0
- 1.30.0
- 1.29.0
- 1.28.0
- 1.27.0
- 1.26.0
- 1.25.0
- 1.24.0
- 1.22.0
- 1.21.0
- 1.20.0
- 1.19.0
- 1.18.0
- 1.17.0
- 1.16.0
- 1.15.0
- 1.14.0
- 1.13.0
- 1.12.0
- 1.11.1
- 1.10.0
- 1.9.0
- 1.8.0
- 1.7.0
- 1.6.0
- 1.5.0
- 1.4.0
- 1.3.0
- 1.2.0
- 1.1.0
- 1.0.0
- 0.26.0
- 0.25.0
- 0.24.0
- 0.23.0
- 0.22.0
- 0.21.0
- 0.20.1
- 0.19.2
- 0.18.0
- 0.17.0
- 0.16.0
- 0.15.0
- 0.14.1
- 0.13.0
- 0.12.0
- 0.11.0
- 0.10.0
- 0.9.0
- 0.8.0
- 0.7.0
- 0.6.0
- 0.5.0
- 0.4.0
- 0.3.0
- 0.2.0
Transformers that prepare data for other estimators. This module is styled after scikit-learn's preprocessing module: https://scikit-learn.org/stable/modules/preprocessing.html.
Classes
KBinsDiscretizer
KBinsDiscretizer( n_bins: int = 5, strategy: typing.Literal["uniform", "quantile"] = "quantile" )Bin continuous data into intervals.
LabelEncoder
LabelEncoder( min_frequency: typing.Optional[int] = None, max_categories: typing.Optional[int] = None, )Encode target labels with value between 0 and n_classes-1.
This transformer should be used to encode target values, i.e. y, and not the input X.
MaxAbsScaler
MaxAbsScaler()Scale each feature by its maximum absolute value.
This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
MinMaxScaler
MinMaxScaler()Transform features by scaling each feature to a given range.
This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.
OneHotEncoder
OneHotEncoder( drop: typing.Optional[typing.Literal["most_frequent"]] = None, min_frequency: typing.Optional[int] = None, max_categories: typing.Optional[int] = None, )Encode categorical features as a one-hot format.
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka 'one-of-K' or 'dummy') encoding scheme.
Note that this method deviates from Scikit-Learn; instead of producing sparse binary columns, the encoding is a single column of STRUCT<index INT64, value DOUBLE>.
Examples:
Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding. >>> from bigframes.ml.preprocessing import OneHotEncoder >>> import bigframes.pandas as bpd >>> enc = OneHotEncoder() >>> X = bpd.DataFrame({"a": ["Male", "Female", "Female"], "b": ["1", "3", "2"]}) >>> enc.fit(X) OneHotEncoder() >>> print(enc.transform(bpd.DataFrame({"a": ["Female", "Male"], "b": ["1", "4"]}))) onehotencoded_a onehotencoded_b 0 [{'index': 1, 'value': 1.0}] [{'index': 1, 'value': 1.0}] 1 [{'index': 2, 'value': 1.0}] [{'index': 0, 'value': 1.0}] <BLANKLINE> [2 rows x 2 columns] PolynomialFeatures
PolynomialFeatures(degree: int = 2)Generate polynomial and interaction features.
StandardScaler
StandardScaler()Standardize features by removing the mean and scaling to unit variance.
The standard score of a sample x is calculated as:z = (x - u) / s where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
Examples:
.. code-block:: from bigframes.ml.preprocessing import StandardScaler import bigframes.pandas as bpd scaler = StandardScaler() data = bpd.DataFrame({"a": [0, 0, 1, 1], "b":[0, 0, 1, 1]}) scaler.fit(data) print(scaler.transform(data)) print(scaler.transform(bpd.DataFrame({"a": [2], "b":[2]})))