mlflow.pyspark.ml

mlflow.pyspark.ml.autolog(log_models=True, log_datasets=True, disable=False, exclusive=False, disable_for_unsupported_versions=False, silent=False, log_post_training_metrics=True, registered_model_name=None, log_input_examples=False, log_model_signatures=True, log_model_allowlist=None, extra_tags=None)[source]

Note

Autologging is known to be compatible with the following package versions: 3.1.2 <= pyspark <= 3.5.3. Autologging may not succeed when used with package versions outside of this range.

Enables (or disables) and configures autologging for pyspark ml estimators. This method is not threadsafe. This API requires Spark 3.0 or above.

When is autologging performed?

Autologging is performed when you call Estimator.fit except for estimators (featurizers) under pyspark.ml.feature.

Logged information
Parameters
  • Parameters obtained by estimator.params. If a param value is also an Estimator, then params in the the wrapped estimator will also be logged, the nested param key will be {estimator_uid}.{param_name}

Tags
  • An estimator class name (e.g. “LinearRegression”).

  • A fully qualified estimator class name (e.g. “pyspark.ml.regression.LinearRegression”).

Post training metrics

When users call evaluator APIs after model training, MLflow tries to capture the Evaluator.evaluate results and log them as MLflow metrics to the Run associated with the model. All pyspark ML evaluators are supported.

For post training metrics autologging, the metric key format is: “{metric_name}[-{call_index}]_{dataset_name}”

  • The metric name is the name returned by Evaluator.getMetricName()

  • If multiple calls are made to the same pyspark ML evaluator metric, each subsequent call adds a “call_index” (starting from 2) to the metric key.

  • MLflow uses the prediction input dataset variable name as the “dataset_name” in the metric key. The “prediction input dataset variable” refers to the variable which was used as the dataset argument of model.transform call. Note: MLflow captures the “prediction input dataset” instance in the outermost call frame and fetches the variable name in the outermost call frame. If the “prediction input dataset” instance is an intermediate expression without a defined variable name, the dataset name is set to “unknown_dataset”. If multiple “prediction input dataset” instances have the same variable name, then subsequent ones will append an index (starting from 2) to the inspected dataset name.

Limitations
  • MLflow cannot find run information for other objects derived from a given prediction result (e.g. by doing some transformation on the prediction result dataset).

Artifacts
  • An MLflow Model with the mlflow.spark flavor containing a fitted estimator (logged by mlflow.spark.log_model()). Note that large models may not be autologged for performance and storage space considerations, and autologging for Pipelines and hyperparameter tuning meta-estimators (e.g. CrossValidator) is not yet supported. See log_models param below for details.

  • For post training metrics API calls, a “metric_info.json” artifact is logged. This is a JSON object whose keys are MLflow post training metric names (see “Post training metrics” section for the key format) and whose values are the corresponding evaluator information, including evaluator class name and evaluator params.

How does autologging work for meta estimators?

When a meta estimator (e.g. Pipeline, CrossValidator, TrainValidationSplit, OneVsRest) calls fit(), it internally calls fit() on its child estimators. Autologging does NOT perform logging on these constituent fit() calls.

A “estimator_info.json” artifact is logged, which includes a hierarchy entry describing the hierarchy of the meta estimator. The hierarchy includes expanded entries for all nested stages, such as nested pipeline stages.

Parameter search

In addition to recording the information discussed above, autologging for parameter search meta estimators (CrossValidator and TrainValidationSplit) records child runs with metrics for each set of explored parameters, as well as artifacts and parameters for the best model and the best parameters (if available). For better readability, the “estimatorParamMaps” param in parameter search estimator will be recorded inside “estimator_info” artifact, see following description. Inside “estimator_info.json” artifact, in addition to the “hierarchy”, records 2 more items: “tuning_parameter_map_list”: a list contains all parameter maps used in tuning, and “tuned_estimator_parameter_map”: the parameter map of the tuned estimator. Records a “best_parameters.json” artifacts, contains the best parameter it searched out. Records a “search_results.csv” artifacts, contains search results, it is a table with 2 columns: “params” and “metric”.

Parameters
  • log_models – If True, if trained models are in allowlist, they are logged as MLflow model artifacts. If False, trained models are not logged. Note: the built-in allowlist excludes some models (e.g. ALS models) which can be large. To specify a custom allowlist, create a file containing a newline-delimited list of fully-qualified estimator classnames, and set the “spark.mlflow.pysparkml.autolog.logModelAllowlistFile” Spark config to the path of your allowlist file.

  • log_datasets – If True, dataset information is logged to MLflow Tracking. If False, dataset information is not logged.

  • disable – If True, disables the scikit-learn autologging integration. If False, enables the pyspark ML autologging integration.

  • exclusive – If True, autologged content is not logged to user-created fluent runs. If False, autologged content is logged to the active fluent run, which may be user-created.

  • disable_for_unsupported_versions – If True, disable autologging for versions of pyspark that have not been tested against this version of the MLflow client or are incompatible.

  • silent – If True, suppress all event logs and warnings from MLflow during pyspark ML autologging. If False, show all events and warnings during pyspark ML autologging.

  • log_post_training_metrics – If True, post training metrics are logged. Defaults to True. See the post training metrics section for more details.

  • registered_model_name – If given, each time a model is trained, it is registered as a new model version of the registered model with this name. The registered model is created if it does not already exist.

  • log_input_examples – If True, input examples from training datasets are collected and logged along with pyspark ml model artifacts during training. If False, input examples are not logged.

  • log_model_signatures

    If True, ModelSignatures describing model inputs and outputs are collected and logged along with spark ml pipeline/estimator artifacts during training. If False signatures are not logged.

    Warning

    Currently, only scalar Spark data types are supported. If model inputs/outputs contain non-scalar Spark data types such as pyspark.ml.linalg.Vector, signatures are not logged.

  • log_model_allowlist

    If given, it overrides the default log model allowlist in mlflow. This takes precedence over the spark configuration of “spark.mlflow.pysparkml.autolog.logModelAllowlistFile”.

    The default log model allowlist in mlflow
    # classification
    pyspark.ml.classification.LinearSVCModel
    pyspark.ml.classification.DecisionTreeClassificationModel
    pyspark.ml.classification.GBTClassificationModel
    pyspark.ml.classification.LogisticRegressionModel
    pyspark.ml.classification.RandomForestClassificationModel
    pyspark.ml.classification.NaiveBayesModel
    
    # clustering
    pyspark.ml.clustering.BisectingKMeansModel
    pyspark.ml.clustering.KMeansModel
    pyspark.ml.clustering.GaussianMixtureModel
    
    # Regression
    pyspark.ml.regression.AFTSurvivalRegressionModel
    pyspark.ml.regression.DecisionTreeRegressionModel
    pyspark.ml.regression.GBTRegressionModel
    pyspark.ml.regression.GeneralizedLinearRegressionModel
    pyspark.ml.regression.LinearRegressionModel
    pyspark.ml.regression.RandomForestRegressionModel
    
    # Featurizer model
    pyspark.ml.feature.BucketedRandomProjectionLSHModel
    pyspark.ml.feature.ChiSqSelectorModel
    pyspark.ml.feature.CountVectorizerModel
    pyspark.ml.feature.IDFModel
    pyspark.ml.feature.ImputerModel
    pyspark.ml.feature.MaxAbsScalerModel
    pyspark.ml.feature.MinHashLSHModel
    pyspark.ml.feature.MinMaxScalerModel
    pyspark.ml.feature.OneHotEncoderModel
    pyspark.ml.feature.RobustScalerModel
    pyspark.ml.feature.RFormulaModel
    pyspark.ml.feature.StandardScalerModel
    pyspark.ml.feature.StringIndexerModel
    pyspark.ml.feature.VarianceThresholdSelectorModel
    pyspark.ml.feature.VectorIndexerModel
    pyspark.ml.feature.UnivariateFeatureSelectorModel
    
    # composite model
    pyspark.ml.classification.OneVsRestModel
    
    # pipeline model
    pyspark.ml.pipeline.PipelineModel
    
    # Hyper-parameter tuning
    pyspark.ml.tuning.CrossValidatorModel
    pyspark.ml.tuning.TrainValidationSplitModel
    
    # SynapeML models
    synapse.ml.cognitive.*
    synapse.ml.exploratory.*
    synapse.ml.featurize.*
    synapse.ml.geospatial.*
    synapse.ml.image.*
    synapse.ml.io.*
    synapse.ml.isolationforest.*
    synapse.ml.lightgbm.*
    synapse.ml.nn.*
    synapse.ml.opencv.*
    synapse.ml.stages.*
    synapse.ml.vw.*
    

  • extra_tags – A dictionary of extra tags to set on each managed run created by autologging.