Assisted Machine learning analysis

The goal of this notebook is to create models to understand the influence of static variables on a dynamic variable. In this example we are going to predict the UER, but any other dynamic variable can be predicted.

If you want more explanations, you can find in the documentation a part dedicated to explain the different concepts.

Import

[1]:

import frflib
import sklearn
from frflib.utils.ml_analysis.optimizer import MLDataset

from frflib.utils.ml_analysis.processing import (
    make_static_list,
    make_pipeline,
    make_dict_preprocessing,
    ONE_HOT_ENCODER,
    TARGET_ENCODER,
    REGROUP_ENCODER,
    XYSCALER,
    MINMAX,
    STANDARD_SCALER,
    UNCHANGED,
)
from frflib.utils.ml_analysis.analysis import (
    find_best_model_param,
    model_analysis,
    fit_model,
    build_model_summary,
)
from frflib.utils.ml_analysis.simulation import (
    get_truncated_normal,
    get_percentile,
    discretize,
    check_enough_data,
    generate_df_simulation,
)
from sklearn.model_selection import train_test_split

from frflib.data_class.input_data import InputData
from frflib.forecast_methods import forecast_wf
from frflib.plots.plotly_ import plot_scatter
import plotly.express as px


ForecastWF = forecast_wf.ForecastWF
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.figure_factory as ff
import plotly.express as px
import os
import inspect
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

/env/lib/python3.8/site-packages/category_encoders/target_encoder.py:92: FutureWarning: Default parameter min_samples_leaf will change in version 2.6.See https://github.com/scikit-learn-contrib/category_encoders/issues/327
  warnings.warn("Default parameter min_samples_leaf will change in version 2.6."
/env/lib/python3.8/site-packages/category_encoders/target_encoder.py:97: FutureWarning: Default parameter smoothing will change in version 2.6.See https://github.com/scikit-learn-contrib/category_encoders/issues/327
  warnings.warn("Default parameter smoothing will change in version 2.6."

Load data

[2]:

PATH_FRFLIB = os.path.dirname(inspect.getsourcefile(frflib))
PATH_TO_DATA = os.path.join(PATH_FRFLIB, "sample_data")

project_id = "01"
wf_name = "dataset_01_simple_wf"

WF_PATH = os.path.join(PATH_TO_DATA, f"{wf_name}.hdf")
DATA_PATH = os.path.join(PATH_TO_DATA, f"dataset_{project_id}.hdf")

[3]:

field_data = InputData.load_hdf(DATA_PATH)
df_st = field_data.df_static
wf = ForecastWF.load_from_hdf(WF_PATH, field_data)
field_data._compute_indicators()
wf.enrich_data()

df_static = field_data.df_static
df_static

[3]:

	production_name	wellbore	reservoir	sub_group	ptr_main_layer_total	ptr_main _layer	ptr_secondary_layer	cluster	avg_poro_vv	avg_vcl_vv	...	z_bh_tvdss_m.1	top_res_sstvd_m	top perfo tvd_ss_ft	z_wh_tvdss_m.1	sensor_depth_tvdss	init_res_pr_bar	init_depletion_pr_bar	start_up_date	year_start	first_prod_days
wellname
well_0	WA12	WA12	Deltaic	B	B22	B2	None	WA	0.54	0.39	...	1077.26	1036.50	1063.72	497.48	962.016	519.86	4.04	2006-12-01	2006	2191
well_1	JD28	JD28	Fluvial	D	D2	D2	None	JD	0.77	0.23	...	1115.73	1003.26	1087.60	332.30	982.500	550.00	1.01	2014-05-01	2014	4899
well_10	SA29	SA29_st1	Fluvial	D	D3	D1	D2	SA	0.64	0.39	...	1295.13	1231.86	1308.72	626.28	1209.727	592.00	4.63	2011-08-01	2011	3895
well_100	RA08	RA08	Fluvial	D	D3	D3	None	RA	0.82	0.26	...	1511.14	1507.18	1444.31	556.99	1329.310	680.00	6.41	2009-03-01	2009	3012
well_101	JB32	JB32_st1	Fluvial	D	D2	D1	None	JB	0.70	0.32	...	1171.23	1105.51	NaN	375.00	NaN	490.00	8.07	2017-01-01	2017	5875
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
well_93	YC13	YC13	Deltaic	B	B22	B2	None	YC	0.68	0.30	...	1166.93	1093.93	1131.87	505.13	1021.972	540.00	4.29	2010-09-01	2010	3561
well_94	VD17	VD17	Deltaic	B	B22	B2	None	VD	0.64	0.40	...	1121.12	1049.24	1101.96	511.04	1039.875	570.00	0.94	2003-03-01	2003	820
well_95	YA18	YA18_st2	Fluvial	E	E1	E1	None	YA	0.51	0.41	...	1452.83	1463.29	1480.16	479.44	1352.500	670.00	5.85	2007-08-01	2007	2434
well_97	P136	P136	Fluvial	D	D3	D3	None	P1	0.74	0.30	...	1301.07	1309.64	1329.15	561.55	1249.500	657.00	2.37	2013-08-01	2013	4626
well_99	P302	P302_st1	Fluvial	D	D3	D3	None	P3	0.73	0.22	...	1320.42	1311.99	1334.13	535.87	1246.300	610.00	5.68	2016-08-01	2016	5722

176 rows × 38 columns

I. Create Model

The first part is to create a model. We are going to choose a model, create a pipeline with some transformations and then train it.

Parameters

The goal is to predict a dynamic variable. You can select here the variable to predict and the models to use. You can add multiple models, but it will take more time. The available models are - gradientBoosting - RandomForest - Lasso - Ridge DATASETS = [“train”, “validation”]

def create_df_result(best_model_dict): model_df = pd.DataFrame() for data_set in DATASETS: data = { key: values for key, values in best_model_dict[data_set].items() if type(values) == list } temp = pd.DataFrame(data=data).set_index(“wellname”) temp[“dataset”] = data_set model_df = pd.concat([model_df, temp], axis=0) return model_df

[4]:

target = "oil_cum_mmbbl"
models = ["GradientBoosting", "Lasso", "RandomForest"]
metric = "neg_mean_squared_error"
availables_metrics = sorted(sklearn.metrics.SCORERS.keys())
assert (
    metric in availables_metrics
), f"metrics not in sklearn metrics please use one of {availables_metrics}"

[5]:

df_predict = field_data.df_dynamic.groupby("wellname").max()
df = pd.concat([df_predict[target], df_static], axis=1)

Cleaning

In this part we drop the wells that didn’t start to produce.

[6]:

df = df.loc[df[target].notnull(), :]

Preprocessing

We have implemented some transformers for basic preprocessing. By default we use the targetEncoder for categorical variable and minmax for numerical data.

All transformers available Categorical - targetEncoder : Replace the categories of categorical columns by the mean of the target - OneHotEncoder

Numerical - XYscaler : scaler to be used with geographical column (x,y) - StandardScaler - minmax : rescale the data between 0 and 1 - unchanged : apply no rescaling on the data

Below you have access to the default a list with each column the default transformer we apply.

If you want to modify the preprocessing, you can follow the below example. You can do 2 things: - don’t keep the column (see example on wellbore below) - Choose another transformer for a column (see example on reservoir)

[7]:

static_col = make_static_list(df.drop([target], axis=1))

col_drop = [
    "gross_lentgh_m",
    "x_wh_m",
    "y_wh_m",
    "z_wh_tvdss_m",
    "z_bh_tvdss_m",
    "z_bh_tvd_m",
    "azim_deg",
    "virgin_res_pr_bar",
    "last_res_pr_bar",
    "z_bh_tvdss_m.1",
    "top perfo tvd_ss_ft",
    "z_wh_tvdss_m.1",
    "sensor_depth_tvdss",
    "init_res_pr_bar",
    "init_depletion_pr_bar",
    "first_prod_days",
    "reservoir.1",
    "ptr_main_layer_total",
    "multi_lateral_flag",
    "ptr_main _layer",
    "ptr_secondary_layer",
    "well_Type",
    "production_name",
    "top_res_sstvd_m",
    "year_start",
    "ptr_main _layer",
    "net_length_m",
    "drill_length_md_m",
    #'reservoir',
    #'wellbore',
    # "avg_vcl_vv",
    # "avg_ntg_vv",
    # "sub_group",
    # "start_up_date",
    # "cluster",
    #'incl_max_deg',
    #'top_res_sstvd_m'
    # "x_bh_m",
    # "y_bh_m"
]
for col in static_col:
    if col["Name"] == "x_bh_m" or col["Name"] == "y_bh_m":
        col["Transform"] = XYSCALER
    if col["Name"] == "reservoir":
        col["Transform"] = ONE_HOT_ENCODER
    if col["Name"] == "top_res_sstvd_m":
        col["Transform"] = MINMAX
    # To choose to delete column from training
    if col["Name"] in col_drop:
        col["Keep"] = False

print("Use in model")
set(df.columns) - set(col_drop)

Use in model

[7]:

{'avg_ntg_vv',
 'avg_poro_vv',
 'avg_vcl_vv',
 'cluster',
 'incl_max_deg',
 'oil_cum_mmbbl',
 'reservoir',
 'start_up_date',
 'sub_group',
 'wellbore',
 'x_bh_m',
 'y_bh_m'}

Pipeline

We then create a sklearndf pipeline with the preprocessing options and the selected models.

[8]:

dict_preprocessing = make_dict_preprocessing(static_col)
preprocessor = make_pipeline(dict_preprocessing)
dict_preprocessing

[8]:

{'numerical': {'minmax': ['avg_poro_vv',
   'avg_vcl_vv',
   'avg_ntg_vv',
   'incl_max_deg'],
  'normalize_standard': [],
  'XYscaler': ['x_bh_m', 'y_bh_m'],
  'unchanged': []},
 'categorical': {'OneHotEncoder': ['reservoir'],
  'targetEncoder': ['wellbore', 'sub_group', 'cluster']},
 'date_time': {'date_time_encoder': ['start_up_date']}}

Fits models

Once the preprocessing is done you can launch the training of the models.

[9]:

# fit on models
# metrics available
# sorted(sklearn.metrics.SCORERS.keys())
# columns = [x["Name"] for x in static_col if x["Keep"] == True]
df_train, df_test = train_test_split(df, test_size=0.3, random_state=35)

data = MLDataset(
    df_train.drop([target], axis=1),
    df_test.drop([target], axis=1),
    df_train[target],
    df_test[target],
)

[10]:

best_model_name, grid, pipe_processor, optimisation_per_model = find_best_model_param(
    data,
    preprocessor,
    metric=metric,
    budget=5,
    models=models,
    feature_selection=False,
    feature_penalisation=False,
)
data_out, model = fit_model(best_model_name, grid, data, pipe_processor)
model_summary = build_model_summary(data_out)
model_summary.update(
    {"optimization_metric": metric, "model": best_model_name, "params": grid}
)
(dict_dependence_plot, dict_summuray_plot, shap_values, df_shap) = model_analysis(
    data_out, model, best_model_name
)

Default parameter min_samples_leaf will change in version 2.6.See https://github.com/scikit-learn-contrib/category_encoders/issues/327
Default parameter smoothing will change in version 2.6.See https://github.com/scikit-learn-contrib/category_encoders/issues/327
Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
Default parameter min_samples_leaf will change in version 2.6.See https://github.com/scikit-learn-contrib/category_encoders/issues/327
Default parameter smoothing will change in version 2.6.See https://github.com/scikit-learn-contrib/category_encoders/issues/327
Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.

[11]:

plt.figure(figsize=(12, 8))
for model_name in models:
    score_list = optimisation_per_model[model_name]
    score = [x[0] for x in score_list]
    penalisation = [x[1] for x in score_list]
    plt.plot(score, label=f"score {model_name}")
    if np.sum(penalisation) > 0:
        plt.plot(penalisation, label=f"penalisation {model_name}")
plt.ylim(bottom=0)
plt.legend()
plt.title(f"Optimizer min score = {min(score):.1f} metric : {metric}")
plt.show()

../_images/notebooks_assisted_ml_analysis_21_0.png

II. Model Analysis

Result

Once the preprocessing is done you can launch the training of the models and then visualise the result of the best model. A good indicator is the r2 square. - r2 > 0.6 : really good r2, you have a strong model - 0.6 > r2 > 0.4 : Acceptable r2 - r2 < 0.4 : Low r2, the model is not accurate enough print(f”Best model {model_summary[‘model’]} :nbsphinx-math:`nParams `= {model_summary[‘params’]}”) df_result = create_df_result(model_summary)

metric_func = {“r2”: r2_score, “mae”: mean_absolute_error, “rmse”: mean_squared_error} result_score = {“train”: {}, “validation”: {}}

for dataset in result_score.keys(): original, pred = ( df_result[df_result.dataset == dataset].target, df_result[df_result.dataset == dataset].predict, ) print(f”{dataset}”) for key, func in metric_func.items(): if key == “rmse”: score = func(original, pred, squared=False) else: score = func(original, pred) result_score[dataset].update({key: score}) print(f”\t{key} = {score:.2f}”)

fig_model = plot_scatter( df_result, “target”, “predict”, color=”dataset”, hover_name=”wellname”, same_scale=True, x_range=[0, 70], y_range=[0, 70], title=”Model with all wells”, idendity=True, ) fig_model.show()

[12]:

# inspection
(
    dict_dependence_plots,
    dict_summuray_plot,
    shap_values,
    df_shap,
) = model_analysis(data_out, model, model_name)

Model analysis

Now that we have a model, we are going to make some analysis to understand the influcence of the static variables on the target.

Summuray Plot

This plot gives a summary of the influence and the correlation between the variables and the target. The x-axis is the mean absolute impact on all the wells in term of the target. The variables are sorted according to their impact, and the colour is the correlation (positive or negative with the target).

[13]:

def make_summary_fig(dict_summuray_plot, number_features_plot):
    summuray = pd.DataFrame(data=dict_summuray_plot["data"])
    summuray = summuray[:number_features_plot][::-1]

    fig = px.bar(
        summuray,
        x="mean_abs",
        y="features",
        color="correlation",
        title=f"Impact on {target}",
        orientation="h",
        color_continuous_scale="Bluered",
        range_color=(-1, 1),
    )
    return fig


number_features_plot = 30
fig = make_summary_fig(dict_summuray_plot, number_features_plot=15)
fig.show()

[14]:

# This plot shows the impact of a variable on the target. Compare to the first plot, we can explore each variable at a time. On the x-axis you have the values of the selected variable (which is normalised between [0,1]). On the y-axis you have the impact of the feature. This plot allows us to understand the relationship between the target and variable (linear, monotonic etc…).
df_impact = pd.DataFrame(dict_dependence_plots["data"]).set_index("wellname")
# df_impact = pd.concat([df_impact, df], axis=1)

variable = [x[7:] for x in df_impact.columns if x[:6] == "impact"]
for var in variable[0:2]:
    if f"{var}" in df_impact.columns:
        f = px.scatter(df_impact, var, f"impact_{var}")
        f.show()