Abstract
Despite the pipeline network being the safest mode of oil and gas transportation systems, the pipeline failure rate has increased significantly over the last decade, particularly for aging pipelines. Predicting failure risk and prioritizing the riskiest asset from a large set of pipelines is one of the demanding tasks for the utilities. Machine learning (ML) application in pipeline failure risk prediction has recently shown promising results. However, due to safety and security concerns, obtaining sufficient operation and failure data to train ML models accurately is a significant challenge. This study employed a Generative Adversarial Network (GAN) based framework to generate synthetic pipeline data (DSyn) using a subset (70%) of experimental burst test results data (DExp) compiled from the literature to overcome the limitation of accessing operational data. The proposed framework was tested on (1) real data, and (2) combined real and generated synthetic data. The burst failure risk of corroded oil and gas pipelines was determined using probabilistic approaches, and pipelines were classified into two classes depending on their probability of failure: (1) low failure risk (Pf: 0–0.5) and (2) high failure risk (Pf: >0.5). Two random forest (RF) models (MExp and MComb) were trained using a subset of 70% of actual experimental pipeline data, (DExp) and a combination of 70% of actual experimental and 100% of synthetic data, respectively. These models were validated on the remaining subset (30%) of experimental test data. The validation results reveal that adding synthetic data can further improve the performance of the ML models. The area under the ROC Curve was found to be 0.96 and 0.99 for real model (MExp) and combined model (MComb) data, respectively. The combined model with improved performance can be used in strategic oil and gas pipeline resilience improvement planning, which sets long-term critical decisions regarding maintenance and potential replacement of pipes.