Archive for the ‘Machine Learning’ Category

Optimization of wear parameters for ECAP-processed ZK30 alloy using response surface and machine learning … – Nature.com

Experimental results Microstructure evolution

The ZK30s AA and ECAPed conditions of the inverse pole figures (IPF) coloring patterns and associated band contrast maps (BC) are shown in Fig.2. High-angle grain boundaries (HAGBs) were colored black, while Low-angle grain boundaries (LAGBs) were colored white for AA condition, and it was colored red for 1P and Bc, as shown in Fig.2. The grain size distribution and misorientation angle distribution of the AA and ECAPed ZK30 samples is shown in Fig.3. From Fig.2a, it was clear that the AA condition revealed a bimodal structure where almost equiaxed refined grains coexist with coarse grains and the grain size was ranged between 3.4 up to 76.7m (Fig.3a) with an average grain size of 26.69m. On the other hand, low fraction of LAGBs as depicted in Fig.3b. Accordingly, the GB map (Fig.2b) showed minimal LAGBs due to the recrystallization process resulting from the annealing process. ECAP processing through 1P exhibited an elongated grain alongside refined grains and the grain size was ranged between 1.13 and 38.1m with an average grain size of 3.24m which indicated that 1P resulted in a partial recrystallization, as shown in Fig.2c,d. As indicated in Fig.2b 1P processing experienced a refinement in the average grain size of 87.8% as compared with the AA condition. In addition, from Fig.2b it was clear that ECAP processing via 1P resulted a significant increase in the grain aspect ratio due to the uncomplete recrystallization process. In terms of the LAGBs distribution, the GB maps of 1P condition revealed a significant increase in the LAGBs fraction (Fig.2d). A significant increase in the LAGBs density of 225% after processing via 1P was depicted compared to the AA sample (Fig.2c). Accordingly, the UFG structure resulted from ECAP processing through 1P led to increase the fraction of LAGBs which agreed with previous study35,36. Shana et al.35 reported that during the early passes of ECAP a generation and multiplication of dislocation is occur which is followed by entanglement of the dislocation forming the LAGBs and hence, the density of LAGBs was increased after processing through 1P. The accumulation of the plastic strain up to 4Bc revealed an almost UFG, which indicated that 4Bc led to a complete dynamic recrystallization (DRX) process (Fig.2e). The grain size was ranged between 0.23 up to 11.7m with average grain size of 1.94m (the average grain size was decreased by 92.7% compared to the AA condition). On the other hand, 4Bc revealed a decrease in the LAGBs density by 25.4% compared to 1P condition due to the dynamic recovery process. The decrease in the LAGBs density after processing through 4Bc was coupled with an increase in the HAGBs by 4.4% compared to 1P condition (Figs.2f, 3b). Accordingly, the rise of the HAGBs after multiple passes can be referred to the transfer of LAGBs into HAGBs during the DRX process.

IPF coloring maps and their corresponding BC maps, superimposed for the ZK30 billets in its AA condition (a,b), and ECAP processed through (c,d) 1P, (e,f) 4Bc (with HAGBs in black lines and LAGBs in white lines (AA) and red lines (1P, 4Bc).

Relative frequency of (a) grain size and (b) misorientation angle of all ZK30 samples.

Similar findings were reported in previous studies. Dumitru et al.36 reported that ECAP processing resulted in the accumulation and re-arrangement of dislocations which resulted in forming a subgrains and equiaxed grains with an UFG structure and a fully homogenous and equiaxed grain structure for ZK30 alloy was attained after the third pass. Furthermore, they reported that the LAGBs is transferred into HAGBs during the multiple passes which leads to the decrease in the LAGBs density. Figueiredo et al.37 reported that the grains evolved during the early passes of ECAP into a bimodal structure while further processing passes resulted in the achievement of a homogenous UFG structure. Zhou et al.38 reported that by increasing the processing passes resulted in generation of new grain boundaries which resulted in increasing the misorientation to accommodate the deformation and the Geometrically Necessary Dislocations (GNDs) generated a part of the total dislocations with a HAGBs, thus develop misorientations between the neighbor grains. Tong et al.39 reported that the fraction of LAGBs is decreased during multiple passes for MgZnCa alloy.

Figure4a displays X-ray diffraction (XRD) patterns of the AA-ZK30 alloy, 1P, and 4Bc extruded samples, revealing peaks corresponding to primary -Mg Phase, Mg7Zn3, and MgZn2 phases in all extruded alloys, with an absence of diffraction peaks corresponding to oxide inclusions. Following 1P-ECAP, the -Mg peak intensity exhibits an initial increase, succeeded by a decrease and fluctuations, signaling texture alterations in the alternative Bc route. The identification of the MgZn2 phase is supported by the equilibrium MgZn binary phase diagram40. However, the weakened peak intensity detected for the MgZn2 phase after the 4BcECAP process indicates that a significant portion of the MgZn2 dissolved into the Mg matrix, attributed to their poor thermal stability. Furthermore, the atomic ratio of Mg/Zn for this phase is approximately 2.33, leading to the deduction that the second phase is the Mg7Zn3 compound. This finding aligns with recent research on MgZn alloys41. Additionally, diffraction patterns of ECAP-processed samples exhibit peak broadening and shifting, indicative of microstructural adjustments during plastic deformation. These alterations undergo analysis for crystallite size and micro-strain using the modified Williamson and Hall (WH) method42, as illustrated in Fig.4b. After a single pass of ECAP, there is a reduction in crystallite size and an escalation in induced micro-strain. Subsequent to four passes-Bc, further reductions in crystallite size and heightened micro-strain (36nm and 1.94103, respectively) are observed. Divergent shearing patterns among the four processing routes, stemming from disparities in sample rotation, result in distinct evolutions of subgrain boundaries. Route BC, characterized by the most extensive angular range of slip, generates subgrain bands on two shearing directions, expediting the transition of subgrain boundaries into high-angle grain boundaries43,44. Consequently, dislocation density and induced micro-strains reach their top in route BC, potentially influenced by texture modifications linked to orientation differences in processing routes. Hence, as the number of ECAP passes increases, an intensive level of deformation is observed, leading to the existence of dynamic recrystallization and grain refinement, particularly in the ECAP 4-pass. This enhanced deformation effectively impedes grain growth. Consequently, the number of passes in the ECAP process is intricately linked to the equivalent strain, inducing grain boundary pinning, and resulting in the formation of finer grains. The grain refinement process can be conceptualized as a repetitive sequence of dynamic recovery and recrystallization in each pass. In the case of the 4Bc ECAP process, dynamic recrystallization dominates, leading to a highly uniform grain reduction and, causing the grain boundaries to become less distinct45. Figure4b indicates that microstructural features vary with ECAP processing routes, aligning well with grain size and mechanical properties.

(a) XRD patterns for the AA ZK30 alloy and after 1P and 4Bc ECAP processing, (b) variations of crystallite size and lattice strain as a function of processing condition using the WilliamsonHall method.

Figure5 shows the volume loss (VL) and average coefficient of friction (COF) for the AA and ECAPed ZK30 alloy. The AA billets exhibited the highest VL at all wear parameters compared to the ECAPed billets as shown in Fig.5. From Fig.5a it revealed that performing the wear test at applied load of 1N exhibited the higher VL compared to the other applied forces. In addition, increasing the applied force up to 3 N revealed lower VL compared to 1 N counterpart at all wear speeds. Further increase in the applied load up to 5 N revealed a notable decrease in the VL. Similar behavior was attained for the ECAP-processed billets through 1P (Fig.5c) and 4Bc (Fig.5e). The VL was improved by increasing the applied load for all samples as shown in Fig.5 which indicated an enhancement in the wear resistance. Increasing the applied load increases the strain hardening of ZK30 alloy that are in contact as reported by Yasmin et al.46 and Kori et al.47. Accordingly, increasing the applied load resulted in increasing the friction force, which in turn hinder the dislocation motion and resulted in higher deformation, so that ZK30 experienced strain hardening and hence, the resistance to abrasion is increased, leading to improving the wear resistance48. Furthermore, increasing the applied load leads to increase the surface in contact with wear ball and hence, increases gripping action of asperities, which help to reduces the wear rate of ZK30 alloy as reported by Thuong et al.48. Out of contrary, increasing the wear speed revealed increasing the VL of the AA billets at all wear loads. For the ECAPed billet processed through 1P, the wear speed of 125mm/s revealed the lowest VL while the wear speed of 250mm/s showed the highest VL (Fig.5c). Similar behaviour was recorded for the 4Bc condition. In addition, from Fig.5c, it was clear that 1P condition showed higher VL compared to 4Bc (Fig.5e) at all wear parameters, indicating that processing via multiple passes resulted in significant grain size refinement (Fig.2). Hence, higher hardness and better wear behavior were attained which agreed with previous study7. In addition, from Fig.5, it was clear that increasing the wear speed increased the VL. For the AA billets tested at 1N load the VL was 1.52106 m3. ECAP processing via 1P significantly improved the wear behavior as the VL was reduced by 85% compared to the AA condition. While compared to the AA condition, the VL improved by 99.8% while straining through 4Bc, which is accounted for by the considerable refinement that 4Bc provides. A similar trend was observed for the ECAPed ZK30 samples tested at a load of 3 and 5 N (Fig.5). Accordingly, the significant grain refinement after ECAP processing (Fig.2) increased the grain boundaries area; hence, a thicker oxide protective layer can be formed, leading to improve the wear resistance of the ECAPed samples. It is worth to mentioning here that, the grain refinement coupled with refining the secondary phase particle and redistribution resulted from processing through ECAP processing through multiple passes resulted in improving the hardness, wear behavior and mechanical properties according to HallPetch equation7,13,49. Similar findings were noted for the ZK30 billets tested at 3 N load, processing through 1P and 4Bc exhibited decreasing the VL by 85%, 99.85%, respectively compared to the AA counterpart. Similar finding was recorded for the findings of ZK30 billets which tested at 5 N load.

Volume loss of ZK30 alloy (a,c,e) and the average coefficient of friction (b,d,f) in its (a,b) AA, (c,d) 1P and (e,f) 4Bc conditions as a function of different wear parameters.

From Fig.5, it can be noticed that the COF curves revealed a notable fluctuation with implementing least square method to smoothing the data, confirming that the friction during the testing of ECAPed ZK30 alloy was not steady for such a time. The remarkable change in the COF can be attributed to the smaller applied load on the surface of the ZK30 samples. Furthermore, the results of Fig.5 revealed that ECAP processing reduced the COF, and hence, better wear behavior was attained. Furthermore, for all ZK30 samples, it was observed that the highest applied load (5 N) coupled with the lowest wear time (110s) exhibited better COF and better wear behavior was displayed. These findings agreed with Farhat et al.50, they reported that decreasing the grain size led to improve the COF and hence improve the wear behavior. Furthermore, they reported that a plastic deformation occurs due to the friction between contacted surface which resisted by the grain boundaries and fine secondary phases. In addition, the strain hardening resulted from ECAP processing leads to decrease the COF and improving the VL50. Sankuru et al.43 reported that ECAP processing foe pure Mg resulted in substantial grain refinement which was reflected in improving both microhardness and wear rate of the ECAPed billets. Furthermore, they found that increasing the number of passes up to 4Bc reduced the wear rate by 50% compared to the AA condition. Based on the applied load and wear velocity and distance, wear mechanism can be classified into mild wear and severe wear regimes49. Wear test parameters in the present study (load up to 5 N and speed up to 250mm/s) falls in the mild wear regime where the delamination wear and oxidation wear mechanisms would predominantly take place43,51.

The worn surface morphologies of the ZK30-AA billet and ECAPed billet processed through 4Bc are shown in Fig.6. From Fig.6 it can revealed that scores of wear grooves which aligned parallel to the wear direction have been degenerated on the worn surface in both AA (Fig.6a) and 4Bc (Fig.6b) conditions. Accordingly, the worn surface was included a combination of adhesion regions and a plastic deformation bands along the wear direction. Furthermore, it can be observed that the wear debris were adhered to the ZK30 worn surface which indicated that the abrasion wear mechanism had occur52. Lim et al.53 reported that hard particle between contacting surfaces scratches samples and resulted in removing small fragments and hence, wear process was occurred. In addition, from Fig.6a,b it can depicted that the wear grooves on the AA billet were much wider than the counterpart of the 4Bc sample and which confirmed the effectiveness of ECAP processing in improving the wear behavior of the ZK30 alloy. Based on the aforementioned findings it can be concluded that ECAP-processed billets exhibited enhanced wear behavior which can be attributed to the obtained UFG structure52.

SEM micrograph of the worn surface after the wear test: (ac) AA alloy; (b) ECAP-processed through 4Bc.

Several regression transformations approach and associations among variables that are independent have been investigated in order to model the wear output responses. The association between the supplied parameters and the resulting responses was modeled using quadratic regression. The models created in the course of the experiment are considered statistically significant and can be used to forecast the response parameters in relation to the input control parameters when the highest possible coefficient of regression of prediction (R2) is closer to 1. The regression Eqs.(9)(14) represent the predicted non-linear model of volume loss (VL) and coefficient and friction (COF) at different passes as a function of velocity (V) and applied load (P), with their associated determination and adjusted coefficients. The current studys adjusted R2 and correlation coefficient R2 values fluctuated between 95.67 and 99.97%, which is extremely near to unity.

$${text{AA }}left{ {begin{array}{*{20}l} {VL = + 1.52067 times 10^{ - 6} - 1.89340 times 10^{ - 9} P - 4.81212 times 10^{ - 11} V + 8.37361 times 10^{ - 12} P * V} hfill & {} hfill \ { - 2.91667E - 10 {text{P}}^{2} - 2.39989E - 14 {text{V}}^{2} } hfill & {(9)} hfill \ {frac{1}{{{text{COF}}}} = + 2.72098 + 0.278289P - 0.029873V - 0.000208 P times V + 0.047980 {text{P}}^{2} } hfill & {} hfill \ { + 0.000111 {text{V}}^{2} - 0.000622 {text{P}}^{2} times V + 6.39031 times 10^{ - 6} P times {text{V}}^{2} } hfill & {(10)} hfill \ end{array} } right.$$

$$1{text{ Pass }}left{ {begin{array}{*{20}l} {VL = + 2.27635 times 10^{ - 7} + 7.22884 times 10^{ - 10} P - 2.46145 times 10^{ - 11} V - 1.03868 times 10^{ - 11} P times V} hfill & {} hfill \ { - 1.82621 times 10^{ - 10} {text{P}}^{2} + 6.10694 times 10^{ - 14} {text{V}}^{2} } hfill & {} hfill \ { + 8.76819 times 10^{ - 13} P^{2} times V + 2.48691 times 10^{ - 14} P times V^{2} } hfill & {(11)} hfill \ {frac{1}{{{text{COF}}}} = - 0.383965 + 1.53600P + 0.013973V - 0.002899 P times V} hfill & {} hfill \ { - 0.104246 P^{2} - 0.000028 V^{2} } hfill & {(12)} hfill \ end{array} } right.$$

$$4{text{ Pass}}left{ {begin{array}{*{20}l} {VL = + 2.29909 times 10^{ - 8} - 2.29012 times 10^{ - 10} P + 2.46146 times 10^{ - 11} V - 6.98269 times 10^{ - 12} P times V } hfill & {} hfill \ { - 1.98249 times 10^{ - 11} {text{P}}^{2} - 7.08320 times 10^{ - 14} {text{V}}^{2} } hfill & {} hfill \ { + 3.23037 times 10^{ - 13} P^{2} * V + 1.70252 times 10^{ - 14} P times V^{2} } hfill & {(13)} hfill \ {frac{1}{{{text{COF}}}} = + 2.77408 - 0.010065P - 0.020097V - 0.003659 P times V} hfill & {} hfill \ { + 0.146561 P^{2} + 0.000099 V^{2} } hfill & {(14)} hfill \ end{array} } right.$$

The experimental data are plotted in Fig.7 as a function of the corresponding predicted values for VL and COF for zero pass, one pass, and four passes. The minimal output value is indicated by blue dots, which gradually change to the maximum output value indicated by red points. The effectiveness of the produced regression models was supported by the analysis of these maps, which showed that the practical and projected values matched remarkably well and that the majority of their intersection locations were rather close to the median line.

Comparison between VL and COF of experimental and predicted values of ZK30 at AA, 1P, and 4Bc.

As a consequence of wear characteristics (P and V), Fig.8 displays 3D response plots created using regression models to assess changes in VL and COF at various ECAP passes. For VL, the volume loss and applied load exhibit an inverse proportionality at various ECAP passes, which is apparent in Fig.8ac. It was observed that increasing the applied load in the wear process will minimize VL. So, the optimal amount of VL was obtained at an applied load of 5N. There is an inverse relation between V of the wear process and VL at different ECAP passes. There is a clear need to change wear speeds for bullets with varying numbers of passes. As a result, the increased number of passes will need a lower wear speed to minimize VL. The minimal VL at zero pass is 1.50085E06 m3 obtained at 5N and 250mm/s. Also, at a single pass, the optimal VL is 2.2266028E07 m3 obtained at 5 N and 148mm/s. Finally, the minimum VL at four passes is 2.07783E08 m3 at 5N and 64.5mm/s.

Three-dimensional plot of VL (ac) and COF (df) of ZK30 at AA, 1P, and 4Bc.

Figure8df presents the effect of wear parameters P and V on the COF for ECAPed ZK30 billets at zero, one, and four passes. There is an inverse proportionate between the applied load in the wear process and the coefficient of friction. As a result, the minimum optimum value of COF of the ZK30 billet at different process passes was obtained at 5 N. On the other hand, the speed used in the wear process decreased with the number of billet passes. The wear test rates for billets at zero, one, and four passes are 250, 64.5, and 64.5mm/s, respectively. The minimum COF at zero pass is 0.380134639, obtained at 5N and 250mm/s. At 5N and 64.5mm/s, the lowest COF at one pass is 0.220277466. Finally, the minimum COF at four passes is 0.23130154 at 5N and 64.5mm/s.

The previously mentioned modern ML algorithms have been used here to provide a solid foundation for analyzing the obtained data and gaining significant insights. The following section will give the results acquired by employing these approaches and thoroughly discuss the findings.

The correlation plots and correlation coefficients (Fig.9) between the input variables, force, and speed, and the six output variables (VL_P0, VL_P1, VL_P4, COF_P0, COF_P1, and COF_P4) for data preprocessing of ML models give valuable insights into the interactions between these variables. Correlation charts help to investigate the strength and direction of a linear relationship between model input and output variables. We can initially observe if there is a positive, negative, or no correlation between each two variables by inspecting the scatterplots. This knowledge aids in comprehending how changes in one variable effect changes in the other. In contrast, the correlation coefficient offers a numerical assessment of the strength and direction of the linear relationship. It ranges from 1 to 1, with near 1 indicating a strong negative correlation, close to 1 indicating a strong positive correlation, and close to 0 indicating no or weak association. It is critical to examine the size and importance of the correlation coefficients when examining the correlation between the force and speed input variables and the six output variables (VL_P0, VL_P1, VL_P4, COF_P0, COF_P1, and COF_P4). A high positive correlation coefficient implies that a rise in one variable is connected with an increase in the other. In contrast, a high negative correlation coefficient indicates that an increase in one variable is associated with an increase in the other. From Fig.9 it was clear that for all ZK30 billets, the both VL and COP were reversely proportional with the applied (in the range of 1-up to- 5N). Regarding the wear speed, the VL of both the AA and 1P conditions exhibited an inversed proportional with the wear speed while 4Bc exhibited a direct proportional with the wear speed (in the range of 64.5- up to- 250mm/s) despite of the COP for all samples revealed an inversed proportional with the wear speed. The VL of AA condition (P0) revealed strong negative correlation coefficient of 0.82 with the applied load while it displayed intermediate negative coefficient of 0.49 with the wear speed. For 1P condition, VL showed a strong negative correlation of 0.74 with the applied load whereas it showed a very weak negative correlation coefficient of 0.13 with the speed. Furthermore, the VL of 4Bc condition displayed a strong negative correlation of 0.99 with the applied load while it displayed a wear positive correlation coefficient of 0.08 with the speed. Similar trend was observed for the COF, the AA, 1P and 4Bc samples displayed intermediate negative coefficient of 0.047, 0.65 and 0.61, respectively with the applied load while it showed a weak negative coefficient of 0.4, 0.05 and 0.22, respectively with wear speed.

Correlation plots of input and output variables showcasing the strength and direction of relationships between each inputoutput variable using correlation coefficients.

Figure10 shows the predicted train and test VL values compared to the original data, indicating that the VL prediction model performed well utilizing the LR (Linear Regression) technique. The R2-score is a popular statistic for assessing the goodness of fit of a regression model. It runs from 0 to 1, with higher values indicating better performance. In this scenario, the R2-scores for both the training and test datasets range from 0.55 to 0.99, indicating that the ML model has established a significant correlation between the projected VL values and the actual data. This shows that the model can account for a considerable percentage of the variability in VL values.

Predicted train and predicted test VL versus actual data computed for different applied loads and number of passes of (a) 0P (AA), (b) 1P, and (c) 4Bc: evaluating the performance of the VL prediction best model achieved using LR algorithm.

The R2-scores for training and testing three distinct ML models for the output variables VL_P0, VL_P1, and VL_P4 are summarized in Fig.11. The R2-score, also known as the coefficient of determination, is a number ranging from 0 to 1 that indicates how well the model fits the data. For VL_P0, R2 for testing is 0.69, and that for training is 0.96, indicating that the ML model predicts the VL_P0 variable with reasonable accuracy on unknown data. On the other hand, the R2 value of 0.96 for training suggests that the model fits the training data rather well. In summary, the performance of the ML models changes depending on the output variables. With R2 values of 0.98 for both training and testing, the model predicts 'VL_P4' with great accuracy. However, the models performance for 'VL_P0' is reasonable, with an R2 score of 0.69 for testing and a high R2 score of 0.96 for training. The models performance for 'VL_P1' is relatively poor, with R2 values of 0.55 for testing and 0.57 for training. Additional assessment measures must be considered to understand the models' prediction capabilities well. Therefore, as presented in the following section, we did no-linear polynomial fitting with extracted equations that accurately link the output and input variables.

Result summary of ML train and test sets displaying R2-score for each model.

Furthermore, the data was subjected to polynomial fitting with first- and second-degree models (Fig.12). The fitting accuracy of the data was assessed using the R2-score, which ranged from 0.92 to 0.98, indicating a good fit. The following equations (Eqs.15 to 17) were extracted from fitting the experimental dataset of the volume loss at different conditions of applied load (P) and the speed (V) as follows:

$${text{VL}}_{text{P}}0 = 1.519e - 06{ } + { } - 2.417e - 09{text{ * P }} + { } - 3.077e - 11{ * }V$$

(15)

$$VL_{text{P}}1 = 2.299e - 07 - 5.446e - 10 * {text{P}} - 5.431e - 11 * V - 5.417e - 11 * {text{P}}^{2} + 2.921e - 12 * {text{P}} V + 1.357e - 13 * V^{2}$$

(16)

$$VL_{text{P}}4 = 2.433e - 08 - 6.200e - 10 * {text{P}} + 1.042e - 12 * V$$

(17)

Predicted versus actual (a) VL_P0 fitted to Eq.15 with R2-score of 0.92, (b) VL_P1 fitted to Eq.16 with R2-score of 0.96, (c) VL_P4 fitted to Eq.17 with R2-score of 0.98.

Figure13 depicts the predicted train and test coefficients of friction (COF) values placed against the actual data. The figure seeks to assess the performance of the best models obtained using the SVM (Support Vector Machine) and GPR (Gaussian Process Regression) algorithms for various applied loads and number of passes (0, 1P, and 4P). The figure assesses the accuracy and efficacy of the COF prediction models by showing the predicted train and test COF values alongside the actual data. By comparing projected and actual data points, we may see how closely the models match the true values. The ML models trained and evaluated on the output variables 'COF_P0', 'COF_P1', and 'COF_P4' using SVM and GPR algorithms show great accuracy and performance, as summarized in Fig.13. The R2 ratings for testing vary from 0.97 to 0.99, showing that the models efficiently capture the predicted variables' variability efficiently. Furthermore, the training R2 scores are consistently high at 0.99, demonstrating a solid fit to the training data. These findings imply that the ML models can accurately predict the values of 'COF_P0', 'COF_P1', and 'COF_P4' and generalize well to new unseen data.

Predicted train and predicted test COF versus actual data computed for different applied loads and number of passes of (a) 0P (AA), (b) 1P, and (c) 4Bc: evaluating the performance of the COF prediction best model achieved using SVM and GPR algorithms.

Figure14 presents a summary of the results obtained through machine learning modeling. The R2 values achieved for COF modeling using SVM and GPR are 0.99 for the training set and range from 0.97 to 0.99 for the testing dataset. These values indicate that the models have successfully captured and accurately represented the trends in the dataset.

Result summary of ML train and test sets displaying R2-score for each model.

The results of the RSM optimization carried out on the volume loss and coefficient of friction at zero pass (AA), along with the relevant variables, are shown in Appendix A-1. The red and blue dots represented the wear circumstance (P and V) and responses (VL and COF) for each of the ensuing optimization findings. The volume loss and coefficient of friction optimization objective were formed to in range, using minimize as the solution target, and the expected result of the desirability function was in the format of smaller-is-better attributes. The values of (A) P=5 N and (B) V=250mm/s were the optimal conditions for volume loss. Appendix A-1(a) shows that this resulted in the lowest volume loss value attainable of 1.50127E-6 m3. Also, the optimal friction coefficient conditions were (A) P=2.911 N and (B) V=250mm/s. This led to the lowest coefficient of friction value possible, which was 0.324575, as shown in Appendix A-1(b).

Appendix A-2 displays the outcomes of the RSM optimization performed on the volume loss and coefficient of friction at one pass, together with the appropriate variables. The volume loss and coefficient of friction optimization objectives were designed to be "in range," with "minimize" as the solution objective. It was anticipated that the intended function would provide "smaller-is-better" traits. The ideal conditions for volume loss were (A) P=4.95 N and (B) V=136.381mm/s. This yielded the lowest volume loss value feasible of 2.22725E-7 m3, as seen in Appendix A-2 (a). The optimal P and V values for the coefficient of friction were found to be (A) P=5 N and (B) V=64.5mm/s. As demonstrated in Appendix A-2 (b), this resulted in the lowest coefficient of friction value achievable, which was 0.220198.

Similarly, Appendix A-3 displays the outcomes of the RSM optimization performed on the volume loss and coefficient of friction at four passes, together with the appropriate variables. The volume loss and coefficient of friction optimization objectives were designed to be "in range," with "minimize" as the solution objective. The desired functions expected result would provide of "smaller-is-better" characteristics. The optimal conditions for volume loss were (A) P=5 N and (B) V=77.6915mm/s. This yielded the lowest volume loss value feasible of 2.12638E-8 m3, as seen in Appendix A-1 (a). The optimal P and V values for the coefficient of friction were found to be (A) P=4.95612 N and (B) V=64.9861mm/s. As seen in Appendix A-1(b), this resulted in the lowest coefficient of friction value achievable, which was 0.235109.

The most appropriate combination of wear-independent factors that contribute to the minimal feasible volume loss and coefficient of friction was determined using a genetic algorithm (GA). Based on genetic algorithm technique, the goal function for each response was determined by taking Eqs.(9)(14) and subjecting them to the wear boundary conditions, P and V. The following expression applies to the recommended functions for objective: Minimize (VL, COF), subjected to ranges of wear conditions: 1P5 (N), 64.5V250 (mm/s).

Figures15 and 16 show the GA optimization techniques performance in terms of fitness value and the running solver view, which were derived from MATLAB, together with the related wear requirements for the lowest VL and COF at zero pass. VL and COF were suggested to be minimized by Eqs.(9) and (10), which were then used as the function of fitness and exposed to the wear boundary limit. According to Fig.15a, the lowest value of VL that GA could find was 1.50085E6 m3 at P=5N and V=249.993mm/s. Furthermore, the GA yielded a minimum COF value of 0.322531 at P=2.91 N and V=250mm/s (Fig.15b).

Optimum VL (a) and COF (b) by GA at AA condition.

Optimum VL (a) and COF (b) by hybrid DOE-GA at AA condition.

The DOEGA hybrid analysis was carried out to enhance the GA outcomes. Wear optimal conditions of VL and COF at zero pass are used to determine the initial populations of hybrid DOEGA. The hybrid DOEGA yielded a minimum VL value of 1.50085E-6 m3 at a speed of 249.993mm/s and a load of 5N (Fig.16a). Similarly, at a 2.91 N and 250mm/s speed load, the hybrid DOEGA yielded a minimum COF (Fig.16b) of 0.322531.

The fitness function, as defined by Eqs.11 and 12, was the depreciation of VL and COF at a 1P, subject to the wear boundary condition. Figure17a,b display the optimal values of VL and COF by GA, which were 2.2266E7 m3 and 0.220278, respectively. The lowest VL measured at 147.313mm/s and 5 N. In comparison, 5 N and 64.5mm/s were the optimum wear conditions of COF as determined by GA. Hybrid DOEGA results of minimum VL and COF at a single pass were 2.2266 E-7 m3 and 0.220278, respectively, obtained at 147.313mm/s and 5 N for VL as shown in Fig.18a and 5 N and 64.5mm/s for COF as shown in Fig.18b.

Optimum VL (a) and COF (b) by GA at 1P condition.

Optimum VL (a) and COF (b) by hybrid DOE-GA at 1P condition.

Subject to the wear boundary condition, the fitness function was the minimization of VL and COF at four passes, as defined by Eqs.13 and 14. The optimum values of VL and COF via GA shown in Fig.19a,b were 2.12638E8 m3 and 0.231302, respectively. The lowest reported VL was 5 N and 77.762mm/s. However, GA found that the optimal wear conditions for COF were 5 N and 64.5mm/s. In Fig.20a,b, the hybrid DOEGA findings for the minimum VL and COF at four passes were 2.12638E8 m3 and 0.231302, respectively. These results were achieved at 77.762mm/s and 5 N for VL and 5 N and 64.5mm/s for COF.

Optimum VL (a) and COF (b) by GA at 4Bc condition.

Optimum VL (a) and COF (b) by hybrid DOE-GA at 4Bc condition.

A mathematical model whose input process parameters influence the quality of the output replies was solved using the multi-objective genetic algorithm (MOGA) technique54. In the current study, the multi-objective optimization using genetic algorithm (MOGA) as the objective function, regression models, was implemented using the GA Toolbox in MATLAB 2020 and the P and V are input wear parameter values served as the top and lower bounds, and the number of parameters was set to three. After that, the following MOGA parameters were selected: There were fifty individuals in the initial population, 300 generations in the generation, 20 migration intervals, 0.2 migration fractions, and 0.35 Pareto fractions. Constraint-dependent mutation and intermediary crossover with a coefficient of chance of 0.8 were used for optimization. The Pareto optimum, also known as a non-dominated solution, is the outcome of MOGA. It is a group of solutions that consider all of the objectives without sacrificing any of them55.

By addressing both as multi-objective functions was utilized to identify the lowest possible values of the volume loss and coefficient of friction at zero pass. Equations(9) and (10) were the fitness functions for volume loss and coefficient of friction at zero pass for ZK30. The Pareto front values for the volume loss and coefficient of friction at zero pass, as determined by MOGA, are listed in Table 2. The volume loss (Objective 1) and coefficient of friction (Objective 2) Pareto chart points at zero pass are shown in Fig.21. A friction coefficient reduction due to excessive volume loss was observed. As a result, giving up a decrease in the coefficient of friction can increase volume loss. For zero pass, the best volume loss was 1.50096E06 m3 with a sacrifice coefficient of friction of 0.402941. However, the worst volume loss was 1.50541E06 m3, with the best coefficient of friction being 0.341073.

The genetic algorithm was used for the multi-objective functions of minimal volume loss and coefficient of friction. The fitness functions for volume loss and coefficient of friction at one pass were represented by Eqs.(11) and (12), respectively. Table 3 displays the Pareto front points of volume loss and coefficient of friction at one pass. Figure22 presents the volume loss (Objective 1) and coefficient of friction (Objective 2) Pareto chart points for a single pass. It was discovered that the coefficient of friction decreases as the volume loss increases. As a result, the volume loss can be reduced at the expense of a higher coefficient of friction. The best volume loss for a single pass was 2.22699E07 m3, with the worst maximum coefficient of friction being 0.242371 and the best minimum coefficient of friction being 0.224776 at a volume loss of 2.23405E07 m3.

The multi-objective functions of minimal volume loss and coefficient of friction were handled by Eqs.(13) and (14), respectively, served as the fitness functions for volume loss and coefficient of friction at four passes. The Pareto front points of volume loss and coefficient of friction at four passes are shown in Table 4. The Pareto chart points for the volume loss (Objective 1) and coefficient of friction (Objective 2) for four passes are shown in Fig.23. It was shown that when the volume loss increases, the coefficient of friction lowers. The volume loss can be decreased as a result, however, at the expense of an increased coefficient of friction. The best minimum coefficient of friction was 0.2313046 at a volume loss of 2.12663E08 m3, and the best minimum volume loss was 2.126397E08 m3 at a coefficient of friction of 0.245145 for four passes. In addition, Table 5 compares wear response values at DOE, RSM, GA, hybrid RSM-GA, and MOGA.

This section proposed the optimal wear parameters of different responses, namely VL and COF of ZK30. The presented optimal wear parameters, such as P and V, are based on previous studies of ZK30 that recommended the applied load from one to 30 N and speed from 64.5 to 1000mm/s. Table 6 presents the optimal condition of the wear process of different responses by genetic algorithm (GA).

Table 7 displays the validity of wears regression model for VL under several circumstances. The wear models' validation was achieved under various load and speed conditions. The volume loss response models had the lowest error % between the practical and regression models and were the most accurate, based on the validation data. Table 7 indicates that the data unambiguously shows that the predictive molding performance has been validated, as shown by the reasonably high accuracy obtained, ranging from 69.7 to 99.9%.

Equations(15 to 17) provide insights into the relationship that links the volume loss with applied load and speed, allowing us to understand how changes in these factors affect the volume loss in the given system. The validity of this modeling was further examined using a new unseen dataset by which the prediction error and accuracy were calculated, as shown in Table 8. Table 8 shows that the data clearly demonstrates that the predictive molding performance has been validated, as evidenced by the obtained accuracy ranging from 69.7 to 99.9%, which is reasonably high.

Original post:
Optimization of wear parameters for ECAP-processed ZK30 alloy using response surface and machine learning ... - Nature.com

Tags:

Machine learning approach predicts heart failure outcome risk – HealthITAnalytics.com

April 22, 2024 -Researchers from the University of Virginia (UVA) have developed a machine learning tool designed to assess and predict adverse outcome risks for patients with advanced heart failure with reduced ejection fraction (HFrEF), according to a recent study published in the American Heart Journal.

The research team indicated that risk models for HFrEF exist, but few are capable of addressing the challenge of missing data or incorporating invasive hemodynamic data, limiting their ability to provide personalized risk assessments for heart failure patients.

Heart failure is a progressive condition that affects not only quality of life but quantity as well, explained Sula Mazimba, MD, an associate professor of medicine at UVA and cardiologist at UVA Health, in the news release. "All heart failure patients are not the same. Each patient is on a spectrum along the continuum of risk of suffering adverse outcomes. Identifying the degree of risk for each patient promises to help clinicians tailor therapies to improve outcomes.

Outcomes like weakness, fatigue, swollen extremities and death are of particular concern for heart failure patients, and the risk model is designed to stratify the risk of these events.

The tool was built using anonymized data pulled from thousands of patients enrolled in heart failure clinical trials funded by the National Institutes of Health (NIH) National Heart, Lung and Blood Institute (NHLBI).

Patients in the training and validation cohorts were categorized into five risk groups based on left ventricular assist device (LVAD) implantation or transplantation, rehospitalization within six months of follow-up and death, if applicable.

To make the model robust in the presence of missing data, the researchers trained it to predict patients risk categories using either invasive hemodynamics alone or a feature set incorporating noninvasive hemodynamics data.

Prediction accuracy for each category was determined separately using area under the curve (AUC).

Overall, the model achieved high performance across all five categories. The AUCs ranged from 0.896 +/- 0.074 to 0.969 +/- 0.081 for the invasive hemodynamics feature set and 0.858 +/- 0.067 to 0.997 +/- 0.070 for the set incorporating all features.

The research team underscored that the inclusion of hemodynamic data significantly aided the models performance.

This model presents a breakthrough because it ingests complex sets of data and can make decisions even among missing and conflicting factors, said Josephine Lamp, a doctoral researcher in the UVA School of Engineerings Department of Computer Science. It is really exciting because the model intelligently presents and summarizes risk factors reducing decision burden so clinicians can quickly make treatment decisions.

The researchers have made their tool freely available online for researchers and clinicians in the hopes of driving personalized heart failure care.

In pursuit of personalized and precision medicine, other institutions are also turning to machine learning.

Last week, a research team from Clemson University shared how a deep learning tool can help researchers better understand how gene-regulatory network (GRN) interactions impact individual drug response.

GRNs map the interactions between genes, proteins and other elements. These insights are crucial for exploring how genetic variations influence a patients phenotypes such as drug response. However, many genetic variants linked to disease are in areas of DNA that dont directly code for proteins, creating a challenge for those investigating the role of these variants in individual health.

The deep learning-based Lifelong Neural Network for Gene Regulation (LINGER) tool helps address this by using single-cell multiome data to predict how GRNs work, which can shed light on disease drivers and drug efficacy.

View original post here:
Machine learning approach predicts heart failure outcome risk - HealthITAnalytics.com

Tags:

Practical approaches in evaluating validation and biases of machine learning applied to mobile health studies … – Nature.com

In this section, we first describe how Ecological Momentary Assessments work and how they differentiate from assessments that are collected within a clinical environment. Second, we present the studies and ML use cases for each dataset. Next, we introduce the non-ML baseline heuristics and explain the ML preprocessing steps. Finally, we describe existing train-test-split approaches (cross-validation) and the splitting approaches at the user- and assessment levels.

Within this context, ecological means within the subjects natural environment", and momentary within this moment" and ideally, in real time16. Assessments collected in research or clinical environments may cause recall bias of the subjects answers and are not primarily designed to track changes in mood or behavior longitudinally. Ecological Momentary Assessments (EMA) thus increase validity and decrease recall bias. They are suitable for asking users in their daily environment about their state of being, which can change over time, by random or interval time sampling. Combining EMAs and mobile crowdsensing sensor measurements allows for multimodal analyses, which can gain new insights in, e.g., chronic diseases8,15. The datasets used within this work have EMA in common and are described in the following subsection.

From ongoing projects of our team, we are constantly collecting mHealth data as well as Ecological Momentary Assessments6,17,18,19. To investigate how the machine learning performance varies based on the splits, we wanted different datasets with different use cases. However, to increase comparability between the use cases, we created multi-class classification tasks.

We train each model using historical assessments, the oldest assessment was collected at time tstart, the latest historical assessment at time tlast. A current assessment is created and collected at time tnow, a future assessment at time tnext. Depending on the study design, the actual point of time tnext may be in some hours or in a few weeks from tnow. For each dataset and for each user, we want to predict a feature (synonym, a question of an assessment) at time tnext using the features at time tnow. This feature at time tnext is then called the target. For each use case, a model is trained using data between tstart and tlast, and given the input data from tnow, it predicts the target at tnext. Figure1 gives a schematic representation of the relevant points of time tstart,tlast,tnow, and tnext.

At time tstart, the first assessment is given; tlast is the last known assessment used for training, whereas tnow is the currently available assessment as input for the classifier and the target is predicted at time ttext.

To increase comparability between the approaches, we used the same model architecture with the same pseudo-random initialisation. The model is a Random Forest classifier with 100 trees and the Gini impurity as the splitting criterion. The whole coding was in Python 3.9, using mostly scikit-learn, pandas and Jupyter Notebooks. Details can be found on GitHub in the supplementary material.

For all datasets that we used in this study, we have ethical approvals (UNITI No. 20-1936-101, TYT No. 15-101-0204, Corona Check No. 71/20-me, and Corona Health No. 130/20-me). The following section provides an overview of the studies, the available datasets with characteristics, and then describes each use case in more detail. An brief overview is given in Table1 with baseline statistics for each dataset in Table2.

To provide some more background info about the studies: The analyses happen with all apps on the so-called EMA questionnaires (synonym: assessment), i.e., the questionnaires that are filled out multiple times in all apps and the respective studies. This can happen several times a day (e.g., for the tinnitus study TrackYourTinnitus (TYT)) or at weekly intervals (e.g., studies in the Corona Health (CH) app). Nevertheless, the analysis happens on the recurring questionnaires, which collect symptoms over time and in the real environment through unforeseen (i.e., random) notifications.

The TrackYourTinnitus (TYT) dataset has the most filled-out assessments with more than 110,000 questionnaires as by 2022-10-24. The Corona Check (CC) study has the most users. This is because each time an assessment is filled out, a new user can optionally be created. Notably, this app has the largest ratio of non-German users and the youngest user group with the largest standard deviation. The Corona Health (CH) app with its studies Mental health for adults, adolescents and physical health for adults has the highest proportion of German users because it was developed in collaboration with the Robert Koch Institute and was primarily promoted in Germany. Unification of treatments and Interventions for Tinnitus patients (UNITI) is a European Union-wide project, which overall aim is to deliver a predictive computational model based on existing and longitudinal data19. The dataset from the UNITI randomized controlled trial is described by Simoes et al.20.

With this app, it is possible to record the individual fluctuations in tinnitus perception. With the help of a mobile device, users can systematically measure the fluctuations of their tinnitus. Via the TYT website or the app, users can also view the progress of their own data and, if necessary, discuss it with their physician.

The ML task at hand is a classification task with target variable Tinnitus distress at time tnow and the questions from the daily questionnaire as the features of the problem. The targets values range in [0,1] on a continuous scale. To make it a classification task, we created bins with step size of 0.2 resulting in 5 classes. The features are perception, loudness, and stressfulness of tinnitus, as well as the current mood, arousal and stress level of a user, the concentration level while filling out the questionnaire, and perception of the worst tinnitus symptom. A detailed description of the features was already done in previous works21. Of note, the time delta of two assessments of one user at tnext and tnow varies between users. Its median value is 11 hours.

The overall goal of UNITI is to treat the heterogeneity of tinnitus patients on an individual basis. This requires understanding more about the patient-specific symptoms that are captured by EMA in real time.

The use case we created at UNITI is like that of TYT. The target variable encumbrance, coded as cumberness, which was also continuously recorded, was divided into an ordinal scale from 0 to 1 in 5 steps. Features also include momentary assessments of the user during completion, such as jawbone, loudness, movement, stress, emotion, and questions about momentary tinnitus. The data was collected using our mobile apps7. Here, of note: on average, the median time gap between two assessment is 24 hours for each user.

At the beginning of the COVID-19 pandemic, it was not easy to get initial feedback about an infection, given the lack of knowledge about the novel virus and the absence of widely available tests. To assist all citizens in this regard, we launched the mobile health app Corona Check together with the Bavarian State Office for Health and Food Safety22.

The Corona Check dataset predicts whether a user has a Covid infection based on a list of given symptoms23. It was developed in the early pandemic back in 2020 and helped people to get quick estimate for an infection without having an antigen test. The target variable has four classes: First, suspected coronavirus (COVID-19) case", second, symptoms, but no known contact with confirmed corona case", third, contact with confirmed corona case, but currently no symptoms", and last, neither symptoms nor contact".

The features are a list of Boolean variables, which were known at this time to be typically related with a Covid infection, such as fever, a sore throat, a runny nose, cough, loss of smell, loss of taste, shortness of breath, headache, muscle pain, diarrhea, and general weakness. Depending on the answers given by a user, the application programming interface returned one of the classes. The median time gap of two assessments for the same user is 8 hours on average with a much larger standard deviation of 24.6 days.

The last four use cases are all derived from a bigger Covid-related mHealth project called Corona Health6,24. The app was developed in collaboration with the Robert Koch-Institute and was primarily promoted in Germany, it includes several studies about the mental or physical health, or the stress level of a user. A user can download the app and then sign up for a study. He or she will then receive a baseline one-time questionnaire, followed by recurring follow-ups with between-study varying time gaps. The follow-up assessment of CHA has a total of 159 questions including a full PHQ9 questionnaire25. We then used the nine questions of PHQ9 as features at tnow to predict the level of depression for this user for tnext. Depression levels are ordinally scaled from None to Severe in a total of 5 classes. The median time gap of two assessments for the same user is 7.5 days. That is, the models predict the future in this time interval.

Similar to the adult cohort, the mental health of adolescents during the pandemic and its lock-downs is also captured by our app using EMA.

A lightweight version of the mental health questionnaire for adults was also offered to adolescents. However, this did not include a full PHQ9 questionnaire, so we created a different use case. The target variable to be classified on a 4-level ordinal scale is perceived dejection coming from the PHQ instruments, features are a subset of quality of live assessments and PHQ questions, such as concernment, tremor, comfort, leisure quality, lethargy, prostration, and irregular sleep. For this study, the median time gap of two follow up assessments is 7.3 days.

Analogous to the mental health of adults, this study aims to track how the physical health of adults changes during the pandemic period.

Adults had the option to sign up for a study with recurring assessments asking for their physical health. The target variable to be classified asks about the constraints in everyday life that arise due to physical pain at tnext. The features for this use case include aspects like sport, nutrition, and pain at tnow. The median time gap of two assessments for the same user is 14.0 days.

This additional study within the Corona Health app asks users about their stress level on a weekly basis. Both features and target are assessed on a five-level ordinal scale from never to very often. The target asks for the ability of stress management, features include the first nine questions of the perceived stress scale instrument26. The median time gap of two assessments for the same user on average is 7.0 days.

We also want to compare the ML approaches with a baseline heuristic (synonym: Baseline model). A baseline heuristic can be a simple ML model like a linear regression or a small Decision Tree, or alternatively, depending on the use case, it could also be a simple statement like The next value equals the last one". The typical approach for improving ML models is to estimate the generalization error of the model on a benchmark data set when compared to a baseline heuristic. However, it is often not clear, which baseline heuristic to consider, i.e.: The same model architecture as the benchmark model, but without tuned hyperparameters? A simple, intrinsically explainable model with or without hyperparameter tuning? A random guess? A naive guess, in which the majority class is predicted? Since we have approaches on a user-level (i.e., we consider users when splitting) and on an assessment-level (i.e., we ignore users when splitting), we also should create baseline heuristics on both levels. We additionally account for within-user variance in Ecological Momentary Assessments by averaging a users previously known assessments. Previously known here means that we calculate the mode or median of all assessments of a user that are older than the given timestamp. In total, this leads to four baseline heuristics (user-level latest, user-level average, assessment-level latest, assessment-level average) that do not use any machine learning but simple heuristics. On the assessment-level, the latest known target or the mean of all known targets so far is taken to predict the next target, no matter of the user-id of this assessment. On the user-level, either the last known, or median, or mode value of this user is taken to predict the target. This, in turn, leads to a cold-start problem for users that appear for the first time in a dataset. In this case, either the last known, or mode, or median of all assessments that are known so far are taken to predict the target.

Before the data and approaches could be compared, it was necessary to homogenize them. In order for all approaches to work on all data sets, at least the following information is necessary: Assessment_id, user_id, timestamp, features, and the target. Any other information such as GPS data, or additional answers to questions of the assessment, we did not include into the ML pipeline. Additionally, targets that were collected on a continuous scale, had to be binned into an ordinal scale of five classes. For an easier interpretation and readability of the outputs, we also created label encodings for each target. To ensure consistency of the pre-processing, we created helper utilities within Python to ensure that the same function was applied on each dataset. For missing values, we created a user-wise missing value treatment. More precisely, if a user skipped a question in an assessment, we filled the missing value with the mean or mode (mode = most common value) of all other answers of this user for this assessment. If a user had only one assessment, we filled it with the overall mean for this question.

For each dataset and for each script, we set random states and seeds to enhance reproducibility. For the outer validation set, we assigned the first 80 % of all users that signed up for a study to the train set, the latest 20% to the test set. To ensure comparability, the test users were the same for all approaches. We did not shuffle the users to simulate a deployment scenario where new users join the study. This would also add potential concept drift from the train to the test set and thus improve the simulation quality.

For the cross-validation within the training set, which we call internal validation, we chose a total of 5 folds with 1 validation fold. We then applied the four baseline heuristics (on user level and assessment level with either latest target or average target as prediction) to calculate the within-train-set performance standard deviation and the mean of the weighted F1 scores for each train fold. The mean and standard deviation of the weighted F1 score are then the estimator of the performance of our model in the test set.

We call one approach superior to another if the final score is higher. The final score to evaluate an approach is calculated as:

$${f}_{1}^{final}={f}_{1}^{test}-alpha {sigma }left({f}_{1}^{train}right)$$

(1)

If the standard deviation between the folds during training is large, the final score is lower. The test set must not contain any selection bias against the underlying population. The pre-factor of the standard deviation is another hyperparameter. The more important model robustness for the use case, the higher should be set.

Within cross-validation, there exist several approaches on how to split up the data into folds and validate them, such as the k-fold approach with k as the number of folds in the training set. Here, k1 folds form the training folds and one fold is the validation fold27. One can then calculate k performance scores and their standard deviation to get an estimator for the performance of the model in the test set, which itself is an estimator for the models performance after deployment (see also Fig.2).

Schematic visualisation of the steps required to perform a k-fold cross-validation, here with k=5.

In addition, there exist the following strategies: First, (repeated) stratified k-fold, in which the target distribution is retained in each fold, which can also be seen in Fig.3. After shuffling the samples, the stratified split can be repeated3. Second, leave-one-out cross-validation28, in which the validation fold contains only one sample while the model has been trained on all other samples. And third, leave-p-out cross-validation, in which (left(begin{array}{c}n\ pend{array}right)) train-test-pairs are created with n equals number of assessments (synonym sample)29.

While this approach retains the class distribution in each fold, it still ignores user groups. Each color represents a different class or user id.

These approaches, however, do not always focus on samples that might belong to our mHealth data peculiarities. To be more specific, they do not account for users (syn. groups, subjects) that generate daily assessments (syn. samples) with a high variance.

To precisely explain the splitting approaches, we would like to differentiate between the terms folds and sets. We call a chunk of samples (synonym: assessments, filled-out questionnaires) a set on the outer split of the data, for which we cut-off the final test set. However, within the training set, we then split further to create training and validation folds. That is, using the term fold, we are in the context of cross validation. When we use the term set, then we are in the outer split of the ML pipeline. Figure4 visualizes this approach. Following this, we define 4 different approaches to split the data. For one of them we ignore the fact that there are users, for the other three we do not. We call these approaches user-cut, average-user, user-wise and time-cut. All approaches have in common that the first 80 % of all users are always in the training set and the remaining 20 % are in the test set. A schematic visualization of the splitting approaches is shown in Fig.5. Within the training set, we then split on user-level for the approaches user-cut, average-user and user-wise, and on assessment-level for the approach time-cut.

In the second step, users are ordered by their study registration time, with the initial 80 % designated as training users and the remaining 20 % as test users. Subsequently, assessments by training users are allocated to the training set, and those by test users to the test set. Within the training set, user grouping dictates the validation approach: group-cross-validation is applied if users are declared as a group, otherwise, standard cross-validation is utilized. We compute the average f1 score, ({f}_{1}^{train}), from training folds and the f1 score on the test set, ({f}_{1}^{test}). The standard deviation of ({f}_{1}^{train},sigma ({f}_{1}^{train})), indicates model robustness. The hyperparameter adjusts the emphasis on robustness, with higher values prioritizing it. Ultimately, ({f}_{1}^{final}), which is a more precise estimate if group-cross-validation is applied, offers a refined measure of model performance in real-world scenarios.

Yellow means that this sample is part of the validation fold, green means it is part of a training fold. Crossed out means that the sample has been dropped in that approach because it does not meet the requirements. Users can be sorted by time to accommodate any concept drift.

In the following section, we will explain the splitting approaches in more detail. The time-cut approach ignores the fact of given groups in the dataset and simply creates validation folds based on the time the assessments arrive in the database. In this example, the month, in which a sample was collected, is known. More precisely, all samples from January until April are in the training set while May is in the test set. The user-cut approach shuffles all user ids and creates five data folds with distinct user-groups. It ignores the time dimension of the data, but provides user-distinct training and validation folds, which is like the GroupKFold cross-validation approach as implemented in scikit-learn30. The average-user approach is very similar to the user-cut approach. However, each answer of a user is replaced by the median or mode answer of this user up to the point in question to reduce within-user-variance. While all the above-mentioned approaches require only one single model to be trained, the user-wise approach requires as many models as distinct users are given in the dataset. Therefore, for each user, 80 % of his or her assessments are used to train a user-specific model, and the remaining 20% of the time-sorted assessments are used to test the model. This means that for this approach, we can directly evaluate on the test set as each model is user specific and we solved the cold-start problem by training the model on the first assessments of this user. If a user has less than 10 assessments, he or she is not evaluated on that approach.

Approval for the UNITI randomized controlled trial and the UNITI app was obtained by the Ethics Committee of the University Clinic of Regensburg (ethical approval No. 20-1936-101). All users read and approved the informed consent before participating in the study. The study was carried out in accordance with relevant guidelines and regulations. The procedures used in this study adhere to the tenets of the Declaration of Helsinki. The Track Your Tinnitus (TYT) study was approved by the Ethics Committee of the University Clinic of Regensburg (ethical approval No. 15-101-0204). The Corona Check (CH) study was approved by the Ethics Committee of the University of Wrzburg (ethical approval no. 71/20-me) and the universitys data protection officer and was carried out in accordance with the General Data Protection Regulations of the European Union. The procedures used in the Corona Health (CH) study were in accordance with the 1964 Helsinki declaration and its later amendments and was approved by the ethics committee of the University of Wrzburg, Germany (No. 130/20-me). Ethical approvals include secondary use. The data from this study are available on request from the corresponding author. The data are not publicly available, as the informed consent of the participants did not provide for public publication of the data.

Further information on research design is available in theNature Portfolio Reporting Summary linked to this article.

Go here to read the rest:
Practical approaches in evaluating validation and biases of machine learning applied to mobile health studies ... - Nature.com

Tags:

Application of power-law committee machine to combine five machine learning algorithms for enhanced oil recovery … – Nature.com

This study combines the predictions of five machine learning models by means of the PLCM method to increase the generalization of the model in the context of EOR screening. This study not only assesses the individual machine learning methods in predicting the most suitable EOR techniques, but also takes benefit from the PLCM method optimized by the PSO to increase the prediction accuracy, for the first time in the context of EOR screening. In this manner, the predictive tool is not limited to only one data-driven model, but also takes advantage of the strength points of different types of machine learning algorithms. Figure1 shows the flowchart of this study. First, the required dataset to build and evaluate the utilized models is collected. Then, the data is preprocessed, which includes encoding the textual data into numeric values and normalizing the variables into [0,1]. Then, the individual machine learning models are trained. The hyperparameters of the models are tuned using a grid search with fivefold cross-validation. After training the individual models, their outputs are combined using the PLCM method optimized by the PSO algorithm. Then, the performance of the utilized methods is compared in terms of quantitative and visual evaluation metrics. The metrics, including the accuracy, precision, recall, F1-score, confusion matrix, precision-recall curve, and Receiver Operating Characteristic (ROC) curve to analyze their ability to handle the class imbalance issue. In the end, a feature importance analysis is conducted to find out the most influential input variables on the prediction of suitable EOR techniques. Another specialty of this study is that it uses a more comprehensive dataset than those in the previous studies, which increases the generalization of the developed model.

General flowchart of the study.

In this study, a dataset including 2563 EOR projects (available in Supplementary Information) from 23 different countries applied to sandstone, carbonate, limestone, dolomite, unconsolidated sandstone, and conglomerate reservoirs was collected from the literature5,20,21,22,23,24,25,26,27 to develop the screening methods. The utilized variables include the formation type, porosity (%), permeability (mD), depth (ft), viscosity (cP), oil gravity (API), temperature (F), and the production mechanism before conducting EOR. The EOR techniques include injection of steam, hydrocarbon miscible, hydrocarbon immiscible, CO2 miscible, CO2 immiscible, carbonated water, low-salinity water, CO2 foam, nitrogen miscible, nitrogen immiscible, micellar polymer, surfactant/polymer, surfactant, cyclic steam drive, steam-assisted gas drive (SAGD), liquefied petroleum gas (LPG) miscible, in-situ combustion, polymer, alkaline/surfactant/polymer (ASP), hot water, microbial, air-foam, hydrocarbon miscible-WAG, and hydrocarbon immiscible-WAG. Table 2 reports the statistical analysis of the variables. Since formation is a categorical feature, it was converted to numerical values. Among fifteen different formation types, sandstone, carbonate, and dolomite are the most prevalent formation types with 45%, 10%, and 10% of the total data, respectively. To assess the accuracy of the developed models on unseen data, 85% of the data was used for training and the remaining 15% was used as blind test cases, and fivefold cross-validation is used for hyperparameter tuning. It is common to divide the dataset with a ratio of 70:15:15 as training, validation, and testing subsets. The validation subset is commonly used for tuning the hyperparameters of the models. Nonetheless, in the current study, 5-Fold cross validation was used to tune the hyperparameters, which does not require putting aside a portion of the data for validation. In this technique, the training subset is divided into K (5 in this study) non-overlapping folds. Then, the model is trained and validated K times with the fixed hyperparameters. One of the folds is used for validation and the others for training. Finally, the validation score is calculated as the average of scores over K repetitions. This is repeated for all configurations of the hyperparameters and the set of hyperparameters with the highest cross-validation score is selected. Thereby, as we did not need a separate validation subset, all samples, except for the testing subset, were used for training (85%).

One of the crucial steps before moving to model development is data preprocessing. One type of preprocessing is to encode textual values to numerical values, which is called label encoding. For example, the formation type, previous production mechanism, and EOR techniques are textual features, which were encoded as numbers. Another preprocessing step is scaling the data into similar intervals since the scale of the features differ significantly. For example, viscosity is in the order of 106, while porosity is in the order of tens. In this study, the features were normalized into [0,1] interval using ((X - X_{min } )/(X_{max } - X_{min } )), where (X_{min }) and (X_{max }) are the minimum and maximum of the features in the training subset.

ANN is a learning algorithm that is inspired by the human brain. ANN can figure out the relationship between the inputs and outputs without the need for complex mathematical or computational methods. Among the various types of ANN, the Multilayer Perceptron (MLP-ANN) stands out as the most commonly used28,29,30. The MLP includes three layers, namely input, hidden, and output layers31,32, as illustrated in Fig.2. As shown, each layer consists of computational units known as neurons. The number of neurons in the input and output layers is the same as the dimension of the input and output variables, respectively. The number of hidden layers and their size should be determined by trial and error. Each neuron is connected to all neurons of the previous layers, which represents a unique linear combination of the data coming in from previous layer. The linear combination takes place using a set of weights. For example, (W_{xh}) represents the set of weights mapping the inputs to the hidden layers, and (W_{ho}) represents the set of weights mapping the hidden neurons to the output layer. Another critical aspect of an ANN model is the activation function, which receives the results of the linear combination, known as activations, and determines the activation of each neuron. Including hidden layers with non-linear activation functions in an ANN empowers it to capture non-linear dependencies. The weights are learned during the training phase of the model, which is the ultimate goal of the training process. Using these weights, the outputs, represented by (hat{y}), are calculated by the feed-forward process as below.

$$hat{y} = fleft( {mathop sum limits_{i = 1} W_{ij} x_{i} + b_{j} } right),$$

(1)

where f isthe activation function; (b_{j}) is the hidden layer bias; (x_{i}) is theinput for the ith variable; and, (W_{ij}) is theconnection weight between the ith input and jth neuron.

Schematic structure of an ANN.

The learning process in an ANN is actually adjusting the weights and biases in the hidden layers using the backpropagation algorithm to minimize the loss function between the predicted and actual values28,33. In a multiclass classification problem, the outputs are converted to one-hot encoded vectors, where all elements of the vectors are zeros except for the element corresponding to that specific sample class. To handle multiclass classification, the categorical cross entropy is used as the loss function, which is defined as follows.

$$CCEleft( W right) = mathop sum limits_{i = 1}^{C - 1} y_{i} log left( {hat{y}_{i} } right),$$

(2)

where y denotes the vector of actual outputs and C is the number of classes. Each output in a multiclass problem is a vector of probabilities for each class. The probabilities are calculated using the Softmax activation function. To minimize the loss function, the gradient of the loss with respect to the weights and biases must be calculated and back propagated to all layers to update the weights. Given the gradient of the loss function, the weights can be updated as follows.

$$W^{t + 1} = W^{t} - eta nabla_{W} CCE,$$

(3)

where (W^{t + 1}) and (W^{t}) are the new and current weights, (eta) is the learning rate, and (nabla_{W} CCE) is the gradient of the loss function calculated by an optimization algorithm, such as Adam, Stochastic Gradient Descent (SGD), RMSprop, Adagrad, Momentum, Nestrov and Accelerated Gradient34,35.

ANNs offer a variety of hyperparameters that can be tuned to optimize the models performance. It includes options for controlling model structure, learning rates, and regularization. Furthermore, ANNs incorporate class weights into the loss function, addressing the problem of class-imbalance, which is useful for the problem understudy. It also supports multiclass classification. Accordingly, one of the utilized methods in this study is the ANN.

According to the explanations, the control parameters of the ANN are the number of hidden layers, number of neurons in the hidden layers, activation functions, the optimizer, and learning rate, which should be fine-tuned to achieve a satisfactory performance.

CatBoost is a gradient-boosting tree construction method36, which makes use of both symmetric and non-symmetric construction methods. In CatBoost, a tree is learned at each iteration with the aim of reducing the error made by previous trees. Figure3 shows the process of CatBoost tree building. In this figure, the orange and blue circles represent a dataset with two classes. The process starts with a simple initial model, assigning the average of the entire dataset to a single leaf node. Then, the misclassified samples (enlarged circles in Fig.3) are identified and new trees are added based on the gradient boosting approach. Afterward, the predictions are updated to the combination of the predictions made by all trees. By adding new trees at each iteration, the number of misclassified samples decreases. Adding the trees continues until either the minimum number of samples required for splits or the maximum depth of the trees is reached. For categorical features, the CatBoost algorithm employs a symmetric splitting method for each feature. Then, based on the type of the feature, it chooses one of the split methods for each feature to create a new branch for each category37.

Schematic of the CatBoost tree construction.

Considering a training dataset with (N) samples, where (X) is the matrix of inputs ((x_{1} ,; ldots ,;x_{N})) and (y) is the vector of outputs ((y_{1} ,; ldots ,;y_{N})), the goal is to find a mapping function, (f(X)), from the inputs to the outputs. Here, (f(X)) is the boosted trees. Just like the ANN, the CatBoost needs a loss function ((L(f))) to be minimized to perform the optimal tree building strategy.

Now, the learning process entails minimizing the (L(f)).

$$f^{*} (X) = arg ;mathop {min }limits_{f} L;(f) = arg ;mathop {min }limits_{f} mathop sum limits_{i = 1}^{N} L;(y_{i} ,;hat{y}_{i} ),$$

(4)

If the algorithm entails M gradient boosting steps, a new estimator hm can be added to the model.

$$f_{m + 1} ;(x_{i} ) = f_{m} ;(x_{i} ) + h_{m} ;(x_{i} ),$$

(5)

where (f_{m + 1} ;(x_{i} )) is the new model, and (h_{m} ;(x_{i} )) is the newly added estimator. The new estimator is determined by employing the gradient boosting algorithm, where the steepest descent obtains (h_{m} = - ;alpha_{m} g_{m}) where (alpha_{m}) is the step length and (g_{m}) is the gradient of the loss function.

Now, the addition of a new tree/estimator can be accomplished by

$$f_{m + 1} (x) = f_{m} (x) + left( {arg mathop {min }limits_{{h_{m} in H}} left[ {mathop sum limits_{i = 1}^{N} Lleft( {y_{i} , ;f_{m} (x_{i} ) + h_{m} (x_{i} ) } right)} right]} right);(x),$$

(6)

$$f_{m + 1} (x) = f_{m} (x) - alpha_{m} g_{m} .$$

(7)

By taking benefit from the gradient boosting approach, the ensemble of decision trees built by the CatBoost algorithm often leads to a high prediction accuracy. The CatBoost also uses a strategy known as ordered boosting to improve the efficacy of its gradient-boosting process. In this type of boosting, a specific order is used to train the trees, which is determined by their feature importance. This prioritizes the most informative features, resulting in more accurate models38. The algorithm offers a wide range of regularization methods, such as depth regularization and feature combinations, which helps prevent overfitting. This is specifically useful when dealing with complex datasets.

The CatBoost offers a range of control parameters to optimize the structure of the model. These parameters include the number of estimators, maximum depth of the trees, maximum number of leaves, and regularization coefficients. These control parameters are optimized in this study to obtain the best performance from the model.

KNN is a non-parametric learning algorithm proposed by Fix and Hodges39. This algorithm does not have a training step and determines the output of a sample based on the output of the neighboring samples10. The number of neighbors is denoted by K. With K=1, the label of the sample is as of the nearest sample. As the name of this algorithm implies, the K nearest neighbors are found based on the distance between the query sample and all samples in the dataset. Euclidean, Minkowski, Chebyshev, and Manhattan distances are some common distance measures. The Minkowski distance is a generalization of the Euclidean and the Manhattan distance with (p = 2) and (p = 1), respectively. p is the penalty term in Lp norm, which can be a positive integer. The distance between the samples greatly depends on the scale of the features. Therefore, feature scaling is of great importance40. After finding the K nearest samples to the new sample (query), its label is determined using Eq.(8).

$$hat{f}(x_{q} ) leftarrow {text{arg }};mathop {max }limits_{c in C} mathop sum limits_{i = 1}^{K} delta (c, ;f(x_{i} )), quad delta (a,;b) = 1 quad {text{if}};; a = b.$$

(8)

where (x_{q}) is the new sample, (f(x_{i} )) is the label of the ith neighboring sample, C denotes the number of classes, and (delta (a,;b)) is the Kronecker delta which is 1 if (a = b) and 0 otherwise. An extension to KNN is the distance-weighted KNN, where the inverse of the distances between the samples are used as the weights. In this manner, the prediction for the query sample will be

$$hat{f}(x_{q} ) leftarrow {text{arg }};mathop {max }limits_{c in C} mathop sum limits_{i = 1}^{K} w_{i} delta (c,; f(x_{i} )),quad delta (a,;b) = 1 quad {text{if}} ;;a = b,$$

(9)

where (w_{i}) is the inverse of the distance between the query sample and sample i, (w_{i} = 1/D(x_{q} ,;x_{i} )). Consequently, the closer neighbors will have a higher impact on the predicted label.

One distinctive feature of KNN that sets it apart from other machine learning methods is its ability to handle incomplete observations and noisy data41. This technique enables the identification of significant patterns within noisy data records. Another advantage of KNN is that it does not require any training and building and the model optimization can be done quite quickly. According to the above explanations, the controlling parameters of KNN are the number of neighbors (K), using/not using distance weighting, penalty terms, and the algorithm used to compute the nearest neighbors.

SVM is a binary classification algorithm introduced by Cortes and Vapink42. SVM can be implemented to solve problems with linear or non-linear behavior43,44. However, non-linear data should be mapped into a higher-dimensional space to make it linearly separable. This technique is called the kernel trick. The classification is done by a decision boundary which has the maximum margin from both classes. Figure4 shows the schematic of an SVM classifier for a binary classification task. The margins are constructed by finding the support vectors in each class and drawing the hyperplanes from the support vectors45. The hyperplanes are shown by dashed lines and the decision boundary is drawn between them. In this figure, the green circles represent the positive (+1) and the blue circles represent the negative (1) classes. The circles on the hyperplanes are the support vectors. The decision boundary with the maximum margin from the classes results in the highest generalization.

Schematic of a binary SVM.

By considering the mapping function (emptyset (X)) and inputs (X) and outputs (y), the equation of the decision boundary can be written as follows46:

$$W^{T} emptyset (X) + b = 0,$$

(10)

where W is the weight parameters and b is the bias term. The smallest perpendicular distance between the hyperplanes is known as the margin, which is double the distance between the support vectors and the decision boundary. Assuming that the data is separated by two hyperplanes with margin (beta), after rescaling W and b by (beta /2) in the equality, for each training example we have

$$y_{i} left[ {W^{T} emptyset (x_{i} ) + b} right] ge 1,quad i = left{ {1,;2, ldots ,;M} right}.$$

(11)

For every support vector ((X_{s} , ;y_{s})) the above inequality is an equality. Thereby, the distance between each support vector and the decision boundary, r, is as follows

$$r = frac{{y_{s} (W^{T} X_{s} + b)}}{left| W right|} = frac{1}{left| W right|},$$

(12)

where (left| W right|) is the L2 norm of the weights. Therefore, the margin between the two hyperplanes becomes (frac{2}{left| W right|}). The goal is to maximize (frac{2}{left| W right|}), which is equivalent to minimizing (frac{1}{2}W^{T} W). Consequently, the optimization problem of the SVM is:

$$begin{gathered} arg ;mathop {min }limits_{W,b} frac{1}{2}W^{T} W, hfill \ subject; to ;y_{i} left[ {W^{T} emptyset (x_{i} ) + b} right] ge 1,quad {text{for}};;i = 1,; ldots ,;M. hfill \ end{gathered}$$

(13)

Nonetheless, to increase the generalization of the model and avoid overfitting, slack variables ((xi))46,47 are used (see Fig.3), which allow the model to have some miss-classified samples during training. This approach is known as the soft margin approach. Now, the optimization problem becomes

$$begin{gathered} arg ;mathop {min }limits_{W,b} left( {frac{1}{2}W^{T} W + cmathop sum limits_{i} xi_{i} } right), hfill \ subject; to; y_{i} left[ {W^{T} emptyset (x_{i} ) + b} right] ge 1 - xi_{i} ,quad {text{for}};;i = 1,; ldots ,;M. hfill \ end{gathered}$$

(14)

where c is a regularization factor that controls the weight of the slack variables in the loss function. Equation(14) is a dual optimization problem, which is solved using the Lagrange approach. The Lagrange approach converts a dual-optimization problem to a standard one by incorporating the equality and inequality constraints to the loss function. Thereby, Eq.(14) becomes

$$begin{gathered} L(W,;b,;alpha ) = frac{1}{2}W^{T} W - mathop sum limits_{i = 1}^{M} alpha_{i} left[ {y_{i} left( {W^{T} emptyset (X_{i} ) + b} right) - 1} right], hfill \ subject; to ;;0 le alpha_{i} le c,quad i = 1,; ldots ,;M. hfill \ end{gathered}$$

(15)

where (alpha_{i})s are Lagrange multipliers. To minimize the above loss function, its derivatives with respect to W and b are set equal to zero. By doing this, we obtain (W = sumnolimits_{i = 1}^{M} {alpha_{i} y_{i} emptyset (X_{i} )}) and (sumnolimits_{i = 1}^{M} {alpha_{i} y_{i} = 0}). Plugging these back into the Lagrange gives the dual formulation.

$$begin{gathered} arg ;mathop {max }limits_{alpha } - frac{1}{2}mathop sum limits_{i,j = 1}^{M} alpha_{i} alpha_{j} y_{i} y_{j} emptyset (X_{i} )emptyset (X_{j} ) + mathop sum limits_{i = 1}^{M} alpha_{i} , hfill \ subject;; to; mathop sum limits_{i = 1}^{M} alpha_{i} y_{i} = 0, ;;0 le alpha_{i} le c, ;;i = 1,; ldots ,;M. hfill \ end{gathered}$$

(16)

Equation(16) is solved using a Quadratic Programming solver to obtain the Lagrange multipliers (alpha_{i}). (alpha_{i}) is non-zero only for the support vectors. Parameter b does not appear in the dual formulation, so it is determined separately from the initial constraints. Calculating (emptyset (X_{i} )emptyset (X_{j} )) is computationally expensive since it requires two mapping operations and one multiplication, especially if the data is high-dimensional. To tackle this problem, the Kernel trick is introduced, where (emptyset (X_{i} )emptyset (X_{j} )) is represented as a kernel function (K(X_{i} ,;X_{j} )) based on the Mercers Theorem48. Finally, after determining the Lagrange multipliers, the prediction for a new sample z is calculated as follows

$$y = signleft( {mathop sum limits_{i = 1}^{n} alpha_{i} y_{i} K(X_{i,} z) + b} right).$$

(17)

The kernel function should be determined by trial and error. Some of the commonly used kernels are the linear, polynomial, and radial basis function (RBF) kernels.

SVM is one of the most successful machine learning algorithms in hand-written digit recognition49,50. SVMs can handle high-dimensional data, making them suitable for tasks with a large number of features. Because of taking benefit from the maximum margin theory and slack variables, SVMs are resistant to overfitting. One special feature of the SVMs, making them different than other artificial intelligence tools, is the kernel trick that enables SVMs to solve different kinds of non-linear classification problems. The convex nature of the loss function of the SVM leads to a convex optimization problem, which ensures converging to a global optimum. Finally, memory efficiency due to using only support vectors to construct the model and ability to handle class-imbalance by incorporating the class weights to the loss function are two other advantages of the SVMs making them suitable for the EOR screening problem in this study.

According to above explanations, some of the most important control parameters of the SVM are the kernel function, regularization factor (c), the degree of polynomial kernels, the intercept of polynomial kernels (coef0), and class weights. Class weights are used to tackle the class-imbalance issue by giving larger weights to rare classes in calculating the loss function.

Since SVM is a binary classifier, to perform multi-class classification, one-to-rest or one-to-one approaches are used. In this study, the one-to-rest approach is used, where (C) SVM models are trained. Each SVM model predicts membership of the samples in one of the C classes.

In the context of machine learning, Random Forest (RF) is an ensemble learning technique that builds a multitude of decision trees during training and combines their outputs to make more accurate and robust predictions51. RF is a supervised learning method, suitable for classification and regression tasks. Each tree in the forest is constructed independently, using a random subset of the features and samples with replacement from the training data52. This randomness adds diversity to the decision-making process, preventing the model from too much focusing on idiosyncrasies in the data. An RF takes a random approach to selecting a subset of input variables/features (controlled by the maximum number of features), and performs the optimal split to divide a node based on a split criterion. Avoiding tree pruning ensures maximal tree growth. As a result, a multitude of trees are constructed, and the model employs a voting mechanism to determine the most prevalent class in a classification task.

Each tree makes its own prediction, and the final decision is determined by the majority voting paradigm. This approach not only enhances the prediction accuracy of the model but also makes it stronger against overfitting. Figure5 shows the schematic of a random forest where n trees are used to make a prediction. Each subset is randomly selected from the dataset and divided into two parts, including the bag and out-of-bag (OOB) parts. The data in each bag is used to build a tree and the data in OOB is used to test that tree. The OOB subset serves as an ongoing and unbiased estimation of the general prediction error, predating the verification of prediction accuracy through the independent testing subset for the aggregated results. When (X) is inputted to the ensemble, each tree provides a separate output ((o_{1} ,; ldots , ;o_{n})). In the end, the ultimate class of the inputs is determined by the same approach given in Eq.(8).

Schematic of the random forest tree construction.

The RF produces competing results to boosting and bagging, without any alteration to the training set. It minimizes the bias by incorporating a random sample predictor before each node segmentation. The RF model can handle high-dimensional data, without need for feature selection. Its implementation in Python is relatively straightforward, boosting training speeds and easy parallelization. Given these advantages, it is becoming increasingly popular among data scientists52,53.

According to the above explanations, the control parameters of a random forest are the split criterion, maximum depth of trees, the number of estimators, and the maximum number of features. These control parameters are fine-tuned to achieve the best performance. There is also another control parameter, which is the minimum number of samples required to split a node, but it is not investigated in this study.

A committee machine is a technique to merge the output of a multitude of predictive models to come up with a single prediction33. The benefit of this technique is to take advantage of the results of different alternatives for modeling a particular problem, instead of using only one model. The individual models are selected in such a way that at least one model from each type of machine learning models is included. Thereby, we can take benefit from the strength points of different types of learning algorithms. By using the PLCM technique, the chance of overfitting can be lowered33. There are two main approaches to combine the output of individual models, namely the static and dynamic approaches. In the static method, a linear combination of the individual outputs is used to get the ultimate output, while the dynamic approach uses a non-linear combination of the outputs. In this study, the dynamic approach with a power-law model is used to accomplish the integration task. Equation(18) shows the power-law model.

$$y = mathop sum limits_{i = 1}^{5} alpha_{i} y_{i}^{{beta_{i} }} ,$$

(18)

where (y) is the ultimate output, (alpha_{i}) and (beta_{i}) are the coefficients that must be optimized to achieve the goal of the power-law committee machine, and (y_{i}) is the output of the (i)-th individual predictive model. In this study, the coefficients of the power-law model ((alpha_{i}) and (beta_{i})) are optimized by the PSO algorithm to achieve a satisfactory integration of the outputs. The PSO is described in the following subsection.

Kennedy and Eberhart54 introduced the PSO as a population-based optimization algorithm. This algorithm starts solving the problem with random solutions65. Each solution in this algorithm is known as a particle, where a swarm is composed of a multitude of particles. The particles change their position in the solution space by a specified velocity which is updated at each iteration. The particles position determines the solution found by the particle. When the position of the particle changes, a new solution is obtained. The following equations give the updating formulae for the velocity and position of a particle

$$v_{i} (t + 1) = omega v_{i} (t) + c_{1} r_{1} (x_{best,i} (t) - x_{i} (t)) + c_{2} r_{2} (x_{best,g} (t) - x_{i} (t)),$$

(19)

$$x_{i} (t + 1) = x_{i} (t) + v_{i} (t + 1),$$

(20)

where (x_{i}) and (v_{i}) are the position and velocity of particle (i), respectively, (t) is the iteration number, (omega) is the inertia coefficient, (c_{1}) and (c_{2}) are the self-learning and social-learning coefficient, respectively, (r_{1}) and (r_{2}) are two random numbers, (x_{best,i}) is the best solution found by the particle, and (x_{best,g}) is the global best solution. The values of the (x_{best,i}) and (x_{best,g}) are obtained by evaluating the objective function. In this study, the objective function is the negative of prediction accuracy by the PLCM method. The velocity and position of the particles are updated until the algorithm reaches the stopping criterion. The parameters used in Eq.(19) are determined based on the work by Poli et al.56, where (omega ,) (c_{1} ,) and (c_{2}) are set at 0.7298, 1.49618, and 1.49618, respectively.

The PSO is one of the most commonly used optimization algorithms in petroleum engineering57,58,59,60. Among different metaheuristic optimization algorithms, the PSO has shown a better performance compared to the most of other optimization algorithms, such as the genetic algorithm and simulated annealing. The PSO has shown the ability to reach better optimal solutions and faster convergence to similar results than its rivals in many applications61. Thereby, this algorithm is used in this study to optimize the coefficients of the PLCM method.

After describing the tools used in this study, it is necessary to define the evaluation metrics, which are required to evaluate the performance of the proposed method. These metrics include the quantitative and visual indicators that are described in the following subsection.

In this study, quantitative and visual evaluation metrics are used to assess the performance of the proposed method. These metrics include the accuracy, precision, recall, F1-score, confusion matrix, Receiver Operating Characteristic (ROC) curve, and precision-recall curve.

Accuracy is the total number of correct predictions divided by the total number of data points. In binary classification, accuracy is defined as the number of true positives (TP) divided by the number of samples (accuracy = frac{TP}{N}), where N is the total number of data points/samples.

Precision is the portion of positive predictions that are actual positives. Precision focuses on the accuracy of positive predictions. For a binary classification precision is defined as (Precision = frac{TP}{{TP + FP}}), where FP is the number of false positives, which means that the prediction by the model is positive, whereas the actual label of the sample is negative.

Recall gives the portion of the positive samples that are identified as positives. Recall focuses on how well the model captures positive instances. In other words, it is the ratio of true positives to all positive samples in the dataset defined as ({text{Re}} call = frac{TP}{{TP + FN}}), where FN is the number of false negative predictions defined as the samples which are incorrectly classified as negative.

The inverse of the harmonic average of the recall and precision multiplied by 2 is known as F1-Score. F1-Score is defined in Eq.(21).

$$F1{ - }Score = 2frac{PR}{{P + R}},$$

(21)

where P and R are the precision and recall, respectively. A good classifier should have high values of precision and recall, which indicates a high F1-Score.

In multi-class classification, as the problem in this study, each metric is calculated for individual classes and averaged across all classes to obtain a single value. In this manner, each time, one of the classes is considered positive, and other classes are assumed as negative.

In a multiclass problem, the confusion matrix is a (C times C) matrix, where the rows represent the actual class and the columns represent the predicted class of the samples. The values on the main diagonal of the matrix show the number of correct predictions (true positives), and off-diagonal values show the number of incorrect predictions (false positives). The sum of the values on the main diagonal of the matrix divided the total number of samples gives the accuracy, as described above. Also, the diagonal value for each class if divided by the sum of all values in each column gives the class-specific precision, and if divided by the sum of all values in each row gives the class-specific recall.

Original post:
Application of power-law committee machine to combine five machine learning algorithms for enhanced oil recovery ... - Nature.com

Tags:

Free tool uses machine learning to pick better molecules for testing new reactions – Chemical & Engineering News

To send an e-mail to multiple recipients, separate e-mail addresses with a comma, semicolon, or both.

Title: Free tool uses machine learning to pick better molecules for testing new reactions

Author:

Chemical & Engineering News will not share your email address with any other person or company.

Visit link:
Free tool uses machine learning to pick better molecules for testing new reactions - Chemical & Engineering News

Tags: