Machine Learning, Deep Learning and Data Preprocessing Techniques for Detection, Prediction, and Monitoring of Stress and Stress-related Mental Disorders: A Scoping Review

Background: Mental stress and its consequent mental disorders (MDs) are significant public health issues. With the advent of machine learning (ML), there's potential to harness computational techniques for better understanding and addressing these problems. This review seeks to elucidate the current ML methodologies employed in this domain to enhance the detection, prediction, and analysis of mental stress and MDs. Objective: This review aims to investigate the scope of ML methodologies used in the detection, prediction, and analysis of mental stress and MDs. Methods: Utilizing a rigorous scoping review process with PRISMA-ScR guidelines, this investigation delves into the latest ML algorithms, preprocessing techniques, and data types used in the context of stress and stress-related MDs. Results and Discussion: A total of 98 peer-reviewed publications were examined. The findings highlight that Support Vector Machine (SVM), Neural Network (NN), and Random Forest (RF) models consistently exhibit superior accuracy and robustness among ML algorithms. Physiological parameters such as heart rate measurements and skin response are prevalently used as stress predictors due to their rich explanatory information and ease of data acquisition. Dimensionality reduction techniques, including mappings, feature selection, filtering, and noise reduction, are frequently observed as crucial steps preceding the training of ML algorithms. Conclusion: This review identifies significant research gaps and outlines future directions for the field. These include model interpretability, model personalization, the incorporation of naturalistic settings, and real-time processing capabilities for the detection and prediction of stress and stress-related MDs. Keywords: Machine Learning; Deep Learning; Data Preprocessing; Stress Detection; Stress Prediction; Stress Monitoring; Mental Disorders


Introduction
Mental health has become a public health concern.According to Institute of Health Metrics and Evaluation (IHME), in 2019, about 53 million people in the United States and about one in eight individuals worldwide (about 1 billion people) suffer from at least one mental health disorder (MD) [1].MD is defined as an impairment in a person's cognition, emotional control, or behavior pattern, which has clinical significance and is often linked to distress or functional impairment [2].MDs severely limit people's daily functioning and can be fatal [3], [4].In 2019, mental health (MH) problems accounted for 6.6% of all disability-adjusted life years in the US, making it the fifth most significant cause of disability overall [5], [6].Some of the more prevalent MDs are anxiety disorders, depression or mood disorders, bipolar disorders, psychotic disorders (including schizophrenia), eating disorders, social disorders and disruptive behavior and addictive behaviors [2].In 2019, anxiety and depression have been the most prevalent forms of MDs (301 and 280 million people affected worldwide, respectively).Anxiety disorder encompasses emotions of concern, anxiety, excessive fear, or associated behavioral problems that are severe enough to affect everyday activities [2].Symptoms include an unproportionate level of stress compared to the significance of the triggering event, difficulty in putting worries out of one's mind, and nervousness [7], [8].Generalized anxiety disorder, panic attacks, social anxiety disorder, and post-traumatic stress disorder are all examples of different types of anxiety disorders [2], [9].Depression is characterized by a long-lasting sadness and a lack of desire to be active.One of the main symptoms of depression is the inability to enjoy or find pleasure in most of one's daily activities as well as felling sadness, anger, or emptiness [2], [10].A depressive episode typically lasts for at least two weeks.Additionally, a loss of self-worth, feelings of hopelessness for the future and suicidal thoughts are indicators and symptoms of depression.People who are depressed are more prone to commit suicide [2], [10], [11].
Stress is categorized into distress, which typically has chronic negative effects on health, and eustress, which is short-term and positively influences motivation and development [12].Throughout this paper, the term stress is specifically used to denote distress, rather than eustress.Mental stress has shown to significantly contribute to developing and worsening anxiety and depression disorders [13], [14], [15].Mental stress is the body's natural response to various events in which a person feels that the demands of their external environment exceed their psychological and physiological resources for dealing with those demands [16].Mental stress leads to an asynchrony between the sympathetic and parasympathetic nervous systems (SNS and PNS) which are the main divisions of the autonomic nervous system (ANS) [17] and serve an important role in regulating vital biological activities [18], [19].The sympathetic nervous system is an integrative system that responds to potentially dangerous circumstances.Activation of the sympathetic nervous system is part of the system responsible for controlling 'fight-or-flight' responses.The parasympathetic nervous system is responsible for the body's "rest and digest" processes.
Given the import role and impact of stress in MDs, previous research has investigated various qualitative and quantitative methods to measure and monitor stress to inform effective stress mitigation approaches.While majority of stress literature relies on self-reported measures, recent literature has used physiological variables such as heart rate, heart rate variability [20], [21], [22], [23], [24], and behavioral data (e.g., speech, movement, facial expressions) [25] to understand changes to SNS and PNS associated with stress.The recent advances in sensor and mobile health technologies has resulted in the emergence of "big data" related to mental health as well as advanced bioinformatics methods, tools, or techniques to use such data for modeling or inference.One such tool that has recently emerged as a robust, rapid, objective, reliable, and cost-efficient technique for studying chronic illnesses and MDs is Machine Learning (ML).ML uses advanced statistical and probabilistic techniques to construct systems that can automatically learn from data.Several characteristics of ML makes it suitable for applications in MH monitoring including significant pattern recognition and forecasting capabilities [26], capacity to extract crucial information from various data resources and opportunity to create personalized experiences [26], and ability to analyze large amounts of data in a short time [27].As such, ML has gained popularity and has been applied to MH data to enable detection, monitoring, and treatment [28].The objective of this research is to review the literature to summarize and synthesize the application of ML in the detection, monitoring, or prediction of stress and stress-related MDs, in particular anxiety and depression.This paper documents methods-specific findings such as data types, preprocessing methods, and different algorithms used as well as type and characteristics of studies that used ML.Traditional statistical methods, such as linear regression, logistic regression, t-tests, and ANOVA [29], have been widely employed in the past to detect and analyze stress and stress-related MDs.These methods have proven useful in specific contexts, such as comparing means of different groups, or modeling linear relationships between variables.As demonstrated by [22], [23], [24], [25] and [26], these methods have provided valuable insights in situations where the data is relatively simple and adheres to the underlying assumptions of the statistical techniques.However, when faced with complex, high-dimensional mental health data, which has become increasingly available thanks to advancements in technology and data collection techniques, these traditional statistical methods might not be sufficient.The limitations of these methods stem from their inherent simplicity and the assumptions they rely on, which might not hold true in the context of MH data.For example, linear and logistic regression assume linear relationships between variables, while t-tests and ANOVA require specific assumptions about the data distribution.These assumptions may not be applicable in the case of intricate and heterogeneous MH data, potentially leading to inaccurate or incomplete conclusions.
Advanced data analytics methods, such as machine learning (ML), offer a more powerful and flexible alternative to traditional statistical methods.ML algorithms, with their significant pattern recognition and forecasting capabilities [26], are capable of capturing complex, nonlinear relationships between variables and can adapt to various data distributions.These capabilities enable ML techniques to provide more accurate and insightful predictions, classifications, and associations in the context of MH data [34].Additionally, ML algorithms can handle large-scale, high-dimensional data more efficiently than traditional methods, allowing researchers to analyze vast amounts of information from diverse sources, such as electronic health records, wearable

Information Sources
The literature search involved databases such as EI Engineering Village, Web of Science, ACM Digital Library, and IEEE Xplore.Additional sources were identified through contact with experts and review of references in relevant articles.

Search Strategy
A comprehensive search was conducted using a combination of keywords related to ML and mental health disorders (Table 1).The search strategy was designed to capture a broad spectrum of ML applications within this field.The full search list from all databases is available in the Multimedia Appendix.

Study Selection, Inclusion, and Exclusion Criteria
Articles that did not fully use ML for stress or stress related MDs evaluations were excluded from the research.Studies published in languages other than English were also excluded.The initial search yielded 1241 results.After duplicate articles were deleted and eligibility was confirmed using Rayyan QCRI [36], 1204 articles remained.After applying the exclusion criteria, 98 papers were selected for full review (Figure 1).Data Charting Process Data charting was conducted by two reviewers independently using a standardized form, which had been pretested on a subset of included studies.Discrepancies were resolved through discussion or consultation with a third reviewer.Study authors were contacted for clarification or additional data where necessary.

Data Items
Data extracted included publication year, study design, population characteristics, ML techniques used, outcomes measured, and key findings.Other variables sought included data preprocessing methods and performance metrics of the ML models.Simplifying assumptions, such as considering different ML algorithms within the same family as a single technique, were made to facilitate synthesis.

Synthesis of Results
Data were synthesized descriptively, grouping findings by ML techniques, data type and preprocessing techniques.Where possible, quantitative performance metrics were extracted or derived.Results were analyzed in the context of the overall study designs and populations to highlight trends and identify gaps in the current research landscape.No formal critical appraisal or quantitative meta-analysis was conducted due to the diversity of the included studies and the scoping nature of this review.

Results and Discussion
In this section types of data, preprosessing techniques, and ML techniques used on the data in the literature have been reviewed, and compared with the existing literature.Hz), reflect the autonomic nervous system's dynamics during beat-to-beat measurements of the heart rate (Figure 3) [38], [39].These HRV measures, both in the time and frequency domains, provide a nuanced view of the physiological underpinnings associated with various mental health conditions.
• Heart Rate (HR) (n=17): One of the most important indicators of stress is an abrupt increase in HR.Among the physiological signals, HR is among the top measures that explains stress in ML models and it has been used in different studies with almost all ML algorithms [40], [41], [42].
• Blood Pressure (BP) (n=1): BP can be obtained by pulse transit time (PTT) or by pressure cuffs [43].Stressful conditions create an influx of hormones that increase HR and constrict blood vessels leading to a temporary BP elevation [44].In most cases, BP recovers to its prestress level after the stress response diminishes [45].Schultebraucks et al. used systolic BP as one of the measures in predicting one's level of susceptibility to Post-Traumatic Stress Disorder (PTSD) [46].Electroencephalogram (EEG) (n=9): EEG detects brain electrical activity.Compared to other brain mapping techniques, for stress detection, it is more practical due to several factors including affordability, non-invasiveness, non-intrusiveness and most importantly its high temporal resolution [47].The high temporal resolution of EEG makes it appropriate for real-time stress detection, as well as DL approaches which require large dataset for training [47], [48], [49], [50], [51].
The most commonly used EEG features for detection of stress are power of different frequency bands, Alpha (8-13 Hz), Beta (12.5-30Hz), Theta (4-7.5 Hz), Gamma (30-40 Hz), average and standard deviation of a specific time window of EEG signal, and time-frequency features obtained by Discrete Wavelet Transform (DWT) algorithm [51], [52], [53].It has also been shown that statistical features of EEG signal such as Kurtosis and Entropy are useful features in stress prediction using ML algorithms [50].Moreover, Power Spectral Density (PSD), correlation (C), divisional asymmetry (DASM), rational asymmetry (RASM), and power spectrum (PS) are other EEG features that have been used in different studies for stress detection [54].
Since EEG signals are collected from the scalp, they include excessive noise and so they have high uncertainty.Therefore, signal processing and feature selection/extraction is a very important step while dealing with EEG data.Several well-developed methods are available for treating the EEG data.Among them, using latent space derived from auto-encoders and signal reconstruction techniques such as Artifact Subspace Reconstruction (ASR) are well-known methods that can be applied on EEG data to significantly reduce the artifacts [49].These methods are also fast enough that can make online detection feasible.
Amygdala and hippocampus are the parts of the brain that have the major responsibility for human reactions to stress [55].Brain activity caused by stress in those regions would affect the prefrontal cortex.Studies collecting data from prefrontal cortex have also verified that EEG data from this brain region can be used for stress detection [56].EEG can be collected from the prefrontal cortex using off-the-shelf EEG recording products such as MUSE and Neurosky Mindwave [50], [53], [54], [56].
Eye Tracking (n=3): Eye-tracking features can be indicators of stress.For example, to diagnose the level of stress, the changes in the striations of muscle material in the iris as a response to stress can be used as features for ML algorithms.In other words, pupil diameter, which would be controlled by iris sphincter muscles can be used as a feature [57].Other eye-tracking features that have been for stress detection are visual fixations, saccade movements, pupil size, micro saccades and number of eye-blinks in specific time window during a certain task [58], [59], [60].

Skin Response (n=24):
A skin response can be defined as a stimulus-regulated electrodermal response and is typically measured using electrodes placed on the fingertips or hands.Skin response is usually associated with increase in sympathetic activity upon inducing stress events [61].The skin becomes a better conductor of electricity when it is stimulated either externally or internally by physiologically stimulating factors, including stressful conditions [62].
Respiratory Signals (n=7): Mental stress can affect different respiratory cycle phases and breathing patterns [63], [64].For example, It is discovered that stress had no impact on overall breath duration (respiration rate), but that exhalation periods were longer and pause periods were shorter in the stress experiment compared to the neutral condition [65].
Based on the findings of several studies, it can be concluded that respiratory signal is one of the top contributing factors in explanation of stress in ML models.The most common time domain respiratory signal features that are extracted for stress detection are: Root Mean Square (RMS), Interquartile range (IQR), Mean of squared Differences between Adjacent elements (MDA) of breathing rate and blood oxygenation levels.The most commonly used frequency domain features of the respiratory signal are the power of low frequencies (LF, under 2 Hz), the power of high frequencies (HF, above 2 Hz) and the ratio of power of low frequencies over the power of high frequencies (LF/HF) [42], [46], [47], [66], [67], [68].

Electromyogram (EMG) (n=3):
EMG detects the electrical activity of muscles at rest, during a modest contraction, and during a strong contraction [69].Similar to acceleration data, several studies have shown that, using EMG data can help increasing the performance of ML models trained on ECG data.The action potential intrigued in the EMG during stress can reduce the variance for decision making of classification models that use ECG [42], [70], [71].
Hormones (n=1): It has been shown that stress can alter the levels of glucocorticoids, catecholamines, growth hormones, and prolactin in the bloodstream.Therefore in ML models, level of hormones such as cortisol, dehydroepiandrosterone sulfate (DHEAS), thyroid-stimulating hormone (TSH), free triiodothyronine (FT3), and free thyroxine (FT4) can be used as predictors for detection of stress-related disorders [46].
Acceleration/Body Movement (n=8): Mental Stress may cause a broad variety of behavioral/body movement symptoms such as shaking hands and feet which can be measured by the acceleration data [72].Moreover, research has shown that people with a greater stress score had less variance in their activity level and body movements [73], [74], [75].For example, In the elderly, stressful life events can be related to a reduced rate of regular physical exercise [76].Time and frequency features such as mean absolute deviation from mean (MAD), total power of acceleration, standard deviation, mean norm of acceleration, absolute integral, peak frequency of each axis are the features of hand/body acceleration used for stress detection [41], [77], [78].One practical characteristic of motion/acceleration data would be the fact that it can be used to identify" sources of noise in other signals .For example, motion data can help distinguishing stress from physical activity (e.g., exercise) when other physiological measures such as ECG have uncertainty in prediction [79], [80].

Audio and Speech Signals
• Speech Signals (n=3): Using speech signals, it is feasible to diagnose and assess neurological and MDs [81].Moreover, studies have shown that, like body acceleration and EMG, features of speech signal can make stress predictions of heart measurements more robust.The best explanatory parameters of speech signal are frequency domain parameters (e.g., PSD, strongest frequency from FFT transform) and time-frequency features such as Mel-Frequency Cepstral Coefficient (MFCC) [40], [82], [83].Since time-frequency measures are 2-dimentional measurements with high number of samples, they make this signal suitable for using in convolutional neural network models (CNNs) of stress and depression detection [84].

Preprocessing Techniques
In this section, important preprocessing techniques that have yielded significant findings and how they are used to help the detection of stress and its related MDs have been reviewed.

Synthetic Minority Oversampling Technique (SMOTE) (n=3):
In detection of stress and its related MDs, usually the number of samples for the stress or MD class is significantly lower than the nonstress or non-MD class.This imbalance in the number of samples for each class leads to a bias in prediction (towards the majority class).To correct for data bias, it is possible to oversample the underrepresented group.In stress detection studies using ML models, SMOTE is one of the most common approaches to boost the minority class using, which creates new samples by synthesizing those already available in the data (by combining their features) [77], [95], [112].
Early Modality Fusion (n=1): In ML models used for prediction of stress with a multimodal approach, it has been shown that early fusion of multimodal data before feature extraction is more effective and archives a better performance.This is due to the fact that early modality fusion catches better the important characteristics that are in coherence with each other.For example a study showed that combining different measures including skin response, skin temperature and body acceleration before feature extraction outperforms the approach that extracts the features for each measure separately and combines them afterwards (Figure 4) [113].Welch's method is one of the most common approaches to calculate PSD [49].PSD is often used in the studies that include frequency domain HRV features for stress detection such as total HF or LF power [66], [67], [68], [86], [114], [114], [115], [116], [117], [118], [119], [120], [121].

ILIOU (n=1):
In detection of MDs such as depression and anxiety using machine learning techniques, having the least error rate is significantly important so that the person can take further actions appropriately.In this matter, data preprocessing step has an important role to minimize the noise and bias towards the false prediction.[42], [108], [122].
Independent component analysis (ICA) (n=4): Independent component analysis (ICA) is a computational and statistical method for uncovering hidden elements underlying random variables, observations, or signals.This method is mostly used for removing artifacts from stationary signal noises of the multi-channel data.ICA optimizes higher-order statistics such as kurtosis, while PCA optimizes the covariance matrix of the data, which reflects second-order statistics.In stress detection using physiological signals that contain stationary noises (e.g.eyeblink noise in EEG) it is recommended to remove noises using ICA [47], [48], [49], [51].
Artifact subspace reconstruction (ASR) (n=1): ASR is an adaptive approach for removing artifacts from signal recordings online or offline, mostly non-stationary signal noises.To identify artifacts based on their statistical qualities in the component subspace, it repeatedly computes a PCA on covariance matrices [123].Since there are usually lots of non-stationary noises in the EEG data, in order to classify stress in multiple levels using EEG data, using ASR before classification is highly recommended [49].
Latent Growth Mixture Modeling (LGMM) (n=1): Growth mixture modeling (GMM) is to discover numerous hidden subpopulations, describe longitudinal development within each hidden subpopulation, and investigate variation in hidden subpopulations' rates of change.Latent growth mixture models are gaining popularity as a statistical tool for estimating individual development over time and for probing the presence of latent trajectories, in which people belong to trajectories that are not directly observable [46], [124], [125].

Dynamic Time Warping (DTW) (n=1):
It is common practice to transform data from two time series into vectors and then compute the Euclidean distance between the resulting points in vector space to determine the degree of similarity or dissimilarity between the series, regardless of if they vary in time or velocity.DTW method can be applied to find such similarities that may exist between people in terms of their mood series.As an example, one may compare time-series to find whether they match for stress, depression, or anxiety.Moreover, it can be utilized to forecast the mental condition of persons with substantially comparable series patterns [115], [126].The difference between DTW and Euclidian matching is that unlike Euclidean matching, DTW considers distance of each point in one sequence, to every point in the other sequence to determine the similarity between them (Figure 5).Kalman Filter (n=2): The Kalman filter is a technique for making predictions about unknown variables (e.g., missing data) based on observable data.Kalman filters include two iterative stepspredict and update-that are used to estimate states using linear dynamical systems in state space format.Iterative cycles of predict and update are performed until convergence is achieved [128].Kalman filter has been used to handle the missing data for stress detection in some studies [129], [130].

Autoencoders (n=3):
Autoencoders are a type of Neural Networks that learn a representation of the data in lower dimensions than the original data (encoding) by regenerating the input from the encodings (decoding).For data with very high dimensionality, usually clustering is not optimized because of the noise present in the original data.Hence, it is an appropriate practice to use the encoded representation of the data, obtained by autoencoders, to have lower and more optimized dimensions for clustering [49], [93], [131].

Self-Organizing Map (SOM) (n=3):
In ML, a self-organizing map (SOM) produces a lowdimensionaltypically two-dimensionalrepresentation of a high-dimensional dataset while preserving its topology by creating clusters.It is therefore possible to visualize and analyze highdimensional data more easily (Figure 6) [92], [118], [132].Wrapper Feature Selection Methods: Wrapper methods try to use a subset of features while training a model.Changes will be made to the feature subset based on the performance about the prior model (Figure 7).Therefore, finding the best features using wrapper method is a search problem.These methods often have high computing costs [133].Some most common wrapper methods are: Naïve search, Sequential Forward Feature Selection (SFFS), Sequential Backward Feature Selection (SBFS), and Generalized Sequential Search (GSS) [134].Some studies used this approach as their feature selection technique [56], [59].• Chi-square test (n=3): This test checks for independence between categorical features and the target variable.Features with high Chi-square scores are selected, implying a strong association with the target variable, which may be valuable for the model [40], [120], [135].
• Pearson Correlation (n=2): Pearson linear correlation coefficient is a way to quantify how closely two sets of data are correlated linearly.It indicates how different measures are related to each other by a number between -1 to 1. Therefore, among highly correlated variables some them can be removed as they don't add useful information to ML models [98], [136].
• Minimum Redundancy Maximum Relevance (mRMR) (n=2): mRMR technique chooses characteristics having a high correlation to output (relevance) and a low correlation to one another (redundancy).F-statistic is used to determine the correlation between features and the output, whilst Pearson correlation coefficient (for non-time series features) and Dynamic Time Warping (DTW for time series features) may be used to calculate the correlation between features (Figure 9).The objective function, which is a function of relevance and redundancy, is then maximized by selecting features one at a time using a greedy search.Mutual Information Difference (MID) and Mutual Information Quotient (MIQ) criteria are both frequently employed objective functions that depict the difference or quotient between relevance and redundancy [137], [138].Using this feature selection method, Giannakakis et al. have ranked ECG measurements in the order of importance as mean HR , LF, NN50, standard deviation of HR,pNN50, LF/HF, RMSSD, HF, and total power [115].
Naïve Bayes (NB) (n=22): Naïve Bayes algorithm is a supervised, generally parametric, classification method that uses the Bayes Theorem as its foundation and has the naïve assumption of predictor independence.In other words, Naïve Bayes classifier assumes that the existence of a given independent variable to predict the dependent variable is independent of the presence of any other independent variable that predicts the dependent variable.

Decision Tree (DT) (n=23):
Decision Tree is a supervised non-parametric ML algorithm used in classification and regression applications.It comprises a root node, branches, internal nodes, and leaf nodes in a hierarchical, tree-like structure (Figure 11).Random Forest (RF) (n=36): Random Forest is a supervised non-parametric ensemble learning algorithm that uses many Decision Trees built during the training process.Random Forest algorithm is used for both classification and regression problems.When it comes to classification, the Random Forest's output is the class that the majority of the Decision Trees choose.For regression purposes, an individual tree's predicted mean or average is returned as the output.Using Random Forests, we can overcome the tendency of decision trees to overfit to their training data.Discriminant Analysis (n=6): Discriminant Analysis is a supervised parametric classification algorithm that works with data including a dependent variable and independent variables and mostly used to classify the observation into a certain group based on the independent variables in the data.Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are the two forms of Discriminant Analysis.

K-nearest neighbors (K-NN) (n=22): K-nearest neighbors (K-NN
) is a non-parametric supervised ML algorithm that is used for both classification and regression purposes.In classification, the algorithm determines the label of a new sample not available in the training data by assigning the label of the majority of k-nearest training data points to that new sample (Figure 12).In regression, the output for each sample, is the average of the values of k-nearest neighbors to that sample (not including the sample itself).In this literature K-NN has been only used for classification.In this example, the label of "Class C" is assigned to the new (black) datapoint since the majority of the 7-nearest datapoints to the new datapoint are from "Class C".
Support Vector Machines (SVM) (n=48): Support Vector Machine is a parametric supervised ML algorithm used for both classification and regression problems.It can solve both linear and nonlinear problems using non-linear kernels.For classification, the SVM algorithm finds a line (or a hyperplane for non-linear kernels) between each pair of classes of the training data in a way that the margin distance of that line or hyperplane to the closest point of each of those two classes is maximized (Figure 13).This is repeated for all pairs of classes in the dataset.Then the obtained lines are used as boundaries for classes.In regression, the SVM tries to find the line/hyperplane that within a very small margin of  (epsilon) has maximum number of datapoints.That line/hyperplane used for regression.

Figure 13. Visual representation of Support Vector Machine algorithm
K-means clustering (n=4): K-means clustering is an unsupervised ML algorithm that aims to arrange objects into groups based on their similarity.To find those similarities, it calculates the distance of data points into K random cluster centroids and assigns each data point to its closest centroid.Then location of each centroid is then updated by average value of all datapoints associated with that centroid.This process is repeated until there is no change in the location of the centroids.In ML models for stress detection, K-means clustering has been used in the literature for personalization of the ML models [42], [51], and for labeling the dataset [131], [140].
Neural Network (NN) (n=39): DL methods are a subset of ML methods that.NNs are the heart of the DL algorithms.The neural network is a method for implementing ML that utilizes interconnected nodes or neurons arranged in a layered structure resembling the human brain.There are different types of NNs have been explained below: • Artificial Neural Network (ANN): It is possible to think of a single perceptron (or neuron) as an abstract Logistic Regression.In each layer of ANNs, a group of multiple perceptron or artificial neurons is used.Figure 14 shows an ANN with one layer and its working mechanism.  (1and   (2) denote the weights of the links connecting the first layer (input layer) to the hidden layer and weights of the links connecting the second layer to the next layer (output layer), respectively.(b) Representation of how a single neuron works.First, all the outputs of the previous layer are multiplied by the weights associated with the links connecting them to the  ℎ neuron of the next layer and summed by a bias (summation and bias step).The result is then passed through an activation function (activation step).
• Convolutional Neural Network (CNN): CNNs are a form of neural network that are especially adept at handling data structures with a grid-like layout, such as images/objects.Classification and computer vision applications are common uses for convolutional neural networks (ConvNets or CNNs) (Figure 15).The purpose of the impact is to encourage the network to give greater attention to the small but significant portions of the input data by enhancing some and reducing others.Since stress may alter a small portion of physiological data (e.g.ECG), attention mechanism can be used to detect stress using RNNs when large datasets are available [141].[87].
Other ML techniques (n=19): • Voting ensemble classifier: The classification is decided based on weighted voting, which is determined by using a voting ensemble approach.The voting classifier allows for voting in which the final class labels are determined either by the class chosen most frequently by the classification models, or by the average of the output probabilities from each classification model.In the literature, this method has been utilized for PTSD detection [112], stress and stress related MDs [68], [80], [103], [140], [142].
• Fuzzy C-means (FCM) clustering: Fuzzy C-means clustering (FCM) is a clustering approach that assigns every data point to all the clusters with a certain probability instead of assigning each point to only one cluster.A data point that is near to the cluster's center, for instance, will have a high degree of membership there, while a data point that is distant from the cluster's center would have a low degree of membership [143].Since depression, anxiety are not discrete measures, some studies have used FCM as an alternative to other clustering techniques for detection of these MDs [99], [101].
In this article, the recent ML algorithms, preprocessing techniques, and data (e.g., physiological data, questionnaire data, etc.) used in detection, prediction and monitoring of stress and the most common MDs (i.e., depression, anxiety, other stress-related MDs) have been reviewed.Based on this review, it is concluded that among classic ML algorithms (excluding DL approaches), supervised models of Support Vector Machines (SVMs) and Random Forest (RF), have been used more often and achieved better performance in terms of model accuracy and robustness (measured by parameters like Area Under the Receiver Operating Characteristic curve (AUROC)).The accuracy of ML models is a critical indicator of their utility in real-world applications.The review demonstrates that SVM consistently achieves high accuracy across various data types, including HR, HRV, and skin response.For instance, SVM achieved 93% accuracy with HR, PPG, and skin response data in study [34], and 96% with skin response data in study [140].These results underscore SVM's robustness in handling complex, non-linear data.Random Forest also shows commendable performance, with an accuracy of 99.88% in study [144], reflecting its strength in ensemble learning to mitigate overfitting and noise.Moreover, among the predicting measures for stress and stress-related MDs, HR, HRV and skin response have been used the most often (Figure 16).These measures were the major explaining factors in the ML algorithms to predict stress and stress-related MDs.It is noticeable that DL approaches are becoming more popular as these techniques provide unique specifications that classic ML algorithms cannot provide.
Since stress is a time dependent event, the relationship between different lags of time can be important for detection of stress.Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) will take into account the relationship between datapoints in different timeseries for their decision making and they have the potential to enhance the detections.Deep learning models, specifically CNNs and LSTMs, show promising results, with CNNs achieving 92.8% accuracy in HRV and ECG data in study [139], indicating their potential in feature-rich physiological data.However, it is worth noting that deep learning models require substantial data for training, which may limit their applicability in studies with smaller datasets.Among the selected features, statistical indicators of heart measurements such as mean, standard deviation of HR, along with time and frequency representations of HRV such as RMSSD and total LF and HF power were most widely used.Heart measurements also have been more often than other measurements as they are unobtrusive, non-invasive, affordable and easier to measure and also describing a big portion of stress events.After those, skin response measure has been found as one of the most important factors in detection of stress and its related disorders.The timefrequency approaches to analyze time series data are getting more popular in this area as they are proper representations of data for DL approaches which can be more accurate and robust.As an example, for DL algorithms, RNNs with attention mechanisms can help to find portions of data related to stress and its related disorders with higher confidence.
Most of the studies models do not interpret the ML models and look at them as black box.This limits the contribution to the body of science.SHapley Additive exPlanations (SHAP) is a technique used by some studies to interpret the models such as evaluation of features to find the most important ones and also how in what direction each feature affects the predictions.SHAP correlation plot provides insight into the distribution of the features themselves, as well as the relationship between their influence on the model.In other words, it provides the importance of each feature on prediction of the dependent variable by taking into account both the main effect as and the interaction effect of that feature with other features in the data [46], [77], [105], [144], [145], [146].
Despite progress in stress detection methodologies, the exploration of personalized models has been limited.Most studies have not gone beyond basic normalization techniques, overlooking the fact that physiological measures are as distinct to individuals as biometric identifiers.A notable exception can be found in a select few studies [51], [113], [147], which have employed more sophisticated personalization techniques, integrating complex data transformations to account for individual variability.

Strengths of the review
In undertaking this scoping review, we have embarked on a rich exploration of the applications of machine learning (ML) in the field of stress detection, articulating a narrative that is both comprehensive and detailed.The review lays out a landscape where diverse data types are not merely cataloged but deeply analyzed for their roles and interconnections within the broader context of methodological approaches.This provides a robust understanding of the field's current state and its complexities.
This review has documented a comprehensive assessment on various physiological measurement techniques, including heart rate variability (HRV), electroencephalograms (EEG), and electrocardiograms (ECG), etc.This assessment is not just a recounting of the types of data employed in the literature but a thoughtful consideration of how each contributes to a multifaceted understanding of stress indicators.It is an acknowledgment that the signals of stress are as complex as the condition itself, necessitating a rich palette of investigative tools.
The review also examines a range of advanced preprocessing techniques such as mRMR, SOM, SMOTE and PCA.This examination sheds light on how different studies leverage these methods to refine the quality of data fed into ML models, thereby potentially enhancing the models' accuracy and reliability in detecting stress.It is an illustration of how sophisticated data treatment can lead to more nuanced insights, even if our own methodology did not directly employ these techniques.

Limitations
Our scoping review acknowledges its inherent constraints, including a possible selection bias due to potential omissions of pertinent studies.It serves as a contemporary cross-section of the rapidly evolving domains of machine learning and mental health, underscoring the imperative for periodic scholarly review to sustain its relevance and precision.While we survey a broad spectrum of machine learning techniques applied to stress detection, we do not extensively assess their efficacy, suggesting a fertile ground for future empirical investigations to assess these methods across diverse data cohorts and settings.Additionally, while we address the preprocessing techniques and their impact on model performance, our discussion does not delve into detailed technical analysis.Finally, the crucial issue of model interpretability is touched upon but not explored in depth, presenting an opportunity for further scholarly explorations.

Conclusions and Future Directions
The pivotal insights from this review underscore the potential of ML to redefine the approach to mental health care, particularly in the diagnosis and management of stress-related conditions and MDs.As we have discerned, there is an expansive field ripe for further exploration, with research gaps suggesting a number of promising directions.Guided by these insights, we can now chart a course for future research that not only expands the boundaries of our scientific understanding but also translates into tangible improvements in clinical practice.

Real-time and Naturalistic ML Applications
The scarcity of real-time studies in naturalistic settings has highlighted the importance of developing ML models that accurately reflect and respond to the complexities of real life.Future research must prioritize the creation of algorithms capable of operating amidst the unpredictability of daily life, providing immediate insights and adaptable interventions.These models hold the potential to transform practice by offering tools that can preemptively identify stress and MD symptoms, enabling clinicians to intervene before conditions worsen.

Temporal Data and Deep Learning
Our review illuminates the untapped potential of time series data in capturing the evolution of stress and MDs.Deep learning techniques, specifically designed to interpret complex, sequential data, could lead to breakthroughs in how we understand and predict mental health trajectories.For practice, this means more sophisticated diagnostic tools that can provide a nuanced picture of a patient's mental health over time, enabling personalized treatment plans that are responsive to the patient's changing condition.

Personalization in ML Models
The need for individualized care in mental health cannot be overstated.The heterogeneity of stress responses and MD symptoms calls for personalized ML models tailored to individual physiological and behavioral patterns.Future research should focus on leveraging multi-task learning to refine algorithms that adapt to individual baselines, enhancing the personalization of care.For clinicians, this means access to tools that can more accurately reflect and respond to the unique needs of each patient, reducing the risk of misdiagnosis and improving treatment efficacy.
Predictive analytics can be instrumental in identifying key factors that contribute to misdiagnosis and delayed help-seeking.Future studies should look to build on this knowledge to inform the creation of interventions that encourage timely and accurate diagnosis.In practice, this could lead to the development of targeted screening tools that assist clinicians in recognizing at-risk individuals more effectively.The integration of clinical expertise with ML innovation is crucial for the development of tools that are both advanced and clinically relevant.Collaboration between healthcare professionals, patients, and AI developers will be essential in creating user-centered tools that address real-world needs.This collaborative approach will likely result in the development of AI applications that are more intuitive and effective in clinical settings.

Figure 1 .
Figure 1.Preferred items for scoping literature review and meta-analysis flowchart[35]

Figure 3 .
Figure 3. (a) Depiction of heart's beat-to-beat measurements using Blood Volume Pulse (BVP) signal (b) Power Spectral Density (PSD) of RR intervals (the signal is bandpass filtered with cut-off frequencies of 0.04 Hz and 0.4 Hz)

Figure 7 .
Figure 7. Steps of a wrapper feature selection method

Figure 8 .
Figure 8. Steps of a filter feature selection method

Figure 9 .
Figure 9. calculation of relevance and redundancy for (a) non-time series features (b) time-series features (DTW: Dynamic Time Warping)

Figure 10 .
Figure 10.Number of articles for each ML model

Figure 11 .
Figure 11.Structure of a Decision Tree

Figure 12 .
Figure 12.Example of K-NN classification with K = 7.In this example, the label of "Class C" is assigned to the new (black) datapoint since the majority of the 7-nearest datapoints to the new datapoint are from "Class C".

Figure 14 .
Figure 14.(a) Representation of an ANN with one hidden layer. (1) and  (2) denote the weights of the

Figure 15 .
Figure 15.Representation of CNN for physiological signal

Figure 16 .
Figure 16.Distribution of ML models used for each type of data.In this figure, skin response and heart measures (including HR, HRV and blood pressure) have been shown separately due to their high usage and importance in the literature.Other psychophysiological measures include EEG, EMG, Eyetracking and respiratory signals.Activity includes body movement.Sentiment data includes speech and text data.Finally, perceived measures include questionnaire and self-report data.
SVM NN RF LR DT KNN NB Boosting LDA/QDA K-means Fuzzy

Table 1 .
Keywords and search strategy for articles since 2017 (last 5 years)

Distribution of Articles by Type of Data
[99]u et al.proposed ILIOU, a data mapping and transformation method, that identifies useful information for detection of MDs, especially for depression.This method outperforms common data preprocessing techniques such as Principal component analysis (PCA), Evolutionary Search Algorithm (ESA) and Isomap for detection of depression[99].It does this by generating new variables that are uncorrelated and progressively optimize variance Principal Component Analysis (PCA) (n=3): Principal component analysis (PCA) is a method for lowering the dimensionality of such datasets while maximizing interpretability and minimizing Cong et al. introduced X-A-BiLSTM, which is a DL model that includes XGBoost (to filter data and handle imbalanced data) and Attention Bi-LSTM (LSTM with forward and backward memory and Attention mechanism) Neural Network used for stress classification using text data Attention mechanism in RNNs is a new technique that is becoming popular for finding animalities in physiological signals.However, based on the review of literature, this mechanism has only been used on text data (not on physiological signals) to detect stress.Therefore, Attention mechanism is technology that can be further utilized for physiological signals to detect stress.Unsupervised ML (and DL algorithms) such as clustering techniques have been used mostly for the preprocessing step to label the data (if labels are not available) and also for finding a representation of the data that achieves the best performance in detection algorithms.For data preprocessing, feature selection (i.e., filter and wrapper methods) and extraction techniques have been commonly used.In feature extraction approaches, latent representations of data by transformations such as output of encoder in autoencoders have been useful to remove data noises and to make the data more compact, making further computations more efficient.PCA and ICA are other most common feature extraction approaches used in the literature.