https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9043535
The paper proposes a deep neural network that combines convolutional layers with LSTM for extracting activity features and classifying them in mobile and wearable computing scenarios. The proposed model replaces the fully connected layer with a global average pooling layer(GAP) and adds batch normalization(BN) to speed up convergence. The model achieves high accuracy and robustness on three public datasets. The proposed model outperforms previous results and has fewer parameters while adaptively extracting activity features.
Human activity recognition (HAR) has become a popular due to its ability to extract features from daily activities and provide a basis for other intelligent applications. HAR technology has been widely used in various fields, including home behavior analysis, video surveillance, gait analysis, and gesture recognition. Sensor-based HAR has become more popular with the development of sensor technology and ubiquitous computing technology. Two categories of HAR methods are approaches based on fixed sensors and approaches based on mobile sensors. Mobile sensor-based methods using accelerometers, gyroscopes, and magnetometers have received widespread attention due to their portability and high acceptance in daily life.
Early research on human activity recognition (HAR) mainly used traditional machine learning methods such as decision tree, SVM, and naïve Bayes. However, these methods rely heavily on manual feature extraction and human domain knowledge. To address this problem, researchers have turned to deep learning methods that can automatically extract appropriate features from raw sensor data. Several models using deep learning methods, including CNN, LSTM, and their combination, have been proposed for HAR. However, these models have a relatively complex overall network structure and a large number of parameters, resulting in high computational cost. To address these shortcomings, this paper proposes a novel deep neural network for HAR called LSTM-CNN, which can extract activity features automatically and classify them with few parameters. The model's performance is evaluated on three widely used public datasets, demonstrating high accuracy, good generalization ability, and fast convergence speed.
UCI-HAR
WISDM
OPPORTUNITY
In order to prepare the raw data collected by motion sensors for input to the proposed LSTM-CNN model, pre-processing steps were performed.
The model consists of eight layers, including two LSTM layers for extracting temporal features, two convolutional layers for extracting spatial features, a max-pooling layer, a global average pooling layer, a batch normalization layer, and an output layer with a Softmax classifier.
The LSTM layer is used for the extraction of temporal features and has 32 memory cells in each layer.
The convolutional layers employ ReLU activation and have 64 and 128 convolution kernels, respectively.
The global average pooling layer replaces the fully-connected layer, and the batch normalization layer is added after the GAP layer to accelerate the convergence of the model.
The output layer consists of a fully-connected layer and a Softmax classifier, which converts the output of the upper layer into a probability vector. The model is trained and tested on three public datasets, including UCI-HAR, WISDM, and OPPORTUNITY.
The proposed network structure was built using Keras, a high-level neural networks API written in Python, with TensorFlow used as the backend. The model was trained in a fully-supervised manner using cross-entropy loss and Adam optimizer. The batch size was set to 192 and the number of epochs was 200, with a learning rate of 0.001. The model was trained on a PC with Ubuntu operating system and the gradient was back-propagated from the Softmax layer to the LSTM layer.
The WISDM and OPPORTUNITY datasets are imbalanced datasets, and the overall classification accuracy is not an appropriate measure of performance due to the imbalances. F-measure (F1 score) is a more useful performance indicator than accuracy, as it takes both false positives and false negatives into account and combines two measures defined based on the total number of correctly recognized samples, which is known as "precision" and "recall". F1 score offsets imbalances in classes by weighting classes based on their proportion of samples, and the formula for the F1 score is provided.
The performance of the LSTM-CNN model was evaluated using three public datasets. The confusion matrices obtained from the test sets of these datasets showed that the overall accuracy of the model was high, with 95.80% for UCI-HAR, 95.75% for WISDM, and 92.63% for OPPORTUNITY. The model was compared with CNN and DeepConvLSTM models, and the F1 score was used to ensure fairness and consistency of the results. The LSTM-CNN model outperformed the other two models on UCI-HAR and WISDM datasets, with the best-reported result increasing by an average of 3%. The model also showed significant improvements in OPPORTUNITY dataset, with an increase of about 7% compared to the CNN model of Yang et al. It was concluded that using the GAP layer instead of a fully-connected layer brought significant advantages in HAR tasks, and the proposed method had superior performance on different public datasets.
Five different model architectures were compared, including a classical convolutional neural network (CNN) with a fully-connected layer, a CNN with a global average pooling (GAP) layer, a CNN with a GAP layer and batch normalization (BN) layer, a combination of two LSTM layers and CNN layers, and the proposed LSTM-CNN architecture. The results showed that replacing the fully-connected layer with a GAP layer significantly reduced the number of model parameters while maintaining the same performance. The addition of a BN layer further improved the model's accuracy. The combination of two LSTM layers and CNN layers outperformed the other architectures by 1%. Finally, the proposed LSTM-CNN architecture achieved an F1 score of 95.78% on the test set. It was also found that the use of LSTM layers slowed down the computation speed of the model due to the dependence on the output of the previous time step.
It was found that the Adam optimizer performed the best, and increasing the number of filters in the second convolutional layer improved accuracy but increased the model parameters. The optimal batch size was found to be 192. The results highlight the importance of selecting appropriate hyper-parameters to optimize the performance of deep learning models.
The paper proposes a deep neural network for human activity recognition that combines convolutional layers with LSTM. A GAP layer is used to replace the fully-connected layer behind the convolutional layer, which reduces the model parameters while maintaining high accuracy. A BN layer is added after the GAP layer to speed up the convergence of the model. The proposed architecture is capable of learning temporal dynamics on various time scales, and the F1 score is used to evaluate the model's performance on three public datasets. The impact of hyper-parameters on model performance is also explored. The proposed LSTM-CNN model outperforms other methods and has good generalization.
It was my first time reading a paper about Human Activity recognition. This paper provides a detailed analysis of a deep neural network for human activity recognition using a combination of convolutional layers and LSTM. The authors propose a novel architecture that replaces the fully-connected layer with a global average pooling (GAP) layer and includes a batch normalization (BN) layer to stabilize the output of the upper layer. The study also explores the impact of hyper-parameters such as the number of filters, optimizer type, and batch size on model performance. The results show that the proposed model outperforms other methods in the literature and achieves high recognition accuracy while using few model parameters. Overall, this paper provides a valuable insight into the concepts of CNN, LSTM, BN, and GAP and their application in human activity recognition.