[Paper Review] LSTM-CNN Architecture for Human Activity Recognition

gredora·2023년 3월 3일

Paper Review

목록 보기
2/20

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9043535

Abstract

The paper proposes a deep neural network that combines convolutional layers with LSTM for extracting activity features and classifying them in mobile and wearable computing scenarios. The proposed model replaces the fully connected layer with a global average pooling layer(GAP) and adds batch normalization(BN) to speed up convergence. The model achieves high accuracy and robustness on three public datasets. The proposed model outperforms previous results and has fewer parameters while adaptively extracting activity features.

Introduction

Human activity recognition (HAR) has become a popular due to its ability to extract features from daily activities and provide a basis for other intelligent applications. HAR technology has been widely used in various fields, including home behavior analysis, video surveillance, gait analysis, and gesture recognition. Sensor-based HAR has become more popular with the development of sensor technology and ubiquitous computing technology. Two categories of HAR methods are approaches based on fixed sensors and approaches based on mobile sensors. Mobile sensor-based methods using accelerometers, gyroscopes, and magnetometers have received widespread attention due to their portability and high acceptance in daily life.

Early research on human activity recognition (HAR) mainly used traditional machine learning methods such as decision tree, SVM, and naïve Bayes. However, these methods rely heavily on manual feature extraction and human domain knowledge. To address this problem, researchers have turned to deep learning methods that can automatically extract appropriate features from raw sensor data. Several models using deep learning methods, including CNN, LSTM, and their combination, have been proposed for HAR. However, these models have a relatively complex overall network structure and a large number of parameters, resulting in high computational cost. To address these shortcomings, this paper proposes a novel deep neural network for HAR called LSTM-CNN, which can extract activity features automatically and classify them with few parameters. The model's performance is evaluated on three widely used public datasets, demonstrating high accuracy, good generalization ability, and fast convergence speed.

Dataset Description

UCI-HAR

  • consists of 6 basic activities, recorded by 30 subjects wearing a smartphone with embedded inertial sensors, and also includes postural transitions.

WISDM

  • contains 6 activities, recorded by 36 subjects with an Android phone in their front leg pockets. The dataset is unbalanced, with walking being the most common activity.

OPPORTUNITY

  • includes 17 activities, recorded in a sensor-rich environment by 4 subjects performing morning activities. The dataset includes sensors on the body, objects, and in the environment, with 18 sporadic gestures included in the dataset. The placement of on-body sensors used in the OPPORTUNITY dataset is also shown.

Data Pre-Processing

In order to prepare the raw data collected by motion sensors for input to the proposed LSTM-CNN model, pre-processing steps were performed.

  1. linear interpolation was used to fill in missing values indicated by NaN/0.
  2. scaling and normalization were applied to the data to bring it into the range of 0 to 1.
  3. segmentation was performed using a sliding window with an overlap rate of 50%, with a window size of 128 for UCI-HAR and WISDM datasets and 24 for OPPORTUNITY dataset. The segmentation was done to preserve the temporal relationship between data points in an activity. The resulting short time series were then fed into the LSTM-CNN model as input.

Proposed Architecture

The model consists of eight layers, including two LSTM layers for extracting temporal features, two convolutional layers for extracting spatial features, a max-pooling layer, a global average pooling layer, a batch normalization layer, and an output layer with a Softmax classifier.

The LSTM layer is used for the extraction of temporal features and has 32 memory cells in each layer.

The convolutional layers employ ReLU activation and have 64 and 128 convolution kernels, respectively.

The global average pooling layer replaces the fully-connected layer, and the batch normalization layer is added after the GAP layer to accelerate the convergence of the model.

The output layer consists of a fully-connected layer and a Softmax classifier, which converts the output of the upper layer into a probability vector. The model is trained and tested on three public datasets, including UCI-HAR, WISDM, and OPPORTUNITY.

Experimental Results

The proposed network structure was built using Keras, a high-level neural networks API written in Python, with TensorFlow used as the backend. The model was trained in a fully-supervised manner using cross-entropy loss and Adam optimizer. The batch size was set to 192 and the number of epochs was 200, with a learning rate of 0.001. The model was trained on a PC with Ubuntu operating system and the gradient was back-propagated from the Softmax layer to the LSTM layer.

The WISDM and OPPORTUNITY datasets are imbalanced datasets, and the overall classification accuracy is not an appropriate measure of performance due to the imbalances. F-measure (F1 score) is a more useful performance indicator than accuracy, as it takes both false positives and false negatives into account and combines two measures defined based on the total number of correctly recognized samples, which is known as "precision" and "recall". F1 score offsets imbalances in classes by weighting classes based on their proportion of samples, and the formula for the F1 score is provided.

The performance of the LSTM-CNN model was evaluated using three public datasets. The confusion matrices obtained from the test sets of these datasets showed that the overall accuracy of the model was high, with 95.80% for UCI-HAR, 95.75% for WISDM, and 92.63% for OPPORTUNITY. The model was compared with CNN and DeepConvLSTM models, and the F1 score was used to ensure fairness and consistency of the results. The LSTM-CNN model outperformed the other two models on UCI-HAR and WISDM datasets, with the best-reported result increasing by an average of 3%. The model also showed significant improvements in OPPORTUNITY dataset, with an increase of about 7% compared to the CNN model of Yang et al. It was concluded that using the GAP layer instead of a fully-connected layer brought significant advantages in HAR tasks, and the proposed method had superior performance on different public datasets.

Impact of structure

Five different model architectures were compared, including a classical convolutional neural network (CNN) with a fully-connected layer, a CNN with a global average pooling (GAP) layer, a CNN with a GAP layer and batch normalization (BN) layer, a combination of two LSTM layers and CNN layers, and the proposed LSTM-CNN architecture. The results showed that replacing the fully-connected layer with a GAP layer significantly reduced the number of model parameters while maintaining the same performance. The addition of a BN layer further improved the model's accuracy. The combination of two LSTM layers and CNN layers outperformed the other architectures by 1%. Finally, the proposed LSTM-CNN architecture achieved an F1 score of 95.78% on the test set. It was also found that the use of LSTM layers slowed down the computation speed of the model due to the dependence on the output of the previous time step.

Impact of Hyperparameter

It was found that the Adam optimizer performed the best, and increasing the number of filters in the second convolutional layer improved accuracy but increased the model parameters. The optimal batch size was found to be 192. The results highlight the importance of selecting appropriate hyper-parameters to optimize the performance of deep learning models.

Conclusion

The paper proposes a deep neural network for human activity recognition that combines convolutional layers with LSTM. A GAP layer is used to replace the fully-connected layer behind the convolutional layer, which reduces the model parameters while maintaining high accuracy. A BN layer is added after the GAP layer to speed up the convergence of the model. The proposed architecture is capable of learning temporal dynamics on various time scales, and the F1 score is used to evaluate the model's performance on three public datasets. The impact of hyper-parameters on model performance is also explored. The proposed LSTM-CNN model outperforms other methods and has good generalization.

Comment

It was my first time reading a paper about Human Activity recognition. This paper provides a detailed analysis of a deep neural network for human activity recognition using a combination of convolutional layers and LSTM. The authors propose a novel architecture that replaces the fully-connected layer with a global average pooling (GAP) layer and includes a batch normalization (BN) layer to stabilize the output of the upper layer. The study also explores the impact of hyper-parameters such as the number of filters, optimizer type, and batch size on model performance. The results show that the proposed model outperforms other methods in the literature and achieves high recognition accuracy while using few model parameters. Overall, this paper provides a valuable insight into the concepts of CNN, LSTM, BN, and GAP and their application in human activity recognition.

profile
그래도라

0개의 댓글