보름달 🌝 #2.1 Machine Learning - Menstrual Cycle Prediction (ENG)

보름달·2020년 11월 26일
5
post-thumbnail

1. Introduction

    Prediction of menstrual cycle is the key of 보름달 service, because menstrual cycle is the most important factor in female health. Menstrual cycle itself is a measurement of female health and women have to recognize their own cycles for the ease of preparation, and to be prepared for pregnant. However, the busy modern people are not easy to take care of their cycles.

    보름달 uses the latest cycles entered by the user to predict the user's menstrual cycle and ovulation date. The goal is to be able to predict cycles for not only general people, but also irregular cycles such as changes in recent cycles or irregular cycles. As user-entered information and data pile up, users will be able to obtain information they need, such as more accurate cycles and ovulation dates.

2. Datasets


    Raw data in Open Cycle: Forecasting Ovulation for Family Planning used cycles data for 1798 people, and and quoted in many studies, so we tried to find the datasets. We sent emails to many researchers, but they replied that they don't have access to the data. So we found other datasets as an alternative, and thought the amount was not enough, we decided to use that judging that would be the best. There was a problem that the datasets have only regular cycles. Since there was not enough cases with irregular cycles with patterns, or lately changed cycles, we used fake datasets too.

    We used python jupyter notebook to make cleaned datasets from the raw data. We excluded people with one or two cycles, who don't have accumulated data. We made cleaned dataset which removed rows with null value and includes attributes that the users entered in the application: ClientID, CycleNumber, LengthofCycle, MeanCycleLength, EstimatedDayofOvulation, LengthofMenses, MeanMensesLength, Age, NumberPregnancies and BMI.

3. Feature Engineerinig

    Feature Engineering is the entire process of selecting features to be entered into the model to enhance the performance of the model. We used following methodology for that:

NN(Nueral Network)
Neural network is a learning algorithm influenced by the neural network of biology. It is a nonlinear model in which artificial neurons that form networks which have problem-solving ability by combining synapses change the strength of the synapses.
source: LG CNS blog

    First off, we learned how to use tensorflow.js and tried to realize a simple non linear regression model to predict the estimated day of ovulation of a cycle according to its length as a first practicing example. We started by plotting the relationship between the 2 and the predicted values of the model.


   As we can see, it seems that the estimated day of ovulation is linked to the ovulation day. It looks like the ovulation day increases when the cycle length increases. The relationship between those 2 elements seems to be relevant.

   In the tensorflow, epoch is the number of training, batch size is the size of data at one training, and iteration is the number of traing batch at one epoch. Since our data set is limitied, one epoch is not enough. However, when the epoch is too much, there could be an overfitting. So we draw the graph to show the loss value in accordance with epoch and batch size.The loss function on this model shows us a significant decrease of the loss throughout the training an a final value of 0.0092 during training, and 0.0096 during testing. For example, with a cycle length of 28 the model predicts the ovulation day to be on the 15th day which matches what we can see on the graph generated previously.

    Then we tried to do the same with other elements of the dataset to find a relationship between them. We first tried to predict the ovulation day according to the length of menses.

   As we can see, we cannot observe a clear relationship between the 2. It looks like the ovulation day seems to be sooner with the lower values of the length of Menses. However, the range of value for the ovulation day is very broad for every data points of the Length of Menses variable.


   We also tried to predict the length of Menses of a cycle based on its length but couldn’t observe a clear relationship between the 2.

    We obtained similar results when trying to use BMI, Age and number of pregnancies one at a time to predict the ovulation day Moreover, when using LengthofCycle, LengthOfMenses, BMI, Age and NumberOfPregnancies to predict the day of ovulation, we obtained results similar to when we only used the Length of Cycle.

   As we can see on the graph, at the end of training we had a loss of 0.0093 and also a loss of 0.0093 on testing. This doesn’t look like an improvement compared to the loss of 0.0092 obtained before. We can see that the loss value on the testing set was slightly lower and closed to the loss obtained on the training set when using those 5 features, but the difference too small to not be significant enough and draw conclusions.

    At the end of this “training” and features research using a non linear regression model in order to learn how to clean a dataset, import it, create an AI model and train it on the dataset, we had to move to other AI models and use the dataset differently for multiple reasons.
   First off, our goal isn’t to wait for the end of a cycle to predict when was the ovulation day that already happened, that would be pointless as the user needs to know a prediction for the next cycle, and not a cycle that already ended.
   Our AI models need to predict the length of the next cycle, length of Menses and the estimated day of ovulation based on the data obtained about previous cycles.

    Previous research papers that we read used the body basal temperature at their main feature to predict the day of ovulation, as it gives a clear and precise indication as when it happened. However, this is not something we can obtain with our service. The BBT (body basal temperature) needs to be measured daily, when waking up, with a special kind of thermometer that most people do not own. But what we could learn from this research paper is how they used CNN (Convolutional Neural Network) models and LSTM (Long Short Term Memory model, a type of Recurrent Neural Network or RNN) models to predict the day of ovulation and how effective those model were, especially the LSTM model.

    But before moving on to an LSTM or CNN model, we tried to predict the ovulation day using the mean cycle length and the length of menses of the current cycle. As the ovulation day happens a few days after the menses, we could use the mean value of previous cycles and the length of menses of the current cycle to try to predict the ovulation day. Sadly this method did not provide really good results. This could be explained by the fact that only the length of Menses varies from cycle to cycle of an individual, same as the ovulation day. However we’ve seen that the ovulation day does not show a clear relationship with the length of Menses but more with the length of Cycle, which here does not change between cycles of a same individual because we are using the same mean value every time.

   We only managed to obtain a loss of 0.187 on the training set and 0.2 on the testing set, but after a huge number of epochs compared to what we had for previous model. Which could mean that we only overfitted the data and do not necessarily have a working model.
   

4. Methodology

Time Series Analysis
Time Series Analysis is to anlysis a set of values observed sequentially in time. Unlike a regression analysis that presupposes that independent variables are independent of each other and that variables themselves are independent, a time series analysis assumes that the variables have autocorrelativity and uses time as independent variables. The order is important because the preceding data is analyzed to be affected by the subsequent data.

LSTM(Long Short-Term Memory models)
LSTM is a type of RNN (Recurrent Neural Networks). RNN is a type of direct cycle artificial neural network that connects hidden nodes to edges with direction and is known to be suitable for processing sequentially appearing data such as voice and text. However, relevant information and the point where it is used are far, the reverse wave gradient gradually decreases, greatly reducing the learning ability. LSTM is designed to solve this problem by adding cell-state to RNN's hidden state.

CNN(Convolutional Neural Network)
CNN is a kind of ANN(Artificial Neural Network) using convolution operation, which uses patterns and learns directly from the data classification. Features don't have to be manually extracted.

    We did time series analysis using 2 methdologies above.

5. Evaluation & Analysis

    We began working on an LSTM model, but this attempt was unsuccessful as we encountered errors which we couldn’t solve. Tensorflow.js community is not as big as TensorFlow on python, and less tutorials or examples are available. Which is why we couldn’t find enough documentation about the encountered errors in order to fix them. The error was about the dimension of the input, saying that it shouldn’t be of dimension 3 instead of 2. However, as we printed our input tensor in the console it was indeed a 3d tensor. We couldn’t fix this specific error or understand why it happened.

    For that reason, we decided to work on another model that gave good really on different projects about time series: 1D CNN. In a 1D CNN, the kernel of the convolutional layer moves in 1 direction. And even if we have to transform our input tensor into a 3d tensor for the model to work, it is basically a 1-dimensional array of features.

   When working with time series prediction, the goal being to predict the next value using previous values as input, the label used is the next value following the features value in the dataset. For example if we have to predict the length of the next cycle of a women using n=44 values of previous cycles, we would choose a number of previous cycles as the features (lets take 4 as an example) and the next value after each group of features as a label. Thus, transforming the dataset using 2 for loop in order to have a number of samples = n-features = 40, composed of 4 features and a label for each.

2) Evaluation & Analysis


   After training, the loss value is equal to 0,0018 on average. Giving us excellent results when predicting a value.
   For example, using [28,40,27,40] as an input when predicting, the output is equal to 27, which is equal or very similar to the value found in our dataset. Same when using [40,27,40,27], the output is equal to 40, following the pattern found for this particular user.

   Using the same approach and the same type of model to predict the length of Menses and ovulation day we achieve different results. For the length of Menses, the loss value obtaining is not as promessing.

   But when it comes to the ovulation day, we achieve a good loss value. This can be explained by the link between the ovulation day and the length of cycles that we illustrated before when working on a regular Neural Network. As the ovulation day value changes according to the length of cycles, we could expect that if the model performed on length of cycles, it could perform on ovulation day too.
    The loss value is not as low as when predicting the cycles, but we still have a loss value equal to 0.04

5. Conclusion

     We could get aimed results with machine learning. 보름달 Service uses the information entered by the user to predict the user's next menstrual cycle and ovulation date. This is the key feature in women's health applications, and will also be the main feature in 보름달 services.

     Because of a limitation of the dataset, it could be different with reality. When the real users enter their information and give feedback, the accuracy will be increased

profile
당신의 여성 건강 지킴이

6개의 댓글

comment-user-thumbnail
2020년 12월 6일

great! Nice Machine Learning models, I got a lot of info Thx :)

1개의 답글
comment-user-thumbnail
2022년 2월 7일

I don't see anything new here, there are ordinary calendars that do this job well. It would be better to explain how to deal with pain during menstruation...

답글 달기
comment-user-thumbnail
2022년 2월 7일

Hello! Well, in fact, this is a very useful thing, but I have not used such things for a long time, for example, because I just remember my cycle and can roughly understand when my monthly period will come. And in fact, there are a lot of ways to deal with period pain right now. You can try Hometown Hero which I use. Or you can go to the doctor in order to still do a full examination because it may be that your pains are not the norm, you know that not all pains are normal during menstruation? Good luck to you!

답글 달기
comment-user-thumbnail
2022년 10월 1일

Hii i need the codes can i get them please

답글 달기