임시

노하람·2023년 9월 4일

MLOps_service

DS Team. 재사용 가능한 MLOps 모듈을 개발합니다.
퍼사드 패턴을 활용하여, 사용자는 도커/쿠버네티스/쿠버플로의 지식을 전혀 가지고 있지 않아도, 인터페이스를 통해 단일 명령으로 파이프라인을 생성하고 결과를 확인할 수 있는 것을 MVP 목표로 합니다.

Prerequisites

쿠버네티스 클러스터

0. How to use

쿠버네티스 클러스터를 설치하고 NFS와 LoadBalancer(MetalLB) 등을 설정합니다.
- Prerequisites
  - minikube, kustomize, helm 등의 클러스터 관리도구 설치
  - git clone https://github.com/Musma/MLOps_service.git
  - cd MLops_service
- NFS 설치 및 설정: bash utils/kubernetes/make_nfs_folder.sh
- Kubeflow 설치 및 설정: bash install_kubeflow_1.5sh
  - 설치 중 MetalLB IP영역 설정
    - 예시: 192.168.0.230 ~ 192.168.0.254

1. Usage(DSME)

파이프라인 생성
1. xgbregressor를 forecasting 모델로 사용하여 rl 훈련까지 파이프라인 생성
  - 기본: python main.py pipeline --task forecasting --model xgbregressor --period 7 --ship_id INVAR --tank 1
  - 다중모델 사용: python main.py pipeline --task forecasting --model linearregression xgbregressor xgbrfregressor kneighborsregressor linearsvr lineartreeregressor --period 7 --ship_id INVAR --tank 1
  - 하이퍼 파라미터 튜닝(HPO) 설정: python main.py pipeline --task forecasting --model xgbregressor --period 7 --ship_id INVAR --tank 1 --ml_objective_type minimize --ml_objective_goal 0.01 --ml_objective_metric mean_absolute_error --ml_max_trial 4 --ml_parallel_trial 4 --rl_episode_min 100 --rl_episode_max 200 --rl_episode_step 100 --rl_objective_type maximize --rl_objective_goal 100 --rl_objective_metric target_goal_value --rl_max_trial 3 --rl_parallel_trial 3
예측 파이프라인 생성
1. forecasting predictor 생성: python main.py predict --task forecasting --ship_id INVAR --tank 1
2. rl predictor 생성: python main.py predict -t rl --ship_id INVAR --tank 1

2. arguments 정리

커맨드 정의
- component.commands.py 참고
- pipeline 과 predict 위주 사용
- 예시: python main.py pipeline ~, python main.py predict ~
  - 각각의 커맨드의 자세한 arguments는 하단(2.1~2.x 섹션) 참고

비고	arguments	discribe
0	data	ETL component 생성
1	train	모델 훈련 테스팅&파이프라인 내 TFjob 활용
2	pipeline	파이프라인 생성
3	predict	추론 컴포넌트 생성

pipeline command arguments
- component.component_loader.py 참고
- 예시: python main.py pipeline --task forecasting --model linearregression xgbregressor xgbrfregressor kneighborsregressor linearsvr lineartreeregressor --period 7 --ship_id INVAR --tank 1 --start_time "2023-01-06 00:01" --end_time "2023-01-31 23:59" --interval_second 24 -> pipeline 커맨드 / 작업은 forecasting / 6종 모델을 사용 / 훈련데이터 기간 7일(1,7,21 중 선택) / 스케줄링 시작 시간 지정 / 스케줄링 종료 시간 지정 / 스케줄링 간격 지정
  - 예시2(단축): python main.py pipeline -t forecasting -m linearregression xgbregressor xgbrfregressor kneighborsregressor linearsvr lineartreeregressor -p 7 -si INVAR -ta 1 -s "2023-01-06 00:01" -e "2023-01-31 23:59" -is 24

비고	arguments	discribe	example	required
0	--task/-t	pipeline은 `forecasting`으로 진행	forecasting	Required
1	--model/-m	forecasting ML 모델 선택	xgbregressor	Required
2	--period/-p	훈련데이터 기간 선택(1일, 7일, 21일 중 선택)	7	Required
3	--ship_id/-si	훈련 목표 호선 지정("INVAR", "2515" 등)	INVAR	Required
4	--tank/-ta	훈련 목표 화물창 ID(탱크번호) 지정	1	Required
5	--start_time/-s	scheduledworkflows 시작일 지정(파이프라인 스케줄링)	2023-01-06 00:01	Optional
6	--end_time/-e	scheduledworkflows 종료일 지정	2023-01-31 23:59	Optional
7	--interval_second/-is	scheduledworkflows 반복간격 지정(시간단위)	24	Optional
8	--input_data_path	(미사용)훈련데이터 경로 지정	미사용(자동지정)	Optional
9	--output_model_path	(미사용)모델 저장 경로 지정	미사용(자동지정)	Optional
10	--ml_objective_type/-mlotype	ML 에러 메트릭 목표점수 최대화,최소화 선택	minimize	Optional
11	--ml_objective_goal/-mlogoal	ML 에러 메트릭 목표 점수	0.01	Optional
12	--ml_objective_metric/-mlometric	ML 사용할 에러 메트릭	mean_absolute_error	Optional
13	--ml_max_trial/-mlmt	ML 총 trial 개수	3	Optional
14	--ml_parallel_trial/-mlpt	ML 병렬 훈련 개수(파드)	3	Optional
15	--rl_episode_min/-rlemin	RL 최소 에피소드	3000	Optional
16	--rl_episode_max/-rlemax	RL 최대 에피소드값	4000	Optional
17	--rl_episode_step/-rlestep	RL 에피소드 간격	100	Optional
18	--rl_objective_type/-rlotype	RL 에러 메트릭 목표점수 최대화,최소화 선택	maximize	Optional
19	--rl_objective_goal/-rlogoal	RL 에러 메트릭 목표 점수	100	Optional
20	--rl_objective_metric/-rlometric	RL 사용할 에러 메트릭	target_goal_value	Optional
21	--rl_max_trial/-rlmt	RL 총 trial 개수	3	Optional
22	--rl_parallel_trial/-rlpt	RL 병렬 훈련 개수(파드)	3	Optional

predict command arguments
- component.component_loader.py 참고
- 예시: python main.py predict -t forecasting -si INVAR -ta 1 or python main.py predict -t rl -si INVAR -ta 1
- 가장 최근 훈련된 모델을 불러옵니다.(최근 일정기간 모델 경쟁 방식으로 변경 예정)

비고	arguments	discribe	example	required
0	--task/-t	predict는 `forecasting, rl 중 선택`으로 진행	forecasting	Required
1	--ship_id/-si	추론 목표 호선 지정("INVAR", "2515" 등)	INVAR	Required
2	--tank/-ta	추론 목표 화물창 ID(탱크번호) 지정	1	Required

3. 참고사항

train/pipeline command 사용 시 각 모델에 대한 하이퍼 파라미터를 설정할 수 있습니다.
- 다만 사용할 수 있는 argument가 정해져있고, 따로 지정하지 않아도 pipeline 실행 시 하이퍼 파라미터 튜닝은 정해진 종류와 범위에서 자동으로 튜닝됩니다.
  - 모델 별 arguments: components.forecasting.ml_model.src.<각 모델>.py의 모델 별 parser 참고.
  - 모델 별 하이퍼 파라미터의 튜닝 범위 및 튜닝 목표(Katib 설정) : components.forecasting.ml_model.src.container_op.<각 모델>_ops.py의 create_katib_experiment_task() 참고.

4. API

명세(Swagger): http://112.169.63.143:2470/docs
- 아래 예시의 자세한 파라미터는 Swagger를 참조해주세요.

파이프라인 로그 요청 예시:

curl -X 'GET' \
    'http://112.169.63.143:2470/api/v1/pipeline?date=20230117' \
    -H 'accept: application/json'

파이프라인 생성 요청 예시:

즉시 생성

curl -X 'POST' \
    'http://112.169.63.143:2470/api/v1/pipeline?command=pipeline&task=forecasting&period=7&ship_id=INVAR&tank=2&ml_objective_type=minimize&ml_objective_goal=0.01&ml_objective_metric=mean_absolute_error&ml_max_trial=3&ml_parallel_trial=3&rl_episode_min=3000&rl_episode_max=4000&rl_episode_step=100&rl_objective_type=maximize&rl_objective_goal=100&rl_objective_metric=target_goal_value&rl_max_trial=3&rl_parallel_trial=3' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "model": [
        "xgbregressor"
    ]
}'

스케줄러 생성

curl -X 'POST' \
    'http://112.169.63.143:2470/api/v1/pipeline?command=pipeline&task=forecasting&period=7&ship_id=INVAR&tank=2&start_time=2023-02-01%2001%3A00&end_time=2023-12-31%2023%3A00&interval_second=24&ml_objective_type=minimize&ml_objective_goal=0.01&ml_objective_metric=mean_absolute_error&ml_max_trial=3&ml_parallel_trial=3&rl_episode_min=3000&rl_episode_max=4000&rl_episode_step=100&rl_objective_type=maximize&rl_objective_goal=100&rl_objective_metric=target_goal_value&rl_max_trial=3&rl_parallel_trial=3' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "model": [
        "xgbregressor"
    ]
}'

예측/제어 추론기 생성 예시

curl -X 'POST' \
    'http://112.169.63.143:2470/api/v1/predictor?command=predict&task=forecasting&ship_id=INVAR&tank=2' \
    -H 'accept: application/json' \
    -d ''

파이프라인 삭제 예시

curl -X 'DELETE' \
    'http://112.169.63.143:2470/api/v1/pipeline' \
    -H 'accept: application/json'

5. 파일 별 상세설명

.
├── Dockerfile # Mulops 플랫폼 빌드용 Dockerfile
├── MLproject.yaml  # 추후 빌드용
├── README.md
├── build_image.sh # 이미지 빌드 스크립트 `bash build_image.sh <dockerhub ID>`
├── config.ini # 실행한 명령에 따른 argument 저장, 로드용 설정 파일
├── install_kubeflow_1.5.sh # one-step install kubeflow 
├── main.py # Mulops 실행파일
├── pyproject.toml # 추후 빌드용
├── requirements.txt # Mulops requirements
├── components # 파이프라인 컴포넌트
│   ├── commands.py # 각 파이프라인에 따른 프로세스 (data, train, pipeline, predict 등)
│   ├── component_loader.py # commands에 따른 argument parsing과 실행
│   ├── config.json # 기구축된 모델 정보
│   ├── container_op.py # mulops ContainerOp의 Base Class
│   ├── make_pipeline.py # forecasting-RL 파이프라인 명세
│   ├── make_predictor.py # 추론 컴포넌트 명세
│   ├── mlflow_module.py # (Todo) MLFlow 모듈
│   ├── module_loader.py # Task, Model에 따라 각 모듈에 적합한 argument 파싱
│   ├── predict_component.py # 추론 컴포넌트 파이프라인 컴파일 및 실행
│   ├── predict_component.yaml # 빌드된 추론 파이프라인 명세
│   ├── workflow.py # forecasting-RL 파이프라인 컴파일 및 실행
│   ├── workflow.yaml # 빌드된 훈련 파이프라인 명세
│   ├── data # ETL 파이프라인
│   │   ├── Dockerfile # Mulops data component용 빌드용 Dockerfile
│   │   ├── README.md
│   │   ├── build_image.sh
│   │   ├── component.yaml
│   │   ├── requirements.txt
│   │   ├── src # data source code
│   │   │   ├── acquisition # 데이터 수집 관련 모듈
│   │   │   │   ├── config.json # MQTT client 등 acquisition config
│   │   │   │   ├── get_file_data.py
│   │   │   │   ├── get_influx_data.py
│   │   │   │   ├── get_mqtt_data.py
│   │   │   │   ├── get_rest_api_data.py 
│   │   │   │   └── get_url_data.yaml
│   │   │   ├── load_pkg # about DSME data utils modules
│   │   │   │   ├── __init__.py
│   │   │   │   ├── display.py
│   │   │   │   ├── influx.py
│   │   │   │   ├── load_data.py
│   │   │   │   └── visualization.py
│   │   │   ├── preprocess # data preprocess modules
│   │   │   │   ├── dsme_hvac_utils.py
│   │   │   │   └── text_preprocessor.py
│   │   │   ├── container_op
│   │   │   │   └── data_ops.py # make data containerOp
│   │   │   ├── config.json # data 수집, 저장 등 config
│   │   │   ├── data_pipeline.py # data 파이프라인 workflow 지정 및 컴파일
│   │   │   ├── data_pipeline.yaml # 빌드된 data 파이프라인
│   │   │   ├── load_data_and_preprocessing.py # DSME ETL(main)
│   │   │   └── versioning # (예정) DVC(data version control) 추가
│   │   └── tests
│   ├── forecasting # 정형 데이터 예측 파이프라인
│   │   ├── dl_model 
│   │   │   ├── Dockerfile # DL 이미지 빌드용
│   │   │   ├── README.md
│   │   │   ├── build_image.sh # DL 훈련 이미지 빌드용
│   │   │   ├── component.yaml
│   │   │   ├── src # Deep Learning source code
│   │   │   │   ├── container_op # 모델별 ContainerOp 서브 클래스, 하이퍼 파리미터 튜닝 및 서빙 소스코드
│   │   │   │   │   ├── cnnlstm_ops.py
│   │   │   │   │   ├── gru_ops.py
│   │   │   │   │   ├── lstm_ops.py
│   │   │   │   │   └── simplernn_ops.py
│   │   │   │   ├── dsme_hvac_dl.py # DSEM AutoML DL 훈련코드
│   │   │   │   ├── cnnlstm.py # DL Classes
│   │   │   │   ├── gru.py
│   │   │   │   ├── lstm.py
│   │   │   │   └── simplernn.py
│   │   │   └── tests1
│   │   └── ml_model 
│   │       ├── Dockerfile # ML 이미지 빌드용
│   │       ├── README.md
│   │       ├── build_image.sh # ML 훈련 이미지 빌드용
│   │       ├── component.yaml
│   │       ├── src # Deep Learning source code
│   │       │   ├── container_op # 모델별 ContainerOp 서브 클래스, 하이퍼 파리미터 튜닝 및 서빙 소스코드
│   │       │   │   ├── extratreesregressor.py 
│   │       │   │   ├── kneighborsregressor.py
│   │       │   │   ├── linearregression.py
│   │       │   │   ├── linearsvr_ops.py
│   │       │   │   ├── lineartreeregressor_ops.py
│   │       │   │   ├── xgbregressor_ops.py
│   │       │   │   └── xgbrfregressor_ops.py
│   │       │   ├── dsme_hvac_ml.py
│   │       │   ├── utils.py # DSME-ML utils module
│   │       │   ├── extratreesregressor.py # ML Classes
│   │       │   ├── kneighborsregressor.py
│   │       │   ├── linearregression.py
│   │       │   ├── linearsvr.py
│   │       │   ├── lineartreeregressor.py
│   │       │   ├── xgbregressor.py
│   │       │   └── xgbrfregressor.py
│   │       └── tests
│   ├── predictor # 추론 파이프라인
│   │   ├── container_op 
│   │   │   └── predictor_ops.py # 추론 ContainerOp
│   │   ├── src
│   │   │   ├── config.json # 추론 API 및 DB config
│   │   │   ├── forecasting_predictor.py # 예측 전처리/추론/저장/API
│   │   │   ├── rl_predictor.py # 제어 전처리/추론/저장/API
│   │   │   └── utils.py
│   │   └── temp # 추후 KServe API 사용 대비
│   │       ├── common.py
│   │       ├── inference.py
│   │       ├── input.json
│   │       ├── kserve_example.py
│   │       ├── model-settings.json
│   │       └── test.py
│   ├── rl # 제어모델(강화학습) 파이프라인
│   │   ├── Readme.md
│   │   ├── src
│   │   │   ├── container_op
│   │   │   │   └── dqn_ops.py # RL DQN ContainerOp 서브 클래스, 하이퍼 파리미터 튜닝 및 서빙 소스코드
│   │   │   ├── check_result.py # RL 결과 계산 module
│   │   │   ├── config.json
│   │   │   ├── dsme_hvac_rl.py # RL 훈련
│   │   │   ├── environment.py # RL Environment 설정
│   │   │   └── utils.py # RL utils module
│   │   └── test
└── utils
    ├── etc
    │   ├── configparser.py # 각 파이프라인(data, pipeline, predict)를 실행하기 위해 설정된 argument 저장
    │   └── dict_module.py
    ├── fastapi # (테스트) fastapi module
    │   └── application.py
    ├── flask_api # (테스트) flaskapi module
    │   ├── application.py
    │   ├── templates
    │   │   ├── pipeline_params.html
    │   │   └── result.html
    │   └── wsgi.py
    ├── io # 플랫폼 내 활용을 위한 각종 io utils
    │   ├── json_io.py
    │   ├── pickle_io.py
    │   ├── test.ipynb
    │   ├── test.py
    │   └── xgb_io.py
    ├── katib # 하이퍼 파리미터 튜닝 결과 로거(components.container_op.py에 구현되어있음)
    │   └── logger
    │       └── logger.py
    ├── kfp # kfp sdk 사용을 위한 modules(api를 통한 서버와 kubeflow dashboard 연결)
    │   └── python
    │       ├── api.py
    │       ├── compile.py
    │       ├── config.json
    │       ├── connect.py
    │       ├── kfp_server.py
    │       └── python_api_client.py
    ├── kubernetes
    │   ├── logger
    │   │   └── event_logger.py # 쿠버네티스 이벤트 로거
    │   ├── delete_k8s_resources_op.py # 파이프라인 리소스 삭제 ContainerOp(컴포넌트로 추가하기 위함)
    │   ├── delete_pipeline_resources.sh # 파이프라인 리소스 전체 삭제(파이프라인 재실행 전 수행)
    │   ├── make_nfs_folder.sh # Mulops 시스템 활용을 위한 NFS 세팅(서버IP 변경 필요)-Todo
    │   └── manifests
    │       ├── certificate
    │       │   └── gateways-issuer.yaml # Kubeflow gateway 이슈 발생 시 사용
    │       ├── kubeflow # Kubeflow 설치(kustomize)
    │       │   └── ...
    │       ├── pv # Persistent Volumes 셍성(NFS)
    │       │   └── nfs
    │       │       ├── data_pv.yaml
    │       │       ├── forecasting_pv.yaml
    │       │       ├── pv.yaml
    │       │       └── rl_pv.yaml
    │       └── pvc # Persistent Volumes Claim 생성(NFS)
    │           └── nfs
    │               ├── data_pvc.yaml
    │               ├── forecasting_pvc.yaml
    │               ├── pvc.yaml
    │               └── rl_pvc.yaml
    ├── mlflow # 추후 MLflow 활용시 참고
    │   ├── mlflow.py
    │   └── mlflow_component_example.py
    ├── tfjob # 파이프라인 내 Job(TFjob) 사용을 위해 빌드(docker.io/moey920/luncher:latest 사용)
    │   ├── Dockerfile
    │   ├── build_image.sh
    │   ├── component.yaml
    │   ├── requirements.txt
    │   ├── test
    │   ├── common
    │   │   └── launch_crd.py
    │   ├── src
    │   │   ├── __init__.py
    │   │   └── launch_tfjob.py
    │   └── test.yaml
    └── time_module # Mulops 시스템에 활용을 위한 time modules
        ├── __init__.py
        ├── converter.py
        └── get_time.py

x. Release history

v0.1 / 22.07.19 Noah.
v0.3~v0.4 / 22.08.02 Noah.
v0.7 / 22.08.25 Noah.
v0.8 / 22.09.06 Noah.
v0.9 / 22.09.07 Noah. Update forecasting DL/ML models
v2.0 / 22.11.17 Noah. Update DSME HVAC Pipeline for MVP Testing
v3.0 / 23.01.06 Noah. Update Systems(recurring run, fix bugs, add feats) & README.md
v3.1 / 23.01.11 Noah. Update Systems(rl predictor, forecasting models, find file path) & README.md
v3.2 / 23.01.13 Noah. Update README.md
v3.3 / 23.01.13 Noah. Update auto delete previous run, when execute Predictor
v3.4 / 23.01.17 Noah. Update trained(tuned) model api data component
v3.5 / 23.01.18 Noah. Update Mulops Backend /w fastAPI web sockets
v3.6 / 23.01.25 Noah. Update fastAPI. info http://112.169.63.143:2470/docs & README.md
v3.7 / 23.02.02 Noah. Update HPO args in Mulops System, add HVAC ship_id, tank
v3.8 / 23.02.08 Noah. Fix Predictor ShipID,Tank error & Separated by ship and tank and stored in influx & Update offline installation v0.1
v3.9 / 23.02.10 Noah. Update Multiple prediction & offline linstallation
v3.9.1 / 23.03.06 Noah. Update APIs, INVAR4, fix bugs

노하람

MLOps, MLE 직무로 일하고 있습니다😍

이전 포스트

임시