무지는 죄라는 것을 철저하게 깨달은 옛기억
CUDA tag를 달아 둔 것은 과거의 죄를 기록하기 위해서임
근데 이제 nvidia로 다시 넘어갈거 ㅋ
python -m pip install intel_extension_for_pytorch -f https://developer.intel.com/ipex-whl-stable-cpu
pip install --upgrade intel-extension-for-tensorflow[xpu]
pip install --upgrade intel-extension-for-tensorflow-weekly[gpu] -f https://developer.intel.com/itex-whl-weekly
pip install intel-extension-for-transformers
https://pypi.org/project/intel-extension-for-transformers/
https://pypi.org/project/intel-extension-for-pytorch/
https://pytorch.org/tutorials/recipes/recipes/intel_extension_for_pytorch.html
https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/cheat_sheet.html
https://www.intel.com/content/www/us/en/developer/articles/technical/introducing-intel-extension-for-pytorch-for-gpus.html
https://pypi.org/project/intel-extension-for-tensorflow/
https://github.com/intel/intel-extension-for-tensorflow/blob/main/docs/guide/practice_guide.md
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModel, WeightOnlyQuantConfig
model_name = "EleutherAI/gpt-j-6B"
config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
prompt = "Once upon a time, a little girl"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModel.from_pretrained(model_name, quantization_config=config)
gen_tokens = model.generate(inputs, max_new_tokens=300)
gen_text = tokenizer.batch_decode(gen_tokens)
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModel, WeightOnlyQuantConfig
model_name = "EleutherAI/gpt-j-6B"
config = WeightOnlyQuantConfig(compute_dtype="bf16", weight_dtype="int8")
prompt = "Once upon a time, a little girl"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModel.from_pretrained(model_name, quantization_config=config)
gen_tokens = model.generate(inputs, max_new_tokens=300)
gen_text = tokenizer.batch_decode(gen_tokens)
huggingface 관련해서는 일단 intel data model section이 있지만 걍 cpu쓰는게 편한듯.
pipeline에서 device (int or str or torch.device) — Defines the device (e.g., "cpu", "cuda:1", "mps", or a GPU ordinal rank like 1) on which this pipeline will be allocated.
그렇지만 1은 cuda고 0로 해도 cuda가 잡힌다 어휴
Model | FP32 | INT4 (Group size 32) | INT4 (Group size 128) | Next Token Latency |
---|---|---|---|---|
EleutherAI/gpt-j-6B | 0.643 | 0.644 | 0.64 | 21.98ms |
meta-llama/Llama-2-7b-hf | 0.69 | 0.69 | 0.685 | 24.55ms |
decapoda-research/llama-7b-hf | 0.689 | 0.682 | 0.68 | 24.84ms |
EleutherAI/gpt-neox-20b | 0.674 | 0.672 | 0.669 | 80.16ms |
mosaicml/mpt-7b-chat | 0.672 | 0.67 | 0.666 | 35.84ms |
tiiuae/falcon-7b | 0.698 | 0.694 | 0.693 | 36.1ms |
baichuan-inc/baichuan-7B | 0.474 | 0.471 | 0.47 | Coming Soon |
facebook/opt-6.7b | 0.65 | 0.647 | 0.643 | Coming Soon |
databricks/dolly-v2-3b | 0.613 | 0.609 | 0.609 | 22.02ms |
tiiuae/falcon-40b-instruct | 0.756 | 0.757 | 0.755 | Coming Soon |
import intel_extension_for_pytorch as ipex
model = model.to('xpu')
data = data.to('xpu')
model = ipex.optimize(model)
https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/cheat_sheet.html
Plaidml은 잊자...
git clone https://github.com/intel/intel-extension-for-transformers.git itrex
cd itrex
pip install -r requirements.txt
pip install -v .