Halaman ini diterjemahkan oleh Cloud Translation API.

Melatih model dengan GPU pada mode Autopilot GKE

Autopilot

Panduan memulai ini menunjukkan cara men-deploy model pelatihan dengan GPU di Google Kubernetes Engine (GKE) dan menyimpan prediksi di Cloud Storage. Dokumen ini ditujukan bagi administrator GKE yang sudah memiliki cluster mode Autopilot dan ingin menjalankan workload GPU untuk pertama kalinya.

Anda juga dapat menjalankan workload ini di cluster Standard jika Anda membuat kumpulan node GPU terpisah di cluster. Untuk mendapatkan petunjuk, lihat Melatih model dengan GPU pada mode GKE Standard.

Sebelum memulai

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the GKE and Cloud Storage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Install the Google Cloud CLI.

Catatan: Jika Anda telah menginstal gcloud CLI sebelumnya, pastikan Anda memiliki versi terbaru dengan menjalankan gcloud components update.

Jika Anda menggunakan penyedia identitas (IdP) eksternal, Anda harus login ke gcloud CLI dengan identitas gabungan Anda terlebih dahulu.

Untuk melakukan inisialisasi gcloud CLI, jalankan perintah berikut:

gcloud init

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the GKE and Cloud Storage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Install the Google Cloud CLI.

Catatan: Jika Anda telah menginstal gcloud CLI sebelumnya, pastikan Anda memiliki versi terbaru dengan menjalankan gcloud components update.

Jika Anda menggunakan penyedia identitas (IdP) eksternal, Anda harus login ke gcloud CLI dengan identitas gabungan Anda terlebih dahulu.

Untuk melakukan inisialisasi gcloud CLI, jalankan perintah berikut:

gcloud init

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Gandakan repositori sampel

Jalankan perintah berikut di Cloud Shell:

git clone https://github.com/GoogleCloudPlatform/ai-on-gke && \ cd ai-on-gke/tutorials-and-examples/gpu-examples/training-single-gpu

Membuat cluster

Di konsol Google Cloud , buka halaman Create an Autopilot cluster:

Buka Membuat cluster Autopilot
Di kolom Name, masukkan gke-gpu-cluster.
Dalam daftar Region, pilih us-central1.
Klik Buat.

Membuat bucket Cloud Storage

Di konsol Google Cloud , buka halaman Create a bucket:

Buka Membuat bucket
Pada kolom Name your bucket, masukkan nama berikut:
```
PROJECT_ID-gke-gpu-bucket 
```
Ganti PROJECT_ID dengan project ID Google Cloud Anda.
Klik Lanjutkan.
Untuk Jenis lokasi, pilih Region.
Dalam daftar Region, pilih us-central1 (Iowa) dan klik Lanjutkan.
Pada bagian Pilih kelas penyimpanan untuk data Anda, klik Lanjutkan.
Di bagian Choose how to control access to objects, untuk Access control, pilih Uniform.
Klik Buat.
Pada dialog Akses publik akan dicegah pastikan bahwa kotak centang Enforce public access prevention on this bucket sudah dicentang, lalu klik Konfirmasi.

Konfigurasi cluster Anda untuk mengakses bucket menggunakan Workload Identity Federation for GKE

Agar cluster Anda dapat mengakses bucket Cloud Storage, lakukan langkah-langkah berikut:

Buat Akun Layanan Kubernetes di cluster Anda.
Buat kebijakan izin IAM yang memungkinkan ServiceAccount mengakses bucket.

Buat Akun Layanan Kubernetes di cluster Anda

Di Cloud Shell, lakukan hal berikut:

Hubungkan ke cluster Anda:

gcloud container clusters get-credentials gke-gpu-cluster \     --location=us-central1

Membuat namespace Kubernetes:

kubectl create namespace gke-gpu-namespace

Buat Akun Layanan Kubernetes dalam namespace:

kubectl create serviceaccount gpu-k8s-sa --namespace=gke-gpu-namespace

Buat kebijakan izin IAM di bucket

Berikan peran Storage Object Admin (roles/storage.objectAdmin) di bucket ke Akun Layanan Kubernetes:

gcloud storage buckets add-iam-policy-binding gs://PROJECT_ID-gke-gpu-bucket \     --member=principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/gke-gpu-namespace/sa/gpu-k8s-sa \     --role=roles/storage.objectAdmin \     --condition=None

Ganti PROJECT_NUMBER dengan Google Cloud nomor project Anda.

Verifikasi bahwa Pod dapat mengakses bucket Cloud Storage

Di Cloud Shell, buat variabel lingkungan berikut:
```
export K8S_SA_NAME=gpu-k8s-sa export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket 
```
Ganti PROJECT_ID dengan project ID Google Cloud Anda.
Buat Pod yang memiliki container TensorFlow:
```
envsubst < src/gke-config/standard-tensorflow-bash.yaml | kubectl --namespace=gke-gpu-namespace apply -f - 
```
Perintah ini menyisipkan variabel lingkungan yang Anda buat ke dalam referensi yang sesuai dalam manifes. Anda juga dapat membuka manifes di editor teks serta mengganti $K8S_SA_NAME dan $BUCKET_NAME dengan nilai yang sesuai.

Buat file sampel di bucket:

touch sample-file gcloud storage cp sample-file gs://PROJECT_ID-gke-gpu-bucket

Tunggu hingga Pod Anda siap:
```
kubectl wait --for=condition=Ready pod/test-tensorflow-pod -n=gke-gpu-namespace --timeout=180s 
```
Setelah Pod sudah siap, output-nya adalah sebagai berikut:
```
pod/test-tensorflow-pod condition met 
```
Jika perintah kehabisan waktu, GKE mungkin masih membuat node baru untuk menjalankan Pod. Jalankan kembali perintah dan tunggu hingga Pod siap.

Buka shell di container TensorFlow:

kubectl -n gke-gpu-namespace exec --stdin --tty test-tensorflow-pod --container tensorflow -- /bin/bash

Coba baca file sampel yang Anda buat:
```
ls /data 
```
Output menunjukkan file sampel.

Periksa log untuk mengidentifikasi GPU yang terpasang ke Pod:

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Output menunjukkan GPU yang terpasang ke Pod, mirip dengan berikut ini:

... PhysicalDevice(name='/physical_device:GPU:0',device_type='GPU')

Keluar dari container:
```
exit 
```

Hapus contoh Pod:

kubectl delete -f src/gke-config/standard-tensorflow-bash.yaml \     --namespace=gke-gpu-namespace

Latih dan prediksi menggunakan set data `MNIST`

Di bagian ini, Anda akan menjalankan workload pelatihan pada set data contoh MNIST.

Salin data contoh ke bucket Cloud Storage:

gcloud storage cp src/tensorflow-mnist-example gs://PROJECT_ID-gke-gpu-bucket/ --recursive

Buat variabel lingkungan berikut:

export K8S_SA_NAME=gpu-k8s-sa export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket

Tinjau Tugas pelatihan:

# Copyright 2023 Google LLC # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # #      http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.  apiVersion: batch/v1 kind: Job metadata:   name: mnist-training-job spec:   template:     metadata:       name: mnist       annotations:         gke-gcsfuse/volumes: "true"     spec:       nodeSelector:         cloud.google.com/gke-accelerator: nvidia-tesla-t4       tolerations:       - key: "nvidia.com/gpu"         operator: "Exists"         effect: "NoSchedule"       containers:       - name: tensorflow         image: tensorflow/tensorflow:latest-gpu          command: ["/bin/bash", "-c", "--"]         args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt; python tensorflow_mnist_train_distributed.py"]         resources:           limits:             nvidia.com/gpu: 1             cpu: 1             memory: 3Gi         volumeMounts:         - name: gcs-fuse-csi-vol           mountPath: /data           readOnly: false       serviceAccountName: $K8S_SA_NAME       volumes:       - name: gcs-fuse-csi-vol         csi:           driver: gcsfuse.csi.storage.gke.io           readOnly: false           volumeAttributes:             bucketName: $BUCKET_NAME             mountOptions: "implicit-dirs"       restartPolicy: "Never"

Deploy Tugas pelatihan:
```
envsubst < src/gke-config/standard-tf-mnist-train.yaml | kubectl -n gke-gpu-namespace apply -f - 
```
Perintah ini mengganti variabel lingkungan yang Anda buat ke dalam referensi yang sesuai dalam manifes. Anda juga dapat membuka manifes di editor teks serta mengganti $K8S_SA_NAME dan $BUCKET_NAME dengan nilai yang sesuai.
Tunggu hingga Tugas memiliki status Completed:
```
kubectl wait -n gke-gpu-namespace --for=condition=Complete job/mnist-training-job --timeout=180s 
```
Jika Job sudah siap, output-nya akan mirip dengan berikut ini:
```
job.batch/mnist-training-job condition met 
```
Jika perintah kehabisan waktu, GKE mungkin masih membuat node baru untuk menjalankan Pod. Jalankan kembali perintah dan tunggu hingga Tugas siap.

Periksa log dari container TensorFlow:

kubectl logs -f jobs/mnist-training-job -c tensorflow -n gke-gpu-namespace

Output menunjukkan bahwa peristiwa berikut terjadi:

Instal paket Python yang diperlukan
Mendownload set data MNIST
Melatih model menggunakan GPU
Simpan model
Mengevaluasi model

... Epoch 12/12 927/938 [============================>.] - ETA: 0s - loss: 0.0188 - accuracy: 0.9954 Learning rate for epoch 12 is 9.999999747378752e-06 938/938 [==============================] - 5s 6ms/step - loss: 0.0187 - accuracy: 0.9954 - lr: 1.0000e-05 157/157 [==============================] - 1s 4ms/step - loss: 0.0424 - accuracy: 0.9861 Eval loss: 0.04236088693141937, Eval accuracy: 0.9861000180244446 Training finished. Model saved

Hapus workload pelatihan:

kubectl -n gke-gpu-namespace delete -f src/gke-config/standard-tf-mnist-train.yaml

Men-deploy workload inferensi

Di bagian ini, Anda akan men-deploy workload inferensi yang mengambil set data sampel sebagai input dan menampilkan prediksi.

Salin image untuk prediksi ke bucket:

gcloud storage cp data/mnist_predict gs://PROJECT_ID-gke-gpu-bucket/ --recursive

Tinjau workload inferensi:

# Copyright 2023 Google LLC # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # #      http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.  apiVersion: batch/v1 kind: Job metadata:   name: mnist-batch-prediction-job spec:   template:     metadata:       name: mnist       annotations:         gke-gcsfuse/volumes: "true"     spec:       nodeSelector:         cloud.google.com/gke-accelerator: nvidia-tesla-t4       tolerations:       - key: "nvidia.com/gpu"         operator: "Exists"         effect: "NoSchedule"       containers:       - name: tensorflow         image: tensorflow/tensorflow:latest-gpu          command: ["/bin/bash", "-c", "--"]         args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt; python tensorflow_mnist_batch_predict.py"]         resources:           limits:             nvidia.com/gpu: 1             cpu: 1             memory: 3Gi         volumeMounts:         - name: gcs-fuse-csi-vol           mountPath: /data           readOnly: false       serviceAccountName: $K8S_SA_NAME       volumes:       - name: gcs-fuse-csi-vol         csi:           driver: gcsfuse.csi.storage.gke.io           readOnly: false           volumeAttributes:             bucketName: $BUCKET_NAME             mountOptions: "implicit-dirs"       restartPolicy: "Never"

Deploy workload inferensi:
```
envsubst < src/gke-config/standard-tf-mnist-batch-predict.yaml | kubectl -n gke-gpu-namespace apply -f - 
```
Perintah ini mengganti variabel lingkungan yang Anda buat ke dalam referensi yang sesuai dalam manifes. Anda juga dapat membuka manifes di editor teks serta mengganti $K8S_SA_NAME dan $BUCKET_NAME dengan nilai yang sesuai.

Tunggu hingga Tugas memiliki status Completed:

kubectl wait -n gke-gpu-namespace --for=condition=Complete job/mnist-batch-prediction-job --timeout=180s

Outputnya mirip dengan hal berikut ini:

job.batch/mnist-batch-prediction-job condition met

Periksa log dari container TensorFlow:

kubectl logs -f jobs/mnist-batch-prediction-job -c tensorflow -n gke-gpu-namespace

Outputnya adalah prediksi untuk setiap image dan keyakinan model pada prediksi tersebut, mirip dengan berikut ini:

Found 10 files belonging to 1 classes. 1/1 [==============================] - 2s 2s/step The image /data/mnist_predict/0.png is the number 0 with a 100.00 percent confidence. The image /data/mnist_predict/1.png is the number 1 with a 99.99 percent confidence. The image /data/mnist_predict/2.png is the number 2 with a 100.00 percent confidence. The image /data/mnist_predict/3.png is the number 3 with a 99.95 percent confidence. The image /data/mnist_predict/4.png is the number 4 with a 100.00 percent confidence. The image /data/mnist_predict/5.png is the number 5 with a 100.00 percent confidence. The image /data/mnist_predict/6.png is the number 6 with a 99.97 percent confidence. The image /data/mnist_predict/7.png is the number 7 with a 100.00 percent confidence. The image /data/mnist_predict/8.png is the number 8 with a 100.00 percent confidence. The image /data/mnist_predict/9.png is the number 9 with a 99.65 percent confidence.

Pembersihan

Agar tidak menimbulkan biaya pada akun Google Cloud Anda untuk resource yang Anda buat dalam panduan ini, lakukan salah satu langkah berikut:

Pertahankan cluster GKE: Hapus resource Kubernetes di cluster dan resource Google Cloud
Mempertahankan project Google Cloud : Hapus cluster GKE dan resource Google Cloud
Menghapus project

Hapus resource Kubernetes dalam cluster dan resource Google Cloud

Hapus namespace Kubernetes dan workload yang Anda deploy:

kubectl -n gke-gpu-namespace delete -f src/gke-config/standard-tf-mnist-batch-predict.yaml kubectl delete namespace gke-gpu-namespace

Hapus bucket Cloud Storage:
1. Buka halaman Bucket:
  
  Buka Bucket
2. Pilih kotak centang untuk PROJECT_ID-gke-gpu-bucket.
3. Klik Delete.
4. Untuk mengonfirmasi penghapusan, ketik DELETE, lalu klik Delete.
Hapus Google Cloud akun layanan:
1. Buka halaman Akun Layanan.
  
  Buka Akun layanan
2. Pilih project Anda.
3. Pilih kotak centang untuk gke-gpu-sa@PROJECT_ID.iam.gserviceaccount.com.
4. Klik Delete.
5. Untuk mengonfirmasi penghapusan, klik Hapus.

Hapus cluster GKE dan Google Cloud resource

Hapus cluster GKE:
1. Buka halaman Cluster:
  
  Buka Cluster
2. Pilih kotak centang untuk gke-gpu-cluster.
3. Klik Delete.
4. Untuk mengonfirmasi penghapusan, ketik gke-gpu-cluster, lalu klik Delete.
Hapus bucket Cloud Storage:
1. Buka halaman Bucket:
  
  Buka Bucket
2. Pilih kotak centang untuk PROJECT_ID-gke-gpu-bucket.
3. Klik Delete.
4. Untuk mengonfirmasi penghapusan, ketik DELETE, lalu klik Delete.
Hapus Google Cloud akun layanan:
1. Buka halaman Akun Layanan.
  
  Buka Akun layanan
2. Pilih project Anda.
3. Pilih kotak centang untuk gke-gpu-sa@PROJECT_ID.iam.gserviceaccount.com.
4. Klik Delete.
5. Untuk mengonfirmasi penghapusan, klik Hapus.

Menghapus project

Perhatian: Menghapus project memiliki efek berikut:

Semua hal dalam project akan dihapus. Jika menggunakan project yang sudah ada untuk tugas dalam dokumen ini, saat Anda menghapusnya, pekerjaan lain yang telah Anda lakukan dalam project tersebut juga akan terhapus.
Project ID kustom hilang. Saat membuat project ini, Anda mungkin telah membuat project ID kustom yang ingin digunakan di masa mendatang. Untuk mempertahankan URL yang menggunakan project ID, seperti URL appspot.com, hapus resource yang dipilih di dalam project, bukan menghapus seluruh project.

Jika Anda berencana mempelajari beberapa arsitektur, tutorial atau panduan memulai, dengan menggunakan kembali project dapat membantu Anda agar tidak melampaui batas kuota project.

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

Langkah berikutnya

Pelajari lebih lanjut cara menggunakan GPU di GKE

Melatih model dengan GPU pada mode Autopilot GKE

Sebelum memulai

Gandakan repositori sampel

Membuat cluster

Membuat bucket Cloud Storage

Konfigurasi cluster Anda untuk mengakses bucket menggunakan Workload Identity Federation for GKE

Buat Akun Layanan Kubernetes di cluster Anda

Buat kebijakan izin IAM di bucket

Verifikasi bahwa Pod dapat mengakses bucket Cloud Storage

Latih dan prediksi menggunakan set data MNIST

Men-deploy workload inferensi

Pembersihan

Hapus resource Kubernetes dalam cluster dan resource Google Cloud

Hapus cluster GKE dan Google Cloud resource

Menghapus project

Langkah berikutnya

Latih dan prediksi menggunakan set data `MNIST`