如何為TensorFlow和PyTorch自動(dòng)選擇空閑GPU，解決搶卡爭(zhēng)端

本文作者：汪思穎

2017-08-29 09:26

導(dǎo)語(yǔ)：讓系統(tǒng)自動(dòng)選擇空閑GPU設(shè)備

雷鋒網(wǎng)按：本文作者天清，原文載于其知乎專欄世界那么大我想寫代碼，雷鋒網(wǎng)獲其授權(quán)發(fā)布。

項(xiàng)目地址：QuantumLiu/tf_gpu_manager

***

更新：支持pytorch

***

使用

git clone https://github.com/QuantumLiu/tf_gpu_manager

把manager.py放到你訓(xùn)練的目錄就行。

直接使用with gm.auto_choice()自動(dòng)選擇設(shè)備進(jìn)行接下來代碼塊的操作。

import tensorflow as tf
from manager import GPUManager
from keras.layers LSTM
gm=GPUManager()
with gm.auto_choice():
x=tf.placeholder(tf.float32,shape=(None,20,64))
y=LSTM(32)(x)

背景

隨著深度學(xué)習(xí)技術(shù)快速的發(fā)展，深度學(xué)習(xí)任務(wù)的數(shù)據(jù)和計(jì)算規(guī)模也越來越大，想要做出個(gè)像樣的work，沒有一臺(tái)powerful的GPU工作站是萬(wàn)萬(wàn)不能的。

除了要求單卡性能強(qiáng)大，GPU數(shù)量多也很重要。

因?yàn)橐韵聨c(diǎn)原因，多GPU工作站已經(jīng)成了各大實(shí)驗(yàn)室的標(biāo)配：

一般來說，一個(gè)深度學(xué)習(xí)項(xiàng)目需要一個(gè)實(shí)驗(yàn)室或者小組的多人合作完成，要共享一臺(tái)或幾臺(tái)工作站。一個(gè)host多個(gè)GPU比較方便。
實(shí)驗(yàn)需要試多組參數(shù)或者對(duì)比試驗(yàn)。多GPU并行跑省時(shí)間。
模型計(jì)算量大，需要將模型不同分配在多個(gè)GPU上計(jì)算。

現(xiàn)在，Tensorflow、pytorch等主流深度學(xué)習(xí)框架都支持多GPU訓(xùn)練。

比如Tensorflow，在 tensorflow\python\framework 中定義了device函數(shù)，返回一個(gè)用來執(zhí)行操作的GPU設(shè)備的context manager對(duì)象。

def device(device_name_or_function):
"""Wrapper for `Graph.device()` using the default graph.

See
@{tf.Graph.device}
for more details.

Args:
device_name_or_function: The device name or function to use in the context.

Returns:
A context manager that specifies the default device to use for newly created ops.
"""
return get_default_graph().device(device_name_or_function)

在我們的訓(xùn)練腳本中使用with語(yǔ)句就可以指定接下來的操作在某個(gè)GPU上進(jìn)行。

with tf.device('/gpu:2'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)

那么問題來了：

在寫訓(xùn)練腳本時(shí)怎么知道哪個(gè)GPU是空閑可用的？
同組的人做實(shí)驗(yàn)和我沖突怎么辦？
將來某個(gè)時(shí)刻運(yùn)行這個(gè)腳本的時(shí)候是不是還要根據(jù)情況修改？
同行用我的代碼復(fù)現(xiàn)實(shí)驗(yàn)，GPU配置環(huán)境不一樣，他們甚至可能沒有GPU，又要改代碼？

當(dāng)然，上道兒的開發(fā)者都知道nvidia-smi可以查詢顯卡信息，查看GPU顯存、溫度、功率使用，然后選擇合適的GPU。

如何為TensorFlow和PyTorch自動(dòng)選擇空閑GPU，解決搶卡爭(zhēng)端

每次訓(xùn)練前執(zhí)行這個(gè)命令，再與良好團(tuán)隊(duì)保持良好的溝通可以解決上述1、2兩個(gè)問題，但是3、4兩個(gè)問題還是不好解決。

而且經(jīng)常和師兄弟、同事?lián)尶ㄘM不是影響效率？

我們需要一種解決方案，能夠?qū)崿F(xiàn)不修改腳本、不需要和組員溝通，自動(dòng)選擇空閑GPU設(shè)備。

實(shí)現(xiàn)

如何高效獲取GPU狀態(tài)信息

nvidia-smi是一個(gè)由NVIDIA官方提供的GPU狀態(tài)管理、監(jiān)控命令行軟件。和其他命令行軟件一樣，nvidia-smi也有許多argument。

通過閱讀文檔，以及學(xué)習(xí)老司機(jī)的經(jīng)驗(yàn)，我們知道--query-gpu這個(gè)option可以指定查詢GPU狀態(tài)信息，并返回格式化信息。

如何為TensorFlow和PyTorch自動(dòng)選擇空閑GPU，解決搶卡爭(zhēng)端

通過執(zhí)行命令：

nvidia-smi --help-query-gpu

我們得到了所有支持的查詢參數(shù)（太多了不一一枚舉）

最有用的參數(shù)老司機(jī)給我們總結(jié)出來了：

如何為TensorFlow和PyTorch自動(dòng)選擇空閑GPU，解決搶卡爭(zhēng)端

還有我自己查到的index，name，power.draw, power.limit

如何為TensorFlow和PyTorch自動(dòng)選擇空閑GPU，解決搶卡爭(zhēng)端

于是我們有了基本思路，用os.popen執(zhí)行相關(guān)命令，解析返回文本信息。

def parse(line,qargs):
'''
    line:
        a line of text
    qargs:
        query arguments
    return:
        a dict of gpu infos
    Pasing a line of csv format text returned by nvidia-smi
    解析一行nvidia-smi返回的csv格式文本
    '''
numberic_args=['memory.free','memory.total','power.draw','power.limit']#可計(jì)數(shù)的參數(shù)
  power_manage_enable=lambdav:(not'Not Support'inv)#lambda表達(dá)式，顯卡是否滋瓷power management（筆記本可能不滋瓷）
to_numberic=lambdav:float(v.upper().strip().replace('MIB','').replace('W',''))#帶單位字符串去掉單位
  process=lambdak,v:((int(to_numberic(v))ifpower_manage_enable(v)else1)ifkinnumberic_argselsev.strip())
  return{k:process(k,v)fork,vinzip(qargs,line.strip().split(','))}

def query_gpu(qargs=[]):
'''
    qargs:
        query arguments
    return:
        a list of dict
    Querying GPUs infos
    查詢GPU信息
    '''
qargs=['index','gpu_name','memory.free','memory.total','power.draw','power.limit']+qargs
cmd='nvidia-smi --query-gpu={} --format=csv,noheader'.format(','.join(qargs))
results=os.popen(cmd).readlines()
return [parse(line,qargs) for line in results]

如何衡量GPU空閑度

現(xiàn)在已經(jīng)能獲取GPU狀態(tài)了，但是要怎么衡量GPU空閑度并排序呢？

深度學(xué)習(xí)領(lǐng)域，GPU空閑度可以主要用兩個(gè)指標(biāo)衡量：顯存空閑和功率空閑。

顯存占用又分絕對(duì)空間占用和占用比例。

最后，我們用三個(gè)指標(biāo)衡量：

顯存剩余空間
顯存剩余比例
當(dāng)前功率/額定功率

在之前，我們已經(jīng)把所有GPU的信息存成了一個(gè)list，每個(gè)list是gpu信息的字典。

我們使用內(nèi)置函數(shù)sorted來對(duì)可使用GPU進(jìn)行排序。

如，按顯存使用：

def_sort_by_memory(self,gpus,by_size=False):
if by_size:
print('Sorted by free memory size')
return sorted(gpus,key=lambda d:d['memory.free'],reverse=True)
else:
print('Sorted by free memory rate')
return sorted(gpus,key=lambda d:float(d['memory.free']) / d['memory.total'],reverse=True)

完整腳本

我們定義一個(gè)GPUManager類，在他的實(shí)例對(duì)象的存活周期里會(huì)更新GPU狀態(tài)、記錄已被分配的GPU。

實(shí)例化后，通過調(diào)用auto_choice方法直接返回一個(gè)tf.device對(duì)象。

同時(shí)，考慮到用戶計(jì)算機(jī)可能沒有GPU，加入異常處理機(jī)制。

def check_gpus():
'''
   GPU available check
   reference : http://feisky.xyz/machine-learning/tensorflow/gpu_list.html
'''
all_gpus = [x.name for x in device_lib.list_local_devices() if x.device_type == 'GPU']
if not all_gpus:
print('This script could only be used to manage NVIDIA GPUs,but no GPU found in your device')
  return False
elif not 'NVIDIA System Management' in os.popen('nvidia-smi -h').read():
  print("'nvidia-smi' tool not found.")
  return False
return True
if check_gpus():
def parse(line,qargs):
'''
   line:
   a line of text
   qargs:
   query arguments
   return:
   a dict of gpu infos
   Pasing a line of csv format text returned by nvidia-smi
   解析一行nvidia-smi返回的csv格式文本
   '''
numberic_args = ['memory.free', 'memory.total', 'power.draw', 'power.limit']#可計(jì)數(shù)的參數(shù)
power_manage_enable=lambda v:(not 'Not Support' in v)#lambda表達(dá)式，顯卡是否滋瓷power management（筆記本可能不滋瓷）
to_numberic=lambda v:float(v.upper().strip().replace('MIB','').replace('W',''))#帶單位字符串去掉單位
process = lambda k,v:((int(to_numberic(v)) if power_manage_enable(v) else 1) if k in numberic_args else v.strip())
return {k:process(k,v) for k,v in zip(qargs,line.strip().split(','))}

def query_gpu(qargs=[]) :
'''
   qargs:
   query arguments
   return:
   a list of dict
   Querying GPUs infos
   查詢GPU信息
   '''
qargs =['index','gpu_name', 'memory.free', 'memory.total', 'power.draw', 'power.limit']+ qargs
cmd = 'nvidia-smi --query-gpu={} --format=csv,noheader'.format(','.join(qargs))
results = os.popen(cmd).readlines()
return [parse(line,qargs) for line in results]

def by_power(d):
'''
   helper function fo sorting gpus by power
   '''
power_infos=(d['power.draw'],d['power.limit'])
if any(v==1 for v in power_infos):
print('Power management unable for GPU {}'.format(d['index']))
  return 1
  return float(d['power.draw'])/d['power.limit']

class GPUManager():
  '''
  qargs:
  query arguments
  A manager which can list all available GPU devices and sort them and choice the most free one.Unspecified ones pref.
  GPU設(shè)備管理器，考慮列舉出所有可用GPU設(shè)備，并加以排序，自動(dòng)選出最空閑的設(shè)備。在一個(gè)GPUManager對(duì)象內(nèi)會(huì)記錄每個(gè)GPU是否已被指定，優(yōu)先選擇未指定的GPU。
  '''
def __init__(self,qargs=[]):
  '''
  '''
self.qargs=qargs
self.gpus=query_gpu(qargs)
for gpu in self.gpus:
gpu['specified']=False
  self.gpu_num=len(self.gpus)

  def _sort_by_memory(self,gpus,by_size=False):
if by_size:
print('Sorted by free memory size')
return sorted(gpus,key=lambda d:d['memory.free'],reverse=True)
  else:
print('Sorted by free memory rate')
return sorted(gpus,key=lambda d:float(d['memory.free'])/ d['memory.total'],reverse=True)

  def _sort_by_power(self,gpus):
return sorted(gpus,key=by_power)

  def _sort_by_custom(self,gpus,key,reverse=False,qargs=[]):
if isinstance(key,str) and (key in qargs):
  return sorted(gpus,key=lambda d:d[key],reverse=reverse)
if isinstance(key,type(lambda a:a)):
  return sorted(gpus,key=key,reverse=reverse)
raise ValueError("The argument 'key' must be a function or a key in query args,please read the documention of nvidia-smi")

  def auto_choice(self,mode=0):
  '''
mode:
0:(default)sorted by free memory size
  return:
a TF device object
  Auto choice the freest GPU device,not specified
  ones
  自動(dòng)選擇最空閑GPU
  '''
  for old_infos,new_infos in zip(self.gpus,query_gpu(self.qargs)):
old_infos.update(new_infos)
  unspecified_gpus=[gpu for gpu in self.gpus if not gpu['specified']] or self.gpus

  if mode==0:
print('Choosing the GPU device has largest free memory...')
chosen_gpu=self._sort_by_memory(unspecified_gpus,True)[0]
  elif mode==1:
print('Choosing the GPU device has highest free memory rate...')
chosen_gpu=self._sort_by_power(unspecified_gpus)[0]
  elif mode==2:
print('Choosing the GPU device by power...')
chosen_gpu=self._sort_by_power(unspecified_gpus)[0]
  else:
print('Given an unaviliable mode,will be chosen by memory')
chosen_gpu=self._sort_by_memory(unspecified_gpus)[0]
  chosen_gpu['specified']=True
  index=chosen_gpu['index']
  print('Using GPU {i}:\n{info}'.format(i=index,info='\n'.join([str(k)+':'+str(v) for k,v in chosen_gpu.items()])))
  return tf.device('/gpu:{}'.format(index))
else:
raise ImportError('GPU available check failed'

1人收藏

相關(guān)文章

汪思穎

編輯

關(guān)注AI學(xué)術(shù)，例如論文

發(fā)私信

當(dāng)月熱門文章