flowflops: OneFlow 模型的 Flops 計算
用于計算 OneFlow 模型的 FLOPs 和 Parameters 的第三方庫。
源碼地址(歡迎star): https://github.com/Oneflow-Inc/flow-OpCounter
介紹 & 使用
FLOPs & MACs 介紹
有許多人分不清楚 FLOPs 和 MACs 之間的關系,如 ptflops中的issue (https://github.com/sovrasov/flops-counter.pytorch/issues/70)
針對該問題,可以查看 thop中的解釋 (https://github.com/Lyken17/pytorch-OpCounter/blob/master/benchmark/README.md),翻譯如下:
MACs, FLOPs, what is the difference?
FLOPs 是浮點算子(floating operations)的縮寫,包括mul / add / div ...等。
MACs 代表執行的乘法累加運算,例如: a <- a + (b x c)。
如文中所示,一MACs有一mul和一add。這就是為什么在許多地方FLOPs幾乎是兩倍MACs的原因。
然而,現實世界中的應用要復雜得多。讓我們考慮一個矩陣乘法示例。A是一個形狀為 的矩陣,B是一個 的向量。
foriinrange(m): forjinrange(n): C[i][j]+=A[i][j]*B[j]#onemul-add
它會是m*n個MACs和2m*n個FLOPs。但是這樣的矩陣乘法實現速度很慢,需要并行化才能運行得更快。
foriinrange(m): parallelforjinrange(n): d[j]=A[i][j]*B[j]#onemul C[i][j]=sum(d)#nadds
此時MACs數值不再是 m*n 。
在比較 MAC / FLOP 時,我們希望數字與實現無關并且盡可能通用。因此在 thop (https://github.com/Lyken17/pytorch-OpCounter) 中,我們只考慮乘法的次數,而忽略所有其他操作。
安裝方法
pipinstallflowflops
使用方法
目前支持兩種 FLOPs 計算策略:在 Eager 模式下計算和在 Graph 模式下計算。
在 Graph 模式下計算耗時較長,但結果更加精確
示例:
importoneflowasflow importflowvision.modelsasmodels fromflowflopsimportget_model_complexity_info model=models.resnet50()#yourownmodel,nn.Module dsize=(1,3,224,224)#B,C,H,W formodein["eager","graph"]: print("======{}======".format(mode)) total_flops,total_params=get_model_complexity_info( model,dsize, as_strings=False, print_per_layer_stat=False, mode=mode ) print(total_flops,total_params)
輸出:
======eager====== 412192509625557032 ======graph====== 412744445625557032
可以看到兩種計算方式下的輸出有一定差別,這是因為在 ResNet 的 forward 代碼里存在類似 out += identity 的語句,這會造成 FLOPs 額外增加。而在 Eager 模式下我們只關注在 __init__() 中定義的網絡層,所以這種情況不會在 Eager 模式中被 hook 到。
我們可以計算一下有哪些 add_n 算子在 Eager 模式中被我們忽略了:
stage-one:(1,256,56,56)*3 stage-two:(1,512,28,28)*4 stage-three:(1,1024,14,14)*6 stage-four:(1,2048,7,7)*3
一共為 5,519,360 ,剛好為兩種模式的輸出差值 4127444456 - 4121925096 = 5519360
在 Eager 模式下也會存在一些小誤差,一般認為 ResNet50 的 FLOPs 為 4.09G ,而這里計算得到 4.12G ,是因為一般研究中會忽略類似 ReLU 等算子的 FLOPs 計算,所以與真實數值會有一定誤差。有關一般都忽略了哪些算子的計算,可以查看 fvcore 的輸出,該庫針對 pytorch 進行開發。
Skippedoperationaten::batch_norm53time(s) Skippedoperationaten::max_pool2d1time(s) Skippedoperationaten::add_16time(s) Skippedoperationaten::adaptive_avg_pool2d1time(s) FLOPs:4089184256
在ptflops包中也存在這樣的問題,筆者也有在issue中回復,詳見issue: https://github.com/sovrasov/flops-counter.pytorch/issues/94
Eager & Graph 模式下的 Flops 計算
接下來我們以簡單修改后的 ResNet18 中的 BasicBlock 為例介紹一下兩種 FLOPs 計算方式,設定網絡如下:
我們統一假定輸入形狀為(1, 32, 64, 64)
importoneflowasflow importoneflow.nnasnn defconv3x3( in_planes:int,out_planes:int,stride:int=1,groups:int=1,dilation:int=1 )->nn.Conv2d: """3x3convolutionwithpadding""" returnnn.Conv2d( in_planes, out_planes, kernel_size=3, stride=stride, padding=dilation, groups=groups, bias=True, dilation=dilation, ) defconv1x1(in_planes:int,out_planes:int,stride:int=1)->nn.Conv2d: """1x1convolution""" returnnn.Conv2d(in_planes,out_planes,kernel_size=1,stride=stride,bias=False) classBasicBlock(nn.Module): expansion:int=1 def__init__( self, inplanes:int=32, planes:int=64, stride:int=1, downsample=None, groups:int=1, dilation:int=1, norm_layer=None, )->None: super(BasicBlock,self).__init__() ifnorm_layerisNone: norm_layer=nn.BatchNorm2d #Bothself.conv1andself.downsamplelayersdownsampletheinputwhenstride!=1 self.conv1=conv3x3(inplanes,planes,stride) self.bn1=norm_layer(planes) self.relu=nn.ReLU() self.downsample=downsample self.stride=stride self.fc=nn.Linear(planes,planes) defforward(self,x): identity=x out=self.conv1(x) out=self.bn1(out) out=self.relu(out) ifself.downsampleisnotNone: identity=self.downsample(x) out+=flow.cat([identity,identity],dim=1) out=self.relu(out) returnself.fc(out)
Eager
在 Eager 模式中,我們只關注 __init__() 中定義的網絡層,也就是
self.conv1=conv3x3(inplanes,planes,stride) self.bn1=norm_layer(planes) self.relu=nn.ReLU() self.fc=nn.Linear(planes,planes)
二維卷積
卷積的原理在此不再贅述,直接給出計算公式:
歸一化
batchnorm 主要計算了均值、方差,并基于此對特征進行歸一化與仿射變換,其 FLOPs 為
如果不進行仿射變換,則其 FLOPs 為
激活函數
relu 對輸入(1, C, H, W)進行了 y = x if x > 0 else 0 操作,也就是其 FLOPs 為
線性層
線性層輸入為 (N, C, H, W)
線性層權重為 (W1, W)
兩者相乘的 FLOPs 為
其本質與 matmul 計算相當
Graph
在 Graph 模式中,我們會將 flow.nn.Module 編譯為 flow.nn.Graph ,從 Graph 中抽取出每一個算子輸入的張量形狀后再對網絡的 FLOPs 進行計算
上述網絡轉換后的 Graph:
(GRAPHMyGraph):( (CONFIGGraphConfig(training=False,)) (INPUTtensor(...,size=(1,32,64,64),dtype=oneflow.float32)) (MODULEBasicBlock()):( (INPUTtensor(...,is_lazy='True',size=(1,32,64,64),dtype=oneflow.float32)) (MODULEConv2d(32,64,kernel_size=(3,3),stride=(1,1),padding=(1,1),bias=False)):( (INPUTtensor(...,is_lazy='True',size=(1,32,64,64),dtype=oneflow.float32)) (PARAMETERtensor(...,size=(64,32,3,3),dtype=oneflow.float32,requires_grad=True)):() (OPERATOR:model.conv1.weight()->(out:sbp=(B),size=(64,32,3,3),dtype=(oneflow.float32)):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OPERATOR:model.conv1-conv2d-0(_MyGraph_0_input.0.0_2/out:(sbp=(B),size=(1,32,64,64),dtype=(oneflow.float32)),model.conv1.weight/out:(sbp=(B),size=(64,32,3,3),dtype=(oneflow.float32)))->(model.conv1-conv2d-0/out_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32))):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OUTPUTtensor(...,is_lazy='True',size=(1,64,64,64),dtype=oneflow.float32)) ) (MODULEBatchNorm2d(64,eps=1e-05,momentum=0.1,affine=True,track_running_stats=True)):( (INPUTtensor(...,is_lazy='True',size=(1,64,64,64),dtype=oneflow.float32)) (PARAMETERtensor(...,size=(64,),dtype=oneflow.float32,requires_grad=True)):() (PARAMETERtensor(...,size=(64,),dtype=oneflow.float32,requires_grad=True)):() (BUFFERtensor(...,size=(64,),dtype=oneflow.float32)):() (BUFFERtensor(...,size=(64,),dtype=oneflow.float32)):() (BUFFERtensor(...,size=(),dtype=oneflow.int64)):() (OPERATOR:model.bn1.running_mean()->(out:sbp=(B),size=(64),dtype=(oneflow.float32)):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OPERATOR:model.bn1.running_var()->(out:sbp=(B),size=(64),dtype=(oneflow.float32)):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OPERATOR:model.bn1.weight()->(out:sbp=(B),size=(64),dtype=(oneflow.float32)):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OPERATOR:model.bn1.bias()->(out:sbp=(B),size=(64),dtype=(oneflow.float32)):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OPERATOR:model.bn1-normalization-1(model.conv1-conv2d-0/out_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32)),model.bn1.running_mean/out:(sbp=(B),size=(64),dtype=(oneflow.float32)),model.bn1.running_var/out:(sbp=(B),size=(64),dtype=(oneflow.float32)),model.bn1.weight/out:(sbp=(B),size=(64),dtype=(oneflow.float32)),model.bn1.bias/out:(sbp=(B),size=(64),dtype=(oneflow.float32)))->(model.bn1-normalization-1/y_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32))):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OUTPUTtensor(...,is_lazy='True',size=(1,64,64,64),dtype=oneflow.float32)) ) (MODULEReLU()):( (INPUTtensor(...,is_lazy='True',size=(1,64,64,64),dtype=oneflow.float32)) (INPUTtensor(...,is_lazy='True',size=(1,64,64,64),dtype=oneflow.float32)) (OPERATOR:model.relu-relu-2(model.bn1-normalization-1/y_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32)))->(model.relu-relu-2/y_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32))):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OPERATOR:model.relu-relu-5(model-add_n-4/out_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32)))->(model.relu-relu-5/y_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32))):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OUTPUTtensor(...,is_lazy='True',size=(1,64,64,64),dtype=oneflow.float32)) (OUTPUTtensor(...,is_lazy='True',size=(1,64,64,64),dtype=oneflow.float32)) ) (MODULELinear(in_features=64,out_features=64,bias=True)):( (INPUTtensor(...,is_lazy='True',size=(1,64,64,64),dtype=oneflow.float32)) (PARAMETERtensor(...,size=(64,64),dtype=oneflow.float32,requires_grad=True)):() (PARAMETERtensor(...,size=(64,),dtype=oneflow.float32,requires_grad=True)):() (OPERATOR:model.fc.weight()->(out:sbp=(B),size=(64,64),dtype=(oneflow.float32)):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OPERATOR:model.fc-broadcast_matmul-6(model.relu-relu-5/y_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32)),model.fc.weight/out:(sbp=(B),size=(64,64),dtype=(oneflow.float32)))->(model.fc-broadcast_matmul-6/out_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32))):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OPERATOR:model.fc.bias()->(out:sbp=(B),size=(64),dtype=(oneflow.float32)):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OPERATOR:model.fc-broadcast_add-7(model.fc-broadcast_matmul-6/out_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32)),model.fc.bias/out:(sbp=(B),size=(64),dtype=(oneflow.float32)))->(model.fc-broadcast_add-7/z_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32))):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OUTPUTtensor(...,is_lazy='True',size=(1,64,64,64),dtype=oneflow.float32)) ) (OPERATOR:model-concat-3([_MyGraph_0_input.0.0_2/out:(sbp=(B),size=(1,32,64,64),dtype=(oneflow.float32)),_MyGraph_0_input.0.0_2/out:(sbp=(B),size=(1,32,64,64),dtype=(oneflow.float32))])->(model-concat-3/out_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32))):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OPERATOR:model-add_n-4([model.relu-relu-2/y_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32)),model-concat-3/out_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32))])->(model-add_n-4/out_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32))):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OUTPUTtensor(...,is_lazy='True',size=(1,64,64,64),dtype=oneflow.float32)) ) (OPERATOR:_MyGraph_0_input.0.0_2(...)->(...):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OPERATOR:_MyGraph_0_output.0.0_2(...)->(...):placement=(oneflow.placement(type="cpu",ranks=[0]))) (OUTPUTtensor(...,is_lazy='True',size=(1,64,64,64),dtype=oneflow.float32)) )
Graph 中由 OPERATOR 開始的層就是我們所需要的信息,我們可以注意到
out+=identity
被轉換為
(OPERATOR:model-add_n-3([model.relu-relu-2/y_0:(sbp=(B),size=(1,32,64,64),dtype=(oneflow.float32)),_MyGraph_0_input.0.0_2/out:(sbp=(B),size=(1,32,64,64),dtype=(oneflow.float32))])->(model-add_n-3/out_0:(sbp=(B),size=(1,32,64,64),dtype=(oneflow.float32))):placement=(oneflow.placement(type="cpu",ranks=[0])))
這有助于我們更準確的對網絡 FLOPs 進行計算。
卷積
在 flow.nn.Graph 中 conv3x3 和 conv1x1 會被拆解為 conv2d + bias_add(if bias==True)
由于我們只關注的卷積層的輸入,而在計算 FLOPs 時需要得到卷積層輸出的特征尺寸,所以我們需要依據輸入計算輸出特征的分辨率,方法如下
output_dims=[] fori,in_diminenumerate(in_dims): d=math.ceil((in_dim-kernel_size[i]+2*padding[i])/strides[i])+1 if(in_dim-kernel_size[i]+2*padding[i])%strides[i]!=0: d-=1 output_dims.append(d)
隨后即可正常計算 FLOPs
至于為什么不直接得到算子輸出的形狀,因為解析 Graph 需要占用更多的額外時間
歸一化
在 flow.nn.Graph 中 norm_layer(bn) 是一個單獨的算子,其計算方法與 Eager 模式中保持一致
需要注意的是 InstanceNorm 和 GroupNorm 在 flow.nn.Graph 中將被拆解為若干膠水算子,需要逐個計算
激活函數
在 flow.nn.Graph 中 relu 是一個單獨的算子,其 FLOPs 計算方法與 Eager 模式中保持一致
線性層
在 flow.nn.Graph 中 linear 會被拆解為 matmul + broadcast_add(if bias==True),其 FLOPs 計算公式與 Eager 模式中基本一致
其他
在 flow.nn.Graph 中有一些例如 concat 的算子也會被捕捉,例如上述 Graph 中存在的
(OPERATOR:model-concat-3([_MyGraph_0_input.0.0_2/out:(sbp=(B),size=(1,32,64,64),dtype=(oneflow.float32)),_MyGraph_0_input.0.0_2/out:(sbp=(B),size=(1,32,64,64),dtype=(oneflow.float32))])->(model-concat-3/out_0:(sbp=(B),size=(1,64,64,64),dtype=(oneflow.float32))):placement=(oneflow.placement(type="cpu",ranks=[0])))
針對此類算子,我們認為其不會影響網絡的 FLOPs ,故將其 FLOPs 置為0
目前支持的 Op 與模型
目前該工具支持絕大部分算子、網絡層與大多數 CNN ,列表如下
Eager
#convolutions nn.Conv1d nn.Conv2d nn.Conv3d #activations nn.ReLU nn.PReLU nn.ELU nn.LeakyReLU nn.ReLU6 #poolings nn.MaxPool1d nn.AvgPool1d nn.AvgPool2d nn.MaxPool2d nn.MaxPool3d nn.AvgPool3d #nn.AdaptiveMaxPool1d nn.AdaptiveAvgPool1d #nn.AdaptiveMaxPool2d nn.AdaptiveAvgPool2d #nn.AdaptiveMaxPool3d nn.AdaptiveAvgPool3d #BNs nn.BatchNorm1d nn.BatchNorm2d nn.BatchNorm3d #INs nn.InstanceNorm1d nn.InstanceNorm2d nn.InstanceNorm3d #FC nn.Linear #Upscale nn.Upsample #Deconvolution nn.ConvTranspose1d nn.ConvTranspose2d nn.ConvTranspose3d #RNN nn.RNN nn.GRU nn.LSTM nn.RNNCell nn.LSTMCell nn.GRUCell
Graph
#conv "conv1d" "conv2d" "conv3d" #pool "max_pool_1d" "max_pool_2d" "max_pool_3d" "avg_pool_1d" "avg_pool_2d" "avg_pool_3d" "adaptive_max_pool1d" "adaptive_max_pool2d" "adaptive_max_pool3d" "adaptive_avg_pool1d" "adaptive_avg_pool2d" "adaptive_avg_pool3d" #activate "relu" "leaky_relu" "prelu" "hardtanh" "elu" "silu" "sigmoid" "sigmoid_v2" #add "bias_add" "add_n" #matmul "matmul" "broadcast_matmul" #norm "normalization" #scalar "scalar_mul" "scalar_add" "scalar_sub" "scalar_div" #stats "var" #math "sqrt" "reduce_sum" #broadcast "broadcast_mul" "broadcast_add" "broadcast_sub" "broadcast_div" #empty "reshape" "ones_like" "zero_like" "flatten" "concat" "transpose" "slice"
FlowVision 中部分模型的計算結果
======eager====== +--------------------+----------+-------------+ |Model|Params|FLOPs| +--------------------+----------+-------------+ |alexnet|61.1M|718.16MMac| |vgg11|132.86M|7.63GMac| |vgg11_bn|132.87M|7.64GMac| |squeezenet1_0|1.25M|830.05MMac| |squeezenet1_1|1.24M|355.86MMac| |resnet18|11.69M|1.82GMac| |resnet50|25.56M|4.12GMac| |resnext50_32x4d|25.03M|4.27GMac| |shufflenet_v2_x0_5|1.37M|43.65MMac| |regnet_x_16gf|54.28M|16.01GMac| |efficientnet_b0|5.29M|401.67MMac| |densenet121|7.98M|2.88GMac| +--------------------+----------+-------------+ ======graph====== +--------------------+----------+-------------+ |Model|Params|FLOPs| +--------------------+----------+-------------+ |alexnet|61.1M|718.16MMac| |vgg11|132.86M|7.63GMac| |vgg11_bn|132.87M|7.64GMac| |squeezenet1_0|1.25M|830.05MMac| |squeezenet1_1|1.24M|355.86MMac| |resnet18|11.69M|1.82GMac| |resnet50|25.56M|4.13GMac| |resnext50_32x4d|25.03M|4.28GMac| |shufflenet_v2_x0_5|1.37M|43.7MMac| |regnet_x_16gf|54.28M|16.02GMac| |efficientnet_b0|5.29M|410.35MMac| |densenet121|7.98M|2.88GMac| +--------------------+----------+-------------+
總結
簡單介紹 OneFlow 模型中如何計算網絡 FLOPs
審核編輯:湯梓紅
-
模型
+關注
關注
1文章
3268瀏覽量
48924 -
MACS
+關注
關注
0文章
4瀏覽量
7534 -
OneFlow
+關注
關注
0文章
9瀏覽量
8804
原文標題:flowflops: OneFlow 模型的 Flops 計算
文章出處:【微信號:GiantPandaCV,微信公眾號:GiantPandaCV】歡迎添加關注!文章轉載請注明出處。
發布評論請先 登錄
相關推薦
評論