Javascript类型推断(2) - 开始训练吧
准备训练数据
下面我们将上一节获取的类型数据信息进行预处理,转化为可以训练的数据。
代码在GetTypes.js中,会创建三个相关目录:
let root = "data/Repos-cleaned";
let outputDirGold = "data/outputs-gold/";
let outputDirAll = "data/outputs-all/";
let outputDirCheckJS = "data/outputs-checkjs";
try {
fs.mkdirSync(outputDirGold);
fs.mkdirSync(outputDirAll);
fs.mkdirSync(outputDirCheckJS);
}
catch (err) {
console.log(err);
}
其中,outputs-all数据用于训练。而goutputs-gold中保存用户手动标注的类型信息,这个珍贵数据将用于测试集。output-checkjs用于和check js工具的结果做对比。
最终生成的训练数据如下例:
let a = 0 ; let s = "s" ; console . log ( s ) ; O $number$ O O O O $string$ O O O $Console$ O $void$ O $string$ O O
class Test { public value : number ; constructor ( v ) { this . value = v ; } } let t = new Test ( 0 ) ; O $any$ O O $number$ O O O O O $number$ O O O O $number$ O $number$ O O O O $Test$ O O $any$ O O O O
就是我们上节所见到的代码和token的对应。
这部分的原理大家应该已经了解了,源代码我们就不详细分析了。
拆分训练集和测试集
训练数据准备完成之后,我们就可以调用lexer.py将其分成训练集和测试集。
下面是我们了前68个工程为例的拆分情况:
File counts= 68
Processing 0: 0xProject__0x.js.json
Processing 1: 1backend__1backend.json
Processing 2: 2fd__graphdoc.json
Processing 3: 43081j__rar.js.json
Processing 4: 500tech__angular-tree-component.json
Processing 5: 5calls__5calls.json
Processing 6: 74th__vscode-vim.json
Processing 7: accounts-js__accounts.json
Processing 8: adriancarriger__angularfire2-offline.json
Processing 9: AFASSoftware__maquette.json
Processing 10: afrad__angular2-websocket.json
Processing 11: aggarwalankush__ionic-mosum.json
Processing 12: aggarwalankush__ionic-push-base.json
Processing 13: ahomu__Talkie.json
Processing 14: aikoven__typescript-fsa.json
Processing 15: aioutecism__amVim-for-VSCode.json
Processing 16: airbrake__airbrake-js.json
Processing 17: ajtoo__vscode-org-mode.json
Processing 18: akfish__node-vibrant.json
Processing 19: akserg__ng2-dnd.json
Processing 20: akserg__ng2-slim-loading-bar.json
Processing 21: akserg__ng2-toasty.json
Processing 22: alamgird__angular-next-starter-kit.json
Processing 23: Alberplz__angular2-color-picker.json
Processing 24: alefragnani__vscode-project-manager.json
Processing 25: alex3165__react-mapbox-gl.json
Processing 26: alexjlockwood__avocado.json
Processing 27: alexjlockwood__ShapeShifter.json
Processing 28: alexjoverm__tslint-config-prettier.json
Processing 29: alexjoverm__typescript-library-starter.json
Processing 30: AlexKhymenko__ngx-permissions.json
Processing 31: AlgusDark__bloomer.json
Processing 32: amcdnl__ngrx-actions.json
Processing 33: anandanand84__technicalindicators.json
Processing 34: andrei-markeev__ts2c.json
Processing 35: andrerpena__react-mde.json
Processing 36: andrucz__ionic2-rating.json
Processing 37: angular-redux__store.json
Processing 38: angular-ui__ui-router.json
Processing 39: angulartics__angulartics2.json
Processing 40: ant-design__ant-design-mobile.json
Processing 41: ant-design__ant-design.json
Processing 42: antivanov__js-crawler.json
Processing 43: APIs-guru__graphql-faker.json
Processing 44: APIs-guru__graphql-lodash.json
Processing 45: APIs-guru__graphql-voyager.json
Processing 46: appbaseio__mirage.json
Processing 47: arangodb__arangojs.json
Processing 48: argonjs__argon.json
Processing 49: arkon__ng-sidebar.json
Processing 50: artemsky__ng-snotify.json
Processing 51: artemsky__vue-snotify.json
Processing 52: artsy__emission.json
Processing 53: ascoders__gaea-editor.json
Processing 54: ascoders__react-native-image-viewer.json
Processing 55: ascoders__react-native-image-zoom.json
Processing 56: ashubham__webshot-factory.json
Processing 57: Asymmetrik__ngx-leaflet.json
Processing 58: atom-community__markdown-preview-plus.json
Processing 59: atom-haskell__ide-haskell.json
Processing 60: atom__atom-languageclient.json
Processing 61: aurelia__ux.json
Processing 62: aurelia__validation.json
Processing 63: auth0__angular2-jwt.json
Processing 64: avatsaev__angular-contacts-app-example.json
Processing 65: avatsaev__angular4-docker-example.json
Processing 66: aviabird__angularspree.json
Processing 67: Azure__kashti.json
Train projects: 54
Validation projects: 7
Test projects: 7
Train files: 2184
Validation files: 364
Test files: 187
Producing vocabularies
Size of source vocab: 3377
Size of target vocab: 707
Writing train/valid/test files
Overall tokens: 896479 train, 134374 valid and 60516 test
最后会生成train.txt, valid.txt和test.txt三个文件。
我们取其中的一行,看看其格式:
<s> import 's' ; import { configure } from 's' ; import * as _UNKNOWN_ from 's' ; configure ( { adapter : new _UNKNOWN_ ( ) } ) ; </s> O O O O O O $any$ O O O O O O O $any$ O O O $any$ O O $any$ O O $any$ O O O O O O
嗯,还是加工后的源代码,与我们第一节中生成的token类型表的对应。
同时,还会生成source_wl和target_wl两个词表:
其中source_wl是用到的符号表,例:
.
(
)
,
;
:
{
}
's'
"s"
=
this
0
[
]
const
from
=>
import
null
return
if
export
let
expect
<
>
new
?
function
string
<s>
</s>
public
as
private
!
false
true
===
最后一个词是_UNKNOWN_
,代表未知词。
而target_wl是类型的表,我们看下前几行:
O
$any$
$string$
$number$
$complex$
$void$
$boolean$
$any[]$
$string[]$
$number[]$
$Assertion$
$undefined$
${}$
$HTMLElement$
$Promise$
$ExpectStatic$
$Promise<any>$
$PromiseConstructor$
$Promise<void>$
$Element$
$this$
$ErrorConstructor$
$ZeroEx$
$Math$
$SignedOrder$
$Projection$
$JSON$
$JsApi$
$StockData$
$Console$
$VNode$
$T$
类型中第一个是未知。
除此之外,还会生成test_projects.txt,例:
43081j__rar.js.json
adriancarriger__angularfire2-offline.json
aikoven__typescript-fsa.json
alexjoverm__tslint-config-prettier.json
AlgusDark__bloomer.json
andrerpena__react-mde.json
arangodb__arangojs.json
格式转换
在使用CNTK处理之前,我们还需要将txt格式转换成CNTK需要的ctf格式。
这个工具去CNTK官网上可以找到:https://github.com/microsoft/CNTK/blob/master/Scripts/txt2ctf.py
调用命令如下,以Windows为例,其它系统就不用路径,直接调用python就好:
& 'C:\Program Files\Python37\python.exe' txt2ctf.py --map data/source_wl data/target_wl --input data/train.txt --output data/train.ctf
& 'C:\Program Files\Python37\python.exe' txt2ctf.py --map data/source_wl data/target_wl --input data/valid.txt --output data/valid.ctf
& 'C:\Program Files\Python37\python.exe' txt2ctf.py --map data/source_wl data/target_wl --input data/test.txt --output data/test.ctf
训练
万事俱备,我们就可以调用infer.py来进行训练了。
请记得安装微软的CNTK框架。
下面是我的训练命令和输出
C:\Python\Python36\python.exe .\infer.py
Selected GPU[0] GeForce GTX 960M as the process wide default device.
-------------------------------------------------------------------
Build info:
Built time: Apr 23 2019 21:50:08
Last modified date: Tue Apr 23 17:37:55 2019
Build type: Release
Build target: GPU
With ASGD: yes
Math lib: mkl
CUDA version: 10.0.0
CUDNN version: 7.6.2
Build Branch: HEAD
Build SHA1: ae9c9c7c5f9e6072cc9c94c254f816dbdc1c5be6 (modified)
MPI distribution: Microsoft MPI
MPI version: 7.0.12437.6
-------------------------------------------------------------------
Training 4597857 parameters in 21 parameter tensors.
-------------------------------------------------------------------
Build info:
Built time: Apr 23 2019 21:50:08
Last modified date: Tue Apr 23 17:37:55 2019
Build type: Release
Build target: GPU
With ASGD: yes
Math lib: mkl
CUDA version: 10.0.0
CUDNN version: 7.6.2
Build Branch: HEAD
Build SHA1: ae9c9c7c5f9e6072cc9c94c254f816dbdc1c5be6 (modified)
MPI distribution: Microsoft MPI
MPI version: 7.0.12437.6
-------------------------------------------------------------------
Learning rate per 1 samples: 0.001
Minibatch[ 1- 10]: loss = 1.052736 * 42461, metric = 14.26% * 42461;
Minibatch[ 11- 20]: loss = 0.671728 * 46088, metric = 13.34% * 46088;
Minibatch[ 21- 30]: loss = 0.486434 * 42913, metric = 8.57% * 42913;
Minibatch[ 31- 40]: loss = 0.542112 * 45928, metric = 9.83% * 45928;
评估效果
在evaluation.py中,修改model_file变量为我们上一步训练好的cntk文件,然后运行就可以评估训练的效果了。
model_file = "models/model-1.cntk"