Javascript类型推断(2) - 开始训练吧

简介: # Javascript类型推断(2) - 开始训练吧 ## 准备训练数据 下面我们将上一节获取的类型数据信息进行预处理,转化为可以训练的数据。 代码在GetTypes.js中,会创建三个相关目录: ```ts let root = "data/Repos-cleaned"; let outputDirGold = "data/outputs-gold/"; let

Javascript类型推断(2) - 开始训练吧

准备训练数据

下面我们将上一节获取的类型数据信息进行预处理,转化为可以训练的数据。

代码在GetTypes.js中,会创建三个相关目录:

let root = "data/Repos-cleaned";
let outputDirGold = "data/outputs-gold/";
let outputDirAll = "data/outputs-all/";
let outputDirCheckJS = "data/outputs-checkjs";
try {
    fs.mkdirSync(outputDirGold);
    fs.mkdirSync(outputDirAll);
    fs.mkdirSync(outputDirCheckJS);
}
catch (err) {
    console.log(err);
}

其中,outputs-all数据用于训练。而goutputs-gold中保存用户手动标注的类型信息,这个珍贵数据将用于测试集。output-checkjs用于和check js工具的结果做对比。

最终生成的训练数据如下例:

let a = 0 ; let s = "s" ; console . log ( s ) ;    O $number$ O O O O $string$ O O O $Console$ O $void$ O $string$ O O
class Test { public value : number ; constructor ( v ) { this . value = v ; } } let t = new Test ( 0 ) ;    O $any$ O O $number$ O O O O O $number$ O O O O $number$ O $number$ O O O O $Test$ O O $any$ O O O O

就是我们上节所见到的代码和token的对应。

这部分的原理大家应该已经了解了,源代码我们就不详细分析了。

拆分训练集和测试集

训练数据准备完成之后,我们就可以调用lexer.py将其分成训练集和测试集。

下面是我们了前68个工程为例的拆分情况:

File counts= 68
Processing 0: 0xProject__0x.js.json
Processing 1: 1backend__1backend.json
Processing 2: 2fd__graphdoc.json
Processing 3: 43081j__rar.js.json
Processing 4: 500tech__angular-tree-component.json
Processing 5: 5calls__5calls.json
Processing 6: 74th__vscode-vim.json
Processing 7: accounts-js__accounts.json
Processing 8: adriancarriger__angularfire2-offline.json
Processing 9: AFASSoftware__maquette.json
Processing 10: afrad__angular2-websocket.json
Processing 11: aggarwalankush__ionic-mosum.json
Processing 12: aggarwalankush__ionic-push-base.json
Processing 13: ahomu__Talkie.json
Processing 14: aikoven__typescript-fsa.json
Processing 15: aioutecism__amVim-for-VSCode.json
Processing 16: airbrake__airbrake-js.json
Processing 17: ajtoo__vscode-org-mode.json
Processing 18: akfish__node-vibrant.json
Processing 19: akserg__ng2-dnd.json
Processing 20: akserg__ng2-slim-loading-bar.json
Processing 21: akserg__ng2-toasty.json
Processing 22: alamgird__angular-next-starter-kit.json
Processing 23: Alberplz__angular2-color-picker.json
Processing 24: alefragnani__vscode-project-manager.json
Processing 25: alex3165__react-mapbox-gl.json
Processing 26: alexjlockwood__avocado.json
Processing 27: alexjlockwood__ShapeShifter.json
Processing 28: alexjoverm__tslint-config-prettier.json
Processing 29: alexjoverm__typescript-library-starter.json
Processing 30: AlexKhymenko__ngx-permissions.json
Processing 31: AlgusDark__bloomer.json
Processing 32: amcdnl__ngrx-actions.json
Processing 33: anandanand84__technicalindicators.json
Processing 34: andrei-markeev__ts2c.json
Processing 35: andrerpena__react-mde.json
Processing 36: andrucz__ionic2-rating.json
Processing 37: angular-redux__store.json
Processing 38: angular-ui__ui-router.json
Processing 39: angulartics__angulartics2.json
Processing 40: ant-design__ant-design-mobile.json
Processing 41: ant-design__ant-design.json
Processing 42: antivanov__js-crawler.json
Processing 43: APIs-guru__graphql-faker.json
Processing 44: APIs-guru__graphql-lodash.json
Processing 45: APIs-guru__graphql-voyager.json
Processing 46: appbaseio__mirage.json
Processing 47: arangodb__arangojs.json
Processing 48: argonjs__argon.json
Processing 49: arkon__ng-sidebar.json
Processing 50: artemsky__ng-snotify.json
Processing 51: artemsky__vue-snotify.json
Processing 52: artsy__emission.json
Processing 53: ascoders__gaea-editor.json
Processing 54: ascoders__react-native-image-viewer.json
Processing 55: ascoders__react-native-image-zoom.json
Processing 56: ashubham__webshot-factory.json
Processing 57: Asymmetrik__ngx-leaflet.json
Processing 58: atom-community__markdown-preview-plus.json
Processing 59: atom-haskell__ide-haskell.json
Processing 60: atom__atom-languageclient.json
Processing 61: aurelia__ux.json
Processing 62: aurelia__validation.json
Processing 63: auth0__angular2-jwt.json
Processing 64: avatsaev__angular-contacts-app-example.json
Processing 65: avatsaev__angular4-docker-example.json
Processing 66: aviabird__angularspree.json
Processing 67: Azure__kashti.json
Train projects: 54
Validation projects: 7
Test projects: 7
Train files: 2184
Validation files: 364
Test files: 187
Producing vocabularies
Size of source vocab: 3377
Size of target vocab: 707
Writing train/valid/test files
Overall tokens: 896479 train, 134374 valid and 60516 test

最后会生成train.txt, valid.txt和test.txt三个文件。

我们取其中的一行,看看其格式:

<s> import 's' ; import { configure } from 's' ; import * as _UNKNOWN_ from 's' ; configure ( { adapter : new _UNKNOWN_ ( ) } ) ; </s>    O O O O O O $any$ O O O O O O O $any$ O O O $any$ O O $any$ O O $any$ O O O O O O

嗯,还是加工后的源代码,与我们第一节中生成的token类型表的对应。

同时,还会生成source_wl和target_wl两个词表:
其中source_wl是用到的符号表,例:

.
(
)
,
;
:
{
}
's'
"s"
=
this
0
[
]
const
from
=>
import
null
return
if
export
let
expect
<
>
new
?
function
string
<s>
</s>
public
as
private
!
false
true
===

最后一个词是_UNKNOWN_,代表未知词。

而target_wl是类型的表,我们看下前几行:

O
$any$
$string$
$number$
$complex$
$void$
$boolean$
$any[]$
$string[]$
$number[]$
$Assertion$
$undefined$
${}$
$HTMLElement$
$Promise$
$ExpectStatic$
$Promise<any>$
$PromiseConstructor$
$Promise<void>$
$Element$
$this$
$ErrorConstructor$
$ZeroEx$
$Math$
$SignedOrder$
$Projection$
$JSON$
$JsApi$
$StockData$
$Console$
$VNode$
$T$

类型中第一个是未知。

除此之外,还会生成test_projects.txt,例:

43081j__rar.js.json
adriancarriger__angularfire2-offline.json
aikoven__typescript-fsa.json
alexjoverm__tslint-config-prettier.json
AlgusDark__bloomer.json
andrerpena__react-mde.json
arangodb__arangojs.json

格式转换

在使用CNTK处理之前,我们还需要将txt格式转换成CNTK需要的ctf格式。

这个工具去CNTK官网上可以找到:https://github.com/microsoft/CNTK/blob/master/Scripts/txt2ctf.py

调用命令如下,以Windows为例,其它系统就不用路径,直接调用python就好:

& 'C:\Program Files\Python37\python.exe' txt2ctf.py --map data/source_wl data/target_wl --input data/train.txt --output data/train.ctf
& 'C:\Program Files\Python37\python.exe' txt2ctf.py --map data/source_wl data/target_wl --input data/valid.txt --output data/valid.ctf
& 'C:\Program Files\Python37\python.exe' txt2ctf.py --map data/source_wl data/target_wl --input data/test.txt --output data/test.ctf

训练

万事俱备,我们就可以调用infer.py来进行训练了。
请记得安装微软的CNTK框架。

下面是我的训练命令和输出

C:\Python\Python36\python.exe .\infer.py
Selected GPU[0] GeForce GTX 960M as the process wide default device.
-------------------------------------------------------------------
Build info:

                Built time: Apr 23 2019 21:50:08
                Last modified date: Tue Apr 23 17:37:55 2019
                Build type: Release
                Build target: GPU
                With ASGD: yes
                Math lib: mkl
                CUDA version: 10.0.0
                CUDNN version: 7.6.2
                Build Branch: HEAD
                Build SHA1: ae9c9c7c5f9e6072cc9c94c254f816dbdc1c5be6 (modified)
                MPI distribution: Microsoft MPI
                MPI version: 7.0.12437.6
-------------------------------------------------------------------
Training 4597857 parameters in 21 parameter tensors.
-------------------------------------------------------------------
Build info:

                Built time: Apr 23 2019 21:50:08
                Last modified date: Tue Apr 23 17:37:55 2019
                Build type: Release
                Build target: GPU
                With ASGD: yes
                Math lib: mkl
                CUDA version: 10.0.0
                CUDNN version: 7.6.2
                Build Branch: HEAD
                Build SHA1: ae9c9c7c5f9e6072cc9c94c254f816dbdc1c5be6 (modified)
                MPI distribution: Microsoft MPI
                MPI version: 7.0.12437.6
-------------------------------------------------------------------
Learning rate per 1 samples: 0.001
 Minibatch[   1-  10]: loss = 1.052736 * 42461, metric = 14.26% * 42461;
 Minibatch[  11-  20]: loss = 0.671728 * 46088, metric = 13.34% * 46088;
 Minibatch[  21-  30]: loss = 0.486434 * 42913, metric = 8.57% * 42913;
 Minibatch[  31-  40]: loss = 0.542112 * 45928, metric = 9.83% * 45928;

评估效果

在evaluation.py中,修改model_file变量为我们上一步训练好的cntk文件,然后运行就可以评估训练的效果了。

model_file = "models/model-1.cntk"
目录
相关文章
|
JavaScript 前端开发 DataX
Javascript类型推断(4) - 隐藏层的更新
# Javascript类型推断(4) - 隐藏层的更新 熟悉了整个流程之后,我们可以关注更多的细节。 前面讲训练过程时,没有讲enhance_data的细节。这一部分的主要功能是更新隐藏层。它的调用点在: ```python def train(): train_reader = create_reader(files['train']['file'], is_trainin
2262 0
|
算法 JavaScript 前端开发
Javascript类型推断(3) - 算法模型解析
# Javascript类型推断(3) - 算法模型解析 ## 构建训练模型 上一节我们介绍了生成训练集,测试集,验证集的方法,以及生成词表的方法。 这5个文件构成了训练的基本素材: ```python files = { 'train': { 'file': 'data/train.ctf', 'location': 0 }, 'valid': { 'file':
1019 0
|
JavaScript 开发工具 git
Javascript类型推断(1) - 获取token和类型
Javascript类型推断(1) - 获取token和类型 ## js类型推断的三种思路 第一种思路是用传统的编译类的方法,推断是没啥好办法,但是可以用来验证。 第二种思路是利用对象的属性或方法的调用来推断,JSNice就是这样做的。 第三种思路比较先进,充分利用到越来越流行的Typescript,通过学习Typescript生成的javascript进行监督学习。这种思路是Vi
793 0
|
3月前
|
JavaScript
Node.js【GET/POST请求、http模块、路由、创建客户端、作为中间层、文件系统模块】(二)-全面详解(学习总结---从入门到深化)
Node.js【GET/POST请求、http模块、路由、创建客户端、作为中间层、文件系统模块】(二)-全面详解(学习总结---从入门到深化)
28 0
|
3月前
|
消息中间件 Web App开发 JavaScript
Node.js【简介、安装、运行 Node.js 脚本、事件循环、ES6 作业队列、Buffer(缓冲区)、Stream(流)】(一)-全面详解(学习总结---从入门到深化)
Node.js【简介、安装、运行 Node.js 脚本、事件循环、ES6 作业队列、Buffer(缓冲区)、Stream(流)】(一)-全面详解(学习总结---从入门到深化)
77 0
|
2天前
|
存储 移动开发 JavaScript
学习javascript,前端知识精讲,助力你轻松掌握
学习javascript,前端知识精讲,助力你轻松掌握
|
9天前
|
JavaScript 前端开发 测试技术
学习JavaScript
【4月更文挑战第23天】学习JavaScript
13 1
|
17天前
|
JavaScript 前端开发 应用服务中间件
node.js之第一天学习
node.js之第一天学习
|
2月前
|
运维 JavaScript 前端开发
发现了一款宝藏学习项目,包含了Web全栈的知识体系,JS、Vue、React知识就靠它了!
发现了一款宝藏学习项目,包含了Web全栈的知识体系,JS、Vue、React知识就靠它了!
|
2月前
|
JavaScript
Vue.js学习详细课程系列--共32节(6 / 6)
Vue.js学习详细课程系列--共32节(6 / 6)
27 0