parse-latin
Latin-script (natural language) parser
Last updated 6 months ago by wooorm .
MIT · Repository · Bugs · Original npm · Tarball · package.json
$ cnpm install parse-latin 
SYNC missed versions from official npm registry.

parse-latin

Build Coverage Downloads Size Chat

A Latin-script language parser for retext producing nlcst nodes.

Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”), parse-latin does a good job at tokenizing it.

Note also that parse-latin does a decent job at tokenizing Latin-like scripts, Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի է”), and such.

Install

npm:

npm install parse-latin

Use

var inspect = require('unist-util-inspect')
var Latin = require('parse-latin')

var tree = new Latin().parse('A simple sentence.')

console.log(inspect(tree))

Which, when inspecting, yields:

RootNode[1] (1:1-1:19, 0-18)
└─ ParagraphNode[1] (1:1-1:19, 0-18)
   └─ SentenceNode[6] (1:1-1:19, 0-18)
      ├─ WordNode[1] (1:1-1:2, 0-1)
      │  └─ TextNode: "A" (1:1-1:2, 0-1)
      ├─ WhiteSpaceNode: " " (1:2-1:3, 1-2)
      ├─ WordNode[1] (1:3-1:9, 2-8)
      │  └─ TextNode: "simple" (1:3-1:9, 2-8)
      ├─ WhiteSpaceNode: " " (1:9-1:10, 8-9)
      ├─ WordNode[1] (1:10-1:18, 9-17)
      │  └─ TextNode: "sentence" (1:10-1:18, 9-17)
      └─ PunctuationNode: "." (1:18-1:19, 17-18)

API

ParseLatin(value)

Exposes the functionality needed to tokenize natural Latin-script languages into a syntax tree. If value is passed here, it’s not needed to give it to #parse().

ParseLatin#tokenize(value)

Tokenize value (string) into letters and numbers (words), white space, and everything else (punctuation). The returned nodes are a flat list without paragraphs or sentences.

Returns

Array.<Node> — Nodes.

ParseLatin#parse(value)

Tokenize value (string) into an NLCST tree. The returned node is a RootNode with in it paragraphs and sentences.

Returns

Node — Root node.

Algorithm

Note: The easiest way to see how parse-latin tokenizes and parses, is by using the online parser demo, which shows the syntax tree corresponding to the typed text.

parse-latin splits text into white space, word, and punctuation tokens. parse-latin starts out with a pretty easy definition, one that most other tokenizers use:

  • A “word” is one or more letter or number characters
  • A “white space” is one or more white space characters
  • A “punctuation” is one or more of anything else

Then, it manipulates and merges those tokens into a (nlcst) syntax tree, adding sentences and paragraphs where needed.

  • Some punctuation marks are part of the word they occur in, such as non-profit, she’s, G.I., 11:00, N/A, &c, nineteenth- and…
  • Some full-stops do not mark a sentence end, such as 1., e.g., id.
  • Although full-stops, question marks, and exclamation marks (sometimes) end a sentence, that end might not occur directly after the mark, such as .), ."
  • And many more exceptions

License

MIT © Titus Wormer

Current Tags

  • 0.1.0-rc.3                                ...           0.1.0-rc.3 (6 years ago)
  • 0.1.0-rc.4                                ...           0.1.0-rc.4 (6 years ago)
  • 4.2.1                                ...           latest (6 months ago)
  • 0.4.0-rc.1                                ...           next (6 years ago)

30 Versions

  • 4.2.1                                ...           6 months ago
  • 4.2.0                                ...           a year ago
  • 4.1.1                                ...           2 years ago
  • 4.1.0                                ...           3 years ago
  • 4.0.3                                ...           3 years ago
  • 4.0.2                                ...           3 years ago
  • 4.0.1                                ...           3 years ago
  • 4.0.0                                ...           3 years ago
  • 3.2.0                                ...           4 years ago
  • 3.1.1                                ...           4 years ago
  • 3.1.0                                ...           4 years ago
  • 3.0.0                                ...           4 years ago
  • 2.0.0                                ...           5 years ago
  • 1.0.0                                ...           5 years ago
  • 0.5.2                                ...           5 years ago
  • 0.5.1                                ...           5 years ago
  • 0.5.0                                ...           5 years ago
  • 0.4.2                                ...           6 years ago
  • 0.4.1                                ...           6 years ago
  • 0.4.0                                ...           6 years ago
  • 0.4.0-rc.2                                ...           6 years ago
  • 0.4.0-rc.1                                ...           6 years ago
  • 0.3.0                                ...           6 years ago
  • 0.3.0-rc.1                                ...           6 years ago
  • 0.2.0                                ...           6 years ago
  • 0.1.3                                ...           6 years ago
  • 0.1.2                                ...           6 years ago
  • 0.1.0                                ...           6 years ago
  • 0.1.0-rc.4                                ...           6 years ago
  • 0.1.0-rc.3                                ...           6 years ago

Copyright 2014 - 2016 © taobao.org |