@lambda121/word-extractor
Node.js package to read Word .doc files
Last updated 6 months ago by lambda121 .
MIT · Repository · Bugs · Original npm · Tarball · package.json
$ cnpm install @lambda121/word-extractor 
SYNC missed versions from official npm registry.

Forked from morungos/node-word-extractor

What's different?
I needed buffer support but didn't want to deal with coffeescript so I modified the repo a bit. The main public change is the module is now an object with two methods fromFile and fromBuffer. I also removed 'bluebird' so the returned promises are native.

word-extractor

Read data from a Word document using node.js

Why use this module?

There are a fair number of npm components which can extract text from Word .doc files, but they all appear to require some external helper program, and involve either spawning a process or communicating with a persistent one. That raises the installation and deployment burden as well as the runtime one.

This module is intended to provide a much faster way of reading the text from a Word file, without leaving the node.js environment.

How do I install this module?

yarn add @gmr-fms/word-extractor

# Or using npm...
npm install @gmr-fms/word-extractor

How do I use this module?

const extract = require('word-extractor');
extract.fromFile('file.doc').then(doc => {
  console.log(doc.getBody());
});

The object returned from the extract() method is a promise that resolves to a document object, which then provides several views onto different parts of the document contents.

Methods

extract#fromFile(filePath) => Promise<Document>
extract#fromBuffer(buf) => Promise<Document>

Document#getBody()

Retrieves the content text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getFootnotes()

Retrieves the footnote text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getHeaders()

Retrieves the header and footer text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getAnnotations()

Retrieves the comment bubble text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

License

Copyright (c) 2016-2017. Stuart Watt.

Licensed under the MIT License.

Current Tags

  • 1.0.0                                ...           latest (6 months ago)

2 Versions

  • 1.0.0                                ...           6 months ago
  • 0.3.0                                ...           6 months ago
Maintainers (1)
Downloads
Today 0
This Week 0
This Month 0
Last Day 0
Last Week 2
Last Month 2
Dependencies (7)
Dev Dependencies (5)
Dependents (0)
None

Copyright 2014 - 2016 © taobao.org |