AI公开课:19.03.07雷鸣教授《人工智能革命与趋势》课堂笔记以及个人感悟

简介: AI公开课:19.03.07雷鸣教授《人工智能革命与趋势》课堂笔记以及个人感悟

输出结果


寻找训练文本中与morning最相关的10个词汇:

[('afternoon', 0.8329864144325256), ('weekend', 0.7690818309783936), ('evening', 0.7469204068183899),

('saturday', 0.7191835045814514), ('night', 0.7091601490974426), ('friday', 0.6764787435531616),

('sunday', 0.6380082368850708), ('newspaper', 0.6365975737571716), ('summer', 0.6268560290336609),

('season', 0.6137701272964478)]



寻找训练文本中与email最相关的10个词汇:

[('mail', 0.7432783842086792), ('contact', 0.6995242834091187), ('address', 0.6547545194625854),

('replies', 0.6502780318260193), ('mailed', 0.6334187388420105), ('request', 0.6262195110321045),

('sas', 0.6220622658729553), ('send', 0.6207413077354431), ('listserv', 0.617364227771759),

('compuserve', 0.5954489707946777)]



设计思路

image.png



核心代码

class Word2Vec(BaseWordEmbeddingsModel):

   """Train, use and evaluate neural networks described in https://code.google.

    com/p/word2vec/.

 

   Once you're finished training a model (=no more updates, only querying)

   store and use only the :class:`~gensim.models.keyedvectors.KeyedVectors` instance in `self.

    wv` to reduce memory.

 

   The model can be stored/loaded via its :meth:`~gensim.models.word2vec.Word2Vec.save`

    and

   :meth:`~gensim.models.word2vec.Word2Vec.load` methods.

 

   The trained word vectors can also be stored/loaded from a format compatible with the

   original word2vec implementation via `self.wv.save_word2vec_format`

   and :meth:`gensim.models.keyedvectors.KeyedVectors.load_word2vec_format`.

 

   Some important attributes are the following:

 

   Attributes

   ----------

   wv : :class:`~gensim.models.keyedvectors.Word2VecKeyedVectors`

   This object essentially contains the mapping between words and embeddings. After

    training, it can be used

   directly to query those embeddings in various ways. See the module level docstring for

    examples.

 

   vocabulary : :class:'~gensim.models.word2vec.Word2VecVocab'

   This object represents the vocabulary (sometimes called Dictionary in gensim) of the

    model.

   Besides keeping track of all unique words, this object provides extra functionality, such as

   constructing a huffman tree (frequent words are closer to the root), or discarding

    extremely rare words.

 

   trainables : :class:`~gensim.models.word2vec.Word2VecTrainables`

   This object represents the inner shallow neural network used to train the embeddings. The

    semantics of the

   network differ slightly in the two available training modes (CBOW or SG) but you can think

    of it as a NN with

   a single projection and hidden layer which we train on the corpus. The weights are then

    used as our embeddings

   (which means that the size of the hidden layer is equal to the number of features `self.size`).

 

   """

   def __init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,

       max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,

       sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=hash, iter=5,

        null_word=0,

       trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH,

        compute_loss=False, callbacks=(),

       max_final_vocab=None):

       """

       Parameters

       ----------

       sentences : iterable of iterables, optional

           The `sentences` iterable can be simply a list of lists of tokens, but for larger corpora,

           consider an iterable that streams the sentences directly from disk/network.

           See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.

            word2vec.Text8Corpus`

           or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.

            word2vec` module for such examples.

           See also the `tutorial on data streaming in Python

           <https://rare-technologies.com/data-streaming-in-python-generators-iterators-

            iterables/>`_.

           If you don't supply `sentences`, the model is left uninitialized -- use if you plan to

            initialize it

           in some other way.

       size : int, optional

           Dimensionality of the word vectors.

       window : int, optional

           Maximum distance between the current and predicted word within a sentence.

       min_count : int, optional

           Ignores all words with total frequency lower than this.

       workers : int, optional

           Use these many worker threads to train the model (=faster training with multicore

            machines).

       sg : {0, 1}, optional

           Training algorithm: 1 for skip-gram; otherwise CBOW.

       hs : {0, 1}, optional

           If 1, hierarchical softmax will be used for model training.

           If 0, and `negative` is non-zero, negative sampling will be used.

       negative : int, optional

           If > 0, negative sampling will be used, the int for negative specifies how many "noise

            words"

           should be drawn (usually between 5-20).

           If set to 0, no negative sampling is used.

       ns_exponent : float, optional

           The exponent used to shape the negative sampling distribution. A value of 1.0

            samples exactly in proportion

           to the frequencies, 0.0 samples all words equally, while a negative value samples low-

            frequency words more

           than high-frequency words. The popular default value of 0.75 was chosen by the

            original Word2Vec paper.

           More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-

            Letelier suggest that

           other values may perform better for recommendation applications.

       cbow_mean : {0, 1}, optional

           If 0, use the sum of the context word vectors. If 1, use the mean, only applies when

            cbow is used.

       alpha : float, optional

           The initial learning rate.

       min_alpha : float, optional

           Learning rate will linearly drop to `min_alpha` as training progresses.

       seed : int, optional

           Seed for the random number generator. Initial vectors for each word are seeded with

            a hash of

           the concatenation of word + `str(seed)`. Note that for a fully deterministically-

            reproducible run,

           you must also limit the model to a single worker thread (`workers=1`), to eliminate

            ordering jitter

           from OS thread scheduling. (In Python 3, reproducibility between interpreter launches

            also requires

           use of the `PYTHONHASHSEED` environment variable to control hash randomization).

       max_vocab_size : int, optional

           Limits the RAM during vocabulary building; if there are more unique

           words than this, then prune the infrequent ones. Every 10 million word types need

            about 1GB of RAM.

           Set to `None` for no limit.

       max_final_vocab : int, optional

           Limits the vocab to a target vocab size by automatically picking a matching min_count.

            If the specified

           min_count is more than the calculated min_count, the specified min_count will be

            used.

           Set to `None` if not required.

       sample : float, optional

           The threshold for configuring which higher-frequency words are randomly

            downsampled,

           useful range is (0, 1e-5).

       hashfxn : function, optional

           Hash function to use to randomly initialize weights, for increased training

            reproducibility.

       iter : int, optional

           Number of iterations (epochs) over the corpus.

       trim_rule : function, optional

           Vocabulary trimming rule, specifies whether certain words should remain in the

            vocabulary,

           be trimmed away, or handled using the default (discard if word count < min_count).

           Can be None (min_count will be used, look to :func:`~gensim.utils.keep_vocab_item`),

           or a callable that accepts parameters (word, count, min_count) and returns either

           :attr:`gensim.utils.RULE_DISCARD`, :attr:`gensim.utils.RULE_KEEP` or :attr:`gensim.utils.

            RULE_DEFAULT`.

           The rule, if given, is only used to prune vocabulary during build_vocab() and is not

            stored as part of the

           model.

           The input parameters are of the following types:

               * `word` (str) - the word we are examining

               * `count` (int) - the word's frequency count in the corpus

               * `min_count` (int) - the minimum count threshold.

       sorted_vocab : {0, 1}, optional

           If 1, sort the vocabulary by descending frequency before assigning word indexes.

           See :meth:`~gensim.models.word2vec.Word2VecVocab.sort_vocab()`.

       batch_words : int, optional

           Target size (in words) for batches of examples passed to worker threads (and

           thus cython routines).(Larger batches will be passed if individual

           texts are longer than 10000 words, but the standard cython code truncates to that

            maximum.)

       compute_loss: bool, optional

           If True, computes and stores loss value which can be retrieved using

           :meth:`~gensim.models.word2vec.Word2Vec.get_latest_training_loss`.

       callbacks : iterable of :class:`~gensim.models.callbacks.CallbackAny2Vec`, optional

           Sequence of callbacks to be executed at specific stages during training.

       Examples

       --------

       Initialize and train a :class:`~gensim.models.word2vec.Word2Vec` model

       >>> from gensim.models import Word2Vec

       >>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]

       >>> model = Word2Vec(sentences, min_count=1)

       """

       self.max_final_vocab = max_final_vocab

       self.callbacks = callbacks

       self.load = call_on_class_only

       self.wv = Word2VecKeyedVectors(size)

       self.vocabulary = Word2VecVocab(

           max_vocab_size=max_vocab_size, min_count=min_count, sample=sample,

            sorted_vocab=bool(sorted_vocab),

           null_word=null_word, max_final_vocab=max_final_vocab, ns_exponent=ns_exponent)

       self.trainables = Word2VecTrainables(seed=seed, vector_size=size, hashfxn=hashfxn)

       super(Word2Vec, self).__init__(sentences=sentences, workers=workers,

        vector_size=size, epochs=iter, callbacks=callbacks, batch_words=batch_words,

        trim_rule=trim_rule, sg=sg, alpha=alpha, window=window, seed=seed, hs=hs,

        negative=negative, cbow_mean=cbow_mean, min_alpha=min_alpha,

        compute_loss=compute_loss, fast_version=FAST_VERSION)

 

   def _do_train_job(self, sentences, alpha, inits):

       """Train the model on a single batch of sentences.

       Parameters

       ----------

       sentences : iterable of list of str

           Corpus chunk to be used in this training batch.

       alpha : float

           The learning rate used in this batch.

       inits : (np.ndarray, np.ndarray)

           Each worker threads private work memory.

       Returns

       -------

       (int, int)

            2-tuple (effective word count after ignoring unknown words and sentence length

             trimming, total word count).

       """

       work, neu1 = inits

       tally = 0

       if self.sg:

           tally += train_batch_sg(self, sentences, alpha, work, self.compute_loss)

       else:

           tally += train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss)

       return tally, self._raw_word_count(sentences)

 

   def _clear_post_train(self):

       """Remove all L2-normalized word vectors from the model."""

       self.wv.vectors_norm = None

 

   def _set_train_params(self, **kwargs):

       if 'compute_loss' in kwargs:

           self.compute_loss = kwargs['compute_loss']

       self.running_training_loss = 0

 

   def train(self, sentences, total_examples=None, total_words=None,

       epochs=None, start_alpha=None, end_alpha=None, word_count=0,

       queue_factor=2, report_delay=1.0, compute_loss=False, callbacks=()):

       """Update the model's neural weights from a sequence of sentences.

       Notes

       -----

       To support linear learning-rate decay from (initial) `alpha` to `min_alpha`, and accurate

       progress-percentage logging, either `total_examples` (count of sentences) or

        `total_words` (count of

       raw words in sentences) **MUST** be provided. If `sentences` is the same corpus

       that was provided to :meth:`~gensim.models.word2vec.Word2Vec.build_vocab` earlier,

       you can simply use `total_examples=self.corpus_count`.

       Warnings

       --------

       To avoid common mistakes around the model's ability to do multiple training passes

        itself, an

       explicit `epochs` argument **MUST** be provided. In the common and recommended

        case

       where :meth:`~gensim.models.word2vec.Word2Vec.train` is only called once, you can

        set `epochs=self.iter`.

       Parameters

       ----------

       sentences : iterable of list of str

           The `sentences` iterable can be simply a list of lists of tokens, but for larger corpora,

           consider an iterable that streams the sentences directly from disk/network.

           See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.

            word2vec.Text8Corpus`

           or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.

            word2vec` module for such examples.

           See also the `tutorial on data streaming in Python

           <https://rare-technologies.com/data-streaming-in-python-generators-iterators-

            iterables/>`_.

       total_examples : int, optional

           Count of sentences. Used to decay the `alpha` learning rate.

       total_words : int, optional

           Count of raw words in sentences. Used to decay the `alpha` learning rate.

       epochs : int, optional

           Number of iterations (epochs) over the corpus.

       start_alpha : float, optional

           Initial learning rate. If supplied, replaces the starting `alpha` from the constructor,

           for this one call to`train()`.

           Use only if making multiple calls to `train()`, when you want to manage the alpha

            learning-rate yourself

           (not recommended).

       end_alpha : float, optional

           Final learning rate. Drops linearly from `start_alpha`.

           If supplied, this replaces the final `min_alpha` from the constructor, for this one call to

            `train()`.

           Use only if making multiple calls to `train()`, when you want to manage the alpha

            learning-rate yourself

           (not recommended).

       word_count : int, optional

           Count of words already trained. Set this to 0 for the usual

           case of training on all words in sentences.

       queue_factor : int, optional

           Multiplier for size of queue (number of workers * queue_factor).

       report_delay : float, optional

           Seconds to wait before reporting progress.

       compute_loss: bool, optional

           If True, computes and stores loss value which can be retrieved using

           :meth:`~gensim.models.word2vec.Word2Vec.get_latest_training_loss`.

       callbacks : iterable of :class:`~gensim.models.callbacks.CallbackAny2Vec`, optional

           Sequence of callbacks to be executed at specific stages during training.

       Examples

       --------

       >>> from gensim.models import Word2Vec

       >>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]

       >>>

       >>> model = Word2Vec(min_count=1)

       >>> model.build_vocab(sentences)  # prepare the model vocabulary

       >>> model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)  

        # train word vectors

       (1, 30)

       """

       return super(Word2Vec, self).train(

           sentences, total_examples=total_examples, total_words=total_words,

           epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha,

            word_count=word_count,

           queue_factor=queue_factor, report_delay=report_delay,

            compute_loss=compute_loss, callbacks=callbacks)

 

   def score(self, sentences, total_sentences=int(1e6), chunksize=100, queue_factor=2,

    report_delay=1):

       """Score the log probability for a sequence of sentences.

       This does not change the fitted model in any way (see :meth:`~gensim.models.word2vec.

        Word2Vec.train` for that).

       Gensim has currently only implemented score for the hierarchical softmax scheme,

       so you need to have run word2vec with `hs=1` and `negative=0` for this to work.

       Note that you should specify `total_sentences`; you'll run into problems if you ask to

       score more than this number of sentences but it is inefficient to set the value too high.

       See the `article by Matt Taddy: "Document Classification by Inversion of Distributed

        Language Representations"

       <https://arxiv.org/pdf/1504.07295.pdf>`_ and the

       `gensim demo <https://github.

        com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb>`_ for examples of

       how to use such scores in document classification.

       Parameters

       ----------

       sentences : iterable of list of str

           The `sentences` iterable can be simply a list of lists of tokens, but for larger corpora,

           consider an iterable that streams the sentences directly from disk/network.

           See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.

            word2vec.Text8Corpus`

           or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.

            word2vec` module for such examples.

       total_sentences : int, optional

           Count of sentences.

       chunksize : int, optional

           Chunksize of jobs

       queue_factor : int, optional

           Multiplier for size of queue (number of workers * queue_factor).

       report_delay : float, optional

           Seconds to wait before reporting progress.

       """

       if FAST_VERSION < 0:

           warnings.warn("C extension compilation failed, scoring will be slow. "

               "Install a C compiler and reinstall gensim for fastness.")

       logger.info("scoring sentences with %i workers on %i vocabulary and %i features, "

           "using sg=%s hs=%s sample=%s and negative=%s",

           self.workers, len(self.wv.vocab), self.trainables.layer1_size, self.sg, self.hs, self.

            vocabulary.sample, self.negative)

       if not self.wv.vocab:

           raise RuntimeError("you must first build vocabulary before scoring new data")

       if not self.hs:

           raise RuntimeError(

               "We have currently only implemented score for the hierarchical softmax scheme, "

               "so you need to have run word2vec with hs=1 and negative=0 for this to work.")

       def worker_loop():

           """Compute log probability for each sentence, lifting lists of sentences from the jobs

            queue."""

           work = zeros(1, dtype=REAL) # for sg hs, we actually only need one memory loc

            (running sum)

           neu1 = matutils.zeros_aligned(self.trainables.layer1_size, dtype=REAL)

           while True:

               job = job_queue.get()

               if job is None: # signal to finish

                   break

               ns = 0

               for sentence_id, sentence in job:

                   if sentence_id >= total_sentences:

                       break

                   if self.sg:

                       score = score_sentence_sg(self, sentence, work)

                   else:

                       score = score_sentence_cbow(self, sentence, work, neu1)

                   sentence_scores[sentence_id] = score

                   ns += 1

             

               progress_queue.put(ns) # report progress

     

       start, next_report = default_timer(), 1.0 # buffer ahead only a limited number of jobs..

        this is the reason we can't simply use ThreadPool :(

       job_queue = Queue(maxsize=queue_factor * self.workers)

       progress_queue = Queue(maxsize=(queue_factor + 1) * self.workers)

       workers = [threading.Thread(target=worker_loop) for _ in xrange(self.workers)]

       for thread in workers:

           thread.daemon = True # make interrupting the process with ctrl+c easier

           thread.start()

     

       sentence_count = 0

       sentence_scores = matutils.zeros_aligned(total_sentences, dtype=REAL)

       push_done = False

       done_jobs = 0

       jobs_source = enumerate(utils.grouper(enumerate(sentences), chunksize))

       # fill jobs queue with (id, sentence) job items

       while True:

           try:

               job_no, items = next(jobs_source)

               if (job_no - 1) * chunksize > total_sentences:

                   logger.warning("terminating after %i sentences (set higher total_sentences if you

                    want more).", total_sentences)

                   job_no -= 1

                   raise StopIteration()

               logger.debug("putting job #%i in the queue", job_no)

               job_queue.put(items)

           except StopIteration:

               logger.info("reached end of input; waiting to finish %i outstanding jobs", job_no -

                done_jobs + 1)

               for _ in xrange(self.workers):

                   job_queue.put(None) # give the workers heads up that they can finish -- no more

                    work!

             

               push_done = True

           try:

               while done_jobs < (job_no + 1) or not push_done:

                   ns = progress_queue.get(push_done) # only block after all jobs pushed

                   sentence_count += ns

                   done_jobs += 1

                   elapsed = default_timer() - start

                   if elapsed >= next_report:

                       logger.info("PROGRESS: at %.2f%% sentences, %.0f sentences/s", 100.0 *

                        sentence_count, sentence_count / elapsed)

                       next_report = elapsed + report_delay # don't flood log, wait report_delay

                        seconds

               else:

                   break # loop ended by job count; really done

         

           except Empty:

               pass # already out of loop; continue to next push

     

       elapsed = default_timer() - start

       self.clear_sims()

       logger.info("scoring %i sentences took %.1fs, %.0f sentences/s", sentence_count,

        elapsed, sentence_count / elapsed)

       return sentence_scores[:sentence_count]

 

   def clear_sims(self):

       """Remove all L2-normalized word vectors from the model, to free up memory.

       You can recompute them later again using the :meth:`~gensim.models.word2vec.

        Word2Vec.init_sims` method.

       """

       self.wv.vectors_norm = None

 

   def intersect_word2vec_format(self, fname, lockf=0.0, binary=False, encoding='utf8',

    unicode_errors='strict'):

       """Merge in an input-hidden weight matrix loaded from the original C word2vec-tool

        format,

       where it intersects with the current vocabulary.

       No words are added to the existing vocabulary, but intersecting words adopt the file's

        weights, and

       non-intersecting words are left alone.

       Parameters

       ----------

       fname : str

           The file path to load the vectors from.

       lockf : float, optional

           Lock-factor value to be set for any imported word-vectors; the

           default value of 0.0 prevents further updating of the vector during subsequent

           training. Use 1.0 to allow further training updates of merged vectors.

       binary : bool, optional

           If True, `fname` is in the binary word2vec C format.

       encoding : str, optional

           Encoding of `text` for `unicode` function (python2 only).

       unicode_errors : str, optional

           Error handling behaviour, used as parameter for `unicode` function (python2 only).

       """

       overlap_count = 0

       logger.info("loading projection weights from %s", fname)

       with utils.smart_open(fname) as fin:

           header = utils.to_unicode(fin.readline(), encoding=encoding)

           vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format

           if not vector_size == self.wv.vector_size:

               raise ValueError("incompatible vector size %d in file %s" % (vector_size, fname)) #

                TOCONSIDER: maybe mismatched vectors still useful enough to merge (truncating/padding)?

           if binary:

               binary_len = dtype(REAL).itemsize * vector_size

               for _ in xrange(vocab_size): # mixed text and binary: read text first, then binary

                   word = []

                   while True:

                       ch = fin.read(1)

                       if ch == b' ':

                           break

                       if ch != b'\n': # ignore newlines in front of words (some binary files have)

                           word.append(ch)

                 

                   word = utils.to_unicode(b''.join(word), encoding=encoding,

                    errors=unicode_errors)

                   weights = fromstring(fin.read(binary_len), dtype=REAL)

                   if word in self.wv.vocab:

                       overlap_count += 1

                       self.wv.vectors[self.wv.vocab[word].index] = weights

                       self.trainables.vectors_lockf[self.wv.vocab[word].index] = lockf # lock-factor: 0.0

                        =no changes

         

           else:

               for line_no, line in enumerate(fin):

                   parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).

                    split(" ")

                   if len(parts) != vector_size + 1:

                       raise ValueError("invalid vector on line %s (is this really the text format?)" %

                        line_no)

                   word, weights = parts[0], [REAL(x) for x in parts[1:]]

                   if word in self.wv.vocab:

                       overlap_count += 1

                       self.wv.vectors[self.wv.vocab[word].index] = weights

                       self.trainables.vectors_lockf[self.wv.vocab[word].index] = lockf # lock-factor: 0.0

                        =no changes

     

       logger.info("merged %d vectors into %s matrix from %s", overlap_count, self.wv.vectors.

        shape, fname)

   @deprecated("Method will be removed in 4.0.0, use self.wv.__getitem__() instead")

 

   def __getitem__(self, words):

       """Deprecated. Use `self.wv.__getitem__` instead.

       Refer to the documentation for :meth:`~gensim.models.keyedvectors.

        Word2VecKeyedVectors.__getitem__`.

       """

       return self.wv.__getitem__(words)

 

   @deprecated("Method will be removed in 4.0.0, use self.wv.__contains__() instead")

   def __contains__(self, word):

       """Deprecated. Use `self.wv.__contains__` instead.

       Refer to the documentation for :meth:`~gensim.models.keyedvectors.

        Word2VecKeyedVectors.__contains__`.

       """

       return self.wv.__contains__(word)

 

   def predict_output_word(self, context_words_list, topn=10):

       """Get the probability distribution of the center word given context words.

       Parameters

       ----------

       context_words_list : list of str

           List of context words.

       topn : int, optional

           Return `topn` words and their probabilities.

       Returns

       -------

       list of (str, float)

           `topn` length list of tuples of (word, probability).

       """

       if not self.negative:

           raise RuntimeError(

               "We have currently only implemented predict_output_word for the negative

                sampling scheme, "

               "so you need to have run word2vec with negative > 0 for this to work.")

       if not hasattr(self.wv, 'vectors') or not hasattr(self.trainables, 'syn1neg'):

           raise RuntimeError("Parameters required for predicting the output words not found.")

       word_vocabs = [self.wv.vocab[w] for w in context_words_list if w in self.wv.vocab]

       if not word_vocabs:

           warnings.warn("All the input context words are out-of-vocabulary for the current

            model.")

           return None

       word2_indices = [word.index for word in word_vocabs]

       l1 = np_sum(self.wv.vectors[word2_indices], axis=0)

       if word2_indices and self.cbow_mean:

           l1 /= len(word2_indices)

       # propagate hidden -> output and take softmax to get probabilities

       prob_values = exp(dot(l1, self.trainables.syn1neg.T))

       prob_values /= sum(prob_values)

       top_indices = matutils.argsort(prob_values, topn=topn, reverse=True) # returning the

        most probable output words with their probabilities

       return [(self.wv.index2word[index1], prob_values[index1]) for index1 in top_indices]

 

   def init_sims(self, replace=False):

       """Deprecated. Use `self.wv.init_sims` instead.

       See :meth:`~gensim.models.keyedvectors.Word2VecKeyedVectors.init_sims`.

       """

       if replace and hasattr(self.trainables, 'syn1'):

           del self.trainables.syn1

       return self.wv.init_sims(replace)

 

   def reset_from(self, other_model):

       """Borrow shareable pre-built structures from `other_model` and reset hidden layer

        weights.

       Structures copied are:

           * Vocabulary

           * Index to word mapping

           * Cumulative frequency table (used for negative sampling)

           * Cached corpus length

       Useful when testing multiple models on the same corpus in parallel.

       Parameters

       ----------

       other_model : :class:`~gensim.models.word2vec.Word2Vec`

           Another model to copy the internal structures from.

       """

       self.wv.vocab = other_model.wv.vocab

       self.wv.index2word = other_model.wv.index2word

       self.vocabulary.cum_table = other_model.vocabulary.cum_table

       self.corpus_count = other_model.corpus_count

       self.trainables.reset_weights(self.hs, self.negative, self.wv)

 

   @staticmethod

   def log_accuracy(section):

       """Deprecated. Use `self.wv.log_accuracy` instead.

       See :meth:`~gensim.models.word2vec.Word2VecKeyedVectors.log_accuracy`.

       """

       return Word2VecKeyedVectors.log_accuracy(section)

 

   @deprecated("Method will be removed in 4.0.0, use self.wv.evaluate_word_analogies()

    instead")

   def accuracy(self, questions, restrict_vocab=30000, most_similar=None,

    case_insensitive=True):

       """Deprecated. Use `self.wv.accuracy` instead.

       See :meth:`~gensim.models.word2vec.Word2VecKeyedVectors.accuracy`.

       """

       most_similar = most_similar or Word2VecKeyedVectors.most_similar

       return self.wv.accuracy(questions, restrict_vocab, most_similar, case_insensitive)

 

   def __str__(self):

       """Human readable representation of the model's state.

       Returns

       -------

       str

           Human readable representation of the model's state, including the vocabulary size,

            vector size

           and learning rate.

       """

       return "%s(vocab=%s, size=%s, alpha=%s)" % (

           self.__class__.__name__, len(self.wv.index2word), self.wv.vector_size, self.alpha)

 

   def delete_temporary_training_data(self, replace_word_vectors_with_normalized=False):

       """Discard parameters that are used in training and scoring, to save memory.

       Warnings

       --------

       Use only if you're sure you're done training a model.

       Parameters

       ----------

       replace_word_vectors_with_normalized : bool, optional

           If True, forget the original (not normalized) word vectors and only keep

           the L2-normalized word vectors, to save even more memory.

       """

       if replace_word_vectors_with_normalized:

           self.init_sims(replace=True)

       self._minimize_model()

 

   def save(self, *args, **kwargs):

       """Save the model.

       This saved model can be loaded again using :func:`~gensim.models.word2vec.

        Word2Vec.load`, which supports

       online training and getting vectors for vocabulary words.

       Parameters

       ----------

       fname : str

           Path to the file.

       """

       # don't bother storing the cached normalized vectors, recalculable table

       kwargs['ignore'] = kwargs.get('ignore', ['vectors_norm', 'cum_table'])

       super(Word2Vec, self).save(*args, **kwargs)

 

   def get_latest_training_loss(self):

       """Get current value of the training loss.

       Returns

       -------

       float

           Current training loss.

       """

       return self.running_training_loss

 

   @deprecated(

       "Method will be removed in 4.0.0, keep just_word_vectors = model.wv to retain just the

        KeyedVectors instance")

   def _minimize_model(self, save_syn1=False, save_syn1neg=False,

    save_vectors_lockf=False):

       if save_syn1 and save_syn1neg and save_vectors_lockf:

           return

       if hasattr(self.trainables, 'syn1') and not save_syn1:

           del self.trainables.syn1

       if hasattr(self.trainables, 'syn1neg') and not save_syn1neg:

           del self.trainables.syn1neg

       if hasattr(self.trainables, 'vectors_lockf') and not save_vectors_lockf:

           del self.trainables.vectors_lockf

       self.model_trimmed_post_training = True

 

   @classmethod

   def load_word2vec_format(

       cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict',

       limit=None, datatype=REAL):

       """Deprecated. Use :meth:`gensim.models.KeyedVectors.load_word2vec_format`

        instead."""

       raise DeprecationWarning("Deprecated. Use gensim.models.KeyedVectors.

        load_word2vec_format instead.")

 

   def save_word2vec_format(self, fname, fvocab=None, binary=False):

       """Deprecated. Use `model.wv.save_word2vec_format` instead.

       See :meth:`gensim.models.KeyedVectors.save_word2vec_format`.

       """

       raise DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead.")

 

   @classmethod

   def load(cls, *args, **kwargs):

       """Load a previously saved :class:`~gensim.models.word2vec.Word2Vec` model.

       See Also

       --------

       :meth:`~gensim.models.word2vec.Word2Vec.save`

           Save model.

       Parameters

       ----------

       fname : str

           Path to the saved file.

       Returns

       -------

       :class:`~gensim.models.word2vec.Word2Vec`

           Loaded model.

       """

       try:

           model = super(Word2Vec, cls).load(*args, **kwargs)

       # for backward compatibility for `max_final_vocab` feature

           if not hasattr(model, 'max_final_vocab'):

               model.max_final_vocab = None

               model.vocabulary.max_final_vocab = None

           return model

       except AttributeError:

           logger.info('Model saved using code from earlier Gensim Version. Re-loading old

            model in a compatible way.')

           from gensim.models.deprecated.word2vec import load_old_word2vec

           return load_old_word2vec(*args, **kwargs)



相关文章
|
1月前
|
人工智能 运维 算法
AI来了,运维不慌:教你用人工智能把团队管理提速三倍!
AI来了,运维不慌:教你用人工智能把团队管理提速三倍!
273 8
|
1月前
|
人工智能 搜索推荐 程序员
当AI学会“跨界思考”:多模态模型如何重塑人工智能
当AI学会“跨界思考”:多模态模型如何重塑人工智能
255 120
|
1月前
|
人工智能 并行计算 PyTorch
以Lama Cleaner的AI去水印工具理解人工智能中经常会用到GPU来计算的CUDA是什么? 优雅草-卓伊凡
以Lama Cleaner的AI去水印工具理解人工智能中经常会用到GPU来计算的CUDA是什么? 优雅草-卓伊凡
213 4
|
1月前
|
机器学习/深度学习 人工智能 自然语言处理
拔俗AI人工智能评审管理系统:用技术为决策装上“智能导航”
AI评审系统融合NLP、知识图谱与机器学习,破解传统评审效率低、标准不一难题。通过语义解析、智能推理与风险预判,构建标准化、可复用的智能评审流程,助力项目质量与效率双提升。(238字)
|
8月前
|
机器学习/深度学习 存储 人工智能
AI职场突围战:夸克应用+生成式人工智能认证,驱动“打工人”核心竞争力!
在AI浪潮推动下,生成式人工智能(GAI)成为职场必备工具。文中对比了夸克、豆包、DeepSeek和元宝四大AI应用,夸克以“超级入口”定位脱颖而出。同时,GAI认证为职场人士提供系统学习平台,与夸克结合助力职业发展。文章还探讨了职场人士如何通过加强学习、关注技术趋势及培养合规意识,在AI时代把握机遇。
|
7月前
|
机器学习/深度学习 人工智能 自然语言处理
人工智能应用领域有哪些
本文全面探讨了人工智能(AI)的应用领域和技术核心,涵盖医疗、交通、金融、教育、制造、零售等多个行业,并分析了AI技术的局限性及规避策略。同时,介绍了生成式人工智能认证项目的意义与展望。尽管AI发展面临数据依赖和算法可解释性等问题,但通过优化策略和经验验证,可推动其健康发展。未来,AI将在更多领域发挥重要作用,助力社会进步。
|
10月前
|
机器学习/深度学习 人工智能 运维
人工智能在事件管理中的应用
人工智能在事件管理中的应用
292 21
|
11月前
|
机器学习/深度学习 人工智能 搜索推荐
探索人工智能在现代医疗中的革新应用
本文深入探讨了人工智能(AI)技术在医疗领域的最新进展,重点分析了AI如何通过提高诊断准确性、个性化治疗方案的制定以及优化患者管理流程来革新现代医疗。文章还讨论了AI技术面临的挑战和未来发展趋势,为读者提供了一个全面了解AI在医疗领域应用的视角。
255 11
|
11月前
|
机器学习/深度学习 人工智能 自然语言处理
人工智能在医疗诊断中的应用与前景####
本文深入探讨了人工智能(AI)技术在医疗诊断领域的应用现状、面临的挑战及未来发展趋势。通过分析AI如何辅助医生进行疾病诊断,提高诊断效率和准确性,以及其在个性化医疗中的潜力,文章揭示了AI技术对医疗行业变革的推动作用。同时,也指出了数据隐私、算法偏见等伦理问题,并展望了AI与人类医生协同工作的前景。 ####
783 0

热门文章

最新文章