ESRE 系列(一):如何部署自然语言处理 (NLP):文本嵌入和向量检索

本文涉及的产品
检索分析服务 Elasticsearch 版,2核4GB开发者规格 1个月
简介: 本文将举例说明如何使用文本嵌入模型来生成文本内容的向量表示,并演示如何对生成的向量进行向量检索。我们将会在 Elasticsearch 上部署一个面向所有人群开放的模型,然后在采集管道中使用它来从文本文档生成嵌入。接下来,我们会展示如何在向量检索中使用这些嵌入来查找对于给定查询而言语义相似的文档。

作者

Mayya Sharipova


1690512885785-4d728224-f1fd-457a-82a0-f08381507067.png


随着 Elastic Stack 8.0 的发布,您能够将 PyTorch Machine Learning 模型上传到 Elasticsearch,从而在 Elastic Stack 中实现自然语言处理 (NLP)。NLP 开启了新机遇,让您能够通过密集向量和向量检索完成信息提取文本分类,并打造更好的搜索体验


在这个包含多篇文章的系列中,我们将会使用各种 PyTorch NLP 模型带您从头到尾完成几个示例。


作为自然语言处理 (NLP) 系列文章的一部分,本文将举例说明如何使用文本嵌入模型来生成文本内容的向量表示,并演示如何对生成的向量进行向量检索。我们将会在 Elasticsearch 上部署一个面向所有人群开放的模型,然后在采集管道中使用它来从文本文档生成嵌入。接下来,我们会展示如何在向量检索中使用这些嵌入来查找对于给定查询而言语义相似的文档。


向量检索(通常也称为语义搜索)超越了传统的基于关键字的搜索,让用户可以找到语义相似但可能没有任何共同关键字的文档,从而提供更广泛的结果。向量检索作用于密集向量,并使用 k-最近邻搜索来查找相似向量。为此,首先需要使用文本嵌入模型将文本形式的内容转换为其数字矢量表示。


我们将会使用MS MARCO Passage Ranking Task提供的公共数据集进行演示。这个数据集包含了来自 Microsoft Bing 搜索引擎的真实问题和人工生成的答案,是测试向量检索的绝佳资源;首先,因为回答问题是向量检索最常见的用例之一;其次,MS MARCO 排行榜中排名靠前的论文都以某种形式使用了向量检索


在我们的示例中,我们会利用这个数据集中的一个样本,使用模型来生成文本嵌入,然后对其运行向量检索。此外,我们还希望对向量检索所产生结果的质量进行快速验证。


1. 部署文本嵌入模型

第一步是安装文本嵌入模型。在我们的模型中,使用的是Hugging Face中的msmarco-MiniLM-L-12-v3这是一个句子转换程序模型,它会取一个句子或一个段落,并将其映射到一个384 维的密集矢量。这个模型针对语义搜索进行了优化,并专门针对 MS MARCO Passage 数据集进行了训练,从而让它适合执行我们的任务。除了这个模型,Elasticsearch 还支持许多其他的文本嵌入模型。支持的完整列表可在此处查看。


我们使用在 NER 示例中构建的 Eland docker 代理来安装模型。运行下面的脚本,将我们的模型导入到本地集群中并进行部署:

eland_import_hub_model \
  --url https://<user>:<password>@localhost:9200/ \
  --hub-model-id sentence-transformers/msmarco-MiniLM-L-12-v3 \
  --task-type text_embedding \
  --start

这一次,将--task-type设为text_embedding,并将--start选项传递给 Eland 脚本,这样模型就会自动部署,而无需在 Model Management UI 中启动它。要想加快推理速度,您可以使用inference_threads参数增加推理线程数。


我们可以在 Kibana 控制台中使用这个示例来测试模型是否成功部署:

POST /_ml/trained_models/sentence-transformers__msmarco-minilm-l-12-v3/deployment/_infer
{
  "docs": {
    "text_field": "how is the weather in jamaica"
  }
}

可以在结果中看到如下图预测的密集向量:

{
  "predicted_value" : [
    0.3345310091972351,
    -0.305600643157959,
    0.2592800557613373,
  ]
}


2. 加载初始数据

正如简介中提到的,我们会使用 MS MARCO Passage Ranking 数据集。这个数据集非常大,包含 800 多万个段落。在我们的示例中,我们使用了在 2019 TREC Deep Learning Track 的测试阶段使用的一个子集。用于重新排序任务的数据集 msmarco-passagetest2019-top1000.tsv 包含 200 个查询,对于每个查询,都有一个由简单的 IR 系统提取的相关文本段落列表。从这个数据集中,我们提取了所有带有 ID 的唯一段落,并将它们放入一个单独的 tsv 文件中,总共 182,469 个段落。我们将这个文件用作我们的数据集。


我们使用Kibana 的文件上传功能来上传这个数据集。通过 Kibana 文件上传,我们可以为字段提供定制名称。例如,我们将它们称为ID类型为long的段落 ID,以及text类型为text的段落内容。索引名称为collection。上传完成后,我们可以看到一个名为collection的索引,其中包含 182,469 个文档。

1690513352472-2f13fddf-59be-40ea-84f8-af64323cd058.png



3. 创建管道

我们希望使用推理处理器来处理初始数据,以便能够为每个段落添加一个嵌入。为此,我们创建了一个文本嵌入采集管道,然后使用这个管道为初始数据重建索引。


在 Kibana 控制台中,我们创建了一个采集管道(操作方法请见上一篇博文),这次用于文本嵌入,故而称之为 text-embeddings。这些段落位于名为 text 的字段中。与之前一样,我们将会定义一个 field_map,以将文本映射到模型期望的字段 text_field。同样,将 on_failure 处理程序设为将故障索引到不同的索引中:

PUT _ingest/pipeline/text-embeddings
{
  "description": "Text embedding pipeline",
  "processors": [
    {
      "inference": {
        "model_id": "sentence-transformers__msmarco-minilm-l-12-v3",
        "target_field": "text_embedding",
        "field_map": {
          "text": "text_field"
        }
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "description": "Index document to 'failed-<index>'",
        "field": "_index",
        "value": "failed-{{{_index}}}"
      }
    },
    {
      "set": {
        "description": "Set error message",
        "field": "ingest.failure",
        "value": "{{_ingest.on_failure_message}}"
      }
    }
  ]
}


4. 为数据重建索引

我们希望通过text-embeddings管道推送文档,将文档从collection索引重新编制到新的collection-with-embeddings索引中,以便collection-with-embeddings索引中的文档具有用于段落嵌入的附加字段。但在此之前,我们需要为目标索引创建和定义一个映射,特别是对于采集处理器将存储嵌入的text_embedding.predicted_value字段。如果没有这一步,嵌入将会被索引到常规float字段中,并且不能用于向量检索。我们使用的这个模型将会生成 384 维向量的嵌入,因此,我们会使用已编制索引的 384 维dense_vector字段类型,如下图所示:

PUT collection-with-embeddings
{
  "mappings": {
    "properties": {
      "text_embedding.predicted_value": {
        "type": "dense_vector",
        "dims": 384,
        "index": true,
        "similarity": "cosine"
      },
      "text": {
        "type": "text"
      }
    }
  }
}

终于,我们可以重建索引了。鉴于重建索引需要一些时间来处理所有文档并对它们进行推断,因此,我们会通过调用带有wait_for_completion=false标志的 API 在后台重建索引。

POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "collection"
  },
  "dest": {
    "index": "collection-with-embeddings",
    "pipeline": "text-embeddings"
  }
}

上述命令会返回一个任务 ID。我们可以通过以下方式来监控任务的进度:

GET _tasks/<task_id>

或者,也可以通过观察 模型统计 API或模型统计 UI 中Inference count(推理计数)的增加来跟踪进度。

image.png


已重建索引的文档现在包含了推理结果 - 向量嵌入。例如,其中一个文档如下图所示:

{
    "id": "G7PPtn8BjSkJO8zzChzT",
    "text": "This is the definition of RNA along with examples of types of RNA molecules. This is the definition of RNA along with examples of types of RNA molecules. RNA Definition",
    "text_embedding":
    {
     "predicted_value":
        [
            0.057356324046850204,
            0.1602816879749298,
            -0.18122544884681702,
            0.022277727723121643,
            ....
        ],
        "model_id": "sentence-transformers__msmarco-minilm-l-12-v3"
    }
}


5. 向量检索

目前,我们不支持在搜索请求期间从查询词隐式生成嵌入,因此,我们的语义搜索分为两个步骤:

  1. 从文本查询中获取文本嵌入。为此,我们使用模型的_inferAPI。
  2. 使用矢量搜索来查找与查询文本语义相似的文档。在 Elasticsearch v8.0 中,我们引入了一个新的 _knn_search 终端,用于支持在已编制索引的 dense_vector 字段上进行有效的近似最近邻搜索。我们使用 _knn_search API 来查找最近的文档。


例如,给出一个文本查询“how is the weather in jamaica”(牙买加的天气怎么样),我们会首先运行 _infer API 以得到一个密集向量的嵌入:

POST /_ml/trained_models/sentence-transformers__msmarco-minilm-l-12-v3/deployment/_infer
{
  "docs": {
    "text_field": "how is the weather in jamaica"
  }
}

之后,我们将生成的密集向量插入_knn_search,如下图所示:

GET collection-with-embeddings/_knn_search
{
  "knn": {
    "field": "text_embedding.predicted_value",
    "query_vector": [
     0.3345310091972351,
    -0.305600643157959,
    0.2592800557613373,
    ],
    "k": 10,
    "num_candidates": 100
  },
  "_source": [
    "id",
    "text"
  ]
}

结果,我们得到了最接近查询文档的前 10 个文档,按它们与查询的接近程度排序:

"hits" : [
      {
        "_index" : "collection-with-embeddings",
        "_id" : "47TPtn8BjSkJO8zzKq_o",
        "_score" : 0.94591534,
        "_source" : {
          "id" : 434125,
          "text" : "The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. Continue Reading."
        }
      },
      {
        "_index" : "collection-with-embeddings",
        "_id" : "3LTPtn8BjSkJO8zzKJO1",
        "_score" : 0.94536424,
        "_source" : {
          "id" : 4498474,
          "text" : "The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year"
        }
      },
      {
        "_index" : "collection-with-embeddings",
        "_id" : "KrXPtn8BjSkJO8zzPbDW",
        "_score" :  0.9432083,
        "_source" : {
          "id" : 190804,
          "text" : "Quick Answer. The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. Continue Reading"
        }
      },
...


6. 快速验证

由于我们只使用了 MS MARCO 数据集的一个子集,因此我们无法进行全面评估。但是,我们可以对一些查询进行简单的验证,以确定我们确实得到了相关的结果,而不是一些随机的结果。从TREC 2019 Deep Learning Track对“段落排名任务”的判断中,我们选取最后 3 个查询,将它们提交到我们的矢量相似度搜索,获得前 10 个结果并参考TREC 判断,看一看我们所收到结果的相关性如何。在段落排名任务中,段落的评分标准分为四个等级:不相关 (0)、相关(段落切题但没有回答问题)(1)、高度相关 (2) 和完全相关 (3)。


请注意,我们的验证不是严格的评估,验证结果仅用于快速演示。由于我们只对已知与查询相关的段落进行索引,因此这比原始段落检索任务要容易得多。未来我们会打算对 MS MARCO 数据集进行严格的评估。


将查询 #1124210“tracheids are part of _____”(管胞属于 _____)提交给我们的向量检索,返回了以下结果:

段落 ID

相关性评分

段落

2258591

2 - 高度相关

Tracheid of oak shows pits along the walls.It is longer than a vessel element and has no perforation plates.Tracheids are elongated cells in the xylem of vascular plants that serve in the transport of water and mineral salts.Tracheids are one of two types of tracheary elements, vessel elements being the other.Tracheids, unlike vessel elements, do not have perforation plates.racheids provide most of the structural support in softwoods, where they are the major cell type.Because tracheids have a much higher surface to volume ratio compared to vessel elements, they serve to hold water against gravity (by adhesion) when transpiration is not occurring.

2258592

3 - 完全相关

Tracheid. a dead lignified plant cell that functions in water conduction.Tracheids are found in the xylem of all higher plants except certain angiosperms, such as cereals and sedges, in which the water-conducting function is performed by vessels, or tracheae.Tracheids are usually polygonal in cross section; their walls have annular, spiral, or scalene thickenings or rimmed pores.racheids are found in the xylem of all higher plants except certain angiosperms, such as cereals and sedges, in which the water-conducting function is performed by vessels, or tracheae.Tracheids are usually polygonal in cross section; their walls have annular, spiral, or scalene thickenings or rimmed pores.

2258596

2 - 高度相关

Woody angiosperms have also vessels.The mature tracheids form a column of superposed, cylindrical dead cells whose end walls have been perforated, resulting in a continuous tube called vessel (trachea).Tracheids are found in all vascular plants and are the only conducting elements in gymnosperms and ferns.Tracheids have Pits on their end walls.Pits are not nearly as efficient for water translocation as Perforation Plates found in vessel elements.Woody angiosperms have also vessels.The mature tracheids form a column of superposed, cylindrical dead cells whose end walls have been perforated, resulting in a continuous tube called vessel (trachea).Tracheids are found in all vascular plants and are the only conducting elements in gymnosperms and ferns

2258595

2 - 高度相关

Summary:Vessels have perforations at the end plates while tracheids do not have end plates.Tracheids are derived from single individual cells while vessels are derived from a pile of cells.Tracheids are present in all vascular plants whereas vessels are confined to angiosperms.Tracheids are thin whereas vessel elements are wide.Tracheids have a much higher surface-to-volume ratio as compared to vessel elements.Vessels are broader than tracheids with which they are associated.Morphology of the perforation plate is different from that in tracheids.Tracheids are thin whereas vessel elements are wide.Tracheids have a much higher surface-to-volume ratio as compared to vessel elements.Vessels are broader than tracheids with which they are associated.Morphology of the perforation plate is different from that in tracheids.

131190

3 - 完全相关

Xylem tracheids are pointed, elongated xylem cells, the simplest of which have continuous primary cell walls and lignified secondary wall thickenings in the form of rings, hoops, or reticulate networks.

7443586

2 - 高度相关

1 The xylem tracheary elements consist of cells known as tracheids and vessel members, both of which are typically narrow, hollow, and elongated.Tracheids are less specialized than the vessel members and are the only type of water-conducting cells in most gymnosperms and seedless vascular plants.

181177

2 - 高度相关

In most plants, pitted tracheids function as the primary transport cells.The other type of tracheary element, besides the tracheid, is the vessel element.Vessel elements are joined by perforations into vessels.In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes.

2947055

0 - 不相关

Cholesterol belongs to the groups of lipids called _______.holesterol belongs to the groups of lipids called _______.

6541866

2 - 高度相关

In most plants, pitted tracheids function as the primary transport cells.The other type of tracheary element, besides the tracheid, is the vessel element.Vessel elements are joined by perforations into vessels.In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes.In most plants, pitted tracheids function as the primary transport cells.The other type of tracheary element, besides the tracheid, is the vessel element.Vessel elements are joined by perforations into vessels.In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes.


查询 #1129237“hydrogen is a liquid below what temperature”返回了以下结果:

段落 ID

相关性评分

段落

8588222

0 - 不相关

回答:Hydrogen is a liquid below what temperature?By signing up, you'll get thousands of step-by-step solutions to your homework questions.... for Teachers for Schools for Companies

128984

3 - 完全相关

Hydrogen gas has the molecular formula H 2.At room temperature and under standard pressure conditions, hydrogen is a gas that is tasteless, odorless and colorless.Hydrogen can exist as a liquid under high pressure and an extremely low temperature of 20.28 kelvin (−252.87°C, −423.17 °F).Hydrogen is often stored in this way as liquid hydrogen takes up less space than hydrogen in its normal gas form.Liquid hydrogen is also used as a rocket fuel.

8588219

3 - 完全相关

User:Hydrogen is a liquid below what temperature? a.100°C;c. -183°C;b. -253°C;d.0°C Weegy:0 degrees C Weegy:Hydrogen is a liquid below 253 degrees C. User:What is the boiling point of oxygen? a.100 degrees C c. -57 degrees C b.8 degrees C d. -183 degrees C Weegy:The boiling point of oxygen is -183 degrees C.

3905057

3 - 完全相关

Hydrogen is a colorless, odorless, tasteless gas.Its density is the lowest of any chemical element, 0.08999 grams per liter.By comparison, a liter of air weighs 1.29 grams, 14 times as much as a liter of hydrogen.Hydrogen changes from a gas to a liquid at a temperature of -252.77°C (-422.99°F) and from a liquid to a solid at a temperature of -259.2°C (-434.6°F).It is slightly soluble in water, alcohol, and a few other common liquids.

4254811

3 - 完全相关

At STP (standard temperature and pressure) hydrogen is a gas.It cools to a liquid at -423 °F, which is only about 37 degrees above absolute zero.Eleven degrees cooler, at … -434 °F, it starts to solidify.

2697752

2 - 高度相关

Hydrogen's state of matter is gas at standard conditions of temperature and pressure.Hydrogen condenses into a liquid or freezes solid at extremely cold...Hydrogen's state of matter is gas at standard conditions of temperature and pressure.Hydrogen condenses into a liquid or freezes solid at extremely cold temperatures.Hydrogen's state of matter can change when the temperature changes, becoming a liquid at temperatures between minus 423.18 and minus 434.49 degrees Fahrenheit.It becomes a solid at temperatures below minus 434.49 F.Due to its high flammability, hydrogen gas is commonly used in combustion reactions, such as in rocket and automobile fuels.

6080460

3 - 完全相关

Hydrogen can exist as a liquid under high pressure and an extremely low temperature of 20.28 kelvin (−252.87°C, −423.17 °F).Hydrogen is often stored in this way as liquid hydrogen takes up less space than hydrogen in its normal gas form.Liquid hydrogen is also used as a rocket fuel.Hydrogen is found in large amounts in giant gas planets and stars, it plays a key role in powering stars through fusion reactions.Hydrogen is one of two important elements found in water (H 2 O).Each molecule of water is made up of two hydrogen atoms bonded to one oxygen atom.

128989

3 - 完全相关

Confidence votes 11.4K.At STP (standard temperature and pressure) hydrogen is a gas.It cools to a liquid at -423 °F, which is only about 37 degrees above absolute zero.Eleven degrees cooler, at -434 °F, it starts to solidify.

1959030

0 - 不相关

While below 4 °C the breakage of hydrogen bonds due to heating allows water molecules to pack closer despite the increase in the thermal motion (which tends to expand a liquid), above 4 °C water expands as the temperature increases.Water near the boiling point is about 4% less dense than water at 4 °C (39 °F)

3905800

0 - 不相关

Hydrogen is the lightest of the elements with an atomic weight of 1.0.Liquid hydrogen has a density of 0.07 grams per cubic centimeter, whereas water has a density of 1.0 g/cc and gasoline about 0.75 g/cc.These facts give hydrogen both advantages and disadvantages.



查询 #1133167“how is the weather in jamaica”返回了以下结果:

段落 ID

相关性评分

段落

434125

3 - 完全相关

The climate in Jamaica is tropical and humid with warm to hot temperatures all year round.The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit.Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year.

4498474

3 - 完全相关

The climate in Jamaica is tropical and humid with warm to hot temperatures all year round.The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit.Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year.

190804

3 - 完全相关

Quick Answer.The climate in Jamaica is tropical and humid with warm to hot temperatures all year round.The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit.Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year.继续阅读。Continue Reading.

1824479

3 - 完全相关

A:The climate in Jamaica is tropical and humid with warm to hot temperatures all year round.The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit.Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year.

1824480

3 - 完全相关

Quick Answer.The climate in Jamaica is tropical and humid with warm to hot temperatures all year round.The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit.Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year.

1824488

2 - 高度相关

Learn About the Weather of Jamaica The weather patterns you'll encounter in Jamaica can vary dramatically around the island Regardless of when you visit, the tropical climate and warm temperatures of Jamaica essentially guarantee beautiful weather during your vacation.Average temperatures in Jamaica range between 80 degrees Fahrenheit and 90 degrees Fahrenheit, with July and August being the hottest months and February the coolest.

4922619

2 - 高度相关

Weather.Jamaica averages about 80 degrees year-round, so climate is less a factor in booking travel than other destinations.The days are warm and the nights are cool.Rain usually falls for short periods in the late afternoon, with sunshine the rest of the day.

190806

2 - 高度相关

It is always important to know what the weather in Jamaica will be like before you plan and take your vacation.For the most part, the average temperature in Jamaica is between 80 °F and 90 °F (27 °FCelsius-29 °Celsius).Luckily, the weather in Jamaica is always vacation friendly.You will hardly experience long periods of rain fall, and you will become accustomed to weeks upon weeks of sunny weather.

2613296

2 - 高度相关

Average temperatures in Jamaica range between 80 degrees Fahrenheit and 90 degrees Fahrenheit, with July and August being the hottest months and February the coolest.Temperatures in Jamaica generally vary approximately 10 degrees from summer to winter

1824486

2 - 高度相关

The climate in Jamaica is tropical and humid with warm to hot temperatures all year round.The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit.Jamaican nights are considerably...


我们可以看到,对于所有 3 个查询,Elasticsearch 返回了大部分相关的结果,并且在所有查询中,排名靠前的结果要么是高度相关,要么是完全相关。



立即动手

向量检索是 Elasticsearch 8.0 新增 ESRE 能力下非常令人激动的功能,目前阿里云检索分析服务Elasticsearch版已经支持8.5版本,您可以开通产品立即尝试以上示例内容,数据节点规格建议4C8G以上。详情了解:https://www.aliyun.com/product/bigdata/elasticsearch


  • 配套实验

基于 Elasticsearch 向量检索的以文搜图

通过简易快速的阿里云相关组件和开源模型部署,通过1小时的动手实操,搭建基于Elasticsearch向量检索的以文搜图的搜索服务原型。



  • 活动推荐

《Elasticsearch 训练营:搜文本搜位置搜图片,1小时玩转Elasticsearch》

由浅入深带您上手实操【基础检索】,搭建高频业务场景应用;进阶体验【向量检索】,构建基于算法模型的向量检索应用,晋升Elasticsearch搜索实战派,超多参营好礼等你拿~

新建项目 (58).png



Elasticsearch 技术社区钉钉交流群

image.png

                   

相关文章
|
2月前
|
机器学习/深度学习 存储 人工智能
大数据中自然语言处理 (NLP)
【10月更文挑战第19天】
138 60
|
26天前
|
机器学习/深度学习 自然语言处理 知识图谱
GraphRAG在自然语言处理中的应用:从问答系统到文本生成
【10月更文挑战第28天】作为一名自然语言处理(NLP)和图神经网络(GNN)的研究者,我一直在探索如何将GraphRAG(Graph Retrieval-Augmented Generation)模型应用于各种NLP任务。GraphRAG结合了图检索和序列生成技术,能够有效地处理复杂的语言理解和生成任务。本文将从个人角度出发,探讨GraphRAG在构建问答系统、文本摘要、情感分析和自动文本生成等任务中的具体方法和案例研究。
52 5
|
28天前
|
自然语言处理 Python
如何使用自然语言处理库`nltk`进行文本的基本处理
这段Python代码展示了如何使用`nltk`库进行文本的基本处理,包括分词和词频统计。首先需要安装`nltk`库,然后通过`word_tokenize`方法将文本拆分为单词,并使用`FreqDist`类统计每个单词的出现频率。运行代码后,会输出每个词的出现次数,帮助理解文本的结构和常用词。
|
1月前
|
人工智能 自然语言处理 语音技术
利用Python进行自然语言处理(NLP)
利用Python进行自然语言处理(NLP)
36 1
|
2月前
|
人工智能 自然语言处理 语音技术
利用Python进行自然语言处理(NLP)
利用Python进行自然语言处理(NLP)
29 3
|
2月前
|
自然语言处理 算法 数据挖掘
探讨如何利用Python中的NLP工具,从被动收集到主动分析文本数据的过程
【10月更文挑战第11天】本文介绍了自然语言处理(NLP)在文本分析中的应用,从被动收集到主动分析的过程。通过Python代码示例,详细展示了文本预处理、特征提取、情感分析和主题建模等关键技术,帮助读者理解如何有效利用NLP工具进行文本数据分析。
52 2
|
2月前
|
机器学习/深度学习 人工智能 自然语言处理
AI技术在自然语言处理中的应用与挑战
【10月更文挑战第3天】本文将探讨AI技术在自然语言处理(NLP)领域的应用及其面临的挑战。我们将分析NLP的基本原理,介绍AI技术如何推动NLP的发展,并讨论当前的挑战和未来的趋势。通过本文,读者将了解AI技术在NLP中的重要性,以及如何利用这些技术解决实际问题。
|
3月前
|
机器学习/深度学习 数据采集 自然语言处理
深度学习在自然语言处理中的应用与挑战
本文探讨了深度学习技术在自然语言处理(NLP)领域的应用,包括机器翻译、情感分析和文本生成等方面。同时,讨论了数据质量、模型复杂性和伦理问题等挑战,并提出了未来的研究方向和解决方案。通过综合分析,本文旨在为NLP领域的研究人员和从业者提供有价值的参考。
|
2月前
|
自然语言处理 算法 Python
自然语言处理(NLP)在文本分析中的应用:从「被动收集」到「主动分析」
【10月更文挑战第9天】自然语言处理(NLP)在文本分析中的应用:从「被动收集」到「主动分析」
50 4
|
2月前
|
机器学习/深度学习 人工智能 自然语言处理
探索AI在自然语言处理中的创新应用
【10月更文挑战第7天】本文将深入探讨人工智能在自然语言处理领域的最新进展,揭示AI技术如何改变我们与机器的互动方式,并展示通过实际代码示例实现的具体应用。
39 1