cassandra的全文检索插件

简介:

https://github.com/Stratio/cassandra-lucene-index

Stratio’s Cassandra Lucene Index

Stratio’s Cassandra Lucene Index, derived from Stratio Cassandra, is a plugin for Apache Cassandra that extends its index functionality to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable, geospatial and bitemporal search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Stratio’s Cassandra indexes are one of the core modules on which Stratio’s BigData platform is based.

architecture

Index relevance searches allow you to retrieve the n more relevant results satisfying a search. The coordinator node sends the search to each node in the cluster, each node returns its n best results and then the coordinator combines these partial results and gives you the n best of them, avoiding full scan. You can also base the sorting in a combination of fields.

Any cell in the tables can be indexed, including those in the primary key as well as collections. Wide rows are also supported. You can scan token/key ranges, apply additional CQL3 clauses and page on the filtered results.

Index filtered searches are a powerful help when analyzing the data stored in Cassandra with MapReduce frameworks as Apache Hadoop or, even better, Apache Spark. Adding Lucene filters in the jobs input can dramatically reduce the amount of data to be processed, avoiding full scan.

spark_architecture

The following benchmark result can give you an idea about the expected performance when combining Lucene indexes with Spark. We do successive queries requesting from the 1% to 100% of the stored data. We can see a high performance for the index for the queries requesting strongly filtered data. However, the performance decays in less restrictive queries. As the number of records returned by the query increases, we reach a point where the index becomes slower than the full scan. So, the decision to use indexes in your Spark jobs depends on the query selectivity. The trade-off between both approaches depends on the particular use case. Generally, combining Lucene indexes with Spark is recommended for jobs retrieving no more than the 25% of the stored data.

spark_performance

This project is not intended to replace Apache Cassandra denormalized tables, inverted indexes, and/or secondary indexes. It is just a tool to perform some kind of queries which are really hard to be addressed using Apache Cassandra out of the box features, filling the gap between real-time and analytics.

oltp_olap

More detailed information is available at Stratio’s Cassandra Lucene Index documentation.

Features

Lucene search technology integration into Cassandra provides:

Stratio’s Cassandra Lucene Index and its integration with Lucene search technology provides:

  • Full text search (language-aware analysis, wildcard, fuzzy, regexp)
  • Boolean search (and, or, not)
  • Sorting by relevance, column value, and distance
  • Geospatial indexing (points, lines, polygons and their multiparts)
  • Geospatial transformations (bounding box, buffer, centroid, convex hull, union, difference, intersection)
  • Geospatial operations (intersects, contains, is within)
  • Bitemporal search (valid and transaction time durations)
  • CQL complex types (list, set, map, tuple and UDT)
  • CQL user defined functions (UDF)
  • CQL paging, even with sorted searches
  • Columns with TTL
  • Third-party CQL-based drivers compatibility
  • Spark and Hadoop compatibility

Not yet supported:

  • Thrift API
  • Legacy compact storage option
  • Indexing counter columns
  • Indexing static columns
  • Other partitioners than Murmur3

Requirements

  • Cassandra (identified by the three first numbers of the plugin version)
  • Java >= 1.8 (OpenJDK and Sun have been tested)
  • Maven >= 3.0















本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6757830.html,如需转载请自行联系原作者
相关文章
|
消息中间件 负载均衡 网络协议
ActiveMQ详细配置方案
本文总结ActiveMQ重要的一些配置,包括高可用failover配置、消息策略等。
1135 0
|
Web App开发 存储 关系型数据库
|
Python 流计算 API
PyFlink 教程(三):PyFlink DataStream API - state & timer
介绍如何在 Python DataStream API 中使用 state & timer 功能。
PyFlink 教程(三):PyFlink DataStream API - state & timer
|
7月前
|
消息中间件 关系型数据库 MySQL
基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成
基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成
663 0
|
12月前
|
数据采集 存储 数据管理
cdga|数据治理策略:击破壁垒,迈向纵向一体化的新纪元
企业将逐步击破数据壁垒,实现数据的纵向一体化。这意味着企业能够更高效地整合内外部数据资源,形成全面、准确、及时的数据视图,为管理层提供有力的决策支持。同时,数据的一体化也将促进业务流程的优化和创新,推动企业向智能化、数字化转型迈进。
cdga|数据治理策略:击破壁垒,迈向纵向一体化的新纪元
|
12月前
|
机器学习/深度学习 人工智能 自然语言处理
AI技术:从理论到实践——以Chatbot为例
AI技术:从理论到实践——以Chatbot为例
|
安全 关系型数据库 开发工具
一起聊聊 Supabase 如何构建其平台工程策略
【9月更文挑战第4天】Supabase 是一家开源 PostgreSQL 数据库基础设施提供商,被视为 Google Firebase 的替代方案。该公司采用不断演进的平台工程策略,为其开发团队提供高效的应用开发工具。自2020年起运营的 Supabase 通过整合与自动化内部开发平台,显著提升了生产力。平台工程师 Samuel Rose 加入后,进一步正式化和扩展了这一策略,结合自有产品与行业标准工具,实现了更高效的开发流程。目前,Supabase 的平台工程成果显著,为开发者提供了更好的自助服务和支持。
315 14
|
传感器 供应链 物联网
ERP系统与物联网(IoT)集成:实现智能化业务管理
【7月更文挑战第29天】 ERP系统与物联网(IoT)集成:实现智能化业务管理
966 0
|
NoSQL Ubuntu 安全
Ubuntu 20.04下载安装redis一条龙
Ubuntu 20.04下载安装redis一条龙
|
网络协议 Java 测试技术
配置中心原理和选型:Disconf、Apollo、Spring Cloud Config 和 Nacos
学完注册中心,再看配置中心这块,感觉简单很多,因为很多知识原理是相辅相成的
8100 0
配置中心原理和选型:Disconf、Apollo、Spring Cloud Config 和 Nacos