Lucene/Solr 4.0-ALPHA – What’s In A Name?

简介: <p>Lucene & Solr 4.0-ALPHA were released on July 3, 2012. This is a huge milestone for the project, and the culmination of an idea that was spawned <a href="http://mail-archives.apache.org/mo

Lucene & Solr 4.0-ALPHA were released on July 3, 2012. This is a huge milestone for the project, and the culmination of an idea that was spawned 2 years ago with the creation of the 4x branch. I’ve included the highlights from the release announcements below, but that’s not really the point of this post. What I’d really like to talk about today is why this release is called an “ALPHA”, and what it means to you as a user.

Is This Really A “Release” ?

First and foremost: Do not be confused that the 4.0-ALPHA release is a “nightly build”, “jenkins snapshot”, or “release candidate”.

This is an official Apache release, voted on by the Lucene PMC, and available from the Apache Mirrors.

Why Is It Named 4.0-ALPHA ? Why Not Just 4.0 ?

The reason this release is “4.0-ALPHA” stems from the history of Lucene version numbering, and the strong Java API and index (file) format backwards compatibility commitments the Lucene project tries to live up to. When you develop a project using a Lucene-Core or Solr release, our goal (as Lucene developers) is to ensure that you will be able to upgrade cleanly and easily to any future release in the same “major version” (ie: “3.X” -> “3.Y”) with out needing to change your code, or modify your configuration, or re-index all of your data. You may choose to change your code/configs based on new features that are available, or to take advantage of new performance improvements — but it should not be required.

In the long-ago past of Lucene 1.X, 2.X, and 3.0, this was relatively easy for us to be confident of, because there was only a single “trunk” of development for all major releases. So releases like “2.0″ and “3.0″ were really nothing more then removing the deprecations from the previous “1.X” and “2.X” releases. This made backwards compatibility fairly easy to ensure, but was also very limiting in how/when radical improvements could be made to the APIs. The shift to parallel 3x and 4x branches has a allowed a lot of really amazing feature development that wouldn’t have been possible before — but it also means that the 4.0 release will contain a lot of completely new code and APIs that most users will have never seen — or had an opportunity to give feedback on.

Hence the idea for having formal alpha and beta releases for 4.0 was born. The motivation behind these releases is:

  • The 4.0-ALPHA release means we are confident that the index file formats have been fully “baked” and will be supported through all 5.X versions. 4.0-ALPHA users should not need to worry about index incompatibilities when upgrading to any future “4.X” (unless some seriously heinous bug is reported against 4.0-ALPHA that can’t be fixed in any other way, but this is a very small risk faced in every release)
  • Based on the feedback and bug reports from users of the 4.0-ALPHA release, there may be a 4.0-BETA release once we are confident that the public Java APIs and config file syntax has been fully “baked” and will be supported until 5.0. 4.0-BETA users should not need to worry about changing any configs or applications they write against the 4.0-BETA APIs when upgrading to any future “4.X” release (unless some seriously heinous bug is reported against 4.0-BETA that can’t be fixed in any other way then changing the public APIs or config file syntax)
  • The 4.0 (final) release will based on the feedback and bug reports from users of the 4.0-ALPHA and 4.0-BETA release. 4.0 users should not need to worry about making any changes to their configs or application code when upgrading to any future “4.X” release.

Or to put it another way:

  • We want the 4.0 release to be rock solid and dependable.
  • We don’t want anyone who thinks “It’s a dot-oh release, so it probably sucks” to be right.
  • We want to be able to support a high quality commitment to backwards compatibility for all 4.0 users as we move forward with future “4.X” release.

What’s Next ?

In order for any of this to happen, in order for any of these releases to be worth the hard work all of the Lucene developers have already put into them, in order for any of effort needed moving towards 4.0 “final” to be worthwhile at all — we need your feedback. We need real life users to download Solr, download Lucene-Core. We need users to try out the Solr 4.0 Tutorial, and review the Lucene Core Javadocs. We need existing Solr users to review the Solr CHANGES.txt and upgrade their installations. We need existing Lucene-Core users to review MIGRATE.txt and upgrade their applications.

And when users like you have tried all of these various things, we need you to tell us about your experience. We need you to post questions to the mailing lists if things don’t make sense to you. We need you to submit bug reports for errors you encounter.

We need you, to help us, make Lucene & Solr 4.0 a rock solid release.

Appendix: Release Highlights

Solr 4.0-alpha Release Highlights

The largest set of features goes by the development code-name “Solr Cloud” and involves bringing easy scalability to Solr. See http://wiki.apache.org/solr/SolrCloud for more details.

  • Distributed indexing designed from the ground up for near real-time (NRT) and NoSQL features such as realtime-get, optimistic locking, and durable updates.
  • High availability with no single points of failure.
  • Apache Zookeeper integration for distributed coordination and cluster metadata and configuration storage.
  • Immunity to split-brain issues due to Zookeeper’s Paxos distributed consensus protocols.
  • Updates sent to any node in the cluster and are automatically forwarded to the correct shard and replicated to multiple nodes for redundancy.
  • Queries sent to any node automatically perform a full distributed search across the cluster with load balancing and fail-over.

Solr 4.0-alpha includes more NoSQL features for those using Solr as a primary data store:

  • Update durability – A transaction log ensures that even uncommitted documents are never lost.
  • Real-time Get – The ability to quickly retrieve the latest version of a document, without the need to commit or open a new searcher
  • Versioning and Optimistic Locking – combined with real-time get, this allows read-update-write functionality that ensures no conflicting changes were made concurrently by other clients.
  • Atomic updates – the ability to add, remove, change, and increment fields of an existing document without having to send in the complete document again.

There are many other features coming in Solr 4, such as:

  • Pivot Faceting – Multi-level or hierarchical faceting where the top constraints for one field are found for each top constraint of a different field.
  • Pseudo-fields – The ability to alias fields, or to add metadata along with returned documents, such as function query values and results of spatial distance calculations.
  • A spell checker implementation that can work directly from the main index instead of creating a sidecar index.
  • Pseudo-Join functionality – The ability to select a set of documents based on their relationship to a second set of documents.
  • Function query enhancements including conditional function queries and relevancy functions.
  • New update processors to facilitate modifying documents prior to indexing.
  • A brand new web admin interface, including support for SolrCloud.

Lucene 4.0-alpha Release Highlights

  • The index formats for terms, postings lists, stored fields, term vectors, etc are pluggable via the Codec api. You can select from the provided implementations or customize the index format with your own Codec to meet your needs.
  • Similarity has been decoupled from the vector space model (TF/IDF). Additional models such as BM25, Divergence from Randomness, Language Models, and Information-based models are provided (see http://www.lucidimagination.com/blog/2011/09/12/flexible-ranking-in-lucene-4).
  • Added support for per-document values (DocValues). DocValues can be used for custom scoring factors (accessible via Similarity), for pre-sorted Sort values, and more.
  • When indexing via multiple threads, each IndexWriter thread now flushes its own segment to disk concurrently, resulting in substantial performance improvements (see http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html).
  • Per-document normalization factors (“norms”) are no longer limited to a single byte. Similarity implementations can use any DocValues type to store norms.
  • Added index statistics such as the number of tokens for a term or field, number of postings for a field, and number of documents with a posting for a field: these support additional scoring models (see http://blog.mikemccandless.com/2012/03/new-index-statistics-in-lucene-40.html).
  • Implemented a new default term dictionary/index (BlockTree) that indexes shared prefixes instead of every n’th term. This is not only more time- and space-efficient, but can also sometimes avoid going to disk at all for terms that do not exist. Alternative term dictionary implementions are provided and pluggable via the Codec api.
  • Indexed terms are no longer UTF-16 char sequences, instead terms can be any binary value encoded as byte arrays. By default, text terms are now encoded as UTF-8 bytes. Sort order of terms is now defined by their binary value, which is identical to UTF-8 sort order.
  • Substantially faster performance when using a Filter during searching.
  • File-system based directories can rate-limit the IO (MB/sec) of merge threads, to reduce IO contention between merging and searching threads.
  • Added a number of alternative Codecs and components for different use-cases:
  • Term offsets can be optionally encoded into the postings lists and can be retrieved per-position.
  • A new AutomatonQuery returns all documents containing any term matching a provided finite-state automaton (see http://www.slideshare.net/otisg/finite-state-queries-in-lucene).
  • FuzzyQuery is 100-200 times faster than in past releases (see http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html).
  • A new spell checker, DirectSpellChecker, finds possible corrections directly against the main search index without requiring a separate index.
  • Various in-memory data structures such as the term dictionary and FieldCache are represented more efficiently with less object overhead (see http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html).
  • All search logic is now required to work per segment, IndexReader
    was therefore refactored to differentiate between atomic and composite readers (see http://blog.thetaphi.de/2012/02/is-your-indexreader-atomic-major.html).
  • Lucene 4.0 provides a modular API, consolidating components such as Analyzers and Queries that were previously scattered across Lucene core, contrib, and Solr. These modules also include additional functionality such as UIMA analyzer integration and a completely reworked spatial search implementation.

Please read CHANGES.txt and MIGRATE.txt for a full list of new features and notes on upgrading. Particularly, the new apis are not compatible with previous version of Lucene, however, file format backwards compatibility is provided for indexes from the 3.0 series.


目录
相关文章
|
8月前
|
小程序 JavaScript 关系型数据库
weixin118电影院订票选座系统设计及实现+ssm(文档+源码)_kaic
本文介绍了一款基于微信小程序的电影院订票选座系统。该系统采用WXML、WXS、JS小程序语言开发,结合微信开发者工具和MYSQL数据库,实现了便捷的订票选座功能。用户无需下载安装,通过微信即可快速访问,操作简单高效。系统分为用户与管理员两大模块,支持电影信息查询、在线选座、订单管理等功能,同时确保数据安全与用户体验。经过可行性分析、功能设计、测试等环节,系统表现出良好的稳定性、实用性和可扩展性,为用户提供了一个全面、便捷的订票平台。
|
9月前
|
传感器 人工智能 数据可视化
数智入海,GIS赋能智慧海洋
随着科技发展,各国积极推进海洋数字化建设,建立全球海洋观测网络,获取实时数据并挖掘价值。我国从“十四五”规划到二十大报告强调海洋强国战略,利用地理空间信息技术和物联网整合监测数据,提供智能管理与决策支持,实现海洋环境的可视化三维场景、实时监测、环境保护、灾害预警及专题图件服务,推动海洋经济高质量发展。
|
安全 5G 网络性能优化
IEEE802.1, IEEE802.3和IEEE802.11的分类(仅为分类)
IEEE802.1, IEEE802.3和IEEE802.11的分类(仅为分类)
985 4
|
缓存 Java Shell
Android 系统缓存扫描与清理方法分析
Android 系统缓存从原理探索到实现。
484 15
Android 系统缓存扫描与清理方法分析
|
Android开发
Android 获取 USB设备列表
Android 获取 USB设备列表 【5月更文挑战第6天】
488 4
|
测试技术 数据安全/隐私保护 Docker
Docker部署开源项目Django-CMS企业内容管理系统
【5月更文挑战第20天】Docker部署开源项目Django-CMS企业内容管理系统
614 1
|
算法 定位技术
ArcGIS中ArcMap栅格图像平滑滤波:焦点统计、滤波器、重采样
ArcGIS中ArcMap栅格图像平滑滤波:焦点统计、滤波器、重采样
734 1
|
存储 缓存 JavaScript
Qt+QtWebApp开发笔记(六):http服务器html实现静态相对路径调用第三方js文件
为了解决调用一些依赖的如echarts等一些js的代码模块引入的问题,就需要静态文件了。 本篇解说StaticFileController,在返回的html文本中调用外部js文件,类似的,其他文件都是一样了,只是引入的后缀名不一样。
Qt+QtWebApp开发笔记(六):http服务器html实现静态相对路径调用第三方js文件
|
存储 SQL 数据采集
SaaS厂商数据库设计(3)-租户元数据管理&数据管理
SaaS厂商设计中元数据管理以及逻辑视图方式下数据管理
904 2
SaaS厂商数据库设计(3)-租户元数据管理&数据管理