Lucene&solr 4 实践(1)

简介: 假期重新把之前在新浪博客里面的文字梳理了下,搬到这里。Solr&Lucene 4.0 好,很好,很强大。对于从lucene2.0 solr0.9 就关注,一直过来的人来讲,4.X序列除了的架构、风格、API改变了很多很多,更重要的是业务的优化口子更多了,专业知识要求更高。整个架子的容量、包容性、以及适应信息检索的科研,直接上来demo运行easy、深入会很难。需要整理了解的知识点太多了。

实践(1) 关于solr-core pomschema部分特性解读

注意: pom 中配置信息要完全正确,另外,依赖的工程要求一一正确,后面才可以执行
build.bat > mvnlog.txt  //mvn
输出的控制台结果直接输出到 mvnlog.txt文件中
子工程pom中依赖的jar 版本会从父pom中寻找,同时除了版本不一样外,其他的信息要保持一致

》版本号有比较强的依赖,例如jetty server8的才可以,679的都不行。
   <!--new addby -->

 <!--solr clould zkcli need this jar-->

         <optional>true</optional><!-- Only used for tests and one command-line utility: JettySolrRunner -->
         <!-- <version>9.0.0.M4</version> -->
         <!-- <version>7.0.0.RC4</version> -->

         <optional>true</optional><!-- Only used for tests and one command-line utility: JettySolrRunner -->
         <!-- <version>9.0.0.M4</version> -->
         <!--  <version>7.0.0.RC4</version> -->

         <optional>true</optional><!-- Only used for tests and one command-line utility: JettySolrRunner -->
         <!-- <version>9.0.0.M4</version> -->
         <!-- <version>7.0.0.RC4</version> -->

schema version
信息与老版本的升级、bytes类型的发挥、boolean类型的发挥、int pint tint和区间和排序的问题、地理数据结构、全语言分词的支持、随机域的使用

this schema includes many optional features and should not be used for benchmarking.  
To improve performance one could
- set stored="false" for all fields possible (esp large fields) when you only need to search on the field
  but don't need to return the original value.
- set indexed="false" if you don't need to search on the field, but only
  return the field as a result of searching on other indexed fields.
- remove all unneeded copyField statements
- for best index size and searching performance, set "index" to false for all general text fields,
  use copyField to copy them to the catchall "text" field, and use that for searching.
- For maximum indexing performance, use the StreamingUpdateSolrServer java client.
- Remember to run the JVM in server mode, and use a higher logging level
  that avoids logging every request

  attribute "name" is the name of this schema and is only used for display purposes.
     version="x.y" is Solr's version number for the schema syntax and
     semantics.  It should not normally be changed by applications.
     1.0: multiValued attribute did not exist, all fields are multiValued
          by nature
     1.1: multiValued attribute introduced, false by default
     1.2: omitTermFreqAndPositions attribute introduced, true by default
          except for text fields.
     1.3: removed optional field compress feature
     1.4: autoGeneratePhraseQueries attribute introduced to drive QueryParser
          behavior when a single string produces multiple tokens.  Defaults
          to off for version >= 1.4
     1.5: omitNorms defaults to true for primitive field types
          (int, float, boolean, string...)
   2. sortMissingLast
    sortMissingLast = true
    sortMissingFirst = true
    3. positionIncrementGap=100  
只对multiValue = true fieldType有意义。
     Default numeric field types. For faster range queries, consider the tint/tfloat/tlong/tdouble types.
   <fieldType name="int"    class="solr.TrieIntField"    precisionStep="0" positionIncrementGap="0"/>
   <fieldType name="float"  class="solr.TrieFloatField"  precisionStep="0" positionIncrementGap="0"/>
   <fieldType name="long"   class="solr.TrieLongField"   precisionStep="0" positionIncrementGap="0"/>
   <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>

    Numeric field types that index each value at various levels of precision
    to accelerate range queries when the number of values between the range
    endpoints is large. See the javadoc for NumericRangeQuery for internal
    implementation details.

    Smaller precisionStep values (specified in bits) will lead to more tokens
    indexed per value, slightly larger index size, and faster range queries.
    A precisionStep of 0 disables indexing at different precision levels.
   <fieldType name="tint"    class="solr.TrieIntField"    precisionStep="8" positionIncrementGap="0"/>
   <fieldType name="tfloat"  class="solr.TrieFloatField"  precisionStep="8" positionIncrementGap="0"/>
   <fieldType name="tlong"   class="solr.TrieLongField"   precisionStep="8" positionIncrementGap="0"/>
   <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0"/>
   <!--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -->
   <fieldtype name="binary" class="solr.BinaryField"/>
     These should only be used for compatibility with existing indexes (created with lucene or older Solr versions).
     Use Trie based fields instead. As of Solr 3.5 and 4.x, Trie based fields support sortMissingFirst/Last

     Plain numeric field types that store and index the text
     value verbatim (and hence don't correctly support range queries, since the
     lexicographic ordering isn't equal to the numeric ordering)
   <fieldType name="pint"    class="solr.IntField"/>
   <fieldType name="plong"   class="solr.LongField"/>
   <fieldType name="pfloat"  class="solr.FloatField"/>
   <fieldType name="pdouble" class="solr.DoubleField"/>
   <fieldType name="pdate"   class="solr.DateField" sortMissingLast="true"/>
   <!-- The "RandomSortField" is not used to store or search any
        data.  You can declare fields of this type it in your schema
        to generate pseudo-random orderings of your docs for sorting
        or function purposes.  The ordering is generated based on the field
        name and the version of the index. As long as the index version
        remains unchanged, and the same field name is reused,
        the ordering of the docs will be consistent.
        If you want different psuedo-random orderings of documents,
        for the same version of the index, use a dynamicField and
        change the field name in the request.
   <fieldType name="random" class="solr.RandomSortField" indexed="true" />
    <!-- Arabic -->
    <!-- Bulgarian -->
    <!-- Catalan -->
    <CJK bigram >
    <!-- Czech -->
    <!-- Danish -->
    <!-- German -->
    <!-- Greek -->
    <!-- Spanish -->
    <!-- Basque -->
    <!-- Persian -->
    <!-- Finnish -->
    <!-- French -->
    <!-- Irish -->
    <!-- Galician -->
    <!-- Hindi -->
    <!-- Hungarian -->
    <!-- Armenian -->
    <!-- Indonesian -->
    <!-- Italian -->
    <!-- Latvian -->
    <!-- Dutch -->
    <!-- Norwegian -->
    <!-- Portuguese -->
    <!-- Romanian -->
    <!-- Russian -->
    <!-- Swedish -->
    <!-- Thai -->
    <!-- Turkish -->

