Considerations when migrating from Lucene or Solr to Azure Search

Last updated Thursday, May 17, 2018 in Sitecore Experience Platform for Administrator, Developer
Keywords: Azure, Cloud, Migration

The Sitecore Azure Search provider integrates the Sitecore Search engine with the Microsoft Azure Search service.

Azure Search is a Solr replacement in distributed installations for on-premise Azure IaaS and Azure PaaS solutions. Although Azure Search supports the Lucene query syntax, the behavior is different. Therefore, solution migration might require additional actions.

This topic describes:

Limitations

The Azure Search service has a few limitations that are not present in Solr or Lucene. Therefore, ensure you are familiar with the following topics:

Functional differences in configuration

There are a few differences in functionality between Lucene, Solr, and Sitecore Azure Search; for example, unlike Lucene or Solr, Azure Search requires a defined schema for all indices. The Sitecore Azure Search provider automatically creates these schemas while indexing. Unlike Solr, you do not need to create a schema manually.

If the configuration for a field is not present, it is resolved by the incoming field values. Therefore, to ensure predictable behavior, define the fields that you are indexing in the configuration.

You can configure fields in two ways:

  • fieldNames hint="raw:AddFieldByFieldName"
  • fieldTypes hint="raw:AddFieldByFieldTypeName"

The field node in both cases can include the following cloud-specific attributes:

  • cloudFieldName - this defines the name used in the stored document. You can only define this attribute for the field node in the fieldNames section.
  • searchable, retrievable, facetable, filterable, sortable - these attributes instruct the Azure search service how to handle the field.
  • cloudAnalyzer - the type of analyzer to apply. There are currently only two predefined analyzers supported:
    • lowercase_keyword - the same as the Lucene analyzer.
    • language - the analyzer for culture-specific data.

For example:

<field fieldName="_fullpath"            cloudFieldName="fullpath_1"         searchable="YES"  retrievable="YES"  facetable="YES"  filterable="YES"  sortable="YES"  boost="1f" type="System.String"   settingType="Sitecore.ContentSearch.Azure.CloudSearchFieldConfiguration, Sitecore.ContentSearch.Azure" cloudAnalyzer="lowercase_keyword" />

Fields with the same name and type

Lucene groups fields that have the same name in a document and effectively stores them in array. However, Solr and Azure Search store fields that have the same name and type only once (skipping the duplicate and saving only one). We recommend that you avoid having field names with same name in an item or document.

Fields limitation in the index

Azure Search is limited to 1,000 fields per index, which applies to all available service tiers. The Sitecore Azure Search component includes an exclude field list for content indexes that you can extend with specific items, for example:

  • Sitecore.ContentSearch.Azure.Index.Master.ExcludeFields.config
  • Sitecore.ContentSearch.Azure.Index.Web.ExcludeFields.config

You can also move solution-related content from the master/web indices into a dedicated one.

Queries and the behavior of predicates

When using queries and predicates with Azure Search, consider the following:

  • The Filter and Where predicates transform into the same query strings. Currently, there is no way to force the search (lucene) or filter (OData) queries.
  • Filters against values containing phrase tokens can return more documents than expected. Index, for example, has a Language field that contains: en, en-us, en-au, and en-gb across multiple documents. To return all documents that are en, en-us, en-au, and en-gb, use the following query:

    queryable.Filter(item => item.Language.Equals(“en”))

  • For language-specific queries, you can use the field parsedlanguage instead of language, for example:

    queryable.Filter(item => item.ParsedLanguage.Equals(“japanese_japan”))

  • Currently, time boosting is only supported within the query, so you must move boosting to the query when defining it in the configuration process.
  • The predicates StartsWith, EndsWith, and Contains can return more records than expected for conditions with multiple words. This is because conditions are translated into regex statements. For example, if there are three documents that contain the text apple, pineapple, and pineapple is not actually an apple respectively, then a query with the condition Text.EndsWith("an apple") will return all three documents.
  • Fuzzy query semantics are different in Azure Search, for example:

    Using like as a query for pattern or similarity, interprets the similarity parameter as the Damerau-Levenshtein distance with a value between 0 and 2. This differs in Sitecore, where Lucene implements the similarity parameter by using the BM25 similarity.

Paging support

From Sitecore 8.2-Update-7 and Sitecore 9.0 Update-2, a single search query returns 1,000 documents by default. If the query finds more documents then CloudSearchResults automatically iterates through all of them.

You can limit the number of documents that are returned by a single request to 50 by setting ContentSearch.Azure.LimitSearchResultsPerRequest to true or by implementing your own iterator with the Top and Skip LINQ extensions.

Note

The maximum value supported by Skip is 100,000.

Language support

To search with a language-specific context, you must use a corresponding language analyzer during indexing. For fields that you want to index with a language context, you must set the cloudAnalyzer="language" attribute during configuration. The list of supported languages is limited by the number of Azure Language analyzers.

Azure Search automatically picks up specific language analyzers during indexing by using the configuration defined in the cloudCultureBasedAnalyzerConfiguration section of the Sitecore.ContentSearch.Azure.DefaultIndexConfiguration.config file.

When migrating from Lucene or Solr to Azure Search, ensure you:

  • Add the necessary field definitions to the search index configurations in Sitecore for all related Sitecore instances, for example, Content Management and Content Delivery.

    Note

    The Azure Search configuration provides reasonable defaults for supported .NET types and built-in Sitecore field types.

  • Review the fields that are stored by default to avoid reaching the 1,000 fields limit.
  • Review your search queries and verify the behavior of queries by using StartsWith, EndsWith, and Contains with multi-word queries. You can also consider rewriting the queries to refine the results that are returned on the client side.

    For example, if you want to pre-filter results on the service side, you would change the following query:

    var results = from item in index where item.Field.StartsWith("sitecore example") select item

    to:

    var intermediateResults = from item in index.Take(100) where item.Field.Contains(“sitecore example”) select item

    Or, if you want to return precise results using in-memory filtering on the client side, then change the query to:

    var results = intermediateResults.AsEnumerable().Where(item => item.Field.StartsWith(“sitecore example”));

  • Avoid using StartsWith to match the FullPath prefix when you want to return the descendants of an item. Instead use the built-in Path field and item ID, for example:

    var descendants = from item in index where item.Path == rootItem.ID select item;

  • Review queries using fuzzy search and replace the similarity ratio, (the floating point between 0 and 1), with the Damreau-Levenstein distance (an integer between 0 and 2).
  • Review queries that iterate over large numbers of results (for example, when using the List Manager API). To avoid iterating over more than 100,000 results in a single query, rewrite the queries to partition results using one of the search fields.
  • If your search page supports a language selection of its results, review queries to ensure that precise results return for regional dialects, for example, as en-US, or en-GB.
  • If you must routinely search over a large set of domain-specific items, such as news articles, products, or events, consider moving indexing and search into a dedicated index and define a precise index schema in the Sitecore configuration. Use ExcludeTemplate and ExcludeTemplateField to control which items and fields are included in the index. This helps keep the number of fields in the index under 1,000.
  • If you use index-time boosting on the fields, and want to obtain the equivalent behavior using Azure Search, consider rewriting the queries so they include boost weights.
Send feedback about the documentation to docsite@sitecore.net.