This Solr tutorial explains the basics of Search and how to implement them using Apache Solr. The examples of this Solr tutorial are based on Solr 6.1.
By end of this Solr tutorial, you will be able to have a working Solr instance with a concrete example. You will have everything you need to get started and start exploring advanced features of Solr.
How Solr and Search work?
Solr is an open source solution for building Search Engines. Before talking about Solr, let’s first understand how search works.
Search is a process of retrieving, from a collection of documents, those that are relevant to a user’s information need.
A user can be asking for something that is available in a document, but in a different vocabulary. e.g.,
The implementation of this process requires a specification of an Information Retrieval model incorporating:
- indexing of the documents: The documents can be in several formats, can have several types of content, several vocabularies, etc. In order ro make them searchable, the documents need to be transformed in a format that can be “understood” by the machine. this process is called indexing or indexation.
- querying: the user information need can be expressed in different ways (keywords, navigation, image, etc.). In order to “understand” what the user needs, certain processing needs to be applied.
- matching between the document and the user query: a document containing the word “car” should match a query containing the word “auto”. This type of matching should be defined by the search engine.
- ranking of the selected result set: the returned result set can be very large, and hence difficult to parse by the user. A ranking function is needed in order to order the results set, and make the most relevant documents appear on top of the results.
The search process is described in the following illustration.
Apache Solr is a great solution to implement a Search Engine. In this Solr tutorial, you will learn how to process and index documents, how to formulate and interpret a user query, how to retrieve and rank documents.
To get started, please download the latest Solr release from the Apache download mirrors. Java 7 or greater is required. A JDK from Oracle is preferred since it is the best tested with Solr. On the download page under “Java SE Downloads”, select the latest “Java Platform (JDK)” and install it.
The examples presented in this Solr Tutorial are based on Solr 6.1.0. Once you uncompress the downloaded file, you’ll get a directory named solr-6.1.0, with the following structure:
Directory bin has the scripts that we will use to start and stop Solr.
Directory exampledocs contains the documents we want to index, as well as the script we will use to communicate with Solr.
To launch Solr, type on your terminal $bin/solr start
If everything goes well, you should have access to Solr Admin UI through the following URL: http://localhost:8983/solr. The Admin UI looks as follow:
So far, Solr is up and running. Solr is shipped with few working examples, but for this Solr tutorial, we are going to start from scratch and create a new example. First, let’s create a document collection called “cars”:
$bin/solr create -c cars
In Solr vocabulary, we call this document collection a “core”. It’s an instance where we can add documents and search them. We can have as many cores as we want.
You can also create a Solr core through the Admin UI under section Core Admin in the menu on your left-hand side.
If everything goes well, your core is now created and stored under solr-6.1.0/server/solr/cars.
You can access to the core “cars” though the Admin UI under the top-down menu Core Selector.
If you want to stop Solr for whatever reason, here’s the syntax: $bin/solr stop -all
Solr is very easy to configure. It can be done through the Admin UI or directly through files. The most important files to know are the indexing schema file (managed-schema) and the Solr configuration file (solrconfig.xml). Both of them are located under server/solr/cars/conf/
The schema file is accessible under section Schema in the Admin UI. Under section Files, you can access to the Solr configuration file, as well as other relevant files for the configuration.
Solr Indexing Schema
This file is used to tell Solr how to index the documents that we want to make available for search. It’s a powerful way to make sure that Solr is indexing your content in the right way. This file contains the document fields definition and the types of these fields.
The schema contains also the field “id”, which is the unique key for each document. The “id” field is already pre-defined in every schema (<uniqueKey>id</uniqueKey>).
What’s interesting about Solr is that you can define your own types to tell Solr how the fields should be processed and indexed.
Solr support two types of fields: dynamic fields and “static” field.
There are a number of options for defining new fields:
- Edit the schema file to define the fields. This can be done through the Admin UI or by editing the file directly.
- Use the Schema API to define new fields.
- Use dynamicFields, a form of convention-over-configuration that maps field names to field types based on patterns in the field name. For example, every field ending in “_i” is taken to be an integer.
- Use “schemaless” mode, where field types are auto-detected (guessed) based on the first value seen for that field
Dynamic fields includes the essential benefits of schemaless – namely the ability to add new fields on the fly without having to pre-define them.
Our schema has some common dynamicField patterns defined for use:
|Field Suffix||Multivalued Suffix||Type||Description|
|_t||_txt||text_general||Indexed for full-text search so individual words or phrases may be matched.|
|_s||_ss||string||A string value is indexed as a single unit. This is good for sorting, faceting, and analytics. It’s not good for full-text search.|
|_i||_is||int||a 32 bit signed integer|
|_l||_ls||long||a 64 bit signed long|
|_f||_fs||float||IEEE 32 bit floating point number (single precision)|
|_d||_ds||double||IEEE 64 bit floating point number (double precision)|
|_b||_bs||boolean||true or false|
|_dt||_dts||date||A date in Solr’s date format|
|_p||location||A lattitude and longitude pair for geo-spatial search|
Specify explicitly the types of parsing and analysis that should be applied on the content. As described above, there are two phases during the search process. For each phase, you can define specific types:
- indexing phase: specify the way you want your documents to be analyzed and indexed.
- querying phase: specify the way you want the user query to be parsed and analyzed
There are three categories of types:
- predefined: int, float, string, date, boolean, etc.
- defined by Solr: text, phonetic, location, etc.
- customized: you build them by yourself to satisfy specific needs
There are two categories of Analyzers:
- Tokenizers: define how to split the text. e.g., using punctuation, spaces, etc.
- Filters: define how you want to process your text. e.g., removing stop words, stemming, lemmatization, exampand keywords by their synonyms, protect some specific keywords, etc.
Here’s an example of the predefined type “text_general.” In this example, we have two analyzers:
- Indexing analyzer (<analyzer type=”index”>): applied on the documents fields at the indexing time. It has a standard tokenizer, a stopword filter, and a lower case filter.
- querying analyzer (<analyzer type=”query”>): applied at the querying time. It has a standard tokenizer, a stopword filter, a synonym filter, and a lower case filter.
stopwords.txt and synonyms.txt are two text files that contain respectively the stopwords (e.g., a, the, an, etc.) that should be removed from the text and the synonym relationships (e.g., automobile, auto, car) that should be applied for text expansion. These two files are shipped with Solr and can be modified based on your need.
In this example, we will only use synonyms at query time. There’s no need to apply synonym expansion at indexing phase at the same time.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
The following illustration describes how Solr uses the types to process documents and queries.
Once we have the Types defined, we can use them to create the Fields of the document that are going to be indexed. A Field defines the specification on how it must be indexed and searched by Solr.
The syntax with the basic attributes is as follow:
<field name="…" type="…" indexed="true|false" stored="true|false" required="true|false" … />
- name: the name of the field within the index. This name should be unique within a schema
- type: one of the types that are defined as described in the above section
- indexed (true|false): specify whether the field can be searchable or not. In some cases, you might have sections of your documents that you don’t want to include during the search process. In such case, you set indexed as false.
- stored (true|false): specify whether a field should be stored so that we can return it in the results set.
- required (true|false): specify whether a field must exist in the document to be index or not.
Let’s take a basic example describing cars. A car has a make, a model, a description, a colour and a price. To index the documents describing cars, we could define the following fields within the schema:
<field> <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="make" type="string" indexed="true" stored="true" required="true" /> <field name="model" type="string" indexed="true" stored="true"/> <field name="description" type="text_general" indexed="true" stored="true" /> <field name="colour" type="string" indexed="true" stored="true"/> <field name="price" type="tint" indexed="true" stored="true"/> </field>
As described above, you can also use dynamic fields. Instead of defining explicitly the field names in the schema, you can use field suffix in the documents that you want to index. For instance, you don’t create the field “make” in the schema, instead, you use “make_s” in the document. For document fields in this Solr tutorial, we have chosen to use configuration over convention via dynamic fields.
The above xml section should be added to the schema file (managed-schema). You can do it through the Admin UI or by editing the file manually. The managed-schema file is located under solr-6.1.0/server/solr/cars/conf.
The existing schema has already some fields and types. You should always make sure that the fields and types you create don’t conflict with the existing ones.
If you decide to edit the file manually, you should restart the server, to make the changes effective (syntax: $bin/solr restart).
As shown in the following screenshot, the new fields are now available in the schema.
In the Admin UI, you can use the Analysis feature to analyse how a document field value and a query field value are processed by Solr. It will show you visually the application of the analyzers, the output of each step, and the final matching between a query and a document.
Now that the fields and the types are defined, you can start preparing the documents that you want to index. There are three criteria to be respected while preparing your documents for indexation:
- the document must contain the same field names as described in the schema. For dynamic fields, you should symply respect the field suffix, as described above
- the values of the document fields must respect the types of these fields
- the fields that are defined as “required” in the schema must be present in the document, otherwise the document won’t be indexed.
You can put several documents within the same file and index all of them at once. Here’s the syntax of the file that can be accepted by Solr:
<add> <doc> <field name="field 1">value 1</field> <field name="field 2">value 2</field> ... <field name="field n">value n</field> </doc> <doc> ... </doc> ... </add>
Here is a basic example of documents describing cars. Please copy the following text within a file and name it cars.xml
<add> <doc> <field name="id">1</field> <field name="make">BMW</field> <field name="model">X5</field> <field name="description">Brand new car</field> <field name="colour">Grey</field> <field name="price">45000</field> </doc> <doc> <field name="id">2</field> <field name="make">Audi</field> <field name="model">A4</field> <field name="description">Not afraid of the snow</field> <field name="colour">Grey</field> <field name="price">40000</field> </doc> </add>
So far, we haven’t indexed any document. As shown in the following screenshot, Solr has zero document in the index.
You can index your document by calling Solr server through a http request or through a graphical UI. For the purpose of this example, we are going to use curl to communicate with Solr server through http.
In order to communicate with Solr, you need to specify the server where Solr is hosted, the action you want to perform (update, delete, etc.) and the path to the documents that you want to index.
Syntax: curl http://Solr_server_address/solr/core_name/update -H ‘Content-type:text/xml’ –data-binary @documents.xml
For our example,
curl http://localhost:8983/solr/cars/update -H ‘Content-type:text/xml’ –data-binary @cars.xml
Make sure to specify the right path for the file containing the documents that you want to index.
You should now have two documents indexed, as described in the following screenshot.
If you want to delete documents, you can create a file containing the tags (<delete> </delete>) and specify the criteria of deletion. Here are two examples:
- delete the document with id 1: <delete><id>1</id></delete>
- delete all the cars made by Ford: <delete><query>make:Ford</query></delete>
Once you apply an update or a delete, you need to commit your changes. The same principle is applied: call Solr server and specify the “commit” action:
Syntax: curl http://adresse_du_serveur/solr/core_name/update –data-binary ‘<commit/>’ -H ‘Content-type:text/xml’
Now that we have documents within the index, we can start searching for them. During the search process, there are two steps:
Step 1: Query formulation and execution
The concept is very similar to the indexing. We call Solr server and specify that the action is a “select.” We also specify the user query and a set of search parameters. Solr will select, from the index, those documents that match the user query respecting the set of parameters. Here’s the syntax of the http request that is sent to Solr:
In q, you can specify the set of searched keywords. You can also use some operators to define wether a keyword is optional or not. You can use a combination field:value, etc.
In addition to the query (q), there are several parameters that you can use to build powerful search experience. Here’re a few examples:
- sort (asc|desc): sort the returned documents in a specific order. You can combine several sort in one single query. e.g., sort=inStock, desc, price asc
- rows: specify the number of documents that you want Solr to return in the result set. This is useful for pagination. By default this is set to 10.
- start: used for pagination to specify the number of the document where Solr should start displaying the results. By default this is set to 0.
- fq: it’s a very powerful parameter that helps you to filter the returned documents. For instance, to search for BMWs cheaper than $15000, you can specify the following parameters: fq=price:[* TO 15000]&fq=make:BMW
- fl: it’s used to specify the set of fields that you want to return in your results sets. For instance, if you want to return only the make, the price, and the colour, you could do the following: fl=make,price,colour. If you don’t specify this parameter, Solr will return all the available fields within your index.
Step 2: Search Results
In this step, you get the result set from Solr in the format and order that you specified in your query. You can specify to get an XML or a JSON format.
The most basic query you can make is q:*:*. This returns all the documents available in the index.
Here’s another example using search parameters:
- The query is requesting all the cars made by BMW (q=make:BMW).
- The search results set should start from the first response document (start=0).
- The number of documents that should be returned in the result set is 10 (rows=10).
- Limit the returned fields to the make, the model, and the price (fl=make,model,price).
- The format of the response should be xml
The returned response is described below. In the responseHeader section, Solr returns the query and the search parameters. numFound contains the number of documents found by Solr that match the user query. The section <doc> contains the response documents. More precisely, it contains the fields that were specified in the fl parameter.
<response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">6</int> <lst name="params"> <str name="q">make:BMW</str> <str name="fl">make,model,price</str> <str name="start">0</str> <str name="rows">10</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="make">BMW</str> <str name="model">X5</str> <int name="price">45000</int> </doc> </result> </response>
Welcome to the Search and Solr community!
In this Solr tutorial, you’ve discovered how easy to set up Solr. Now you have everything you need to get started and discover more advanced features.
Should you have any comment about this Solr Tutorial, please get in touch.