Search and Analyze data in Amazon OpenSearch using SQL
Learn how to use SQL queries to retrieve data stored in Amazon OpenSearch.
About | |
---|---|
✅ AWS experience | 200 - Intermediate |
⏱ Time to complete | 60 minutes |
💰 Cost to complete | Free tier eligible |
🧩 Prerequisites | - AWS account -CDK installed: Visit Get Started with AWS CDK to learn more. |
💻 Code Sample | Code sample used in tutorial on GitHub |
📢 Feedback | Any feedback, issues, or just a 👍 / 👎 ? |
⏰ Last Updated | 2023-07-24 |
Note: This tutorial uses a domain with open access. For the highest level of security, we recommend that you put your domain inside a virtual private cloud (VPC).
- Go to https://aws.amazon.com and choose Sign In to the Console.
- Under Analytics, choose Amazon OpenSearch Service.
- Under the "Get Started" choice dialog, select "Managed Clusters" then click on Create domain.
- Provide a name for the domain. The examples in this tutorial use the name movies.
- For the domain creation method, choose Standard create. (Note: To quickly configure a production domain with best practices, you can choose Easy create. For the development and testing purposes of this tutorial, we'll use Standard create.)
- For templates, choose Dev/test.
- For the deployment option, choose Domain with standby.
- For Version, choose the latest version.
- For now, ignore the Data nodes, Warm and cold data storage, Dedicated master nodes, Snapshot configuration, and Custom endpoint sections.
- For simplicity in this tutorial, use a public access domain. Under Network, choose Public access.
- In the fine-grained access control settings, keep the Enable fine-grained access control check box selected. Select Create master user and provide a username and password.
- For now, ignore the SAML authentication and Amazon Cognito authentication sections.
- For Access policy, choose Only use fine-grained access control. In this tutorial, fine-grained access control handles authentication, not the domain access policy.
- Ignore the rest of the settings and choose Create. New domains typically take 15–30 minutes to initialize, but can take longer depending on the configuration. After your domain initializes, select it to open its configuration pane. Note the domain endpoint under General information (for example,
https://search-my-domain.us-east-1.es.amazonaws.com
), which you'll use in the next step.
- Navigate to your OpenSearch service on the AWS console. On the Dashboard section, your domain should be listed under the header “Name”. Click on any of your domains.
- Click on the OpenSearch Dashboard URL link for this domain.
- Login into OpenSearch Dashboards with the username and password you created in Step 1.11
- After login is successful, you will see the screens below in sequence - click “Add data”, Select “Global” tenant and click on the “Add data” button under “Sample eCommerce orders”. That will ingest the sample data comprising of e-commerce orders, into this OpenSearch domain, which you will query using SQL for the rest of this tutorial.
Note: If the Sample eCommerce Orders was already ingested you will see “View data” instead, you can skip the step above in that case.**
- Click the Hamburger Icon (the icon top lef under "Open Search Dashboards" - it looks like 3 horizontal lines above each other) to expand the menu, and then click on “Query Workbench”. This will open up the “Query Editor” where you can enter your SQL queries, with the results showing in the “Output” pane below that.
SQL | OpenSearch | Description |
---|---|---|
column | field | In both cases, at the lowest level, data is stored in named entries, of a variety of data types, containing one value. SQL calls such an entry a column while OpenSearch a field. Notice that in OpenSearch a field can contain multiple values of the same type (essentially a list) while in SQL, a column can contain exactly one value of said type. OpenSearch SQL will do its best to preserve the SQL semantic and, depending on the query, reject those that return fields with more than one value. |
row | document | Column s and field s do not exist by themselves; they are part of a row or a document . The two have slightly different semantics: a row tends to be strict (and have more enforcements) while a document tends to be a bit more flexible or loose (while still having a structure). |
table | index | The target against which queries, whether in SQL or OpenSearch get executed against. |
schema | implicit | In RDBMS, schema is mainly a namespace of tables and typically used as a security boundary. OpenSearch does not provide an equivalent concept for it. However when security is enabled, OpenSearch automatically applies the security enforcement so that a role sees only the data it is allowed to (in SQL jargon, its schema). |
catalog or database | cluster instance or domain | In SQL, catalog or database are used interchangeably and represent a set of schemas that is, a number of tables. In OpenSearch the set of indices available are grouped in a cluster`` or domain . The semantics also differ a bit; a database is essentially yet another namespace (which can have some implications on the way data is stored) while an OpenSearch cluster is a runtime instance, or rather a set of at least one OpenSearch instance (typically running distributed). In practice this means that while in SQL one can potentially have multiple catalogs inside an instance, in OpenSearch one is restricted to only one. |
cluster | cluster (federated) | Traditionally in SQL, cluster refers to a single RDMBS instance which contains a number of catalog s or database s (see above). While RDBMS tend to have only one running instance, on a single machine (not distributed), OpenSearch goes the opposite way and by default, is distributed and multi-instance. Further more, an OpenSearch cluster can be connected to other cluster s in a federated fashion thus cluster means:single cluster : Multiple Elasticsearch instances typically distributed across machines, running within the same namespace.multiple clusters :: Multiple clusters, each with its own namespace, connected to each other in a federated setup.Cross-cluster search in Amazon OpenSearch Service lets you perform queries and aggregations across multiple connected domains. It often makes more sense to use multiple smaller domains instead of a single large domain, especially when you're running different types of workloads. |
- To list all your indexes in your current domain, run the SQL query below in the Query Workbench:
opensearch_dashboards_sample_data_ecommerce
)TABLE_NAME |
---|
.kibana_1 |
.opendistro_security |
fruit |
opensearch_dashboards_sample_data_ecommerce |
opensearch_dashboards_sample_data_flights |
opensearch_dashboards_sample_data_logs |
.kibana |
- Retrieve a limited set of documents (5) from the sample e-commerce table (sample results follow the SQL query below)
Note: the (+) sign next to “order” indicates that there are nested JSON documents which we can see when we expand it. Later we will see how we can query nested JSON document fields using SQL.
geoip
entities attached (related to) this order.geoip
headings (for order id 584677). Under geoip
click on (+) next to Africa and (+) next to location. This will display all nested data related to this order.- We want retrieve a specific order, so we need to get the datatypes of all the columns in this table, using the SQL query below, and scrolling through the results to find the datatype of the column order_id.
- Since the datatype of the column
order_id
is keyword, we need to enclose the search in quotes (updated).
- Use the SQL
SELECT
clause, along withFROM
,WHERE
,GROUP BY
,HAVING
,ORDER BY
, andLIMIT
to search and aggregate data. Among these clauses,SELECT
andFROM
are required, as they specify which fields to retrieve and which indexes to retrieve them from. All other clauses are optional.
type | day_of_week_i | total_quantity | taxless_total_price |
---|---|---|---|
order | 4 | 4 | 133.96 |
order | 4 | 4 | 86.96 |
order | 4 | 4 | 112.96 |
order | 4 | 4 | 121.96 |
order | 4 | 4 | 70.96 |
- Use the
DISTINCT
clause to get back only unique field values. You can specify one or more field names:
manufacturer |
---|
Angeldale |
Champion Arts |
Crystal Lighting |
Elitelligence |
Gnomehouse |
Gnomehouse mom |
- To use an aggregate function like SUM - enclose a single field name/column or expression as a parameter. If you specify a list of one or more fields before the aggregate function, you must specify the same list as part of the
GROUP BY
clause otherwise the aggregate function will calculate the expression over all the documents in the index. Run the query below to calculate the sum of total_quantity by manufacturer.
GROUP BY
clause to define subsets of a result set. You can specify the field name (column name) to aggregate on in the GROUP BY
clause. For example, the following query returns the department numbers and the total sales for each department:manufacturer | sum(total_quantity) |
---|---|
102 | |
Angeldale | 1220 |
Champion Arts | 556 |
Crystal Lighting | 50 |
Elitelligence | 3076 |
Gnomehouse | 1046 |
Gnomehouse mom | 60 |
Karmanite | 34 |
Low Tide Media | 3507 |
Microlutions | 799 |
Oceanavigations | 2743 |
Primemaster | 152 |
Pyramidustries | 2142 |
Pyramidustries active | 328 |
Spherecords | 958 |
Spherecords Curvy | 122 |
Spherecords Maternity | 128 |
Spritechnologies | 616 |
Tigress Enterprises | 2388 |
Tigress Enterprises Curvy | 138 |
Tigress Enterprises MAMA | 148 |
- You can use an expression in the GROUP BY clause. For example, the following query returns the average sales or total_price and count of records for each year and month:
year(order_date) | month(order_date) | avg(taxless_total_price) | count(*) |
---|---|---|---|
2023 | 6 | 74.69634 | 1338 |
2023 | 7 | 75.19941 | 3337 |
Function | Description |
---|---|
AVG | Returns the average of the results. |
COUNT | Returns the number of results. |
SUM | Returns the sum of the results. |
MIN | Returns the minimum of the results. |
MAX | Returns the maximum of the results. |
VAR_POP or VARIANCE | Returns the population variance of the results after discarding nulls. Returns 0 when there is only one row of results. |
VAR_SAMP | Returns the sample variance of the results after discarding nulls. Returns null when there is only one row of results. |
STD or STDDEV | Returns the sample standard deviation of the results. Returns 0 when there is only one row of results. |
STDDEV_POP | Returns the population standard deviation of the results. Returns 0 when there is only one row of results. |
STDDEV_SAMP | Returns the sample standard deviation of the results. Returns null when there is only one row of results. |
- Use aggregate expressions as part of larger expressions in SELECT. The following query calculates the average commission for each manufacturer as 5% of the average sales:
manufacturer | avg_commission |
---|---|
10.1878 | |
Angeldale | 5.70938 |
Champion Arts | 3.46576 |
Crystal Lighting | 4.1681 |
Elitelligence | 3.42215 |
Gnomehouse | 4.50266 |
Gnomehouse mom | 4.33275 |
Karmanite | 7.10333 |
Low Tide Media | 4.22025 |
Microlutions | 3.64217 |
Oceanavigations | 4.53148 |
Primemaster | 7.92308 |
Pyramidustries | 3.38982 |
Pyramidustries active | 3.2559 |
Spherecords | 3.09347 |
Spherecords Curvy | 3.22758 |
Spherecords Maternity | 3.54196 |
Spritechnologies | 3.83885 |
Tigress Enterprises | 3.60001 |
Tigress Enterprises Curvy | 3.53235 |
Tigress Enterprises MAMA | 3.85656 |
WHERE
and HAVING
clauses in SQL- Both
WHERE
andHAVING
are used to filter results. TheWHERE
filter is applied before theGROUP BY
phase, so you cannot use aggregate functions in aWHERE
clause. However, you can use theWHERE
clause to limit the rows to which the aggregate is then applied. Below is a list of operators you can use with the WHERE clause to specify a condition to filter the results.
Operators | Behavior |
---|---|
‘= | Equal to. |
--- | --- |
<> | Not equal to. |
> | Greater than. |
< | Less than. |
>= | Greater than or equal to. |
<= | Less than or equal to. |
IN | Specify multiple OR operators. |
BETWEEN | Similar to a range query. For more information about range queries, see Range query. |
LIKE | Use for full-text search. For more information about full-text queries, see Full-text queries. |
IS NULL | Check if the field value is NULL . |
IS NOT NULL | Check if the field value is NOT NULL . |
=
, <>
, >
, >=
, <
, <=
) with boolean operators NOT
, AND
, or OR
to build more complex expressions.order_id |
---|
723055 |
727462 |
727269 |
727381 |
718424 |
- The
HAVING
filter is applied after theGROUP BY
phase, so you can use theHAVING
clause to limit the groups that are included in the results.
manufacturer | sum(total_quantity) |
---|---|
Angeldale | 1220 |
Elitelligence | 3076 |
Gnomehouse | 1046 |
Low Tide Media | 3507 |
Oceanavigations | 2743 |
Pyramidustries | 2142 |
Tigress Enterprises | 2388 |
- Use an alias for an aggregate expression in the
HAVING
clause. The following query uses an alias to return the total quantity by manufacturer where the sum exceeds 2,000.
manufacturer | tot_qty |
---|---|
Elitelligence | 3076 |
Low Tide Media | 3507 |
Oceanavigations | 2743 |
Pyramidustries | 2142 |
Tigress Enterprises | 2388 |
- The
DELETE
statement deletes documents that satisfy the predicates in theWHERE
clause. If you don’t specify theWHERE
clause, all documents are deleted. It is disabled by default. To enable theDELETE
functionality in SQL, you need to update the configuration by sending the following request using the console in Dev Tools.
deleted_rows
field shows the number of documents deleted.- Using Inner join
JOIN
clause with an INNER
keyword. The join predicate(s) is specified by the ON clause.a.account****_number | a.firstname | a.lastname | e.id | e.name |
---|---|---|---|---|
6 | Hattie | Bond | 6 | Jane Smith |
- Use Cross join, also known as cartesian join, to combine each document from the first index with each document from the second. The result set is the the cartesian product of documents of both indexes. This operation is similar to the inner join without the
ON
clause that specifies the join condition.
Note: It’s risky to perform a cross join on two indexes of large or even medium size. It might trigger a circuit breaker that terminates the query to avoid running out of memory.
a.account_number | a.firstname | a.lastname | e.id | e.name |
---|---|---|---|---|
1 | Amber | Duke | 3 | Bob Smith |
1 | Amber | Duke | 4 | Susan Smith |
1 | Amber | Duke | 6 | Jane Smith |
6 | Hattie | Bond | 3 | Bob Smith |
6 | Hattie | Bond | 4 | Susan Smith |
6 | Hattie | Bond | 6 | Jane Smith |
13 | Nanette | Bates | 3 | Bob Smith |
13 | Nanette | Bates | 4 | Susan Smith |
13 | Nanette | Bates | 6 | Jane Smith |
18 | Dale | Adams | 3 | Bob Smith |
18 | Dale | Adams | 4 | Susan Smith |
18 | Dale | Adams | 6 | Jane Smith |
- Use left outer join to retain rows from the first index if it does not satisfy the join predicate. The keyword
OUTER
is optional.
a.account****_number | a.firstname | a.lastname | e.id | e.name |
---|---|---|---|---|
1 | Amber | Duke | null | null |
6 | Hattie | Bond | 6 | Jane Smith |
13 | Nanette | Bates | null | null |
18 | Dale | Adams | null | null |
- Use subquery which is a complete
SELECT
statement used within another statement and enclosed in parenthesis:
a1.firstname | a1.lastname | a1.balance |
---|---|---|
Amber | Duke | 39225 |
Nanette | Bates | 32838 |
- Use the MATCH function in SQL for full-text search (a subset of full-text queries available in OpenSearch is supported). You can search documents that match a
string
,number
,date
, orboolean
value for a given field. TheMATCHQUERY
andMATCH_QUERY
functions are synonyms for theMATCH
relevance function.
Syntax:
analyzer
boost
address |
---|
789 Madison Street |
671 Bristol Street |
address |
---|
789 Madison Street |
671 Bristol Street |
880 Holmes Lane |
- To search for text in multiple fields, use
MULTI_MATCH
function. This function maps to themulti_match
query used in search engine, to returns the documents that match a provided text, number, date or boolean value with a given field or fields. eg. To search forDale
in either thefirstname
orlastname
fields could be called from SQL usingmulti_match
function. TheMATCHQUERY
andMATCH_QUERY
functions are synonyms for theMATCH
relevance function.
firstname | lastname |
---|---|
Dale | Adams |
- To return a relevance score along with every matching document, use the
SCORE
,SCOREQUERY
, orSCORE_QUERY
functions.
SCORE
function expects two arguments. The first argument is the MATCH_QUERY
expression. The second argument is an optional floating-point number to boost the score (the default value is 1.0):SCORE
function to boost the documents’ scores:account_number | address | score |
---|---|---|
1 | 880 Holmes Lane | 0.5 |
--- | --- | --- |
6 | 671 Bristol Street | 100 |
13 | 789 Madison Street | 100 |
For a complete REST API reference for the SQL plugin, see SQL/PPL API.
_plugins/_sql
endpoint:_sql
using the following format:The OpenSearch Java Database Connectivity (JDBC) driver comes in the form of a JAR file which can be downloaded as per the information provided in this link the SQL repository on GitHub. Installing this will let you setup a connection to OpenSearch from a SQL client such as DBeaver, and you can then run SQL queries from that client. Similarly Amazon Quicksight has a connector to OpenSearch for data visualization. However the most common way for visualizing data in OpenSearch is via OpenSearch dashboards which is a fork from Kibana.
A note about Query DSL: OpenSearch provides a search language called query domain-specific language (DSL) that you can use to search your data. Query DSL is a flexible language with a JSON interface. With query DSL, you need to specify a query in thequery
parameter of the search. One of the simplest searches in OpenSearch uses thematch_all
query, which matches all documents in an index:
Broadly, you can classify queries into two categories—leaf queries and compound queries:
- Full-text queries: Use full-text queries to search text documents. For an analyzed text field search, full-text queries split the query string into terms with the same analyzer that was used when the field was indexed. For an exact value search, full-text queries look for the specified value without applying text analysis. To learn more, see Full-text queries.
- Term-level queries: Use term-level queries to search documents for an exact specified term, such as an ID or value range. Term-level queries do not analyze search terms or sort results by relevance score. To learn more, see Term-level queries.
- Geographic and xy queries: Use geographic queries to search documents that include geographic data. Use xy queries to search documents that include points and shapes in a two-dimensional coordinate system. To learn more, see Geographic and xy queries.
- Joining queries: Use joining queries to search nested fields or return parent and child documents that match a specific query. Types of joining queries include
nested
,has_child
,has_parent
, andparent_id
queries. - Span queries: Use span queries to perform precise positional searches. Span queries are low-level, specific queries that provide control over the order and proximity of specified query terms. They are primarily used to search legal documents. To learn more, see Span queries.
- Specialized queries: Specialized queries include all other query types (
distance_feature
,more_like_this
,percolate
,rank_feature
,script
,script_score
,wrapper
, andpinned_query
).
Comment: You can further explore the latest version of OpenSearch by reviewing the documentation on this site OpenSearch documentation. To build an application using the OpenSearch SQL API, review this link OpenSearch SQL API and refer to the different SQL functions and operators which you can use depending on your application requirements.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.