
Troubleshooting the ML Commons Framework
You have completed the implementation of OpenSearch models, but it seems that something is not functioning correctly. In order to resolve this issue, it is important to learn how to troubleshoot the engine that powers the models feature.
1
GET _cluster/settings?include_defaults=true&filter_path=defaults.plugins.ml_commons
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
{
"defaults": {
"plugins": {
"ml_commons": {
"monitoring_request_count": "100",
"allow_custom_deployment_plan": "false",
"sync_up_job_interval_in_seconds": "10",
"ml_task_timeout_in_seconds": "600",
"task_dispatcher": {
"eligible_node_role": {
"local_model": [
"data",
"ml"
],
"remote_model": [
"data",
"ml"
]
}
},
"trusted_url_regex": "^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]",
"rag_pipeline_feature_enabled": "false",
"task_dispatch_policy": "round_robin",
"max_ml_task_per_node": "10",
"exclude_nodes": {
"_name": ""
},
"model_access_control_enabled": "false",
"native_memory_threshold": "90",
"model_auto_redeploy": {
"lifetime_retry_times": "3",
"enable": "false"
},
"jvm_heap_memory_threshold": "85",
"memory_feature_enabled": "false",
"only_run_on_ml_node": "true",
"max_register_model_tasks_per_node": "10",
"allow_registering_model_via_local_file": "false",
"update_connector": {
"enabled": "false"
},
"max_model_on_node": "10",
"trusted_connector_endpoints_regex": [
"""^https://runtime\.sagemaker\..*[a-z0-9-]\.amazonaws\.com/.*$""",
"""^https://api\.openai\.com/.*$""",
"""^https://api\.cohere\.ai/.*$""",
"""^https://bedrock-runtime\..*[a-z0-9-]\.amazonaws\.com/.*$"""
],
"remote_inference": {
"enabled": "true"
},
"connector_access_control_enabled": "false",
"enable_inhouse_python_model": "false",
"max_deploy_model_tasks_per_node": "10",
"allow_registering_model_via_url": "false"
}
}
}
}
- plugins.ml_commons.model_access_control_enabled: models deployed at OpenSearch can be fully controlled with granular roles that you can tie to them. This setting enables that behavior, as opposed to allow anyone to use models anytime they want. If perhaps you are working with a cluster with this setting enabled; you may want to review if someone didn't associate a role to the model, which may explain why you are getting access errors every time you try to deploy the model.
- plugins.ml_commons.native_memory_threshold: this setting sets an upper bound limit about how much tolerance for the RAM memory (also known as native memory) until it stops allowing tasks to execute. It defaults to
90
, which means that if the RAM memory is over90%
of utilization, a circuit breaker will stop tasks from being executed. For a really busy OpenSearch cluster that also has to serve search requests, this may be something you want to watch out. - plugins.ml_commons.jvm_heap_memory_threshold: this setting sets an upper bound limit about how much tolerance for the JVM heap memory until it stops allowing tasks to execute. It defaults to
85
, which means that if the JVM heap memory is over85%
of utilization, a circuit breaker will stop tasks from being executed. It is important to note that the JVM heap may reach this threshold more frequently during peak times. Once the garbage collection finishes, the heap memory will shrink, but it may fill up back again pretty quickly. - plugins.ml_commons.model_auto_redeploy.enable: As you may have learned at this point, every time you deploy a model, this is executed by a task in the OpenSearch cluster. At any time, the nodes responsible for executing these tasks can fail, and by default, there is no "do it again" according to this setting. Setting this to
true
tells OpenSearch to attempt a redeploy if a model is found not deployed or partially deployed. This may explain why, even after bouncing your cluster, the model still doesn't work. When this setting is set totrue
, you can optionally use the propertyplugins.ml_commons.model_auto_redeploy.lifetime_retry_times
to specify how many redeploy attempts should happen. - plugins.ml_commons.trusted_connector_endpoints_regex: this setting controls which endpoints are allowed to be used to handle inference requests. By default, only a small set of endpoints is on the list. If you ever need to use a custom model, you will need to add your endpoints to this list. Failing to do so may be the reason why your models are shown as deployed, but always fail to handle inference requests. It just means your endpoint is not white-listed.
plugins.ml_commons.max_ml_task_per_node
setting is a bit more tricky, as it has to do with resource utilization. Problems related to resource utilization only rise under certain load conditions and are harder to identify and troubleshoot. In a nutshell, this setting controls how many tasks ML-nodes are allowed to execute. For small workloads where there are not a bunch of concurrent tasks being executed, this won't be a problem. However, think about scenarios where you have fewer ML-nodes and they are responsible for handling a considerable amount of tasks.10
. If you need to scale up more tasks per node, you can increase the value of this setting to something higher. However, there is another trick that you must be aware of. Tasks are executed as threads, and these threads are taken from a pool. Even if you increase the number of tasks that a ML-node can handle, you must ensure the thread pool for specific tasks is large enough to afford the amount of concurrency needed. To query about the thread pools used by the ML Commons plugin, you can use the following command:1
GET _cluster/settings?include_defaults=true&filter_path=defaults.thread_pool.ml_commons
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
"defaults": {
"thread_pool": {
"ml_commons": {
"opensearch_ml_deploy": {
"queue_size": "10",
"size": "9"
},
"opensearch_ml_execute": {
"queue_size": "10",
"size": "9"
},
"opensearch_ml_register": {
"queue_size": "10",
"size": "9"
},
"opensearch_ml_train": {
"queue_size": "10",
"size": "9"
},
"opensearch_ml_predict": {
"queue_size": "10000",
"size": "20"
},
"opensearch_ml_general": {
"queue_size": "100",
"size": "9"
}
}
}
}
}
size
as needed.1
GET /_plugins/_ml/profile/models
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
{
"nodes": {
"QIpgbLWFSwyTFtWz5j-OvA": {
"models": {
"s_kvA4wBfndRacpb8I1Y": {
"model_state": "DEPLOYED",
"predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@687c2ebe",
"target_worker_nodes": [
"QIpgbLWFSwyTFtWz5j-OvA"
],
"worker_nodes": [
"QIpgbLWFSwyTFtWz5j-OvA"
],
"model_inference_stats": {
"count": 8,
"max": 2322.292209,
"min": 469.437416,
"average": 1250.456260875,
"p50": 1197.6908130000002,
"p90": 1667.658159,
"p99": 2256.828804
},
"predict_request_stats": {
"count": 8,
"max": 2324.38096,
"min": 471.412834,
"average": 1252.4851045,
"p50": 1199.588,
"p90": 1669.7088755,
"p99": 2258.9137515499997
}
}
}
}
}
}
model_inference_stats
and predict_request_stats
. The former deals with the actual inferences executed by the model, whereas the latter deals with the predictions made to the model. Your troubleshooting exercise should consider the computed values of the metrics for each group, given the amount of requests displayed in the field count
. It should give a nice idea if the models are indeed the culprit.count
and the actual number of requests executed. This may happen because the Profile API monitors the last 100
requests. To change the number of monitoring requests, update the following cluster setting:1
2
3
4
5
6
PUT _cluster/settings
{
"persistent" : {
"plugins.ml_commons.monitoring_request_count" : 1000000
}
}
_source
field. Once the coordinating node has gathered all the results, it combines them into a unified response to send back to the client."profile": true
to your search body request. For example:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
GET /nlp_pqa_2/_search
{
"profile": true,
"_source": [ "question" ],
"size": 30,
"query": {
"neural": {
"question_vector": {
"query_text": "What is the meaning of life?",
"model_id": "-OnayIsBvAWGexYmHu8G",
"k": 30
}
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
{
"took": 774,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "nlp_pqa_2",
"_id": "1",
"_score": 1,
"_source": {
"question": "What is the meaning of life?"
}
},
{
"_index": "nlp_pqa_2",
"_id": "3",
"_score": 0.3856697,
"_source": {
"question": "How many legs does an Elephant have?"
}
},
{
"_index": "nlp_pqa_2",
"_id": "4",
"_score": 0.38426778,
"_source": {
"question": "How many legs does a Giraffe have?"
}
},
{
"_index": "nlp_pqa_2",
"_id": "2",
"_score": 0.34972358,
"_source": {
"question": "Does this work with xbox?"
}
}
]
},
"profile": {
"shards": [
{
"id": "[3mWnAgBCTvO_NM_zp2p_pg][nlp_pqa_2][2]",
"inbound_network_time_in_millis": 0,
"outbound_network_time_in_millis": 0,
"searches": [
{
"query": [
{
"type": "KNNQuery",
"description": "",
"time_in_nanos": 10847,
"breakdown": {
"set_min_competitive_score_count": 0,
"match_count": 0,
"shallow_advance_count": 0,
"set_min_competitive_score": 0,
"next_doc": 0,
"match": 0,
"next_doc_count": 0,
"score_count": 0,
"compute_max_score_count": 0,
"compute_max_score": 0,
"advance": 0,
"advance_count": 0,
"score": 0,
"build_scorer_count": 0,
"create_weight": 10847,
"shallow_advance": 0,
"create_weight_count": 1,
"build_scorer": 0
}
}
],
"rewrite_time": 6965,
"collector": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time_in_nanos": 6605
}
]
}
],
"aggregations": []
},
{
"id": "[3mWnAgBCTvO_NM_zp2p_pg][nlp_pqa_2][3]",
"inbound_network_time_in_millis": 0,
"outbound_network_time_in_millis": 0,
"searches": [
{
"query": [
{
"type": "KNNQuery",
"description": "",
"time_in_nanos": 79843642,
"breakdown": {
"set_min_competitive_score_count": 0,
"match_count": 0,
"shallow_advance_count": 0,
"set_min_competitive_score": 0,
"next_doc": 615,
"match": 0,
"next_doc_count": 1,
"score_count": 1,
"compute_max_score_count": 0,
"compute_max_score": 0,
"advance": 1822,
"advance_count": 1,
"score": 4185,
"build_scorer_count": 2,
"create_weight": 10888,
"shallow_advance": 0,
"create_weight_count": 1,
"build_scorer": 79826132
}
}
],
"rewrite_time": 2486,
"collector": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time_in_nanos": 40952
}
]
}
],
"aggregations": []
},
{
"id": "[3mWnAgBCTvO_NM_zp2p_pg][nlp_pqa_2][4]",
"inbound_network_time_in_millis": 0,
"outbound_network_time_in_millis": 0,
"searches": [
{
"query": [
{
"type": "KNNQuery",
"description": "",
"time_in_nanos": 81504014,
"breakdown": {
"set_min_competitive_score_count": 0,
"match_count": 0,
"shallow_advance_count": 0,
"set_min_competitive_score": 0,
"next_doc": 1321,
"match": 0,
"next_doc_count": 1,
"score_count": 1,
"compute_max_score_count": 0,
"compute_max_score": 0,
"advance": 435,
"advance_count": 1,
"score": 16599,
"build_scorer_count": 2,
"create_weight": 76898,
"shallow_advance": 0,
"create_weight_count": 1,
"build_scorer": 81408761
}
}
],
"rewrite_time": 3020,
"collector": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time_in_nanos": 45490
}
]
}
],
"aggregations": []
},
{
"id": "[BP2uaV4iScmS_zRntM65AQ][nlp_pqa_2][0]",
"inbound_network_time_in_millis": 1,
"outbound_network_time_in_millis": 2,
"searches": [
{
"query": [
{
"type": "KNNQuery",
"description": "",
"time_in_nanos": 102327857,
"breakdown": {
"set_min_competitive_score_count": 0,
"match_count": 0,
"shallow_advance_count": 0,
"set_min_competitive_score": 0,
"next_doc": 509,
"match": 0,
"next_doc_count": 1,
"score_count": 1,
"compute_max_score_count": 0,
"compute_max_score": 0,
"advance": 903,
"advance_count": 1,
"score": 2298,
"build_scorer_count": 2,
"create_weight": 57221,
"shallow_advance": 0,
"create_weight_count": 1,
"build_scorer": 102266926
}
}
],
"rewrite_time": 8032,
"collector": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time_in_nanos": 26020
}
]
}
],
"aggregations": []
},
{
"id": "[BP2uaV4iScmS_zRntM65AQ][nlp_pqa_2][1]",
"inbound_network_time_in_millis": 1,
"outbound_network_time_in_millis": 5,
"searches": [
{
"query": [
{
"type": "KNNQuery",
"description": "",
"time_in_nanos": 99278876,
"breakdown": {
"set_min_competitive_score_count": 0,
"match_count": 0,
"shallow_advance_count": 0,
"set_min_competitive_score": 0,
"next_doc": 1305,
"match": 0,
"next_doc_count": 1,
"score_count": 1,
"compute_max_score_count": 0,
"compute_max_score": 0,
"advance": 1920,
"advance_count": 1,
"score": 17296,
"build_scorer_count": 2,
"create_weight": 57394,
"shallow_advance": 0,
"create_weight_count": 1,
"build_scorer": 99200961
}
}
],
"rewrite_time": 7244,
"collector": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time_in_nanos": 53085
}
]
}
],
"aggregations": []
}
]
}
}
profile
containing some interesting data about the execution of individual components of the search request. Analyzing this data allows you to debug slower requests and understand how to improve their performance. The trick here is to cross reference the time taken by the models and the time spent in the actual search execution. The time taken by the model can be measured with the profile approach from the previous section.401
or 403
HTTP codes—this means that the request is as successful until the point where the user credentials were verified and the user permissions were put in check. This is actually good news since you won't have to investigate the said error. You just need to investigate if the resource being used should or should not be given to the user.backend_roles
that prohibit certain users to access it. To illustrate this, see the group model_group_test
below.1
2
3
4
5
6
7
POST /_plugins/_ml/model_groups/_register
{
"name": "model_group_test",
"description": "This is an example description",
"access_mode": "restricted",
"backend_roles" : ["data_scientists", "administrators"]
}
model_group_test
and not part of the roles data_scientists
and admins
won't be able to complete the deployment request successfully.git
command. However, for this debugging exercise, you will need to use an IDE to watch the execution of the code. For this reason, it is a better idea to start the cloning process with your IDE. I will show you examples using both IntelliJ IDEA and Visual Studio Code.Remote JVM Debug
. Name it with something meaningful. Set the debugger mode option to Listen to remote JVM
, and select the Auto restart
check box. Apply the configuration then click in Debug
.1
./gradlew run --debug-jvm
plugin
folder of the project. Specifically, navigate to the following folder:1
{PROJECT_DIR}/plugin/src/main/java/org/opensearch/ml/action
TransportRegisterModelGroupAction
in the editor, and create a breakpoint in the first line after the declaration of the method doExecute()
.1
2
3
4
5
POST /_plugins/_ml/model_groups/_register
{
"name": "amazon_bedrock_models",
"description": "Model group for Amazon Bedrock models"
}
5005
.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
version: '3'
services:
opensearch:
image: opensearchproject/opensearch:2.11.1
container_name: opensearch
hostname: opensearch
environment:
- cluster.name=opensearch-cluster
- node.name=opensearch-node
- discovery.type=single-node
- bootstrap.memory_lock=true
- "OPENSEARCH_JAVA_OPTS=-Xms2g -Xmx2g"
- "DISABLE_INSTALL_DEMO_CONFIG=true"
- "DISABLE_SECURITY_PLUGIN=true"
- "OPENSEARCH_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005"
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
ports:
- 9200:9200
- 9600:9600
- 5005:5005
healthcheck:
interval: 20s
retries: 10
test: ["CMD-SHELL", "curl -s http://localhost:9200"]
networks:
default:
name: opensearch_network
5005
, as you can see on line 16
. Start this instance by running the command:1
docker compose up -d
launch.json
in the .vscode
folder and add the follow JSON code.1
2
3
4
5
6
7
8
9
10
11
12
{
"version": "0.2.0",
"configurations": [
{
"name": "Debug ML Commons",
"type": "java",
"request": "attach",
"hostName": "localhost",
"port": 5005
}
]
}
plugin
folder of the project. Specifically, navigate to the following folder:1
{PROJECT_DIR}/plugin/src/main/java/org/opensearch/ml/action
TransportRegisterModelGroupAction
in the editor, and create a breakpoint in the first line after the declaration of the method doExecute()
.Run and Debug
section and click in the ▶️ button right next to the option Debug ML Commons
.1
2
3
4
5
POST /_plugins/_ml/model_groups/_register
{
"name": "amazon_bedrock_models",
"description": "Model group for Amazon Bedrock models"
}