AWS Logo
Menu
Troubleshooting the ML Commons Framework

Troubleshooting the ML Commons Framework

You have completed the implementation of OpenSearch models, but it seems that something is not functioning correctly. In order to resolve this issue, it is important to learn how to troubleshoot the engine that powers the models feature.

Published Dec 20, 2023
In this series, you have been exploring the models feature of OpenSearch. By now, we hope you are aware of its capabilities and are enthusiastic about the incredible possibilities it offers for building. However, it is unrealistic to assume that everything will be perfect. You need to know how to troubleshoot problems that may arise when things don't go as planned.
Here, you will learn a few things about the OpenSearch ML Commons Framework that will help you feel comfortable enough to troubleshoot issues on your own. We hope you won't have to—but as Thor said in the Thor Ragnarok movie: "A wise king never seeks out war. But he must always be ready for it."

ML Commons Framework must-know settings

If there is one pattern about the adoption of any modern software technology is the fact the most users struggle with problems related to the default values of important settings. Not knowing who they are, not knowing what are their default values, and what is their impact when deploying applications is a source of many problems. With the ML Commons Framework is no different.
You should spend some time knowing what are the settings available in the ML Commons Framework and what are their default values. To query about all the settings from this plugin and retrieve their default values, you can use the following command:
You should see the following output:
As you may have noticed; there is a fair amount of settings being used by the ML Commons Framework. Some of them are self-explanatory and I won't go over in detailing all of them. Instead, below, I summarize the top five settings you may want to know in more details.
  1. plugins.ml_commons.model_access_control_enabled: models deployed at OpenSearch can be fully controlled with granular roles that you can tie to them. This setting enables that behavior, as opposed to allow anyone to use models anytime they want. If perhaps you are working with a cluster with this setting enabled; you may want to review if someone didn't associate a role to the model, which may explain why you are getting access errors every time you try to deploy the model.
  2. plugins.ml_commons.native_memory_threshold: this setting sets an upper bound limit about how much tolerance for the RAM memory (also known as native memory) until it stops allowing tasks to execute. It defaults to 90, which means that if the RAM memory is over 90% of utilization, a circuit breaker will stop tasks from being executed. For a really busy OpenSearch cluster that also has to serve search requests, this may be something you want to watch out.
  3. plugins.ml_commons.jvm_heap_memory_threshold: this setting sets an upper bound limit about how much tolerance for the JVM heap memory until it stops allowing tasks to execute. It defaults to 85, which means that if the JVM heap memory is over 85% of utilization, a circuit breaker will stop tasks from being executed. It is important to note that the JVM heap may reach this threshold more frequently during peak times. Once the garbage collection finishes, the heap memory will shrink, but it may fill up back again pretty quickly.
  4. plugins.ml_commons.model_auto_redeploy.enable: As you may have learned at this point, every time you deploy a model, this is executed by a task in the OpenSearch cluster. At any time, the nodes responsible for executing these tasks can fail, and by default, there is no "do it again" according to this setting. Setting this to true tells OpenSearch to attempt a redeploy if a model is found not deployed or partially deployed. This may explain why, even after bouncing your cluster, the model still doesn't work. When this setting is set to true, you can optionally use the property plugins.ml_commons.model_auto_redeploy.lifetime_retry_times to specify how many redeploy attempts should happen.
  5. plugins.ml_commons.trusted_connector_endpoints_regex: this setting controls which endpoints are allowed to be used to handle inference requests. By default, only a small set of endpoints is on the list. If you ever need to use a custom model, you will need to add your endpoints to this list. Failing to do so may be the reason why your models are shown as deployed, but always fail to handle inference requests. It just means your endpoint is not white-listed.
While the settings discussed above have to do with problems related to the plugin behavior and the problems that may rise for you not knowing them; the plugins.ml_commons.max_ml_task_per_node setting is a bit more tricky, as it has to do with resource utilization. Problems related to resource utilization only rise under certain load conditions and are harder to identify and troubleshoot. In a nutshell, this setting controls how many tasks ML-nodes are allowed to execute. For small workloads where there are not a bunch of concurrent tasks being executed, this won't be a problem. However, think about scenarios where you have fewer ML-nodes and they are responsible for handling a considerable amount of tasks.
It may hit the limit imposed by the default value, which is 10. If you need to scale up more tasks per node, you can increase the value of this setting to something higher. However, there is another trick that you must be aware of. Tasks are executed as threads, and these threads are taken from a pool. Even if you increase the number of tasks that a ML-node can handle, you must ensure the thread pool for specific tasks is large enough to afford the amount of concurrency needed. To query about the thread pools used by the ML Commons plugin, you can use the following command:
You should see the following output:
Make sure to adjust the size as needed.

Profiling your deployed models

In some cases, users may express their dissatisfaction with certain aspects of the application, specifically its rather slow performance. Upon initial troubleshooting, it has been found that one possible reason for this sluggishness could be the calls made to models. A nice approach to investigate this further is by utilizing the Profile API provided by the ML Commons Framework.
To use the Profile API to investigate the performance of your models, use the following command:
You should see an output similar to this:
Note the hierarchical structure of this output. The analysis is broken down on a per-node basis, followed by a per-model basis. Then, for each deployed model, there are two groups: model_inference_stats and predict_request_stats. The former deals with the actual inferences executed by the model, whereas the latter deals with the predictions made to the model. Your troubleshooting exercise should consider the computed values of the metrics for each group, given the amount of requests displayed in the field count. It should give a nice idea if the models are indeed the culprit.
You may note a possible discrepancy in the value reported by the field count and the actual number of requests executed. This may happen because the Profile API monitors the last 100 requests. To change the number of monitoring requests, update the following cluster setting:

Profiling your search requests

Searching data with OpenSearch presents a greater level of complexity compared to querying a relational database. The reason behind this lies in OpenSearch's shared-nothing architecture, which distributes documents across various shards. Consequently, when initiating a search request in OpenSearch, the execution process becomes more intricate since one remains unaware of which documents will align with the query and their respective storage locations. This is the reason OpenSearch applies the query-then-fetch approach. In a nutshell, here is how it works.
In the initial query phase, the query is sent to each shard in the index. Each shard performs the search and generates a queue of matching documents. This helps identify the documents that meet the search criteria. However, we still need to retrieve the actual documents themselves in the fetch phase. In this phase, the coordinating node decides which documents to fetch. These documents may come from one or multiple shards involved in the original search. The coordinating node sends a request to the relevant shard copy, which then loads the document bodies into the _source field. Once the coordinating node has gathered all the results, it combines them into a unified response to send back to the client.
Executing search requests in OpenSearch can be complicated due to its complex distributed system. Various parts of the system can fail, become slow, and result in poor performance. This means you need to have something in your pocket to when issues related to performance occur, and if you integrate models with search requests, this can surely occur. For instance, in the part three of this series, you saw that you can leverage models in conjunction with neural requests to create really amazing content experiences out of your data. If you ever find yourself in a situation where you are suspecting that models may be slowing down your searches, you can leverage the Profile API to troubleshoot your search requests.
Getting started with the Profile API is quite simple: just add the sentence "profile": true to your search body request. For example:
You should receive an output similar to this:
Note how the response was returned with an additional field called profile containing some interesting data about the execution of individual components of the search request. Analyzing this data allows you to debug slower requests and understand how to improve their performance. The trick here is to cross reference the time taken by the models and the time spent in the actual search execution. The time taken by the model can be measured with the profile approach from the previous section.

Not working or not available to you?

There is one troubleshooting technique that must be your first instinct when dealing with problems reported by developers using OpenSearch. Always check the HTTP code returned. As you may know, everything OpenSearch does it provided for developers via REST APIs. For this reason, there will be always an HTTP code for you to check. This is important because depending of the HTTP code returned—you may save hours of troubleshooting just by figuring out that an error may not necessarily be an error.
A good example of this may be situations where a request may look like as failed but in reality; the request was sent from a user that has no permissions for that request. For those requests, if you receive an 401 or 403 HTTP codes—this means that the request is as successful until the point where the user credentials were verified and the user permissions were put in check. This is actually good news since you won't have to investigate the said error. You just need to investigate if the resource being used should or should not be given to the user.
For instance, consider the support provided by the ML Commons Framework to model access control. This may explain why every time a user tries to register or deploy a model is failing. It may be the case of the model is trying to use a model group whose access model is restricted, or one that is intentionally not visible to you. This may happen because a model group can be created with restricted access to certain users, using organizational conventions called backend_roles that prohibit certain users to access it. To illustrate this, see the group model_group_test below.
Here, any developer who tries deploying a model belonging to the group model model_group_test and not part of the roles data_scientists and admins won't be able to complete the deployment request successfully.

When all else fails; debug the code

As stated in the beginning of this blog post, it is unrealistic to assume that everything will be perfect. If you went through all sections and still find yourself without a clue about any issues with the ML Commons Framework, you can pursue one last option:
🐞 Debugging the source code for the project.
Now, I understand if you feel uncomfortable with this if you are not a software engineer. But hopefully the instructions below will guide you in the right direction so you can accomplish this fearsome task. However, I believe it will pay your efforts off in the end. Debugging the source code of the ML Commons Framework is the best way for you to understand things, at an implementation level, the behavior that may be hunting your applications.
Before moving further, make sure that you take care of the following dependencies:
Once the dependencies are taken care, you can fork the project on GitHub. Go to the project URL and fork the project. Then, retrieve the URL of your fork so you can clone it locally.
With the project URL of your fork, you can start cloning the project locally. There are many ways to do this, including doing it in a terminal using the git command. However, for this debugging exercise, you will need to use an IDE to watch the execution of the code. For this reason, it is a better idea to start the cloning process with your IDE. I will show you examples using both IntelliJ IDEA and Visual Studio Code.

IntelliJ IDEA

With IntelliJ, once you clone your project fork the IDE will automatically trigger the execution of the Gradle build, which will get your project ready to be used. This process may take some time, depending of your resources in your computer. Give it some time to finish. Then, you can start the configuration of the remote debugging.
Create a new Run/Debug configuration of the type Remote JVM Debug. Name it with something meaningful. Set the debugger mode option to Listen to remote JVM, and select the Auto restart check box. Apply the configuration then click in Debug.
This will keep IntelliJ in listening mode, waiting for the JVM containing the debugging port to start. For this, you will need to start an instance of OpenSearch containing the ML Commons Framework. Good news is that the project contains everything you need for this. Just open a new terminal and type:
It may take a while for the code to finish building and the instance to start. But once it does, you will be ready to start your debugging exercise. At this point, any breakpoints you set in the source code will be called by the debugger once the code reaches that point.
Ideally, you should know the source code from the top of your head to start a debugging exercise. After all, you must know where to look if you suspect something about the codebase. But you don't need to spend lots of time studying the ML Commons Framework source code. You can start by the actions that are triggered every time you send a REST command to train, deploy, and run inferences in models. These actions can be found in the plugin folder of the project. Specifically, navigate to the following folder:
There, you will find packages containing all the entities that you are likely be familiar with. For this example, let's see how you could debug a request to register a new model group. Open the Java class TransportRegisterModelGroupAction in the editor, and create a breakpoint in the first line after the declaration of the method doExecute().
Now you can send a REST API call to OpenSearch to register a new model group:
...and IntelliJ will listen to the exact moment when the JVM executes that request and stop the code for you right where you have set the breakpoint.
🎥 Here is an end to end demo with the instructions given so far for you to follow along.

Visual Studio Code

With Visual Studio Code (VSCode), once you clone your project fork the IDE will automatically trigger the execution of the Gradle build, which will get your project ready to be used. This process may take some time, depending of your resources in your computer. Give it some time to finish. Then, you can start the configuration of the remote debugging.
Because VSCode's Java debugger has limited options when compared to IntelliJ, you will need to use a different approach to attach the debugger to the remote JVM. VSCode doesn't allow you to listen to a remote JVM, which causes the Gradle build to fail because it won't find anything listening to the port 5005.
As such, you will need to configure your own OpenSearch instance with remote debugging enabled. The easiest way to create a new instance of OpenSearch is using Docker. Create a new Docker Compose file and add the following code:
This code contains one OpenSearch instance properly configured to accept debugging requests over the port 5005, as you can see on line 16. Start this instance by running the command:
Now, create a new file called launch.json in the .vscode folder and add the follow JSON code.
Ideally, you should know the source code from the top of your head to start a debugging exercise. After all, you must know where to look if you suspect something about the codebase. But you don't need to spend lots of time studying the ML Commons Framework source code. You can start by the actions that are triggered every time you send a REST command to train, deploy, and run inferences in models. These actions can be found in the plugin folder of the project. Specifically, navigate to the following folder:
There, you will find packages containing all the entities that you are likely be familiar with. For this example, let's see how you could debug a request to register a new model group. Open the Java class TransportRegisterModelGroupAction in the editor, and create a breakpoint in the first line after the declaration of the method doExecute().
You are ready to configure VSCode to attach its debugger into OpenSearch. Go to the Run and Debug section and click in the ▶️ button right next to the option Debug ML Commons.
Now you can send a REST API call to OpenSearch to register a new model group:
...and VSCode will listen to the exact moment when the JVM executes that request and stop the code for you right where you have set the breakpoint.
🎥 Here is an end to end demo with the instructions given so far for you to follow along.

Summary

The models feature opens up the window to exciting use cases where data can be magnified by the power of ML models and Generative AI. Tied with the simplicity of OpenSearch, you can enable your teams to create cutting-edge applications with very low effort.
I hope you have enjoyed reading this series and please make sure to share this content within your social media circle so others could benefit from the same. If you want to discover more about the amazing world of Generative AI, take a look at this space and don't forget to subscribe to the AWS Developers YouTube channel. I'm sure you will be amazed by the new content to come. Finally, follow me on LinkedIn if you want to geek out about technologies in general.
See you, next time!

Comments