Sphractal is a package that provides functionalities to estimate the fractal dimension of complex 3D surfaces formed from overlapping spheres via box-counting algorithm.
Background
Atomic objects in molecular and nanosciences such are often represented as collection of spheres with radii associated with the atomic radius of the individual component.
Some examples of these objects (inclusive of both fine- and coarse-grained representation of the individual components)
are small molecules, proteins, nanoparticles, polymers, and porous materials such as zeolite, metal-organic framework (MOFs).
The overall properties of these objects are often significantly influenced by their surface properties, in particular the surface area available for interaction with other entities, which is related to the surface roughness.
Fractal dimension allows the surface complexity/roughness of objects to be measured quantitatively.
The fractal dimension could be estimated by applying the box-counting algorithm on surfaces represented as either:
approximated point cloud:
that are subsequently voxelised:
or mathematically exact surfaces:
Features
Aims
Representation of the surface as either voxelised point clouds or mathematically exact surfaces.
Efficient algorithm for 3D box-counting calculations.
Customisable parameters to control the level of detail and accuracy of the calculation.
Installation
Use pip or conda to install Sphractal:
pip install sphractal
conda install -c conda-forge sphractal
Special Requirement for Point Cloud Surface Representation
Sphractal requires a file compiled from another freely available repository for the functionalities related to voxelised point clouds surface representation to operate properly.
This could be done by:
Downloading the source code from the repository to a directory of your choice:
git clone https://github.com/jon-ting/fastbc.git
Compile the code into an executable file (which works on any operating system) by doing either one of the following compilations according to the instructions on the README.md page. This will decide whether you will be running the box counting algorithm with GPU acceleration. Feel free to rename the output file from the compilation:
(Optional) Setting the path to the compiled file as an environment variable accessible by Python (replace <PATH_TO_FASTBC> by the absolute path to the executable file you just built), otherwise you could always pass the path to the compiled file to the relevant functions:
export FASTBC=<PATH_TO_FASTBC>
Note that for the environment variable to be persistent (to still exist after the terminal is closed), the line should be added to your ~/.bashrc.
Usage
fromsphractalimportgetExampleDataPath, runBoxCntinpFile=getExampleDataPath() # Replace with the path to your xyz or lmp fileboxCntResults=runBoxCnt(inpFile)
The package was created with cookiecutter using the
py-pkgs-cookiecuttertemplate.
The speeding up of the inner functions via just-in-time compilations with Numba was inspired by the advice received during the NCI-NVIDIA Open Hackathon 2023.
Using risset to install plugins also ensures integration
with other tools like CsoundQt. Risset also can be used
to show manual pages, list opcodes, etc.
Manual Installation
Plugins can be manually downloaded from the releases page:
You will probably have to overcome Apple’s security mechanism to use the plugins.
Right-click on each plugin and choose “Open with Terminal”. Confirm “Open” in the dialog panel.
After installing the toolchain, on Unix systems, you need to source a file that will export the environment variables. This file is generated by espup and is located in your home directory by default. There are different ways to source the file:
Source this file in every terminal:
Source the export file: . $HOME/export-esp.sh
This approach requires running the command in every new shell.
Create an alias for executing the export-esp.sh:
Copy and paste the following command to your shell’s profile (.profile, .bashrc, .zprofile, etc.): alias get_esprs='. $HOME/export-esp.sh'
Refresh the configuration by restarting the terminal session or by running source [path to profile], for example, source ~/.bashrc.
This approach requires running the alias in every new shell.
Add the environment variables to your shell profile directly:
Add the content of $HOME/export-esp.sh to your shell’s profile: cat $HOME/export-esp.sh >> [path to profile], for example, cat $HOME/export-esp.sh >> ~/.bashrc.
Refresh the configuration by restarting the terminal session or by running source [path to profile], for example, source ~/.bashrc.
Important
On Windows, environment variables are automatically injected into your system and don’t need to be sourced.
Usage
Usage: espup <COMMAND>
Commands:
completions Generate completions for the given shell
install Installs Espressif Rust ecosystem
uninstall Uninstalls Espressif Rust ecosystem
update Updates Xtensa Rust toolchain
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
Usage: espup completions [OPTIONS] <SHELL>
Arguments:
<SHELL> Shell to generate completions for [possible values: bash, zsh, fish, powershell, elvish, nushell]
Options:
-l, --log-level <LOG_LEVEL> Verbosity level of the logs [default: info] [possible values: debug, info, warn, error]
-h, --help Print help
Install Subcommand
Note
Xtensa Rust destination path
Installation paths can be modified by setting the environment variables CARGO_HOME and RUSTUP_HOME before running the install command. By default, toolchains will be installed under <rustup_home>/toolchains/esp, although this can be changed using the -a/--name option.
Note
GitHub API
During the installation process, several GitHub queries are made, which are subject to certain limits. Our number of queries should not hit the limit unless you are running espup install command numerous times in a short span of time. We recommend setting the GITHUB_TOKEN environment variable when using espup in CI, if you want to use espup on CI, recommend using it via the xtensa-toolchain action, and making sure GITHUB_TOKEN is not set when using it on a host machine. See esp-rs/xtensa-toolchain#15 for more details on this.
Usage: espup install [OPTIONS]
Options:
-d, --default-host <DEFAULT_HOST>
Target triple of the host
[possible values: x86_64-unknown-linux-gnu, aarch64-unknown-linux-gnu, x86_64-pc-windows-msvc, x86_64-pc-windows-gnu, x86_64-apple-darwin, aarch64-apple-darwin]
-r, --esp-riscv-gcc
Install Espressif RISC-V toolchain built with croostool-ng
Only install this if you don't want to use the systems RISC-V toolchain
-f, --export-file <EXPORT_FILE>
Relative or full path for the export file that will be generated. If no path is provided, the file will be generated under home directory (https://docs.rs/dirs/latest/dirs/fn.home_dir.html)
[env: ESPUP_EXPORT_FILE=]
-e, --extended-llvm
Extends the LLVM installation.
This will install the whole LLVM instead of only installing the libs.
-l, --log-level <LOG_LEVEL>
Verbosity level of the logs
[default: info]
[possible values: debug, info, warn, error]
-a, --name <NAME>
Xtensa Rust toolchain name
[default: esp]
-b, --stable-version <STABLE_VERSION>
Stable Rust toolchain version.
Note that only RISC-V targets use stable Rust channel.
[default: stable]
-k, --skip-version-parse
Skips parsing Xtensa Rust version
-s, --std
Only install toolchains required for STD applications.
With this option, espup will skip GCC installation (it will be handled by esp-idf-sys), hence you won't be able to build no_std applications.
-t, --targets <TARGETS>
Comma or space separated list of targets [esp32,esp32c2,esp32c3,esp32c6,esp32h2,esp32s2,esp32s3,esp32p4,all]
[default: all]
-v, --toolchain-version <TOOLCHAIN_VERSION>
Xtensa Rust toolchain version
-h, --help
Print help (see a summary with '-h')
Uninstall Subcommand
Usage: espup uninstall [OPTIONS]
Options:
-l, --log-level <LOG_LEVEL> Verbosity level of the logs [default: info] [possible values: debug, info, warn, error]
-a, --name <NAME> Xtensa Rust toolchain name [default: esp]
-h, --help Print help
Update Subcommand
Usage: espup update [OPTIONS]
Options:
-d, --default-host <DEFAULT_HOST>
Target triple of the host
[possible values: x86_64-unknown-linux-gnu, aarch64-unknown-linux-gnu, x86_64-pc-windows-msvc, x86_64-pc-windows-gnu, x86_64-apple-darwin, aarch64-apple-darwin]
-f, --export-file <EXPORT_FILE>
Relative or full path for the export file that will be generated. If no path is provided, the file will be generated under home directory (https://docs.rs/dirs/latest/dirs/fn.home_dir.html)
[env: ESPUP_EXPORT_FILE=]
-e, --extended-llvm
Extends the LLVM installation.
This will install the whole LLVM instead of only installing the libs.
-l, --log-level <LOG_LEVEL>
Verbosity level of the logs
[default: info]
[possible values: debug, info, warn, error]
-a, --name <NAME>
Xtensa Rust toolchain name
[default: esp]
-b, --stable-version <STABLE_VERSION>
Stable Rust toolchain version.
Note that only RISC-V targets use stable Rust channel.
[default: stable]
-k, --skip-version-parse
Skips parsing Xtensa Rust version
-s, --std
Only install toolchains required for STD applications.
With this option, espup will skip GCC installation (it will be handled by esp-idf-sys), hence you won't be able to build no_std applications.
-t, --targets <TARGETS>
Comma or space separated list of targets [esp32,esp32c2,esp32c3,esp32c6,esp32h2,esp32s2,esp32s3,all]
[default: all]
-v, --toolchain-version <TOOLCHAIN_VERSION>
Xtensa Rust toolchain version
-h, --help
Print help (see a summary with '-h')
Enable Tab Completion for Bash, Fish, Zsh, or PowerShell
espup supports generating completion scripts for Bash, Fish, Zsh, and
PowerShell. See espup help completions for full details, but the gist is as
simple as using one of the following:
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Система доступна только авторизованным пользователям. Если пользователь не
авторизован, при попытке открытия любой страницы его должно редиректить на страницу
авторизации. На странице авторизации он может ввести логин и пароль для входа в систему
или зарегистрироваться. При регистрации указываются поля: ФИО, логин и пароль.
У каждого авторизованного пользователя имеется своя телефонная книга, т.е. каждый
пользователь видит только те записи, которые он создал.
Обратить внимание (обязательно к выполнению)
Админка для управления пользователями – не требуется.
Формат телефонов должен проверяется и быть валидным для Украины, пример: +380(66) 1234567
Приложение обязательно должно содержать JUnit тесты, максимально плотно покрывающие код.
Приветствуется использование Mockito.
Проект должен собираться средствами Maven
Для запуска использоваться SpringBoot
Все настройки приложения должны находится в properties файле, путь к которому должен
передаваться в качестве аргументов JVM машине (-Dlardi.conf=/path/to/file.properties).
В конфигурационном файле указывается тип хранилища. Тип хранилища используется один раз
при старте JVM (изменения в конфигурационном файле вступают в силу только при перезапуске
JVM).
Реализовать минимум два варианта хранилища: СУБД (MySQL) и файл-хранилище (XML/
JSON/CSV на выбор).
Настройки хранилища должны указываться в файле-конфигурации (хост и пользователь для СУБД или путь к файлу для файлового хранилища).
Для файлового хранилища – в случае отсутствия файла(ов) – его(их) необходимо создать.
Для СУБД-хранилища в файле README.md должен находится SQL запрос для создания всех
необходимых таблиц.
Проверка данных должна осуществляться на стороне сервера.
Приложение должно содержать четкое логическое разделение между представление, логикой и
источником данных.
apollo-server doesn’t ship with any comprehensive logging, and instead offloads that responsiblity to the users and the resolvers or context handler This module provides uniform logging for the entire GraphQL request lifecycle, as provided by plugin hooks in apollo-server. The console/terminal result of which will resemble the image below:
This module requires an Active LTS Node version (v10.23.1+).
Install
Using npm:
npm install apollo-log
Usage
Setting up apollo-log is straight-forward. Import and call the plugin function, passing any desired options, and pass the plugin in an array to apollo-server.
Specifies which Apollo lifecycle events will be logged. The requestDidStart event is always logged, and by default didEncounterErrors and willSendResponse are logged.
mutate
Type: Function
Default: (data: Record<string, string>) => Record<string, string>
If specified, allows inspecting and mutating the data logged to the console for each message.
prefix
Type: String
Default: apollo
Specifies a prefix, colored by level, prepended to each log message.
timestamp
Type: Boolean
If true, will prepend a timestamp in the HH:mm:ss format to each log message.
End to end encryption message processor which can be used to encrypt messages between two or more devices running on any platform and sent over any messaging protocol.
It uses an external library for the implementation of the Double Ratchet Algorithm to handle session creation and key exchange (https://matrix.org/docs/projects/other/olm).
The key exchange and message format is loosely based on the OMEMO protocol which utilises 128 bit AES-GCM. Although OMEMO is an extension of the XMPP protocol, it doesn’t require XMPP as the transmission medium. The message format is output as json and can be reconfigured for transmission at the developer’s descretion.
The LocalStorage interface will need implementing in order to provide a means of storing the sessions. In the example below we’ve used node-localstorage which is sufficent for our needs, however other situations may require a different storage mechanism so the implementation is left to the developer.
Here’s a contrived example simulating sending a message between Alice and Bob:
import { LocalStorage } from 'node-localstorage';
import { OmemoManager } from 'e2ee-msg-processor-js';
(async () => {
await OmemoManager.init();
const aliceLocalStorage = new LocalStorage('./local_storage/aliceStore');
const aliceOmemoManager = new OmemoManager('alice', aliceLocalStorage);
const bobLocalStorage = new LocalStorage('./local_storage/bobStore');
const bobOmemoManager = new OmemoManager('bob', bobLocalStorage);
//bundle and device id need to be published via XMPP pubsub, or an equivalent service so that they are available for Alice and other clients devices wishing to communicate with Bob
const bobsBundle = bobOmemoManager.generateBundle();
aliceOmemoManager.processDevices('bob', [bobsBundle]);
//This message object can be mapped to an XMPP send query or just sent as JSON over TLS or some other secure channel.
const aliceToBobMessage = await aliceOmemoManager.encryptMessage('bob', 'To Bob from Alice');
//Bob will then receive the message and process it
const aliceDecrypted = await bobOmemoManager.decryptMessage(aliceToBobMessage);
console.log(aliceDecrypted);
//Bob can then reply without the need for a key bundle from Alice
const bobToAliceMessage = await bobOmemoManager.encryptMessage('alice', 'To Alice from Bob');
const bobDecrypted = await aliceOmemoManager.decryptMessage(bobToAliceMessage);
console.log(bobDecrypted);
})();
WARNING: THIS LIBRARY IS UNTESTED AND THEREFORE INSECURE. USE AT YOUR OWN RISK.
If you’re a cryptography researcher then please by all means try and break this and submit an issue or a PR.
Model Storage: The trained model should be stored with a unique version, along with its hyperparameters and accuracy, in a storage solution like S3. This requires extending the Python script to persist this information.
Scheduled Training: The model should be trained daily, with a retry mechanism for failures and an SLA defined for the training process.
Model Quality Monitoring: Model accuracy should be tracked over time, with a dashboard displaying weekly average accuracy.
Alerting: Alerts should be configured for:
If the latest model was generated more than 36 hours ago.
If the model accuracy exceeds a predefined threshold.
Model API Access: The model should be accessible via an API for predictions. The API should pull the latest model version whenever available.
Architecture Diagram
The architecture is designed to orchestrate the MLOps lifecycle across multiple components, as shown in the diagram.
Component Descriptions
GitHub: Repository for storing the source code of the machine learning pipeline. It supports CI/CD to deploy the pipeline to the target environment.
Kubernetes: Container orchestrator that runs and manages all pipeline components.
Kubeflow: Manages and schedules the machine learning pipeline, deployed on Kubernetes.
MLFlow: Tracks experiments and serves as a model registry.
Minio: S3-compatible object storage for training datasets and MLFlow model artifacts.
MySQL: Backend database for MLFlow, storing information on experiments, runs, metrics, and parameters.
KServe: Exposes the trained model as an API, allowing predictions at scale over Kubernetes.
Grafana: Generates dashboards for accuracy metrics and manages alerting.
Slack: Receives notifications for specific metrics and alerts when data is unavailable.
System Workflow
Pipeline Development: An ML Engineer creates or modifies the training pipeline code in GitHub.
CI/CD Deployment: CI/CD tests the pipeline, and once cleared, deploys it to Kubeflow with a user-defined schedule.
Pipeline Execution:
The pipeline is triggered on schedule, initiating the sequence of tasks.
Data Fetching: Raw data is read from Minio.
Preprocessing: Data is preprocessed and split into training and validation/test sets.
Model Training: The model is trained, with hyperparameters, metrics, and model weights stored in MLFlow.
Deployment: The trained model is deployed via KServe, making it available as an API on Kubernetes.
Notifications: Slack notifications are triggered if pipeline metrics exceed defined thresholds (e.g., accuracy > 95%).
Monitoring and Alerting:
Grafana Dashboard: Utilizes MLFlow’s MySQL database to visualize metrics, such as model accuracy.
Slack Alerts: Alerts are sent to Slack if no model has been updated within the last 36 hours.
Implementation Details
CI/CD: GitHub is shown in the architecture as the source and CI/CD provider, but it is not fully implemented here as my local resources could not connect with GitHub Actions (self-hosted runner setup was not used).
Training Pipeline SLA Dimensions
These SLA dimensions represent potential enhancements that could further improve the pipeline’s reliability and performance:
Training Frequency: Ideally, the pipeline should train the model daily to maintain relevance. Improved scheduling would enhance consistency.
Retry Mechanism: Implementing retries for errors would improve resilience.
Execution Time Limits: Adding maximum execution times for training runs would prevent long-running processes and increase efficiency.
Availability: Regular model updates (e.g., every 24-36 hours) would improve reliability, with alerts for delayed runs providing faster issue resolution.
Alerting: Alerts for accuracy deviations would aid in quicker troubleshooting.
Resource Usage: Resource limits for CPU and memory would optimize system performance and prevent overuse.
Data Freshness: Ensuring that input data is frequently updated would enhance model quality.
Enhanced Monitoring: Tracking additional metrics like accuracy trends, execution times, and resource utilization would improve insight into pipeline performance.
These are aspirational improvements that would make the pipeline more robust and production-ready. The current implementation covers some aspects and could be improved if sufficient time and resources become available.
Setup Guide
This guide provides detailed instructions for setting up an MLOps environment using Minikube, Kubeflow, MLflow, KServe, Minio, Grafana, and Slack for notifications. It covers prerequisites, environment setup, and necessary configurations to deploy and monitor a machine learning pipeline.
This guide assumes you have a basic understanding of Kubernetes, Docker and Python.
Python3.8+ and Docker should be installed on your system before starting the guide.
Moreover, a Slack namespace is required with permissions to setup the App and Generate webhooks for Alerts setup.
Pre-requisites
Install Minikube
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube && rm minikube-linux-amd64
# alteast 4 CPUs, 8GB RAM, and 40GB disk space
minikube start --cpus 4 --memory 8096 --disk-size=40g
Link kubectl from Minikube If it’s not already installed:
export PIPELINE_VERSION=2.3.0
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=$PIPELINE_VERSION"
kubectl set env deployment/ml-pipeline-ui -n kubeflow DISABLE_GKE_METADATA=true
kubectl wait --for=condition=ready pod -l application-crd-id=kubeflow-pipelines --timeout=1000s -n kubeflow
# You can Ctr1+C if all pods except proxy-agent are running
Note : Allow up to 20 minutes for all Kubeflow pods to be running. Verify the status:
kubectl get pods -n kubeflow
Ensure that all pods are running (except proxy-agent, which is not applicable to us as it serves as proxy for establishing connection to Google Cloud’s SQL and uses GCP metadata).
Take note of the address to access ml-pipeline-ui as you’ll use it for pipeline setup.
MLflow Setup
Build and Push MLflow with Mysql Docker Image to Dockerhub (Continue if your system is not linux/amd64 architecture, otherwise skip this step as the default syedshameersarwar/mlflow-mysql:v2.0.1 works for linux/amd64):
cd ./mlflow
docker build -t mlflow-mysql .
docker login
# create docker repository on docker hub for storing the image
docker tag mlflow-mysql:latest <dockerhub-username>/<docker-repository>:latest
docker push <dockerhub-username>/<docker-repository>:latest
# Make sure to update mlfow.yaml to reference the pushed image
Deploy MLflow Components :
kubectl create ns mlflow
kubectl apply -f pvc.yaml -n mlflow
# Update secret.yaml with desired base64-encoded MySQL and Minio credentials, defaults are provided in the file
kubectl apply -f secret.yaml -n mlflow
kubectl apply -f minio.yaml -n mlflow
kubectl apply -f mysql.yaml -n mlflow
# Check if MySQL and Minio pods are running
kubectl get pods -n mlflow
# if all pods are running, proceed to the next step
kubectl apply -f mlflow.yaml -n mlflow
Verify Deployment : Check if MLflow pod is running.
kubectl get pods -n mlflow
Expose MLflow Service via Minikube :
minikube service mlflow-svc -n mlflow
Note the address to access MLflow UI.
KServe Setup
Clone and Install KServe :
cd ..
git clone https://github.com/kserve/kserve.git
cd kserve
./hack/quick_install.sh
Verify Installation : Check all necessary pods:
kubectl get pods -n kserve
kubectl get pods -n knative-serving
kubectl get pods -n istio-system
kubectl get pods -n cert-manager
# if all pods are running, go back to the root directorycd ..
Configure Service Account and Cluster Role :
Copy Minio credentials (base64-encoded) from MLflows secret.yaml to sa/kserve-mlflow-sa.yaml.
The user field will populate AWS_ACCESS_KEY_ID in sa/kserve-mlflow-sa.yaml.
The secretkey field will populate AWS_SECRET_ACCESS_KEY in sa/kserve-mlflow-sa.yaml.
Leave the region field as it is. (base64 encoded for us-east-1).
Apply the service account and cluster role:
# allow kserve to access Minio
kubectl apply -f sa/kserve-mlflow-sa.yaml
# allow Kubeflow to access kserve and create inferenceservices
kubectl apply -f sa/kserve-kubeflow-clusterrole.yaml
Access localhost:9001 with credentials (base64-decoded) you setup in secret.yaml of MLFlow and create two buckets:
data
mlflow
Then, navigate to Object Browser -> mlflow -> Create a new path called experiments.
Now upload the iris.csv dataset to the data bucket.
You can close the port-forwarding once done by pressing Ctrl+C.
Slack Notifications
Create a Slack App for notifications:
Follow the Slack API Quickstart Guide and obtain a webhook URL for your Slack workspace. (You can skip invite part of step 3 and step 4 entirely).
Test with curl as shown in the Slack setup guide.
Pipeline Compilation and Setup
Compile Pipeline :
# Create a python virtual environment for pipeline compilation
mkdir src
cd src
python3 -m venv env
source env/bin/activate
pip install kfp[kubernetes]
Create a pipeline.py file in this directory and include the contents from src/pipeline.py.
Update the Slack webhook_url on line #336 in pipeline.py.
Generate Pipeline YAML :
python3 pipeline.py
# deactivate the virtual env after pipeline generation
deactivate
The generated file iris_mlflow_kserve_pipeline.yaml is now ready for upload to Kubeflow.
Upload and Schedule Pipeline in Kubeflow :
Visit the ml-pipeline-ui address from Kubeflow setup and Click on Upload Pipeline.
Give pipeline a name (e.g., IrisExp), just make sure its small as I faced an issue for large name where Kubeflow was not able to read pod information due to large pod names.
Upload the generated iris_mlflow_kserve_pipeline.yaml file.
Go to Experiments -> Create New Experiment and name it iris-mlflow-kserve.
Configure a recurring run :
Goto Recurring Runs -> Creating Recurring Run -> Select Pipeline created from above
Make sure to give a small name for Recurring run config due to the same issue as mentioned above.
Select the iris-mlflow-kserve experiment.
Setup Run Type and Run Trigger details as mentioned below:
Run Type : Recurring
Trigger Type : Periodic, every 1 day, Maximum concurrent runs: 1, Catchup: False.
Run Parameters : data_path: /data
Click Start and the pipeline will run daily starting from the next day at the time of run creation.
You can also manually trigger a one-off run for testing purposes.
Follow the same steps as above but select Run Type as One-off and click Start.
This will trigger the pipeline immediately. You can monitor the pipeline in the Runs tab.
Model Inference
After succesfull pipeline exection, you can get API Endpoint for Inference :
kubectl get inferenceservice -n kubeflow
Note the service name and URL host.
Make Prediction Request :
Create an input.json file with the following content:
You should see the predictions for the input data.
You can also visit MLflow UI to see the model versions and experiment details with metrics, hyperparameters, and artifacts.
Grafana Setup for Monitoring and Alerts
Grafana is used to setup the dashboard for monitoring accuracy metric trends on hourly, daily, and weekly basis. Moreover, no data alerts are configured to notify if no new models are trained in 36 hours.
Deploy Grafana :
# move to the root directory if not alreadycd ..
cd grafana
kubectl create ns grafana
kubectl apply -f grafana.yaml -n grafana
# Check if Grafana pod is running
kubectl get pods -n grafana
# if pod is running, expose the service
minikube service grafana -n grafana
Login to Grafana with default credentials (admin/admin) and change the password if you want.
Set Up MySQL as Data Source :
Go to Connections -> Data Sources -> Add data source -> MySQL and configure with MySQL credentials (base64-decoded) from secret.yaml of MLFlow.
Host : mysql-svc.mlflow:3306
Database : <database> in secret.yaml (mysql-secrets) during mlflow setup (default: mlflowdb)
User: <username> in secret.yaml (mysql-secrets) during mlflow setup (default: admin)
Password: <password> in secret.yaml (mysql-secrets) during mlflow setup (default: vIRtUaLMinDs)
Save and Test the connection.
Add Dashboard for Accuracy Metrics
Import the Dashboard:
Navigate to Dashboards -> New -> Import.
Upload the file accuracy_metrics.json from grafana/.
Optionally, change the dashboard name.
Select the MySQL datasource created in previous steps.
Click Import to make the dashboard visible.
Set Up Alerts in Grafana
Create a Slack Contact Point:
Navigate to Alerting -> Contact points -> Create contact point.
Provide a descriptive name for the contact point.
Under Integration, select Slack and enter the webhook URL obtained during Slack setup.
Test the configuration to ensure the Slack notification works.
Click Save.
Create an Alert Rule:
Navigate to Alerting -> Alert Rules -> New Alert Rule.
Provide a name for the rule, such as Iris No Data Alert.
Under Define query, select Code instead of Query Builder, and enter the following SQL query:
selectCOUNT(m.timestamp) as recordcount
FROM experiments as e
INNER JOIN runs as r ONr.experiment_id=e.experiment_idINNER JOIN metrics as m ONr.run_uuid=m.run_uuidwheree.name='iris-experiment'HAVINGMAX(FROM_UNIXTIME(m.timestamp/1000)) > DATE_ADD(NOW(), INTERVAL -36 HOUR);
In Rule Type -> Expressions, delete the default expressions and add a new Threshold expression.
Set the alert condition: WHEN Input A IS BELOW 1.
Click Preview to verify:
The status should be Normal if a new model run has occurred within the last 36 hours, otherwise it will show Alert.
Configure Evaluation Behavior:
Create a new folder named no-data.
Create an evaluation group named no-data with an evaluation interval of 5 minutes.
Set Pending period to None.
Under No data and error handling:
Select Alerting for Alert state if no data or all values are null.
Select Normal for Alert state if execution error or timeout.
Configure Labels and Notifications:
Add the Slack contact point created earlier under Label and Notifications.
Provide a summary and description for the notification message.
Save and Exit:
Save the rule to complete the setup.
Expected Outcome
You should receive a Slack notification if no model has been trained and registered within the last 36 hours in the experiment iris-experiment.
The Grafana dashboard also provides insights into the average accuracy metrics over different periods. While the current pipeline runs daily, this setup would offer useful insights if the training frequency changes, including hourly, daily, or weekly trends.
Limitations and Alternatives
Retry Issue: Although retries are configured, a known issue with Kubeflow (issue #11288) prevents it from working as expected. Alternatives such as Apache Airflow or VertexAI could address this limitation.
Single-Environment Setup: The current setup operates in a single environment, lacking the flexibility of development, staging, and production environments.
Manual Intervention: There is no manual review process before deploying a model to production, which may be beneficial. Alternatives like Apache Airflow’s custom sensors could allow manual interventions.
Kubernetes Dependency: As a fully Kubernetes-native system, each pipeline component runs as a pod. This design is suitable for high-resource nodes but may not work well in low-resource environments.
Additional Considerations: Code readability, testability, scalability, GPU node scheduling, distributed training, and resource optimization are important aspects to consider for long-term scalability and robustness.
Cleanup
minikube stop
minikube delete
Conclusion
This guide provides a comprehensive setup for an MLOps lifecycle, covering model training, monitoring, alerting, and API deployment. While the implementation is limited by time and resource constraints, it offers a solid foundation for a production-ready MLOps environment. The architecture diagram, system workflow, and SLA dimensions provide a clear understanding of the system’s components and requirements. By following the setup guide, users can deploy and monitor the machine learning pipeline, track model accuracy, and receive alerts for critical metrics. The guide also highlights potential enhancements and alternative solutions to address limitations and improve the system’s reliability and performance.
Model Storage: The trained model should be stored with a unique version, along with its hyperparameters and accuracy, in a storage solution like S3. This requires extending the Python script to persist this information.
Scheduled Training: The model should be trained daily, with a retry mechanism for failures and an SLA defined for the training process.
Model Quality Monitoring: Model accuracy should be tracked over time, with a dashboard displaying weekly average accuracy.
Alerting: Alerts should be configured for:
If the latest model was generated more than 36 hours ago.
If the model accuracy exceeds a predefined threshold.
Model API Access: The model should be accessible via an API for predictions. The API should pull the latest model version whenever available.
Architecture Diagram
The architecture is designed to orchestrate the MLOps lifecycle across multiple components, as shown in the diagram.
Component Descriptions
GitHub: Repository for storing the source code of the machine learning pipeline. It supports CI/CD to deploy the pipeline to the target environment.
Kubernetes: Container orchestrator that runs and manages all pipeline components.
Kubeflow: Manages and schedules the machine learning pipeline, deployed on Kubernetes.
MLFlow: Tracks experiments and serves as a model registry.
Minio: S3-compatible object storage for training datasets and MLFlow model artifacts.
MySQL: Backend database for MLFlow, storing information on experiments, runs, metrics, and parameters.
KServe: Exposes the trained model as an API, allowing predictions at scale over Kubernetes.
Grafana: Generates dashboards for accuracy metrics and manages alerting.
Slack: Receives notifications for specific metrics and alerts when data is unavailable.
System Workflow
Pipeline Development: An ML Engineer creates or modifies the training pipeline code in GitHub.
CI/CD Deployment: CI/CD tests the pipeline, and once cleared, deploys it to Kubeflow with a user-defined schedule.
Pipeline Execution:
The pipeline is triggered on schedule, initiating the sequence of tasks.
Data Fetching: Raw data is read from Minio.
Preprocessing: Data is preprocessed and split into training and validation/test sets.
Model Training: The model is trained, with hyperparameters, metrics, and model weights stored in MLFlow.
Deployment: The trained model is deployed via KServe, making it available as an API on Kubernetes.
Notifications: Slack notifications are triggered if pipeline metrics exceed defined thresholds (e.g., accuracy > 95%).
Monitoring and Alerting:
Grafana Dashboard: Utilizes MLFlow’s MySQL database to visualize metrics, such as model accuracy.
Slack Alerts: Alerts are sent to Slack if no model has been updated within the last 36 hours.
Implementation Details
CI/CD: GitHub is shown in the architecture as the source and CI/CD provider, but it is not fully implemented here as my local resources could not connect with GitHub Actions (self-hosted runner setup was not used).
Training Pipeline SLA Dimensions
These SLA dimensions represent potential enhancements that could further improve the pipeline’s reliability and performance:
Training Frequency: Ideally, the pipeline should train the model daily to maintain relevance. Improved scheduling would enhance consistency.
Retry Mechanism: Implementing retries for errors would improve resilience.
Execution Time Limits: Adding maximum execution times for training runs would prevent long-running processes and increase efficiency.
Availability: Regular model updates (e.g., every 24-36 hours) would improve reliability, with alerts for delayed runs providing faster issue resolution.
Alerting: Alerts for accuracy deviations would aid in quicker troubleshooting.
Resource Usage: Resource limits for CPU and memory would optimize system performance and prevent overuse.
Data Freshness: Ensuring that input data is frequently updated would enhance model quality.
Enhanced Monitoring: Tracking additional metrics like accuracy trends, execution times, and resource utilization would improve insight into pipeline performance.
These are aspirational improvements that would make the pipeline more robust and production-ready. The current implementation covers some aspects and could be improved if sufficient time and resources become available.
Setup Guide
This guide provides detailed instructions for setting up an MLOps environment using Minikube, Kubeflow, MLflow, KServe, Minio, Grafana, and Slack for notifications. It covers prerequisites, environment setup, and necessary configurations to deploy and monitor a machine learning pipeline.
This guide assumes you have a basic understanding of Kubernetes, Docker and Python.
Python3.8+ and Docker should be installed on your system before starting the guide.
Moreover, a Slack namespace is required with permissions to setup the App and Generate webhooks for Alerts setup.
Pre-requisites
Install Minikube
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube && rm minikube-linux-amd64
# alteast 4 CPUs, 8GB RAM, and 40GB disk space
minikube start --cpus 4 --memory 8096 --disk-size=40g
Link kubectl from Minikube If it’s not already installed:
export PIPELINE_VERSION=2.3.0
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=$PIPELINE_VERSION"
kubectl set env deployment/ml-pipeline-ui -n kubeflow DISABLE_GKE_METADATA=true
kubectl wait --for=condition=ready pod -l application-crd-id=kubeflow-pipelines --timeout=1000s -n kubeflow
# You can Ctr1+C if all pods except proxy-agent are running
Note : Allow up to 20 minutes for all Kubeflow pods to be running. Verify the status:
kubectl get pods -n kubeflow
Ensure that all pods are running (except proxy-agent, which is not applicable to us as it serves as proxy for establishing connection to Google Cloud’s SQL and uses GCP metadata).
Take note of the address to access ml-pipeline-ui as you’ll use it for pipeline setup.
MLflow Setup
Build and Push MLflow with Mysql Docker Image to Dockerhub (Continue if your system is not linux/amd64 architecture, otherwise skip this step as the default syedshameersarwar/mlflow-mysql:v2.0.1 works for linux/amd64):
cd ./mlflow
docker build -t mlflow-mysql .
docker login
# create docker repository on docker hub for storing the image
docker tag mlflow-mysql:latest <dockerhub-username>/<docker-repository>:latest
docker push <dockerhub-username>/<docker-repository>:latest
# Make sure to update mlfow.yaml to reference the pushed image
Deploy MLflow Components :
kubectl create ns mlflow
kubectl apply -f pvc.yaml -n mlflow
# Update secret.yaml with desired base64-encoded MySQL and Minio credentials, defaults are provided in the file
kubectl apply -f secret.yaml -n mlflow
kubectl apply -f minio.yaml -n mlflow
kubectl apply -f mysql.yaml -n mlflow
# Check if MySQL and Minio pods are running
kubectl get pods -n mlflow
# if all pods are running, proceed to the next step
kubectl apply -f mlflow.yaml -n mlflow
Verify Deployment : Check if MLflow pod is running.
kubectl get pods -n mlflow
Expose MLflow Service via Minikube :
minikube service mlflow-svc -n mlflow
Note the address to access MLflow UI.
KServe Setup
Clone and Install KServe :
cd ..
git clone https://github.com/kserve/kserve.git
cd kserve
./hack/quick_install.sh
Verify Installation : Check all necessary pods:
kubectl get pods -n kserve
kubectl get pods -n knative-serving
kubectl get pods -n istio-system
kubectl get pods -n cert-manager
# if all pods are running, go back to the root directorycd ..
Configure Service Account and Cluster Role :
Copy Minio credentials (base64-encoded) from MLflows secret.yaml to sa/kserve-mlflow-sa.yaml.
The user field will populate AWS_ACCESS_KEY_ID in sa/kserve-mlflow-sa.yaml.
The secretkey field will populate AWS_SECRET_ACCESS_KEY in sa/kserve-mlflow-sa.yaml.
Leave the region field as it is. (base64 encoded for us-east-1).
Apply the service account and cluster role:
# allow kserve to access Minio
kubectl apply -f sa/kserve-mlflow-sa.yaml
# allow Kubeflow to access kserve and create inferenceservices
kubectl apply -f sa/kserve-kubeflow-clusterrole.yaml
Access localhost:9001 with credentials (base64-decoded) you setup in secret.yaml of MLFlow and create two buckets:
data
mlflow
Then, navigate to Object Browser -> mlflow -> Create a new path called experiments.
Now upload the iris.csv dataset to the data bucket.
You can close the port-forwarding once done by pressing Ctrl+C.
Slack Notifications
Create a Slack App for notifications:
Follow the Slack API Quickstart Guide and obtain a webhook URL for your Slack workspace. (You can skip invite part of step 3 and step 4 entirely).
Test with curl as shown in the Slack setup guide.
Pipeline Compilation and Setup
Compile Pipeline :
# Create a python virtual environment for pipeline compilation
mkdir src
cd src
python3 -m venv env
source env/bin/activate
pip install kfp[kubernetes]
Create a pipeline.py file in this directory and include the contents from src/pipeline.py.
Update the Slack webhook_url on line #336 in pipeline.py.
Generate Pipeline YAML :
python3 pipeline.py
# deactivate the virtual env after pipeline generation
deactivate
The generated file iris_mlflow_kserve_pipeline.yaml is now ready for upload to Kubeflow.
Upload and Schedule Pipeline in Kubeflow :
Visit the ml-pipeline-ui address from Kubeflow setup and Click on Upload Pipeline.
Give pipeline a name (e.g., IrisExp), just make sure its small as I faced an issue for large name where Kubeflow was not able to read pod information due to large pod names.
Upload the generated iris_mlflow_kserve_pipeline.yaml file.
Go to Experiments -> Create New Experiment and name it iris-mlflow-kserve.
Configure a recurring run :
Goto Recurring Runs -> Creating Recurring Run -> Select Pipeline created from above
Make sure to give a small name for Recurring run config due to the same issue as mentioned above.
Select the iris-mlflow-kserve experiment.
Setup Run Type and Run Trigger details as mentioned below:
Run Type : Recurring
Trigger Type : Periodic, every 1 day, Maximum concurrent runs: 1, Catchup: False.
Run Parameters : data_path: /data
Click Start and the pipeline will run daily starting from the next day at the time of run creation.
You can also manually trigger a one-off run for testing purposes.
Follow the same steps as above but select Run Type as One-off and click Start.
This will trigger the pipeline immediately. You can monitor the pipeline in the Runs tab.
Model Inference
After succesfull pipeline exection, you can get API Endpoint for Inference :
kubectl get inferenceservice -n kubeflow
Note the service name and URL host.
Make Prediction Request :
Create an input.json file with the following content:
You should see the predictions for the input data.
You can also visit MLflow UI to see the model versions and experiment details with metrics, hyperparameters, and artifacts.
Grafana Setup for Monitoring and Alerts
Grafana is used to setup the dashboard for monitoring accuracy metric trends on hourly, daily, and weekly basis. Moreover, no data alerts are configured to notify if no new models are trained in 36 hours.
Deploy Grafana :
# move to the root directory if not alreadycd ..
cd grafana
kubectl create ns grafana
kubectl apply -f grafana.yaml -n grafana
# Check if Grafana pod is running
kubectl get pods -n grafana
# if pod is running, expose the service
minikube service grafana -n grafana
Login to Grafana with default credentials (admin/admin) and change the password if you want.
Set Up MySQL as Data Source :
Go to Connections -> Data Sources -> Add data source -> MySQL and configure with MySQL credentials (base64-decoded) from secret.yaml of MLFlow.
Host : mysql-svc.mlflow:3306
Database : <database> in secret.yaml (mysql-secrets) during mlflow setup (default: mlflowdb)
User: <username> in secret.yaml (mysql-secrets) during mlflow setup (default: admin)
Password: <password> in secret.yaml (mysql-secrets) during mlflow setup (default: vIRtUaLMinDs)
Save and Test the connection.
Add Dashboard for Accuracy Metrics
Import the Dashboard:
Navigate to Dashboards -> New -> Import.
Upload the file accuracy_metrics.json from grafana/.
Optionally, change the dashboard name.
Select the MySQL datasource created in previous steps.
Click Import to make the dashboard visible.
Set Up Alerts in Grafana
Create a Slack Contact Point:
Navigate to Alerting -> Contact points -> Create contact point.
Provide a descriptive name for the contact point.
Under Integration, select Slack and enter the webhook URL obtained during Slack setup.
Test the configuration to ensure the Slack notification works.
Click Save.
Create an Alert Rule:
Navigate to Alerting -> Alert Rules -> New Alert Rule.
Provide a name for the rule, such as Iris No Data Alert.
Under Define query, select Code instead of Query Builder, and enter the following SQL query:
selectCOUNT(m.timestamp) as recordcount
FROM experiments as e
INNER JOIN runs as r ONr.experiment_id=e.experiment_idINNER JOIN metrics as m ONr.run_uuid=m.run_uuidwheree.name='iris-experiment'HAVINGMAX(FROM_UNIXTIME(m.timestamp/1000)) > DATE_ADD(NOW(), INTERVAL -36 HOUR);
In Rule Type -> Expressions, delete the default expressions and add a new Threshold expression.
Set the alert condition: WHEN Input A IS BELOW 1.
Click Preview to verify:
The status should be Normal if a new model run has occurred within the last 36 hours, otherwise it will show Alert.
Configure Evaluation Behavior:
Create a new folder named no-data.
Create an evaluation group named no-data with an evaluation interval of 5 minutes.
Set Pending period to None.
Under No data and error handling:
Select Alerting for Alert state if no data or all values are null.
Select Normal for Alert state if execution error or timeout.
Configure Labels and Notifications:
Add the Slack contact point created earlier under Label and Notifications.
Provide a summary and description for the notification message.
Save and Exit:
Save the rule to complete the setup.
Expected Outcome
You should receive a Slack notification if no model has been trained and registered within the last 36 hours in the experiment iris-experiment.
The Grafana dashboard also provides insights into the average accuracy metrics over different periods. While the current pipeline runs daily, this setup would offer useful insights if the training frequency changes, including hourly, daily, or weekly trends.
Limitations and Alternatives
Retry Issue: Although retries are configured, a known issue with Kubeflow (issue #11288) prevents it from working as expected. Alternatives such as Apache Airflow or VertexAI could address this limitation.
Single-Environment Setup: The current setup operates in a single environment, lacking the flexibility of development, staging, and production environments.
Manual Intervention: There is no manual review process before deploying a model to production, which may be beneficial. Alternatives like Apache Airflow’s custom sensors could allow manual interventions.
Kubernetes Dependency: As a fully Kubernetes-native system, each pipeline component runs as a pod. This design is suitable for high-resource nodes but may not work well in low-resource environments.
Additional Considerations: Code readability, testability, scalability, GPU node scheduling, distributed training, and resource optimization are important aspects to consider for long-term scalability and robustness.
Cleanup
minikube stop
minikube delete
Conclusion
This guide provides a comprehensive setup for an MLOps lifecycle, covering model training, monitoring, alerting, and API deployment. While the implementation is limited by time and resource constraints, it offers a solid foundation for a production-ready MLOps environment. The architecture diagram, system workflow, and SLA dimensions provide a clear understanding of the system’s components and requirements. By following the setup guide, users can deploy and monitor the machine learning pipeline, track model accuracy, and receive alerts for critical metrics. The guide also highlights potential enhancements and alternative solutions to address limitations and improve the system’s reliability and performance.
Trust QR is a platform that uses Blockchain technology and QR codes to combat counterfeiting. Companies can register their products on the platform, and each product will be assigned a unique QR code. Consumers can scan the QR code on a product to validate product authenticity, ensuring it matches the stated brand and providing manufacturing and expiry date details.
About
This project is based on the Blockchain Technology.