Comet Model Production Monitoring (MPM)¶

Comet's Model Production Monitoring (MPM) helps you maintain high quality ML models by monitoring and alerting on defective predictions from models deployed in production. MPM supports all model types and allows you to monitor all your models in one place. It integrates with Comet Experiment Management, allowing you to track model performance from training to production.

Linux / All in one installs¶

Comet MPM is an optional module that can be installed alongside EM. To enable it, you must set a specific flag. For all-in-one or Linux installations, you can enable the flag by following these steps

cometctl aio install --mpm

This command installs the necessary modules and dependencies, including a PostgreSQL/TimescaleDB server. Alternatively, you can use managed services like timescale.com. If you choose this option, ensure you apply the following configuration settings:

max_connections: 1000
temp_file_limit: -1
max_locks_per_transaction: 1024
max_worker_processes: 24
timescaledb.max_background_workers: 24
work_mem: 5242

Helm chart / Kubernetes¶

To proceed with applying the Helm chart, please refer first to our Helm chart documentation, which covers the necessary Helm commands. After that, follow the steps below to enable MPM (Model Production Monitoring).

To enable Comet MPM using a Helm chart, update your override values by setting the enabled field for the mpm section to true:

# ...
comet:
  # ...
  mpm:
    enabled: true
# ...

MPM will however NOT work without configuring a TimescaleDB instance for it as outlined farther down.

Resource Requirements¶

Our general recommendation is to have 3 MPM pods, with access to 16 vCPUs/Cores and 32GiB of Memory/RAM each. This should support an average utilization of about 4500 predictions per second, along with a single TimescaleDB replica with its own 16vCPU/32GiB allocation.

As usual, we recommend isolating your MPM pods from other workloads on your kubernetes cluster, or setting aggressive resource reservations/requests to ensure optimal performance.

Ideally these should also be a seperate node pool from the one used for the core Comet application/Experiment Management. But if this is not necessary if you are not using the EM product (or use it only minimally), or if you set aggressive resource reservations/requests for MPM.

We recommend a dedicated pool of 3 16vCPU/32GiB nodes (or whichever number matches your MPM replicaCount). See the TimescaleDB section on Nodes for further details about node pools when using the internal TimescaleDB.

To specify the node pool the nodeSelector allows you to set a mapping of labels that must be set on any node on which TimescaleDB will be scheduled to run.

# ...
backend:
  # ...
  mpm:
    nodeSelector:
      agentpool: comet-mpm
    # tolerations: []
    # affinity: {}
# ...

To aggresively set the resource reservations/requests or adjust the pod count, you can use the following settings:

WARNING: When setting aggressive resource reservations, you must either have spare nodes or much larger nodes if you wish to maintain availability when updating the pods. Otherwise you will not have enough capacity to run more than your configured pod count and will need to scale down and back up to replace pods.

# ...
backend:
  # ...
  mpm:
    # replicaCount: 3
    memoryRequest: 32Gi
    # memoryLimit: 32Gi
    cpuRequest: 16000m
    # cpuLimit: 16000m
# ...

TimescaleDB¶

MPM has a critical dependency on TimescaleDB to store its data. So you will also need to configure the connection to a TimescaleDB instance.

As with most data layers, we recommend using a managed service for this, such as:

TimescaleDB Cloud
Managed Service TimescaleDB
An instance of TimescaleDB managed by the TimescaleDB Cloud team, running on your VPC.
Azure DBaaS TimescaleDB
Azure Database for PostgreSQL: Flexible Server—w/ the TimescaleDB extension enabled.

To configure the connection to your TimescaleDB instance set the following details in your override values:

# ...
timescaleDB:
  host: my-timescale.com
  user: postgres
  password: "passW0rd!"
# ...

Internal TimescaleDB¶

As with other data layers, we also support deploying it as part of your self-hosted Comet installation.

If you plan to use the internal TimescaleDB, you should also configure the following settings in your override values. Make sure to set the storageClass to the storageClass appropriate to your cluster:

# ...
timescaleDB:
  enableInternalTimescaleDB: true
timescaleDBInternal:
  persistentVolumes:
    data:
      storageClass: "default"
    wal:
      storageClass: "default"
# ...

For this we use the official TimescaleDB-Single Helm Chart as a subchart. Values specific to this subchart can be set under the timescaleDBInternal key. See this page for full documentation on the possible settings.

Storage¶

TimescaleDB uses two storage volumes: data and wal. data is where your events are stored long-term once they have been fully ingested. wal is the "Write Ahead Log" where data is staged prior to full ingestion. It is essentially a buffer for accepting data faster than it can be fully processed.

You must be sure to set the storageClass to the setting appropriate to your cluster.

# ...
timescaleDBInternal:
  # ...
  persistentVolumes:
    data:
      size: 1Ti
      storageClass: "default"
    wal:
      size: 64Gi
      storageClass: "default"
# ...

Nodes¶

We recommend assigning TimescaleDB to a node pool. Either:

A node pool dedicated for the Comet MPM TimescaleDB. Node count should match replicaCount.
A node pool dedicated for Comet MPM. Be sure to increase the pool size by 1 or whatever your TimescaleDB replica count is, or to specify resources as outlined below.
A node pool dedicated for Comet. Best for low performance needs, or with aggressive resource reservation.

As with Comet itself, if no dedicated node pool is possible, be certain to set aggressive resource requests/reservations to ensure best performance.

The nodeSelector allows you to set a mapping of labels that must be set on any node on which TimescaleDB will be scheduled to run.

# ...
timescaleDBInternal:
  # ...
  nodeSelector:
    agentpool: comet-mpm-data
  # tolerations: []
  # affinity: {}
# ...

Resources¶

When performance matters, and TimescaleDB has not been assigned to a dedicated node pool (or shares a pool with the MPM pods). Be sure to set aggressive resource reservations/requests that match your needs.

WARNING: When setting aggressive resource reservations, you must either have spare nodes or much larger nodes if you wish to maintain availability when updating the pods. Otherwise you will not have enough capacity to run more than your configured pod count and will need to scale down and back up to replace pods.

# ...
timescaleDBInternal:
  # ...
  resources:
    requests:
      cpu: 16000m
      memory: 32Gi
    # limits:
    #   cpu: 16000m
    #   memory: 32Gi
# ...

Image¶

Should you want to specify a custom TimescaleDB container Image, Repository, or Registry (or adjust the pullPolicy) use the nested image key.

The container image must be a timescaledb-ha image.

Due to how the subchart works, if you wish to pull the image from a registry that is not Docker Hub, please prefix the registry URI to the repostory name, instead of as a seperate registry key.

# ...
timescaleDBInternal:
  # ...
  image:
    # tag: pg14.6-ts2.9.2-patroni-dcs-failsafe-p0
    # repository: timescale/timescaledb-ha
    # pullPolicy: Always
# ...

Service Account¶

The TimescaleDB subchart relies on a Kubernetes Service Account for some of its behavior, including replica scaling. By default we have it use the Service Account created for Comet by the helm-chart. If this service account's creation is disabled for any reason, you can have the TimescaleDB subchart create its own. Or you can specify the name of another existing service account to use for this purpose.

# ...
timescaleDBInternal:
  # ...
  serviceAccount:
    # create: false
    name: my-service-account
# ...

Replicas¶

Median load should not require any additional replicas. But should you desire them, you can set them to a number higher than 1 like so:

# ...
timescaleDBInternal:
  # ...
  replicaCount: 3
# ...

Dec. 19, 2023