Cluster

To know the complete detail about a job

sacct -j JOBID --format=JobID,JobName,Partition,AllocCPUS,Elapsed,State,ExitCode,MaxRSS,MaxVMSize,MaxDiskRead,MaxDiskWrite,AveRSS,AveVMSize,AvePages,AveDiskRead,AveDiskWrite

Replace JOBID with the unique number associated with the job. Also, MaxRSS means the maximum amount of memory occupied. It is expressed in KB. Divide by 10^6 to get rough figure in GB.


Monitoring the memory usage of your job in real-time can help you troubleshoot issues and optimize performance. Since you're using a Slurm-managed cluster, you can use Slurm's built-in commands to check resource usage.

Here are some ways to do it:

### squeue
You can use `squeue` to check the status of your job:

```bash
squeue -u <username>
```

### scontrol show job
Use `scontrol` to get detailed information about your job, including its resource allocation:

```bash
scontrol show job <job_id>
```

### sacct
The `sacct` command shows the accounting information for all jobs and job steps in the Slurm job accounting log or Slurm database. You can filter it by job ID to check resource usage:

```bash
sacct -j <job_id> --format=JobID,JobName,MaxRSS
```

Here, `MaxRSS` represents the maximum resident set size, or the maximum amount of memory used by the job.

### Real-time Monitoring with ssh

If you can SSH into the cluster node where your job is running, you could potentially use commands like `top` or `htop` filtered by your user name or process ID to monitor memory usage in real-time. Be cautious when doing this, as excessive use could be considered bad practice or could be against your cluster's usage policies.

```bash
top -u <username>
```

or to find by process ID (PID):

```bash
top -p <pid>
```

### Monitoring Scripts
You can also write a script that utilizes these commands and runs them at a regular interval to keep track of the memory usage. Be cautious with how often you run such a script, however, as excessive polling could cause problems.

Always adhere to the guidelines or policies of your specific computing environment when running monitoring or diagnostic commands.


Logging into a specific node and monitoring its resource utilization is generally possible within a cluster. Here's how you can do it:

1. **SSH into the Node**: If your cluster allows for direct SSH into compute nodes, you can use SSH to log in to the specific node.
    ```bash
    ssh username@CPU20
    ```
    Replace `username` with your actual username on the cluster.

2. **Use `top` or `htop`**: Once you are logged into the specific node, you can use commands like `top` or `htop` to see memory usage. Run `top` and look for the memory stats.
    ```bash
    top
    ```

3. **Check Memory Usage with `free`**: This command can give you a summary of the memory usage on the node.
    ```bash
    free -h
    ```
    The `-h` flag makes the output human-readable, showing sizes in GB, MB, etc.

4. **Slurm Commands**: If the cluster uses Slurm, you can use the `scontrol show node=CPU20` command to get information about a specific node. Note that this would generally be run from the head node or a node where you have the appropriate permissions.

5. **Node Specific Monitoring Tools**: Some clusters have specialized software for monitoring resource utilization on nodes. Check your cluster's documentation for details.

Note: Your ability to SSH into a node and run these commands might be restricted based on your cluster's policies. If you are not sure about this, you may need to consult your cluster's documentation or get in touch with the system administrators.

Comments