Profiler data not shown in Tensorboard, but files are stored

Hi there,

I’ve been struggling to get the TF-profiler running and get the results displayed in Tensorboard (TB). The files get stored in my TB-log directory under the subfolder “profile” as expected. However when I start TB, the “profile” tab is missing (the rest gets displayed, see screenshot below).

From the log of the run (below) I can see that the profile-session get initialized but is followed by a session tear down. I don’t see the expected “Successfully opened dynamic library libcupti.so.**.*” command either, so there might be an issue with the CUPTI library…

If it helps for further debugging:

When checking for the installation path as illustrated here প্রোফাইলার ব্যবহার করে টেনসরফ্লো পারফরম্যান্স অনুকূল করুন  |  TensorFlow Core

I get:

/sbin/ldconfig.real: Can’t stat /usr/local/cuda/extras/CUPTI/lib64: No such file or directory
/sbin/ldconfig.real: Can’t stat /usr/local/nvidia/lib: No such file or directory
/sbin/ldconfig.real: Can’t stat /usr/local/nvidia/lib64: No such file or directory
/sbin/ldconfig.real: Path /usr/local/cuda/targets/x86_64-linux/lib’ given more than once
/sbin/ldconfig.real: Can’t stat /usr/local/cuda-11/targets/x86_64-linux/lib: No such file or directory
/sbin/ldconfig.real: Can’t stat /usr/local/nvidia/lib: No such file or directory
/sbin/ldconfig.real: Can’t stat /usr/local/nvidia/lib64: No such file or directory
/sbin/ldconfig.real: Can’t stat /usr/local/lib/x86_64-linux-gnu: No such file or directory
/sbin/ldconfig.real: Path /usr/lib/x86_64-linux-gnu’ given more than once
/sbin/ldconfig.real: Path /lib/x86_64-linux-gnu’ given more than once
/sbin/ldconfig.real: Path /usr/lib/x86_64-linux-gnu’ given more than once
/sbin/ldconfig.real: Path /usr/lib’ given more than once
libcupti.so.11.2 → libcupti.so.2020.3.1
/sbin/ldconfig.real: /lib/x86_64-linux-gnu/ld-2.31.so is the dynamic linker, ignoring

so it seems like it finds the library.

I’ve also ran this example tensorboard/tensorboard_profiling_keras.ipynb at master · tensorflow/tensorboard · GitHub in a notebook on the renku machine to see if it would display any profile data in TB, but it doesn’t. Again, the files are saved.

Also, the plugin_profil has been installed (see below).

I am unsure how to proceed from there to resolve this issue. Any hints and inputs would be appreciated. Thank you.

image

Log from original run:
python src/main_pretrain.py --config src/config/config.yaml

/opt/conda/lib/python3.9/site-packages/tensorflow_addons/utils/ensure_tf_install.py:53: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.6.0 and strictly below 2.9.0 (nightly versions are not supported).
The versions of TensorFlow you are currently using is 2.9.0 and is not supported.
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you’re using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons’s version.
You can find the compatibility matrix in TensorFlow Addon’s readme:
\https://github.com/tensorflow/addons
warnings.warn(
WARNING:tensorflow:From /home/jovyan/work/das-oa-ssl/src/models/layers/preprocessing.py:43: calling function (from tensorflow.python.eager.def_function) with experimental_compile is deprecated and will be removed in a future version.
Instructions for updating:
experimental_compile is deprecated, use jit_compile instead
2022-08-01 12:37:48.924618: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:48.956007: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:48.956468: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:48.957094: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING
Your GPU may run slowly with dtype policy mixed_float16 because it does not have compute capability of at least 7.0. Your GPU:
Tesla P100-PCIE-12GB, compute capability 6.0
See \https://developer.nvidia.com/cuda-gpus for a list of GPUs and their compute capabilities.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once
WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING
Your GPU may run slowly with dtype policy mixed_float16 because it does not have compute capability of at least 7.0. Your GPU:
Tesla P100-PCIE-12GB, compute capability 6.0
See \https://developer.nvidia.com/cuda-gpus for a list of GPUs and their compute capabilities.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once
TF policy set to mixed_float16
INFO:main:TF version: 2.9.0
INFO:main:Current working dir: /home/jovyan/work/das-oa-ssl
INFO:src.util.misc:Directory logs/2022-08-01_14:37:48_CEST has been created.
INFO:src.util.misc:Directory logs/2022-08-01_14:37:48_CEST/images has been created.
INFO:src.util.misc:Log directories have been created at ./logs/2022-08-01_14:37:48_CEST and path have been set.
INFO:main:Chosen distribution strategy: None
DEBUG:src.data.generators:length data lookup for train/train: 10
DEBUG:src.data.generators:length data lookup for train/validation: 10
DEBUG:src.data.generators:length data lookup for val/None: 10
2022-08-01 12:37:49.795147: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-01 12:37:49.797228: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:49.797673: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:49.797976: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:50.780309: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:50.780667: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:50.780905: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:50.781224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 11319 MB memory: → device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:00:05.0, compute capability: 6.0
INFO:main:Computing min/max percentiles for clipping.
INFO:main:min percentile: -313996.0625, max percentile: 331386.75.
INFO:main:Effective batch size: 2
DEBUG:main:Base learning rate: 0.001
INFO:main:Actual learning rate: 7.8125e-06
INFO:main:Number of total training steps: 10
DEBUG:src.models.layers.image_augmentation:Calling image augmentation in test mode
DEBUG:src.models.layers.patch_embeddings:Calling patch embedding with data type x: <dtype: ‘float16’>
2022-08-01 12:37:57.438061: I tensorflow/core/profiler/lib/profiler_session.cc:99] Profiler session initializing.
2022-08-01 12:37:57.438143: I tensorflow/core/profiler/lib/profiler_session.cc:114] Profiler session started.
2022-08-01 12:37:57.482416: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1665] Profiler found 1 GPUs
2022-08-01 12:37:58.083321: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session tear down.
2022-08-01 12:37:58.087979: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1799] CUPTI activity buffer flushed
Epoch 1/2
2022-08-01 12:37:58.236518: I tensorflow/core/profiler/lib/profiler_session.cc:99] Profiler session initializing.
2022-08-01 12:37:58.236585: I tensorflow/core/profiler/lib/profiler_session.cc:114] Profiler session started.
DEBUG:src.models.layers.image_augmentation:Calling image augmentation in training mode
DEBUG:src.models.layers.patch_embeddings:Calling patch embedding with data type x: <dtype: ‘float16’>
INFO:src.models.mae_refactored:scaling loss because policy mixed_float16
INFO:src.models.mae_refactored:scaling grads because policy mixed_float16
DEBUG:src.models.layers.image_augmentation:Calling image augmentation in training mode
DEBUG:src.models.layers.patch_embeddings:Calling patch embedding with data type x: <dtype: ‘float16’>
INFO:src.models.mae_refactored:scaling loss because policy mixed_float16
INFO:src.models.mae_refactored:scaling grads because policy mixed_float16
2022-08-01 12:38:38.457414: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8101
2022-08-01 12:38:43.129074: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2022-08-01 12:38:43.152367: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1799] CUPTI activity buffer flushed
2022-08-01 12:38:43.427292: I tensorflow/core/profiler/internal/gpu/cupti_collector.cc:521] GpuTracer has collected 8749 callback api events and 8688 activity events.
2022-08-01 12:38:43.860755: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session tear down.
2022-08-01 12:38:44.633636: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43

2022-08-01 12:38:44.967168: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.trace.json.gz
2022-08-01 12:38:45.597159: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43

2022-08-01 12:38:45.621946: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for memory_profile.json.gz to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.memory_profile.json.gz
2022-08-01 12:38:45.642457: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43
Dumped tool data for xplane.pb to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.xplane.pb
Dumped tool data for overview_page.pb to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.overview_page.pb
Dumped tool data for input_pipeline.pb to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.kernel_stats.pb
5/Unknown - 48s 212ms/step - loss: 0.7835 - mae: 0.6896DEBUG:src.models.layers.image_augmentation:Calling image augmentation in test mode
DEBUG:src.models.layers.patch_embeddings:Calling patch embedding with data type x: <dtype: ‘float16’>
INFO:src.callbacks:Seconds per epoch end 000: 99.4s
INFO:src.util.misc:File ./src/config/config.yaml has been saved to ./logs/2022-08-01_14:37:48_CEST.

Hi there,

Did you have a chance to look into this?
I have managed to get it running locally on my machine by updating tensorboard, tensorflow, tensorboard-plugin-profile to the latest versions. The PROFILE option gets displayed in TB.

Upgrading the packages on Renku didn’t do the trick, I still don’t see the PROFILE option in TB there.

Raphaela

Okay some more information that might be helpful. If I don’t install the profiler plugin I can see it in the drop down and TB notifies me to run

pip install -U tensorboard-plugin-profile

The install terminates successfully. After that there’s no longer a PROFILE option in the drop down in TB. Nor on the top bar.