Hi there,
I’ve been struggling to get the TF-profiler running and get the results displayed in Tensorboard (TB). The files get stored in my TB-log directory under the subfolder “profile” as expected. However when I start TB, the “profile” tab is missing (the rest gets displayed, see screenshot below).
From the log of the run (below) I can see that the profile-session get initialized but is followed by a session tear down. I don’t see the expected “Successfully opened dynamic library libcupti.so.**.*” command either, so there might be an issue with the CUPTI library…
If it helps for further debugging:
When checking for the installation path as illustrated here প্রোফাইলার ব্যবহার করে টেনসরফ্লো পারফরম্যান্স অনুকূল করুন | TensorFlow Core
I get:
/sbin/ldconfig.real: Can’t stat /usr/local/cuda/extras/CUPTI/lib64: No such file or directory
/sbin/ldconfig.real: Can’t stat /usr/local/nvidia/lib: No such file or directory
/sbin/ldconfig.real: Can’t stat /usr/local/nvidia/lib64: No such file or directory
/sbin/ldconfig.real: Path /usr/local/cuda/targets/x86_64-linux/lib’ given more than once
/sbin/ldconfig.real: Can’t stat /usr/local/cuda-11/targets/x86_64-linux/lib: No such file or directory
/sbin/ldconfig.real: Can’t stat /usr/local/nvidia/lib: No such file or directory
/sbin/ldconfig.real: Can’t stat /usr/local/nvidia/lib64: No such file or directory
/sbin/ldconfig.real: Can’t stat /usr/local/lib/x86_64-linux-gnu: No such file or directory
/sbin/ldconfig.real: Path /usr/lib/x86_64-linux-gnu’ given more than once
/sbin/ldconfig.real: Path /lib/x86_64-linux-gnu’ given more than once
/sbin/ldconfig.real: Path /usr/lib/x86_64-linux-gnu’ given more than once
/sbin/ldconfig.real: Path /usr/lib’ given more than once
libcupti.so.11.2 → libcupti.so.2020.3.1
/sbin/ldconfig.real: /lib/x86_64-linux-gnu/ld-2.31.so is the dynamic linker, ignoring
so it seems like it finds the library.
I’ve also ran this example tensorboard/tensorboard_profiling_keras.ipynb at master · tensorflow/tensorboard · GitHub in a notebook on the renku machine to see if it would display any profile data in TB, but it doesn’t. Again, the files are saved.
Also, the plugin_profil has been installed (see below).
I am unsure how to proceed from there to resolve this issue. Any hints and inputs would be appreciated. Thank you.
Log from original run:
python src/main_pretrain.py --config src/config/config.yaml
/opt/conda/lib/python3.9/site-packages/tensorflow_addons/utils/ensure_tf_install.py:53: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.6.0 and strictly below 2.9.0 (nightly versions are not supported).
The versions of TensorFlow you are currently using is 2.9.0 and is not supported.
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you’re using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons’s version.
You can find the compatibility matrix in TensorFlow Addon’s readme:
\https://github.com/tensorflow/addons
warnings.warn(
WARNING:tensorflow:From /home/jovyan/work/das-oa-ssl/src/models/layers/preprocessing.py:43: calling function (from tensorflow.python.eager.def_function) with experimental_compile is deprecated and will be removed in a future version.
Instructions for updating:
experimental_compile is deprecated, use jit_compile instead
2022-08-01 12:37:48.924618: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:48.956007: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:48.956468: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:48.957094: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING
Your GPU may run slowly with dtype policy mixed_float16 because it does not have compute capability of at least 7.0. Your GPU:
Tesla P100-PCIE-12GB, compute capability 6.0
See \https://developer.nvidia.com/cuda-gpus for a list of GPUs and their compute capabilities.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once
WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING
Your GPU may run slowly with dtype policy mixed_float16 because it does not have compute capability of at least 7.0. Your GPU:
Tesla P100-PCIE-12GB, compute capability 6.0
See \https://developer.nvidia.com/cuda-gpus for a list of GPUs and their compute capabilities.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once
TF policy set to mixed_float16
INFO:main:TF version: 2.9.0
INFO:main:Current working dir: /home/jovyan/work/das-oa-ssl
INFO:src.util.misc:Directory logs/2022-08-01_14:37:48_CEST has been created.
INFO:src.util.misc:Directory logs/2022-08-01_14:37:48_CEST/images has been created.
INFO:src.util.misc:Log directories have been created at ./logs/2022-08-01_14:37:48_CEST and path have been set.
INFO:main:Chosen distribution strategy: None
DEBUG:src.data.generators:length data lookup for train/train: 10
DEBUG:src.data.generators:length data lookup for train/validation: 10
DEBUG:src.data.generators:length data lookup for val/None: 10
2022-08-01 12:37:49.795147: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-01 12:37:49.797228: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:49.797673: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:49.797976: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:50.780309: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:50.780667: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:50.780905: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-01 12:37:50.781224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 11319 MB memory: → device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:00:05.0, compute capability: 6.0
INFO:main:Computing min/max percentiles for clipping.
INFO:main:min percentile: -313996.0625, max percentile: 331386.75.
INFO:main:Effective batch size: 2
DEBUG:main:Base learning rate: 0.001
INFO:main:Actual learning rate: 7.8125e-06
INFO:main:Number of total training steps: 10
DEBUG:src.models.layers.image_augmentation:Calling image augmentation in test mode
DEBUG:src.models.layers.patch_embeddings:Calling patch embedding with data type x: <dtype: ‘float16’>
2022-08-01 12:37:57.438061: I tensorflow/core/profiler/lib/profiler_session.cc:99] Profiler session initializing.
2022-08-01 12:37:57.438143: I tensorflow/core/profiler/lib/profiler_session.cc:114] Profiler session started.
2022-08-01 12:37:57.482416: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1665] Profiler found 1 GPUs
2022-08-01 12:37:58.083321: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session tear down.
2022-08-01 12:37:58.087979: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1799] CUPTI activity buffer flushed
Epoch 1/2
2022-08-01 12:37:58.236518: I tensorflow/core/profiler/lib/profiler_session.cc:99] Profiler session initializing.
2022-08-01 12:37:58.236585: I tensorflow/core/profiler/lib/profiler_session.cc:114] Profiler session started.
DEBUG:src.models.layers.image_augmentation:Calling image augmentation in training mode
DEBUG:src.models.layers.patch_embeddings:Calling patch embedding with data type x: <dtype: ‘float16’>
INFO:src.models.mae_refactored:scaling loss because policy mixed_float16
INFO:src.models.mae_refactored:scaling grads because policy mixed_float16
DEBUG:src.models.layers.image_augmentation:Calling image augmentation in training mode
DEBUG:src.models.layers.patch_embeddings:Calling patch embedding with data type x: <dtype: ‘float16’>
INFO:src.models.mae_refactored:scaling loss because policy mixed_float16
INFO:src.models.mae_refactored:scaling grads because policy mixed_float16
2022-08-01 12:38:38.457414: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8101
2022-08-01 12:38:43.129074: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2022-08-01 12:38:43.152367: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1799] CUPTI activity buffer flushed
2022-08-01 12:38:43.427292: I tensorflow/core/profiler/internal/gpu/cupti_collector.cc:521] GpuTracer has collected 8749 callback api events and 8688 activity events.
2022-08-01 12:38:43.860755: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session tear down.
2022-08-01 12:38:44.633636: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43
2022-08-01 12:38:44.967168: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.trace.json.gz
2022-08-01 12:38:45.597159: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43
2022-08-01 12:38:45.621946: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for memory_profile.json.gz to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.memory_profile.json.gz
2022-08-01 12:38:45.642457: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43
Dumped tool data for xplane.pb to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.xplane.pb
Dumped tool data for overview_page.pb to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.overview_page.pb
Dumped tool data for input_pipeline.pb to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to ./logs/2022-08-01_14:37:48_CEST/tb_logs/plugins/profile/2022_08_01_12_38_43/rwagner-das-2doa-2dssl-7b0630ac-0.kernel_stats.pb
5/Unknown - 48s 212ms/step - loss: 0.7835 - mae: 0.6896DEBUG:src.models.layers.image_augmentation:Calling image augmentation in test mode
DEBUG:src.models.layers.patch_embeddings:Calling patch embedding with data type x: <dtype: ‘float16’>
INFO:src.callbacks:Seconds per epoch end 000: 99.4s
INFO:src.util.misc:File ./src/config/config.yaml has been saved to ./logs/2022-08-01_14:37:48_CEST.
…