I had to download a file (HowardSprings_2020_L3.nc) manually as I could not get a direct link to it, and I placed it in the folder of the respective dataset, then I tried to add it by
renku dataset add HowardSprings temp/HowardSprings_2020_L3.nc. This failed with a warning that I should provide a URL for proper tracking. So I contacted the data supplier and he gave me a direct link to a complete file,
HowardSprings_L3.nc. I ran
renku dataset add HowardSprings URL, forgetting that I still had the untracked
HowardSprings_2020_L3.nc in the folder. What happened, is that apparently renku added and committed that file first, with the commit message
renku dataset: committing 1 newly added files and then downloaded the second file,
HowardSprings_L3.nc and commited with the commit message
renku dataset add HowardSprings http://dap.ozflux.org.au/thredds/fileServer/ozflux/sites/HowardSp.... Wouldn’t it be better to warn the user that the repo is dirty before taking executive decisions?
This is all in the repo
https://renkulab.io/projects/ba_math_gen-16/et-data. When I browse the datasets, I find both files, but there is no lineage information for either of them. When I look at
HowardSprings_L3.nc in the gitlab interface, it points to the commit
renku dataset: committing 1 newly added files instead of the
renku dataset add commit. But then, the latter does not contain the path anyway, so I am totally lost at how to recover the origin of the files in the dataset. Could anyone help me out here? I was trusting that the lineage gets recorded when running
renku dataset add, since this was implicit in the error message of my first attempt, but if this is not the case, I will have to include it for each file in a readme. Not sure how
renku dataset update should work if the link is truncated, though.