Track origins of files in datasets?

schymans · 21 March 2021 13:47

I had to download a file (HowardSprings_2020_L3.nc) manually as I could not get a direct link to it, and I placed it in the folder of the respective dataset, then I tried to add it by renku dataset add HowardSprings temp/HowardSprings_2020_L3.nc. This failed with a warning that I should provide a URL for proper tracking. So I contacted the data supplier and he gave me a direct link to a complete file, HowardSprings_L3.nc. I ran renku dataset add HowardSprings URL, forgetting that I still had the untracked HowardSprings_2020_L3.nc in the folder. What happened, is that apparently renku added and committed that file first, with the commit message renku dataset: committing 1 newly added files and then downloaded the second file, HowardSprings_L3.nc and commited with the commit message renku dataset add HowardSprings http://dap.ozflux.org.au/thredds/fileServer/ozflux/sites/HowardSp.... Wouldn’t it be better to warn the user that the repo is dirty before taking executive decisions?

This is all in the repo https://renkulab.io/projects/ba_math_gen-16/et-data. When I browse the datasets, I find both files, but there is no lineage information for either of them. When I look at HowardSprings_L3.nc in the gitlab interface, it points to the commit renku dataset: committing 1 newly added files instead of the renku dataset add commit. But then, the latter does not contain the path anyway, so I am totally lost at how to recover the origin of the files in the dataset. Could anyone help me out here? I was trusting that the lineage gets recorded when running renku dataset add, since this was implicit in the error message of my first attempt, but if this is not the case, I will have to include it for each file in a readme. Not sure how renku dataset update should work if the link is truncated, though.

Topic		Replies	Views
Updating a dataset created within renku project Renku (CLI)	5	290	24 September 2020
Proper use of renku dataset to link data from project to project Renku (CLI)	15	937	27 October 2020
Renku dataset avoid redundant directories	3	274	28 August 2020
Import renku dataset in a notebook / script Renku (CLI)	3	366	24 April 2020
Retroactively add data to dataset Renku (CLI)	2	250	23 June 2022

Track origins of files in datasets?

Related Topics