- 1 About the service
- 2 Technology explained
- 3 Transferring data
- 4 Features and limitations
- 4.1 I logged in using an SFTP client but cannot store data in my home directory. What's the matter?
- 4.2 How many copies of the data are made and where are they stored?
- 4.3 How long will the data remain in the archive?
- 4.4 How can I make sure my data did not change. Do you support checksums?
- 4.5 Can I see checksums
- 4.6 How much data can I store
- 4.7 System wide limits
- 5 Upcoming features
- 6 Security
- 7 Assistance
- 8 Registration and access
- 9 Usage tips
- 9.1 I have a suggestion for improvement. What is the award if my idea gets implemented?
- 9.2 Do you have recommendations regarding the size of the files?
- 9.3 Do many small files take up more space than a single large file?
- 9.4 I want to routinely create and validate checksums of large amounts of files
- 9.5 Accessing my data takes a long time or my client runs into a timeout. Why?
- 9.6 I deleted [a file | some files | a directory | my files]. Can I recover the lost data?
- 9.7 Errors while transferring data
About the service
Who are the designated users?
Employees of KIT, universities and institutions in Baden-Wuerttemberg
What's with the bw in the name bwDataArchive
bw Stands for Baden-Wuerttemberg, the state that KIT is located in. The initial investments and the develoment of the service were funded by the Ministery of science, research and arts of Baden-Wuerttemberg. The name bwDataArchiv should be self explanatory except maybe for the missing 'e' in Archiv which is the German word for archive. The english name of the bwDataArchive brand is RDA, the Research Data Archive (with e :-)), but you will find the name bwDataArchive (with e :-)) as well as a name for the service (the project name was without 'e').
Multiple service models
The service offers three service models. Your home organisation can apply for a contract with bwDataArchive. Its up to your home organisation to decide who has access and is allowed to store data. Universities in Baden-Wuerttemberg typically have an IDP which forwards the entitlement for users entitled to use the archive (idp-service-model). Organisations outside the BWIDM federation designate an administrator who can invite users to register for the service using an advanced administration portal (admin-service-model). A third operational model is the sa-service-model, in which a single service account (sa) from an organisation or project is used to allow an application to store data to the archive. The projectsRADAR and bwDataDiss use this model.
What storage technologies do you use?
What ist HPSS?
HPSS is a data management application that is being developed at several computer centres that require long term storage for large amounts of data. See here for more detailed information: HPSS web site
How is the data stored?
Data is securely stored on magnetic tape. We use the following tape drives and technologies: LTO5 (max. 1.5 TB per cartridge), STK 10kC (max. 4 TB per cartridge) and STK 10kD (max. 8 TB per cartridge), IBM TS1140 (max. 4.5 TB per cartridge)
What protocols do you support?
The services supports sftp and GridFTP for uploading and downloading. See https://www.rda.kit.edu/english/transmit.php
Features and limitations
I logged in using an SFTP client but cannot store data in my home directory. What's the matter?
Your home directory is read-only and contains at least the directory private. To store data change directory to private. In the future other directories will be visible in you home e.g. a directory public for open data.
How many copies of the data are made and where are they stored?
All data in bwDataArchive has at least 2 copies. Data is moved to disk and from there duplicated to tape storage. There are tape libraries in two data centres in CN as well as in CS.
How long will the data remain in the archive?
The regular retention time for files on bwDataArchive is ten years. After this ten year period bwDataArchive will delete your data. A warning message is send 6 months ahead to the registered mail addresses. (This is probably the biggest reason to keep at least one of the two possible mail addresses up to date). Contact us at least three months in advance to have the retention time prolonged. If you want to terminate the cooperation with bwDataArchive or if your data is no longer needed you can delete your data yourself.
How can I make sure my data did not change. Do you support checksums?
We store a MD5 checksum for every file. When the file is read the checksum will be build again and compared with the stored checksum. If there is no match the file will not be delivered to the user. For detailed information s. https://www.rda.kit.edu/img/FAQ-bwDataArchiv%20Data%20Protection%20%20-%20V2.pdf
Also at a more basic level on disk and on tape the data is protected with checksums.
Can I see checksums
At the moment you cannot see the checksums. Please contact E-Mail. We will find a solution.
How much data can I store
The amount of data you can store depends on the agreement between your home organisation and the bwDataArchive Service. Your organisation receives accounting data and is responsible for amount limiting measures. The service itself applies no quota limitation for the following reasons
- transfers may take several hours or more. When passing the quota limit pending transfers will fail unexpectedly. For large transfers involving thousands of files it quickly becomes impossible to tell which file was transfered, which file is still pending to be transfered
- selecting data and decide which file should be archived and which file can be discarded is cumbersome and may take more time and costs that can be resonably expected from the researcher.
System wide limits
maximum file size: 600 GB
maximum retention time: 10 years
maximum number of files: 10^12
maximum throughput (GridFTP): ~1000 MB/s
maximum throughput (sftp, single stream): ~70 MB/s
recommended minimal file size: 100 MB
We are working hard to implement the following requests.
Data to and from the research data archive is encrypted with 2048 RSA keys. This ensures secure communication and protects against eavesdropping. Additionally the encryption guarantees that the data arriving at the archive is the same as the data send to the archive. The encrypted channel functions as a 2048 bit error detection system. Before the first communication between hosts a key exchange is needed. The SSH initial process generates a fingerprint which the client can use to verify the authenticity of the server.
I have a question. Who do I ask?
Support and help https://www.rda.kit.edu/english/65.php
I did everything right. Still my client cannot access the archive. What could be wrong?
Please contact the bwDataArchive admins per E-Mail or, if you are a User from BW alternative via Baden-Württemberg Support Portal https://bw-support.scc.kit.edu/. Describe your problems and what you have done and add for example some screenshots.
Registration and access
Where do I register for the service?
I have registered but login fails. What is wrong?
Maybe the registration workflow did not finish completely. This can happen because of network errors or unexpected browser behaviour. Go to https://bwidm.scc.kit.edu/user/index.xhtml, login with your credentials and unregister from the service. Then register again. You will receive an email after you have registered successfully.
I lost my password. How do I obtain a new one?
Login using shibboleth (https://www.rda.kit.edu/bwDA/shibbLogin.php), click on 'manage your account' and change your password.
Why do I need a different password for the archive. Cant I use the one I use at my home - institution
The data will stay at least 10 years in the archive. By that time you may have left the organisation, lost your account there but your data is still archived and accessible only with the archive-specific account. Of course you can set the same password for convenience, but this is not recommended due to security concerns.
I have a suggestion for improvement. What is the award if my idea gets implemented?
You will be named on the bwDataArchive Hall of Fame pages and are eligible for 10 years of 1 TB of free storage.
Do you have recommendations regarding the size of the files?
The objects you store should be as large as possible. But, "what is large?", you may ask. Remember your data is stored on tape. Access to data includes time for locating, winding and positioning. Although current tape drives can read at speeds over 300 MB/s, when the files become smaller the positioning time takes the overhand. So if you can easily construct files of several GigaBytes please do so. The system accepts files up to 600 GB but upload and download of 600 GB takes a considerate amount of time. On the other hand, the system (HPSS) aggregrates smaller files in larger compounds in order to speed up writing. However, we do NOT recommend writing many small files, since reading them back may take much longer than an aggregated tar file.
Do many small files take up more space than a single large file?
Yes on disk not on tape. The archive system (HPSS) caches files on disk before sending them off to tape. To keep up speed during IO, disks have fixed size allocations. Actually the file system software that manages the disk space determines the allocations. Since there is no file system on tape, the space used for one large file is the same as for 10 files each of one tenth of the size of the large file. The answer to the question is therefore: after data has been migrated to tape, small files do not take up more space. (I'm not taking into account that each file on tape has a small header of a few bytes. Therefore many very small files still take up more tape space then the equivalent content of a large file. HPSS aggregrates small files into larger objects and therefore the question is academic.) We do not encourage writing many small files (see above).
I want to routinely create and validate checksums of large amounts of files
This tool may be of help Hash build and check. ALso check the the other entries regarding checksums in this FAQ.
Accessing my data takes a long time or my client runs into a timeout. Why?
Long response time may be due to several reasons:
- Has your data been stored a long time ago? Then it is probably no longer on disk and has to be copied in from tape. This may take up to several minutes (or even hours), depending on the current archive data traffic.
- Retrieval of lots of small files takes _much_ longer than of a few large files. We do not guarantee access times for files < 100 MB and recommend GB-sized files for best performance.
- Something is broken (but we are fixing it).
I deleted [a file | some files | a directory | my files]. Can I recover the lost data?
Straight answer: No.
Errors while transferring data
Please check the following:
- do you have read permission on the files you like to transfer
- symbolic links may point to files you have no permission to read
- did you change into the private directory. Your home directory is read-only!