BwDataArchiv FAQs

From Lsdf
Revision as of 14:06, 26 April 2017 by Jvw (talk | contribs) (System wide limits)

About the service

Who are the designated users?

Employees of KIT, universities and institutions in Baden-Wuerttemberg

What's with the bw in the name bwDataArchiv

bw Stands for Baden-Wuerttemberg, the state that KIT is located in. The initial investments and the develoment of the service were funded by the Ministery of science, research and arts of Baden-Wuerttemberg. The name bwDataArchiv should be self explanatory except maybe for the missing 'e' in Archiv which is the German word for archive. The english name of the bwDataArchiv brand is RDA, the Research Data Archive (with e :-))

Multiple service models

The service offers three service models. Your home organisation can apply for a contract with bwDataArchiv. Its up to your home organisation to decide who has access and is allowed to store data. Universities in Baden-Wuerttemberg typically have an IDP which forwards the entitlement for users entitled to use the archive (idp-service-model). Organisations outside the BWIDM federation designate an administrator who can invite users to register for the service using an advanced administration portal (admin-service-model). A third operational model is the sa-service-model, in which a single service account (sa) from an organisation or project is used to allow an application to store data to the archive. The projectsRADAR and bwDataDiss use this model.

Technology

What storage technologies do you use?

see https://www.rda.kit.edu/technologie.php

What ist HPSS?

HPSS is a data management application that is being developed at several computer centres that require long term storage for large amounts of data. See here for more detailed information: HPSS web site

How is the data securely stored?

Data is stored on magnetic tape. We use the following tape drives and technologies: LTO5 (max. 1.5 TB per cartridge), STK 10kC (max. 4 TB per cartridge) and STK 10kD (max. 8 TB per cartridge), IBM TS1140 (max. 4.5 TB per cartridge)

Features

I have a suggestion for improvement. What is the award if my idea gets implemented?

You will be named on the bwDataArchiv Hall of Fame pages and are eligible for 10 years of 1 TB of free storage.

How many copies of the data are made and where are they stored?

All data in bwDataArchiv has at least 2 copies. Data is moved to disk and from there duplicated to tape storage. There are tape libraries in two data centres in CN as well as in CS.

How long will the data remain in the archive?

The regular retention time for files on bwDataArchiv is ten years. After this ten year period bwDataArchiv will delete your data. A warning message is send 6 months ahead to the registered mail addresses. (This is probably the biggest reason to keep at least one of the two possible mail addresses up to date). Contact us at least three months in advance to have the retention time prolonged. If you want to terminate the cooperation with bwDataArchiv or if your data is no longer needed you can delete your data yourself.

How can I make sure my data did not change. Do you support checksums?

We store a MD5 checksum for every file. When the file is read the checksum will be build again and compared with the stored checksum. If there is no match the file will not be delivered to the user. For detailed information s. https://www.rda.kit.edu/img/FAQ-bwDataArchiv%20Data%20Protection%20%20-%20V2.pdf
Also at a more basic level on disk and on tape the data is protected with checksums.

Can I see the checksums

At the moment you cannot see the checksums. Please contact E-Mail. We will find a solution.

System wide limits

maximum file size: 600 GB

maximum retention time: 10 years

maximum number of files: 10^12

maximum throughput (GridFTP): ~1000 MB/s

maximum throughput (sftp, single stream): ~70 MB/s

recomended minimal file size: 100 MB

Features (in preparation)

Storing data in shared area

Security

Encryption

Data to and from the research data archive is encrypted with 2048 RSA keys. This ensures secure communication and protects against eavesdropping. Additionally the encryption guarantees that the data arriving at the archive is the same as the data send to the archive. The encrypted channel functions as a 2048 bit error detection system. Before the first communication between hosts a key exchange is needed. The SSH initial process generates a fingerprint which the client can use to verify the authenticity of the server.

Assistance

I have a question. Who do I ask?

Support and help https://www.rda.kit.edu/english/65.php

I did everything right. Still my client cannot access the archive. What could be wrong?

Please contact bwDataArchiv per E-Mail or, if you are a User from BW alternative via Baden-Württemberg Support Portal https://bw-support.scc.kit.edu/. Describe your problems and what you have done and add for example some screenshots.

Registration and access

Where do I register for the service?

Visit https://www.rda.kit.edu/bwDA

I have registered but still cannot access the service. What is wrong?

Maybe the registration workflow did not finish completely. This can happen because of network errors or unexpected browser behaviour. Go to https://bwidm.scc.kit.edu/user/index.xhtml, login with your credentials and unregister from the service. Then register again. You will receive an email after you have registered successfully.

I lost my password

Why do I need a different password for the archive. Cant I use the one I use at my home - institution ====

  • The data will stay at least 10 years in the archive. By that time you may have left the organisation and your data is still there

Usage tips

Do you have recommendations regarding the size of the files?

The objects you store should be as large as possible. But, "what is large?", you may ask. Remember your data is stored on tape. Access to data includes time for locating, winding and positioning. Although current tape drives can read at speeds over 300 MB/s, when the files become smaller the positioning time takes the overhand. So if you can easily construct files of several GigaBytes please do so. The system accepts files up to 600 GB but upload and download of 600 GB takes a considerate amount of time. On the other hand, the system (HPSS) aggregrates smaller files in larger compounds in order to speed up writing. The bottom line is that if you are not generating large (multiple GB) files routinely, forget about the size and store the files without ZIPping.

Do many small files take up more space than a single large file?

Yes on disk not on tape. The archive system (HPSS) caches files on disk before sending them off to tape. To keep up speed during IO, disks have fixed size allocations. Actually the file system software that manages the disk space determines the allocations. Since there is no file system on tape, the space used for one large file is the same as for 10 files each of one tenth of the size of the large file. The answer to the question is therefore: after data has been migrated to tape, small files do not take up more space. (I'm not taking into account that each file on tape has a small header of a few bytes. Therefore many very small files still take up more tape space then the equivalent content of a large file. HPSS aggregrates small files into larger objects and therefore the question is academic.)

I want to routinely create and validate checksums of large amounts of files

This tool may be of help Hash build and check. ALso check the the other entries regarding checksums in this FAQ.

Accessing my data takes a long time. Why?

Long response time maybe due to several reasons:

  • Has your data been stored a long time ago? Then it is probably no longer on disk and has to be copied in from tape. This may take up to several hours, depending on the current archive data traffic.
  • Retrieval of lots of small files takes longer than of a few large files.
  • Something is broken (but we are fixing it).

I deleted [a file, some files, a directory, my files]. Can I recover the lost data?

Straight answer: No.

Transferring data

What protocols do you support?

The services supports sftp and GridFTP for uploading and downloading. See https://www.rda.kit.edu/english/transmit.php