Difference between revisions of "hidden:BWA ProjectPlanning item SM9 COS reconfiguration and data migration"

From Lsdf
(Telco IBM 9.9.2015)
(Mails seit Oktober)
Line 42: Line 42:
 
*** es ist weiterhin nicht klar wo die files sind
 
*** es ist weiterhin nicht klar wo die files sind
 
*** Verfahren: Es werden Metadaten verglichen
 
*** Verfahren: Es werden Metadaten verglichen
  +
* Mail Jonathan 6 Oct 2015
  +
** As for the changeCOS, I heard the "rescue script" was finished. This item
  +
can be worked on this week. And I think it also would be good to talk a bit
  +
about why it didn't work in the first place (special characters) and what
  +
can be learned from that for the future.
  +
*Mail Jonathan 7 Oct 2015
  +
**HPSS rescue script
  +
***some Files not have a full path , possibly special Characters
  +
***Scott will work with high priority on it this week
  +
* Mail Scott 8 Oct 2015
  +
** I am still having trouble capturing the path's of a large number of the files. The searches of the HPSS namespace is not retuning paths as expected, even with a manual query. I have escalated this to development for some assistance.
  +
** The list of offending files can be found here: /tmp/findFile/file-path-list--1204--15-09-23--20:14:43.log
  +
**You will see that a large number of them do not have the full paths. I have submitted the steps I have ran to development and I am waiting for their response. My expectation is that they will as some DB2 queries, so I submitted those as well. I will continue to work this as a high priority and send and update once I hear back from them.
  +
* Mail Scott 09 Oct 2015
  +
** I just swapped emails with development, and after a quick modification to approach, I have captured all the paths of the offending files. You can find the output file here: /tmp/findFile/path.out
  +
* aus Mail Dorin Fri, 9 Oct
  +
** reasons of failure to put these files on change cos list. (only random analysis).
  +
**two kind of problems:
  +
  +
**1. Files names with blank characters:
  +
***Ex.: /SFTP/KIT/kr7047/Load_Test/OS Images for VB/FreeBSD-10.0-RELEASE-amd64-disc1.iso
  +
***Due to an awk field selection in our script we missed this kind of files. We can correct it.
  +
**2.files of 0 bytes.
  +
***EX.: /SFTP/KIT/ks2420/Isabel_Kraut/Output/nest3/pf7_all/Job_ic2_151868.err
  +
***scrub> getattrs /SFTP/KIT/ks2420/Isabel_Kraut/Output/nest3/pf7_all/Job_ic2_151868.err
  +
****DataLength : 0 bytes
  +
****DontPurge : Off
  +
****scrub> dump
  +
****/SFTP/KIT/ks2420/Isabel_Kraut/Output/nest3/pf7_all/Job_ic2_151868.err
  +
  +
::::Storage Level 0:
  +
::::No segments at this level
  +
::::Storage Level 1:
  +
::::No segments at this level
  +
::::Storage Level 2:
  +
::::No segments at this level
  +
  +
****Since we created the list from the tape content we missed these files to (they are not on the tape).
  +
  +
****We do not know yet how we got these files. Are they really empty files or there were an error at the transfer time. For example the user started a file transfer, the file had been created in hpss but subsequently the data transfer failed.
  +
* Mail Scott 9 Oct 2015
  +
**I have sent and email off to development about how to changecos the zero byte files.
  +
** As for why they are zero bytes, both of you scenarios are possible. They could be empty error files, like in your example, or they could be a failed transfer. A failed transfer should have some logs associated with it. You can check the time stamp of the file and compare it to the HPSS logs to see if there was a problem somewhere.
  +
* Mail Scott 12.10.2105
  +
**spoke to Jae, and he told me that one solution is to capture all the attributes from the file, owner and permission, etc. and delete and recreate the file using the correct COS. Since they are zero byte files, this can work. The limitation here is that the time stamps will be changed.
  +
* Mail Ahmad vom 14 Oct 2015
  +
**I've just copied two file lists on core server:
  +
***/tmp/findFile/file_attrs_without_spaces.txt (List of zero byte files you asked for)
  +
***/tmp/findFile/file_attrs_with_spaces.txt (List of the files with spaces in file name )
  +
* Mail Ahmad 14 Oct 2015
  +
**the list from Jos with 19K non-zero files : /tmp/findFile/jos_non_zero_files.txt
  +
**These file pathes shoud be included in the /tmp/findFile/file_attrs_without_spaces.txt
  +
* Mail Jonathan vom 14 Oct 2015
  +
**change COS
  +
***The files with white spaces in their filenames have been migrated to the new COS successfully.
  +
***The majority of unmigrated files are zero byte files. Obviously, these files are not on tape and will not occur on a file list retrieved from tapes. Need to evaluate how to proceed here.
  +
***About two third of these zero byte files are known by the file creator (Jos) to not be empty outside HPSS. Need to investigate why they have zero size within HPSS. Ahmad has prepared a list of file attributes. Scott will investigate the log files.
  +
* Mail Jonathan 21.10.2015
  +
**changeCOS
  +
***Files with white spaces in name have been migrated successfully.
  +
***For the zero byte files issue:
  +
****The HPSS log files do not indicate connection errors related to times when the files in question were written into HPSS.
  +
****Please ask Jos to try and put some of the files in question into HPSS again. If there are errors seen, either on client side or that files are empty again where they shouldn't, we need to look into the /var/log/messages files on the gateway server.
  +
****Also, we can turn on additional logging for the FUSE client.
  +
* Mail Scott 26.10.2015 Logs einrichten
  +
**Here is the info that I mentioned in the call. Because there is not any logging available for the times that Jos's files failed to transfer, then we should first look at the users files. I know that their file may in fact be zero byte files, but if there is an error associated with their time stamps, it may give us a clue as the problem.
  +
***Following this we should turn the API Logging on, on the client. We can do this by editing the /var/hpss/etc/env.conf file to contain the following entries:
  +
****HPSS_API_DEBUG_PATH=/var/hpss/tmp/API_DEBUG.out
  +
****HPSS_API_DEBUG=7
  +
**For the mount to pick this extra API logging on, we need to dismount and remount the file system. If that is not an option, then we can create and API resource file. This will log the same things as above, but without remounting the filesystem. To start it:
  +
*** vi /var/hpss/tmp/hpss.api.resource
  +
***and then in it put this line, without the quotes:
  +
***'7 /var/hpss/tmp/API_DEBUG.out'
  +
**In 30 seconds, the CLAPI with take the trigger from the resource file and begin logging the API debug messages.
  +
**We can also increase VFS-FUSE logging. This requires that we dismount and remount the mounting the FS. The developer told me that he should log any errors in case the mount cannot be dismounted. To turn on the TRACE logging, we just need to add 'trace=5' as one of the mount options and remount. This will log everything, so if we turn it on, we will need to plan on turn it off to not impact performance on that node.
  +
* Mail Ahmad vom 28.10.2015
  +
**1. I have just activated the API logging with the second option using the resource file
  +
***/var/hpss/tmp/hpss.api.resource
  +
***But I changed the API_DEBUG.out path to /data. /data has about 75GB of space.
  +
***<nowiki>#</nowiki> cat /var/hpss/tmp/hpss.api.resource
  +
***7 /data/hpss/tmp/API_DEBUG.out
  +
**2. I alos tried option one with editing /var/hpss/etc/env.conf and adding both envs
  +
***HPSS_API_DEBUG_PATH=/data/hpss/tmp/API_DEBUG.out
  +
***HPSS_API_DEBUG=7
  +
**But remount was not possible duo to some users logged in via sftp.
  +
**3. Option three adding trace=5 to mount options has not been done for the same reason as in 2. but if needed I can do this later.
  +
* Mail Scott 28.10.2015
  +
**We can make a couple of runs with the API logging enabled and if this doesn't provide enough info, we can consider dismounting the FS. The FUSE developer told me that all failed transfers to HPSS are should be logged, just not in the level of detail that come with TRACE.
  +
**Thanks for the update. Please let us know what you find in the logs.
  +
* Mail Ahmad vom 28.10.2015
  +
**on both archive-sftp-01 and archive-sftp-02 now sftp logging has ben setup.
  +
***Location: /var/log/sftp
  +
***I will ask Jos to repeat the copy to HPSS. Maybe he can do this tomorrow.
  +
***Should we mount the older COS (1204) again for this test. Or can we use the new working COS 1205? Does it matter?
  +
**** Mail Scott vom 28.10.2015: Lets leave COS 1205 mounted. It should not matter since we are looking for new transfer errors.
  +
* Mail Jonathan 30.10.2015
  +
** changeCOS
  +
***Additional logging for FUSE and separate logs for SFTP have been enabled (see Ahmad's emails from Oct. 28). New transfers with data that has failed before may or may not provide new evidence.
  +
* Mail Tobias 18.11.2015
  +
** Change COS
  +
***All the files have the same checksum (fuse/hpsssum). The next step for KIT is to contact the user and try to get the deleted files back from backup. The question still is what happens with the zero byte files. Now, the problem is that there are no error logs and the failure couldn't be reproduced. Nevertheless Scott will contact the fuse developers.
  +
* Mail Ahmad vom 18.11.2015
  +
** es wurde eine Liste aller Zero Files von den betroffenen Users erstellen

Revision as of 18:43, 20 November 2015

  • Statusbericht HPSS-Statusbericht-10.8.2015 Ahmad 10.8.2015
    • Lange Diskussionen mit HPSS Support über die geeignetste Methode
    • Scrub changecos, je 50 TB Tape, 17 TB Disk. How: Tape-oldDSC-NewDSC-Tape
    • mv erzeugte Fehlermeldung, die HPSS nicht erklären konnte
      • Datei wurde aber kopiert
      • unsupportet Chance COS
      • How to: scrup Chance COS soll verändert werden. Chance COS fängt beim Tape an. Platte. Tape
      • Es muss auch die Anzahl der Drives bestimmt werden.
      • Da nur zwei Laufwerke, es könnte das ganze System blockiert werden.
      • Nur ein Laufwerk wird nun verwendet
      • lange Dauer mit einem Laufwerk
      • Splitten in Chunks von 15000 files
      • Warten auf Laufwerke vom Testsystem
      • cp verursacht noch mehr Fehler: z. B. 'Path is too long'
    • Skripte vorbereitet. Bottelneck Anzahl der Laufwerke (nur zwei)
    • ChangeCos nach Power8 Tests, um mehr Laufwerke bereitzustellen
    • Tapes 'Read only'? Daten werden nicht gemischt.
    • immer dieselbe Tapeclass für neues und altes COS
    • Erfahrungen von Thomas Bönisch für weniger files; Dauer 3 Minuten pro file
    • Snapshot für Daten von altem COS; neuer Mountpoint; alle kommenden files werden in den neuen COS geschrieben
    • vorher alle Daten auf Tape migrieren
    • COS können nebeneinander bestehen
    • Altes COS kann auch länger bestehen bleiben; kann auch später geändert werden.
    • Downtime ankündigen
    • Migration auf Tape
    • Snapshot von allen files auf tapes mit allen IDs
    • neuer Mountpoint
      • neuer Pfad
  • Bericht Ahmad vom 24.8.2015
    • bitfile-ID und Datum bleiben unverändert durch Change-COS
    • 2 Probleme mit files von Ursula konnte nicht gelöst werden
    • Es wird ein Change-COS Report erstellt
  • Bericht Ahmad und Telco IBM am 02.09.2015
    • noch keine Antwort von Scott bzgl. der 30000 nicht kopierten files
      • Antwort vom Entwickler soll am Nachmittag kommen
    • Read only files?
    • danach kann mit den Umstellungen für die E-to-E-Protection begonnen werden
  • Telco IBM 9.9.2015
    • Das Skript ist noch nicht fertig
      • Meeting von Scott mit Entwickler am Nachmittag und evtl. am Freitag
      • es ist weiterhin nicht klar wo die files sind
      • Verfahren: Es werden Metadaten verglichen
  • Mail Jonathan 6 Oct 2015
    • As for the changeCOS, I heard the "rescue script" was finished. This item

can be worked on this week. And I think it also would be good to talk a bit about why it didn't work in the first place (special characters) and what can be learned from that for the future.

  • Mail Jonathan 7 Oct 2015
    • HPSS rescue script
      • some Files not have a full path , possibly special Characters
      • Scott will work with high priority on it this week
  • Mail Scott 8 Oct 2015
    • I am still having trouble capturing the path's of a large number of the files. The searches of the HPSS namespace is not retuning paths as expected, even with a manual query. I have escalated this to development for some assistance.
    • The list of offending files can be found here: /tmp/findFile/file-path-list--1204--15-09-23--20:14:43.log
    • You will see that a large number of them do not have the full paths. I have submitted the steps I have ran to development and I am waiting for their response. My expectation is that they will as some DB2 queries, so I submitted those as well. I will continue to work this as a high priority and send and update once I hear back from them.
  • Mail Scott 09 Oct 2015
    • I just swapped emails with development, and after a quick modification to approach, I have captured all the paths of the offending files. You can find the output file here: /tmp/findFile/path.out
  • aus Mail Dorin Fri, 9 Oct
    • reasons of failure to put these files on change cos list. (only random analysis).
    • two kind of problems:
    • 1. Files names with blank characters:
      • Ex.: /SFTP/KIT/kr7047/Load_Test/OS Images for VB/FreeBSD-10.0-RELEASE-amd64-disc1.iso
      • Due to an awk field selection in our script we missed this kind of files. We can correct it.
    • 2.files of 0 bytes.
      • EX.: /SFTP/KIT/ks2420/Isabel_Kraut/Output/nest3/pf7_all/Job_ic2_151868.err
      • scrub> getattrs /SFTP/KIT/ks2420/Isabel_Kraut/Output/nest3/pf7_all/Job_ic2_151868.err
        • DataLength : 0 bytes
        • DontPurge : Off
        • scrub> dump
        • /SFTP/KIT/ks2420/Isabel_Kraut/Output/nest3/pf7_all/Job_ic2_151868.err
Storage Level 0:
No segments at this level
Storage Level 1:
No segments at this level
Storage Level 2:
No segments at this level
        • Since we created the list from the tape content we missed these files to (they are not on the tape).
        • We do not know yet how we got these files. Are they really empty files or there were an error at the transfer time. For example the user started a file transfer, the file had been created in hpss but subsequently the data transfer failed.
  • Mail Scott 9 Oct 2015
    • I have sent and email off to development about how to changecos the zero byte files.
    • As for why they are zero bytes, both of you scenarios are possible. They could be empty error files, like in your example, or they could be a failed transfer. A failed transfer should have some logs associated with it. You can check the time stamp of the file and compare it to the HPSS logs to see if there was a problem somewhere.
  • Mail Scott 12.10.2105
    • spoke to Jae, and he told me that one solution is to capture all the attributes from the file, owner and permission, etc. and delete and recreate the file using the correct COS. Since they are zero byte files, this can work. The limitation here is that the time stamps will be changed.
  • Mail Ahmad vom 14 Oct 2015
    • I've just copied two file lists on core server:
      • /tmp/findFile/file_attrs_without_spaces.txt (List of zero byte files you asked for)
      • /tmp/findFile/file_attrs_with_spaces.txt (List of the files with spaces in file name )
  • Mail Ahmad 14 Oct 2015
    • the list from Jos with 19K non-zero files : /tmp/findFile/jos_non_zero_files.txt
    • These file pathes shoud be included in the /tmp/findFile/file_attrs_without_spaces.txt
  • Mail Jonathan vom 14 Oct 2015
    • change COS
      • The files with white spaces in their filenames have been migrated to the new COS successfully.
      • The majority of unmigrated files are zero byte files. Obviously, these files are not on tape and will not occur on a file list retrieved from tapes. Need to evaluate how to proceed here.
      • About two third of these zero byte files are known by the file creator (Jos) to not be empty outside HPSS. Need to investigate why they have zero size within HPSS. Ahmad has prepared a list of file attributes. Scott will investigate the log files.
  • Mail Jonathan 21.10.2015
    • changeCOS
      • Files with white spaces in name have been migrated successfully.
      • For the zero byte files issue:
        • The HPSS log files do not indicate connection errors related to times when the files in question were written into HPSS.
        • Please ask Jos to try and put some of the files in question into HPSS again. If there are errors seen, either on client side or that files are empty again where they shouldn't, we need to look into the /var/log/messages files on the gateway server.
        • Also, we can turn on additional logging for the FUSE client.
  • Mail Scott 26.10.2015 Logs einrichten
    • Here is the info that I mentioned in the call. Because there is not any logging available for the times that Jos's files failed to transfer, then we should first look at the users files. I know that their file may in fact be zero byte files, but if there is an error associated with their time stamps, it may give us a clue as the problem.
      • Following this we should turn the API Logging on, on the client. We can do this by editing the /var/hpss/etc/env.conf file to contain the following entries:
        • HPSS_API_DEBUG_PATH=/var/hpss/tmp/API_DEBUG.out
        • HPSS_API_DEBUG=7
    • For the mount to pick this extra API logging on, we need to dismount and remount the file system. If that is not an option, then we can create and API resource file. This will log the same things as above, but without remounting the filesystem. To start it:
      • vi /var/hpss/tmp/hpss.api.resource
      • and then in it put this line, without the quotes:
      • '7 /var/hpss/tmp/API_DEBUG.out'
    • In 30 seconds, the CLAPI with take the trigger from the resource file and begin logging the API debug messages.
    • We can also increase VFS-FUSE logging. This requires that we dismount and remount the mounting the FS. The developer told me that he should log any errors in case the mount cannot be dismounted. To turn on the TRACE logging, we just need to add 'trace=5' as one of the mount options and remount. This will log everything, so if we turn it on, we will need to plan on turn it off to not impact performance on that node.
  • Mail Ahmad vom 28.10.2015
    • 1. I have just activated the API logging with the second option using the resource file
      • /var/hpss/tmp/hpss.api.resource
      • But I changed the API_DEBUG.out path to /data. /data has about 75GB of space.
      • # cat /var/hpss/tmp/hpss.api.resource
      • 7 /data/hpss/tmp/API_DEBUG.out
    • 2. I alos tried option one with editing /var/hpss/etc/env.conf and adding both envs
      • HPSS_API_DEBUG_PATH=/data/hpss/tmp/API_DEBUG.out
      • HPSS_API_DEBUG=7
    • But remount was not possible duo to some users logged in via sftp.
    • 3. Option three adding trace=5 to mount options has not been done for the same reason as in 2. but if needed I can do this later.
  • Mail Scott 28.10.2015
    • We can make a couple of runs with the API logging enabled and if this doesn't provide enough info, we can consider dismounting the FS. The FUSE developer told me that all failed transfers to HPSS are should be logged, just not in the level of detail that come with TRACE.
    • Thanks for the update. Please let us know what you find in the logs.
  • Mail Ahmad vom 28.10.2015
    • on both archive-sftp-01 and archive-sftp-02 now sftp logging has ben setup.
      • Location: /var/log/sftp
      • I will ask Jos to repeat the copy to HPSS. Maybe he can do this tomorrow.
      • Should we mount the older COS (1204) again for this test. Or can we use the new working COS 1205? Does it matter?
        • Mail Scott vom 28.10.2015: Lets leave COS 1205 mounted. It should not matter since we are looking for new transfer errors.
  • Mail Jonathan 30.10.2015
    • changeCOS
      • Additional logging for FUSE and separate logs for SFTP have been enabled (see Ahmad's emails from Oct. 28). New transfers with data that has failed before may or may not provide new evidence.
  • Mail Tobias 18.11.2015
    • Change COS
      • All the files have the same checksum (fuse/hpsssum). The next step for KIT is to contact the user and try to get the deleted files back from backup. The question still is what happens with the zero byte files. Now, the problem is that there are no error logs and the failure couldn't be reproduced. Nevertheless Scott will contact the fuse developers.
  • Mail Ahmad vom 18.11.2015
    • es wurde eine Liste aller Zero Files von den betroffenen Users erstellen