hidden:System Logs (HPSS, TSM)
Format of the entries
date: service [hpss|tsm|druva]: admin name: record: link to further info in gridka or lsdf wiki
13.02.2017: TSM: MB: TSM server sb03 database reallocate to 4 volumes 07.02.2017: icinga:Karin Schaefer: yum update (with kernel update) on monitoring servers scc/gridka-ora-mon 28.11.2016: hpa:C Pfeiler: Further RedHat images provided at /export/hpa/RH_repos/RHEL/. 7.3 copied into /export/hpa/RH_repos/rhel_repo/7.3 28.11.2016: hpa:C Pfeiler: RedHat updates applied. 28.11.2016: hpss.prod:jvw:added allow_other option to LSDF-Archiv hpssfs mount command in /etc/fstab because the user could not chdir since the id is not known. 25.11.2016: hpss_prod.:Dorin Lobontu: added new volumes for primary copy as well as for secondary copy (125 T10KD cartridges and 258 TS1140 cartridges) 14.11.2016: hpss prod.:jvw:Removed scp and rsync entries from /etc/rssh.conf on archive-sftp-0. This will ensure the proper warning message if a users tries to login with ssh.
Entries below are from Ahmad and copied from text file.
4.8.2016: hpss prod. UDA-Checksum : all Fuse mounts mounted on archive-sftp-01/-02 with cksum=md5,nch=g options 4.8.2016: hpss prod. new Fuse-Bugfix upgrade hpssfs-fuse-2.0.1-0.el6.x86_64 on archive-sftp-01/-02 (rpm from Scott/IBM) /SFTP/KIT -> /hpss/bwda/000000/<user> Projects new directories for bwdatadiss and radar 1-4.8.2016 hpss prod.: Downtime for HPSS-FS layout change (1. phase) without username, uid and gid changes doen vai scrub rename command (!) yum update failed becuase of missing x86_64 library that is no include in the rhel repository (Bugzilla) 29.7.2016 hpss prod.+ hpss test: yum security updates on HPSS Server + Frontends (archive-sftp-01/-02 and BWDAHUB) + repoot 8.7.2016 hpss prod. Update to hpssfs-fuse-Bug-Fix (Software/HPSS/fuse-BugFix-v2001-2010/hpssfs-fuse-2.0.0-1.el6.x86_64.rpm) installed on both archive-sft-01/-02. Mounted as before without cksum Option. 4.7.2016 hpss prod. 3xTapeDives (101->mvr1 and 301,302->mvr3) had "suspect" status starting on 30.6 be4i one drive then the two others next days. and were marked repaired. Martin checked the library logs -> OK. Email sent to Waldecker on 1.7 -> no reply. 20.6.2016 hpss test: Upgrade from 743p2 to 743p3 and test read files (wriiten in 743p2) with e2e configured system (743p3). /db2_backup/offline/hcfg/HCFG.0.hpssdb.DBPART000.20151221154752.001 /db2_backup/offline/hsubsys1/HSUBSYS1.0.hpssdb.DBPART000.20151221154835.001 16.6.2016 hpss test: E2E Tests: Downgrade from 743p3 to 743p2 based on db2 offline backup to test read of files: 30.5.2016 hpss test: Testsystem reconfiguration with Tape SC 101, 102 -> Done by Dorin gftp03: /root/Software/HPSS/fuse-BugFix-v2001-2010/fuse_with_BZ6036 . ver. hpssfs-fuse-2.0.1-0.el6.x86_64.rpm 24.5.2016 hpss test: fuse BugFix (got from Scott) rpm installed on test FE archvie-gftp-03 24.5.2016: hpss prod. drive ID 101, IBM TS1140 on mvr01 -> Suspect status * sftp-01/-02 # uname -r -> 2.6.32-573.26.1.el6.x86_64 * for HPSS server (cr01,cr02, mvr01,02,03,04) (!) they were already up to date? # uname -r -> 2.6.32-573.18.1.el6.x86_64 hpss test: Update done by Dorin. for tcr03, tmvr # uname -r -> 2.6.32-573.22.1.el6.x86_64 archive-gftp-03 done by Ahmad, uname -r -> 2.6.32-573.26.1.el6.x86_64 24.5.2016 hpss prod. Maintenance: RHEL yum update for HPSS and FE (archive-sftp-01/-02) 19.5.2016 hpss prod. E2E disk reconfiguration done. -> Email from HPSS Support/IBM Tobias After testing access I forgot to set the right permissions -> Ahmad. (!) - Still to decide about all projects directory permission on production 18.5.2016 hpss prod. Bareos hpss folder permissions change: frp, 777 to scrub> chmod /hpss/bareos 771 18.5.2016 hpss prod. COS 122 (GridKa SC 12) Allowed, Core Server -> restart, MPS -> restart to activate SC 12 for Disk resources to be imported/Created into HPSS after the E2E reconfiguration. Request from HPSS Support/Tobias 18.5.2016 hpss prod. Mover1/Drive 101 was marke Suspect, IBM_IU1_PVR (Major). -> Marked Repaired 11.5.2016: hpss prod. TSM team will take 50x STKT10KD volumes from STK HPSS pool. Thes volumes are not imported and crated into HPSS therefore no further work needed. BUT we have get new once later to nit got short of volumes for the first copy. 11.5.2016: hpss prod. Total broken IBM Library, IBM_IU1_01 PVR(Major). -> Martin B. => no second copy. Refer to howtos: ghi-thresholds.howto 9.5.2016: prod. GHI. apply threshold policy as described by Scott/IBM. Done together with GPFS -Admin(Lusmilla) 6.5.2016 hpss prod. New 100x Tape volumes added from IBM TS1140 (SC 205 sec. copy) added (imported/created) after an warning message by HPSS gui about no space left on tape. 3.5.2016 hpss prod. new project bareos setup, COS 1204, Large SC 23. -> FrontEnd (bht.lsdf.kit.edu, Stephanie Boehringer) -> Done by Ahmad April: 2016 hpss prod. New projekt gridka-dcache project setup with Fuse. COS 123. Done by - login OK (!) After reboot the hpss_disk_mvr05 server could be started via HPSS GUI also PVL -> OK (!) Only hpssmvr07 resources still in Unkown Status, TODO: Decide about what to do with. hscroot@hpss-hmc-a:~> chsysstate -r lpar -m Server-8247-22L-SN212960A -o shutdown --immed --restart -n redhat7-03 - To reboot hpssmvr05 aka redhat7-03: 29.4.2016: Test. hpssmvr05 rebooted duo host was ping/ssh unreachable and hmc console login via vterm of mkvterm hangs after Open Completed message. # Stefan Waldecker replaced the drive's broken jbec. (-> Martin) Then disabled all Migration and Purge Policies assigned to SCs (1,2,3) For die Default SC 99 I disabled the Migration Policy "Test-Migrate-Disk-To-Tape-01" before. Reason: The Tape SC 10 and 11 did not have any tape volumes assigned to both and the MPS Process kept failing with a lot of log files in /var/hpss/log every minute. And PVL went "Major". Howto disable M&P: GUI: Configre/Storage Space/Storage Classes -> mark SC1,2,3 -> Configure-> Mig/Pruge Policy -> NONE Core server Restart (Shutdown/Start) MPS Restart (Shutdown/Start) (!) SC 501 and SC 701 kept unchanged. 29.4.2016: Test. backup all hpss configs via /opt/hpss/bin/lshpss into /root/Software/HPSS/hpss-configs/ 27.4.2016: hpss prod. IBM-1140 Drive 0000078DB20E problem Solved. -> HPSS OK On movr3 : /dev/hpss/st.1b.1-03592E07-0000078DB20E (missing) # sg_map -st -x -i # lsscasi showd only one tape device -> Martin informed. 25.4.2016 hpss prod. IBM-1140 Drive (0000078DB20E) went Suspect on 25.4, -> mover3 Suspect -> HPSS, PVL Major. 22.4.2016: hpss prod. GHI: ddf-s-005# chmod +x /var/hpss/ghi/etc/ghi_backup_migration.ksh 21.4.2016 hpss prod. COS 1205 (Medium 2x copies) Max file Size changed from 500GB to 600GB. (Jos Request to allow End user to tar 8192x68MB in one file of a simulation run see mail from <email@example.com>) For all SCs (13, 21, 22, 23) purge policy changed (see. 13.4.2016): Start purge when space used exceeded: from 0% to 50% Stop Purge when space used falls to: from 0% to 40% 21.4.2016 hpss prod. Change purge policies while reformat of first storage system still running - Migrate, Purge and Repack is done for D0300100 - D0401800, therefor all disk on Storage Unit 2 are empty. - Firmware Update on Storage Unit 2 is done. - All Logical Drives are deleted and recreated with T10PI enabled. - Formatting is expected to take up to 12 days (((36 Logical Drives / 8 in parallel) * 60 hours) / 24 hours). 16.4 Status mail from Tobias - dump_acct_sum created a list of all coses and #Files belonging to - Purge Policies (SCs: 13, 21, 22, 23) chagnged to 0% Start purge when space used exceeded: 0% Stop Purge when space used falls to: 0% - Most files on disk were only on Tape and purged from disk 15.4 - Rest files on disk Repacked, Firmware upgrade done by IBM/Tobias Elpelt - IBM/Tobias removed from HPSS and started formating first disk system. () - Older COSes (1201, 1202, 1203) were Allowed and activated again, Core_server and MPS restartet. - Delete all files belonged to older COSes and other rest unneeded files. 13.4.2016 hpss prod. Start Reconfiguration procedure to reforamt Disk storage systems for E2E support: Core Server Shutdown/Start MPS Shutdown/Start 5.4.2016 hpss prod. UTF8 Support GUI/Global/Global Flags/Object names can contain unprintable characters (ON) -> also site statistics report sent to Jae Kerr produced by this new script 30.3.2016 hpss prod. /etc/cron.d/hpssstat changed to /usr/local/bin/hpssstats2.py (new version downloaed from hpss wiki) 29.3.2016 hpss prod. tape mover on hpssmvr01 sent suspect. Reason drive (101) suspect status. 16.3.2016 hpss prod. tape mover on hpssmvr02 sent Suspect. Reason drive (201) suspect status. Notice: Work done by Dorin. Reason to proved LSDF with tape Storage duo to lack of space on other libs. 10.3.2016 hpss prod. STK Library (CN_STK_LIB01) has been partitioned. The 2x STK-Drives of STK_01_PVR on hpssmvr02 changed drive address to 3,1,1,12 and 3,1,1,0. 4.3.2016 test: upgrade HPSS Testsystem vrom hpss-7.4.3p2 to hpss-7.4.3p3 including DB2 Conversation 23.2.2016: test: FUSE und HPSS client rpms updated on archive-tgftp.lsdf.kit.edu to hpssfs-fuse-2.0.1-0.el6.x86_64.rpm. But also updating HPSS client software to 7.4.3p2-1 was also needed to be able to update the fuse package. 22.2.2016: hpss prod. SFTP-01/02 yum-autoupdate has been installed with email to root,firstname.lastname@example.org in /etc/sysconfig/yum-autoupdate -> Dorin - Check if BWDAHUB has been updated by Frank! (!!!)TODO: - powerMovers mvr05 and mvr07 still to be updated(!!!) 22.2.2016: hpss prod. cr01/02, mvr01, 02,03,04, SFTP-01/02, OS update duo to glibc security bug (CVE-2015-7547) 19.2.2016: test: hpsstcr03 and hpsstmvr OS update duo to glibc security bug (CVE-2015-7547) rm: cannot remove `file_50MB.txt.copy.61991': Invalid argument 19.2.2016 prod. Problem: After deletion of 23 files under /home/ahmad2/* deleting them from Trash .Trash/root/* -> error: see. code sftp-01:/root/coding_ahmad TODO: still 10 files of cosid 1204. 19.2.2016 prod. changecos of zerofiles: rm zero file, touch zerofile, chmod user. zerofile, rm ./Trash/root/zerofile now permission set currectly: ddf-s-005# -rwxr--r--. 1 root root 13028 Jul 19 2014 /var/hpss/ghi/etc/ghi_backup_migration.ksh (!!)TODO: check and keep an eye on it to see if every thing is OK! 11.2.2016 prod. GHI. Problem of missing policy backup and migration solved: /var/hpss/ghi/etc/ghi_backup_migration.ksh had no x-permission. -> Email from Support/tobias (Re: [hpss-scc] KIT HPSS call - agenda for 2016-02-10) -> sent on 11.2.2016. 10.2.2016 Test File deletion from TrashCan not possible due to removed COS (104) that the file belongs to. log - Critical. 10.2.2016 hpss prod. - GHI policy backup and migration still not running. -> Email send an support/Telco. to look at (-> tobias/scott) (!) GHI:- OS still on old 6.4. but scott said, when we go productiv ddf-s nodes should be upgraded to RHEL 6.7. - for the hpss re-compilation and setup of ghi will be needed. s. howto! 4.2.2016 hpss prod. RHEL upgrade 6.4 to 6.7 Done. HPSS Upgrade 743p1-743p2 done. 8xddf-s-* GHI nodes upgrade to hpss-743p2 done. (Downtime for one week) -> ahmad, dorin (!) RDAC migration for Testsystem was tested by Tobias IBM and continued for testCore by Dorin. in Jan/16 3.2.2016 hpss prod. Migration from RDAC to Multipath for disk storage for all HPSS Server done. -> started in mid Jan/16. -> Dorin 17.12.2015 hpss prod. API logging on FrontEnds archive-sftp-01/-02 disabled. /var/hpss/tmp/hpss.api.resource.disabled - Ahmad adapted the changes as discribed by IBM/Scott (10.12) and copied both files to all nodes (ddf-s-001-008) - Started GHI again: ddf-s-005]# initctl start start_ghi_run # initctl start ghi_iom_hpss # /opt/hpss/bin/ghistartup -g 14.12.2015 hpss prod. GHI: /var/hpss/ghi/policy/hpss/backup_migration.policy /var/hpss/ghi/policy/hpss/migrate.policy RULE 'scratch_exclude' EXCLUDE WHERE path_name like '%/scratch/.ghi/%' Should be changed to: RULE 'scratch_exclude' EXCLUDE WHERE path_name like '%/scratch/%' 10.12.2105 hpss prod. GHI: Scott found out, not whole scratch dir was excluded: (s. email) 10.12.2015: hpss test: RDAC-Multipath: IBM/Tobias Elpelt Tested migration from Disk-Drivers-RDAC-Multipath. hpsstmvr shutdown by Ahmad, and restarted by tobias after work done. tobias->Email (how-to RDAC to dm-multipath migration 11.12.2015) 8.12.2015 hpss prod. After adding tape volumess -> Major erros, All Disks 100% full, PVL for PVRs -> Down. GHI stopped, Ticket to IBM Support. Uwe, Jae solved> DB2 user (hpss) was not existing. 8.12.2015 hpss prod. New Tape volumes imported and created into PVRs STK_01 and IBM_IU1_01. (TK005000-TK014900 and ?-TS009900) 7.12.2015: hpss prod. HPSS Downtime for one week for maintenance. OS/RHEL6.4-6.7 and HPSS 743p1->743p2 Upgrade. Email sent to users. 2.12.2015 hpss prod. ChangeCos for zero files issue closed. TODO: recreate zeot files in neew COS 1205 after deleting from old COS 1204. 25.11.2015: hpss prod. GHI: Scott checked and responded. See email and ghi-problems-scott.howto. Now GHI is OK! But still some migration errors. -> Scott. 24.11.2015: hpss prod. OK: Mover03 and drives markted repaired afer cancel jobs. -> Dorin 23.11.2105; prod. Suspect: Mover03 and suspect drives of the IBM Lib -> broken air condition at weekend! 19.11.2015: hpss prod. GHI: still showed error messages. Email sent to IBM Support -> Scott - /opt/hpss/bin/htar.ksh was not on any of GHI nodes. - scott upgraded GHI-HTAR to the latest version for 2.4,GHI-HTAR 220.127.116.11g, for bug fixes. - Scott: patch released for GHI 18.104.22.168. for bug fixes icluding a Sev 1 fix. Postponed for later. 19.11.2015 hpss prod. GHI Scott findings: 18.11.2015 hpss prod. GHI: Due to error messages email sent to IBM Support-> Scott repsponded and found out 19.11.2015: - delete old/create new Hier and COS 1102, 1103 via GUI for ghi policies. - on ddf-s-001 # mmchfs hpss -z Yes (see. 2.11.2015) # mmdsh chmod +x /opt/hpss/bin/ghi_iom - mmmount /hpss -a # chmod +x /var/hpss/ghi/etc/start_ghi_run - change log path in /etc/init/start_ghi_run.conf (>> /data/ghi_log/start_ghi_run.log) - initctl start start_ghi_run - had to start ghi_iom manually: (Update: I should have used# initctl start ghi_iom_hpss) # ddf-s-001: # mmdsh /opt/hpss/bin/ghi_iom hpss 8012 (see. ddf-s-005: /etc/init/ghi_iom_hpss.conf) - /opt/hpss/bin/ghistartup -g - on ddf-s-005 # chmod +x /var/hpss/ghi/etc/ghi_backup_migration.ksh 17.11.2015 hpss prod. GHI: 16.11.2015 Test. RHEL-OS Upgrade from 6.4 -> 6.7 on hpsstcr03 and hpsstmvrand recomplie rdac DCS3700 both Servers. Reason: Find out (9.11.15) that after the OS Upgrade (2xweeks ago) from 2.6.32-358.el6.x86_64 to 2.6.32-358.23.2.el6.x86_64 /lib/modules/2.6.32-358.23.2.el6.x86_64/kernel/drivers/scsi/ mppUpper.ko mppVhba.ko (were missing) Solution (IBM DCS3700 redbook): - cp mvr1:~/Software/rdac-LINUX-09.03.0C05.0652-source.tar.gz mvrt:~Software/DCS3700/ - untar - make uninstall - make clean (in case *.o are compiled) - make - make install - Add to /boot/grub/menu.list: title Red Hat Enterprise Linux Server (2.6.32-358.23.2.el6.x86_64) with MPP root (hd0,0) kernel /vmlinuz-2.6.32-358.23.2.el6.x86_64 ro root=/dev/sda3 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet console=tty0 console=ttyS0,115200n81r initrd /mpp-2.6.32-358.23.2.el6.x86_64.img - reboot 5.11.2015: Test. After hpsstmvr reboot: got disk read error. Therefore disk-tmvr broken - LOGs: CORE0068: Space usage in tablespace STORAGESEGTAPEABIX of database subsys1 has exceeded critical threshold of 90%; usage at 1 - CORE0069: Spacee usage in tablespace TAPESEGUNLINK of database subsys1 has exceeded warning threshold of 85%; usage at 89% - Solution: (Jae IBM): Check both "global" and "subsystem" configurations. Make sure the "Metadata Space Monitor Interval" value is set to 0 seconds. 4.11.2015 Test. Core Server GUI Broken Status: - P8-TestCOSes (500, 700) disabled (GUI/Subsystems/Configure). - Data purged from DSC 1,2,3. MPS should be stareted to see the SCs under GUI/Monitor?Sorage Classes active - Duo to CoreServer Major and Critical messages, /opt/hpss/bin/rc.hpss stop/start and then Purge/Start 4.11.2015 hpss test: 4.11.2015 prod. GHI: old Hierarchy 1401 and COS 1401 (named "GHI Metadata") deleted and new onces with same IDs created with new Tape SC 204 and 205 (DSC 31, TSC 204, 205). restart: core Server/PVL/MPS (!) Before enable scripts and starting ghi you have to (mmchfs hpss -z Yes). 2.11.2015 prod. GHI: all daemons Stoped: On ddf-s-005: disable scripts (1. chmod -x /opt/hpss/bin/ghi_iom) and (2. chmod -x /var/hpss/ghi/etc/start_ghi_run) and (mmchfs hpss -z No # to mount hpps without ghi started!!) 28.10.2015 prod. on sftp-01/02 sftp logging set up. logfile: /var/log/sftp 28.10.2015 prod. API loging activated via rsource file as described by Scott (IBM). See email. 28.10.2015 test. Jonanathan (IBM) put P8 Disk Storage OFF to solve PVL major status. 23.10.2015 prod. Purge started from GUI manually for Disk SC 22. Duo to user errors and log errors (No space left on device). -> Scott adwise to purge since no segments nore because of fragmentation. HPSs Telco 21.10.2015. 20.10.2015 prod. changecos for logs files in /log march-14.10.2105 from COSID 1102 into COSID 1205. And reinstialize log daemon and client daemon23.10.2015 prod. Disk SC 22 Purge policy changed from Space exceeds 90% -> 70% and Stop Purge space left unchanged. 17.9.2015 prod. cronjobs on core server cr2 for accounting and site statistics. /etc/cron.d/(hpssstat, hpssacct) 11.9.2015 prod. cronjobs on core server cr1 for accounting and site statistics. /etc/cron.d/(hpssstat, hpssacct) 10.9.2015 prod. Trashcans activated at 16:03. GUI/Global/Trashcans Settings (5, 86400,864000, 3600) 7.9.2015 hpss prod. IBM_IU1_01 was still in Major. reporting Null mounts. Only PVR shutdown/start helped.-> OK 2.9.2015 hpss prod. IBM_IU1_01 PVR went Major. Cause: CN_IBM_N01 broken -> Martin Beizinger Maintenance, done 4.9. 2.9.2015 prod. STK-Drive change: broken STK-drive (id 202) connected to hpssmvr02 replaced with a new one and added to devices&drives. (old: mvr02:/dev/hpss/st.1b.1-T10000D-579004000755 -> new: mvr02:/dev/hpss/st.1b.1-T10000D-579004000661) 27.8.2015 prod. Class Of Service changed from 1102 to 1205 for LogDaemon GUI/Servers/Log Daemon/specific/Archive Class of Service 26.8.2015 prod. Max file size for 1205 changed to max 500GB. Based on new calculations came from HPSS Support. -> Scott (Email) 25.8.2015 prod. 8 PM end of changecos . But 33555 missing-Files still have COSID 1204. To be reported to HPSS Support. 24.8.2015 prod. GHI manager on ddf-s-005 disk 100% full duo to /var/hpss/ghi/log/start_ghi_run.log (12GB). temporary solution: copyed file to ddf-s-005:/data/ghi_log_backup/. empty file /var/hpss/ghi/log/start_ghi_run.log 20.8.2015 prod. 10 new tape volumes for STK_01_PVR added. 30 new tape volumes for IBM_IU1_01 added 20.8.2015 prod. STK drives for PVR STK_01 PVR went suspect status while change cos for tape vilume TK001500. 17.8.2015 prod. on sftp servers (sftp-01/02) unmount fuse 1204 an mount fuse with new cos 1205 but OLD PATH. /SFTP/KIT 17.8.2015 prod. 11 AM start of changecos from 1204 to 1205 13.8.2015 prod. Max. File size for COS 1204 and 1025 extended to 2TB. (OK came from Jos and HPSS Support-Scott) Martin reported the drive. I Checked the drives in GUI as (Mark Repaired) => Green OK # Suspect drives after firemware update 11.8.2015 hpss prod. T10:IBM|03592E07 0000078DB20E 13.8.2015 hpss prod. T10:IBM|03592E07 0000078DB1EE GHI shutdown -g before upgrade and startup afterwards on ddf-s-005 successfully. (Ahmad) After shutdown applypolicy jobs kept running. -.Scott said no worry after a timeout will stop. GHI logfiles while applypolicy jobs made local disk full. -> Scott said to disable applypolicy scripts since GHI not used. (he didn't confirm) iom processes kept running. -> Scott not problem, automateclly startet. You can only restart with -i. 3.8.2015 hpss prod. Duo to GPFS-3.5-18 Bug upgrade to 3.5-25 on node ddf-s-006 done by (Ludmilla/Ursula) 29.7.2015 Test. Reformating HPSS_T_Cache1 Disk as preperation for the End2End Protection (T10PI). -> IBM Jonathan core Server and PVL restart. 9.4.2015 prod. gui Migration/Purger server Restarted cause: Drives mounted for days. PVL job cancle but not effect, PVL (Major) 9.4.2015 prod. gui configure/Subsystems/Configure/Allow disabled for ID 1201, 1203, Error: "Disk migration Failure". (No Tape behind) 8.4.2015 Support Ticket submitted by IBM-DE Kay Jenke for broken Disk. (01V4NJL,724 for the Disk Alert) 7.4.2015 Test Drive failure Enclosure 99, Drawer 2, Slot 9 Jan/Feb. 2015 hpss prod. HPSS upgrade to 7.4.3p1