hidden:System Logs (HPSS, TSM): Difference between revisions

Revision as of 12:11, 26 October 2017

Format of the entries

date: service [hpss|tsm|druva]: admin name: record: link to further info in gridka or lsdf wiki

26.10.2017: HPSS BwDA:     MB : on hpssrc deleted entry in crontab
                                # sync storage array time
                                #MB: 26.10.2017# 0 1 * * * /usr/local/bin/sync_storage_time.sh
                                The storage is managed only on hpa

19.10.2017: HPSS BwDA:     KS : Tape Drive exchange
                                STK-LIB02, Bay 32, BwDA_3,2,1,0_102 SN:579004003070  New SN:579004007294

04.10.2017: HPSS BwDA:     MB : Tape Drive exchange
                                STK-LIB02, Bay 45, BwDA_3,1,1,12_201 SN: 579004001831  New SN:579004000601

28.09.2017: TSM GridKa:    MB : Roboter 1.3.0.2.0 und 1.3.0.1.0 ausgetauscht in STK_LIB01

27.09.2017: TSM GridKa:    MB : Tape Drive exchange
                                STK_ACS2_7 DriveId: 2,0,1,4 Bay 63 WWN: 500104F000A4B331 SN 576004009753 new SN 579004004337

07.09.2017: TSM GridKa:    MB : Tape Drive exchange
                                STK_ACS2D_11 DriveId: 2,3,1,12 Bay 13 WWN: 500104F000A4B29B SN 579004007298 new SN 579004001438

04.09.2017: HPA:           MB : - enable target Graphical Interface (runlevel 5) for Storage Management GUI
                                - installation of SanTricity Software for storage management, start the GUI with "SMclient"
                                  Software is on hpa:/export/hpa/Software/storage/
                                - firmware stored for E5600 on hpa:/export/hpa/Software/storage/firmware/E5600 

28.08.2017: TSM GridKa:    MB : Tape Drive exchange
                                STK_ACS2D_07 DriveId: 2,2,1,8  WWN: 500104F000A4B2CE SN: 579004007291 New SN: 579004000138

28.08.2017: TSM GridKa:    MB : 2 new cron jobs for database protecion on a01-013-110, GRID1
                                # Backup the database backup from disk to NB03, tape
                                15 12 * * *   dsmc i /db_backup -se=backup  1>/dev/null 2>/dev/null
                                # database protection , backup /archlog all 15 minutes to tape on NB03
                                15 * * * * dsmc i /archlog -se=backup   1>/dev/null 2>/dev/null

22.08.2017: TSM GridKa:    MB : Tape Drives exchange
                                STK_ACS2_16, DriveId: 2,1,1,14 WWN: 500104F000A4B2E3 SN: 576004004932 New SN 576004003756
                                STK_ACS2D_01,DriveId: 2,0,1,11 WWN: 500104F000A4B30A SN: 579004007300 New SN 579004000737

16.08.2017: TSM GridKa:    MB : Special build firmware installed on T10000C drives RE363101-5.50, should prevent FSC37F6 in the future

11.08.2017: TSM GridKa:    MB : TSM Server GRID1 Admin Schedule BA_DB_SNAP renamed to BA_DB_DISK, now every day one full backup to disk at 12:00 in directory /db_backup

10.08.2017: TSM GridKa:    MB : TSM Server GRID1 /actlog and /archlog moved on other disks, created new filesystem for database backups /db_backup, admin schedule BA_DB_SNAP now everyday  

25.07.2017: TSM GridKa:    MB : Roboter 1.2.0.2.0 ausgetauscht in STK_LIB01

24.07.2017: TSM GridKa:    MB : Tape Drives exchange
                                Bay 55  2,0,1,6  SN:576004006401 TRAY_SN:464940G 1450J70010 New_SN:576004009059 T10000C	
                                Bay 58  2,0,1,9  SN:576004009749 TRAY_SN464970G 1539J700104 New_SN:576004003752 T10000C
                                Bay 59  2,0,1,5  SN:576004006915 TRAY_SN:464940G 1450J70009 New_SN:576004009398 T10000C

20.07.2017: TSM GridKa:    MB : Roboter 1.1.0.1.0 ausgetauscht in STK_LIB01

18.07.2017: HPSS BwDA:     DL : new host certificates installed on sftp01/02

11.07.2017: TSM GridKa:    MB : Roboter 1.2.0.1.0 ausgetauscht in STK_LIB01

06.07.2017: TSM GridKa:    MB : New replacement cartridges inserted: TA2110, TA2112, TA2416, TA2449, TA1820

05.07.2017: HPA:           KS : included /var into backup again, produces occasional TSM Error Mails

05.07.2017: Gridka:        MB : On TSM Server GRID1 "set actlogretention 35" ==> ANR2090I Activity log retention set to 35 for management by DATE.

29.06.2017: GridKa:        MB : Laufwerk 576004002418 2,1,1,4  ausgetauscht ,Neue SN:576004006167 T10000C

07.06.2017: TSM GridKa:    MB : Inbetriebnahme T10000D Laufwerke für ALICE und CMS

01.06.2017: TSM GridKa:    MB : Firmware Update T10000D Laufwerke nach 4.12.106-5.60 in STK-LIB01

30.05.2017: HPA:           JvW: Excluded /var from backup for the time being because of the activity it throw backup errors. Need to change the local storage to LVM

18.05.2017: TSM GridKa:    MB : Laufwerk 579004009748 2,0,1,13 Bay 57 ausgetauscht, Neue SN:579004006798 T10000C

28.04.2017: TSM KIT:       MB : time adjustment, changed crontab entry to ".... ntp.scc.kit.edu ..." on TSM Servers scc-tsm-n01,n02,s01,s02 

27.04.2017: HPSS-Prod:     JvW: added /opt/hpss/bin to the root PATH variable in bash_profile. Will add a corresponding rule in Ansible

26.04.2017: GridKa:        MB : Update TSM database form V5.5 to 6.3.5.100, now DB2 !

24.04.2017: HPSS-Prod:     DL : Migration Policy 13(test GridKA) "Number of Migration Streams per File Family" changed from 4 to 1(decided in HPSS GridKa JF)

13.04.2017: HPSS-Prod:     AH : archive-sftp-0[12456] set default values: tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes in /etc/sysctl.conf

11.04.2017: HPSS-Prod:     AH : hosts archive-sftp-0[456] set some limits like archive-sftp-0[12] in /etc/security/limits.conf and /etc/sysctl.conf.

11.04.2017: HPSS-Test:     AH : hpssmvr7 mover sever (Power8) and its NetApp-Disk Devices removed from HPSS.

11.04.2017: HPSS-Prod:     AH : hosts archive-sftp-0[456] yum repo changed from rhs.scc.kit.edu to RHEL with new registration via subscription-manager.

10.04.2017: HPSS-Prod:     JvW: Config changes for the archive-sftp-[0-6] hosts. Most importantly all hosts keys were synced. Users of sftp complianed about wrong fingerprints after the new hosts were added to the production

06.04.2017: TSM LSDF:      MB : Laufwerk 579004003031 1,0,1,7  ausgetauscht am 06.04.2017 Neue SN:579004007136 T10000D

04.04.2017: HPSS-Prod:     AH : hosts archive-sftp-0[456] installed/configured hpss-7.4.3p2,hpssfuse 2.0.1-Bugfix(Scott),pam_mysql,chroot,autofs map files.

03.04.2017: GridKa:        MB : Laufwerk 576004002668 2,1,1,8  ausgetauscht am 03.04.2017 Neue SN:576004001220 T10000C

01.04.2017: HPSS-Prod:     JvW: cleaning yum db after switching to RH subscription
01.04.2017: HPSS-Test:     JvW: cleaning yum db after switching to RH subscription

01.04.2017: HPSS-Prod:     JvW: new hosts archive-sftp-0[456] installed with RH6.8

29.03.2017: HPSS-Prod:     JvW: enabled syslog in hpss. HPSS logging now additionally goes to hpa via rsyslog)

29.03.2017: HPSS-Prod:     AH : hpsscr/GUI change (Max. VVs To Write): TapeSC 204 (from 5 to 1), TapeSC 205 (from 3 to 1).

29.03.2017: HPSS-Prod:     AH : hpsscr/GUI Mirgation policy (SC22) change (# Streams Per FF, Total M. Streams) from 3, 3 to 1, 1.  

29.03.2017: HPSS-Prod:     AH : hpsscr/GUI purge policy (SC21,SC22) from 60min,50%,40% to 1440min,80%,70% (Last file access, Start purge, Stop purge) 

24.03.2017: TSM LSDF:      MB : 4 LTO5 Laufwerke in scc-tsmlib-n01.scc.kit.edu eingebaut, F4R1,F4R3,F4R6,F4R8

23.03.2017: HPSS-Prod:     AH : archive-sftp-01/-02 set hpss-fuse option stream=1 (MB) for all project mounts /hpss/<project>

22.03.2016: HPSS:          MB : Drive  3,1,1,12 SN:579004000634 exchanged, new SN: 579004001831

21.03.2017: HPSS-Prod:     AH : archive-sftp-01/-02 set hpss-fuse option stream=1 (MB) for /hpss/bwda

21.03.2017: HPSS-Prod:     AH : archive-sftp-01/-02 edit /etc/sysctl.conf set vm.vfs_cache_pressure=100. (Default)

17.03.2017: HPSS-Prod:     AH : archive-sftp-01/-02 edit /etc/sysctl.conf set vm.vfs_cache_pressure=1. (Test!)

15.03.2017: HPSS-Prod:     AH : archive-sftp-01 install smem package (yum --enablerepo=epel install smem python-matplotlib).

14.03.2017: GridKa:        MB : Laufwerk 576004002628 2,1,1,0  ausgetauscht am 14.03.2017 Neue SN:576004003031

14.03.2017: HPSS-Prod:     AH : archive-sftp-01/-02 edit /etc/sysctl.conf set net.ipv4.tcp_keepalive_time=600, net.ipv4.tcp_keepalive_intvl=10, net.ipv4.tcp_keepalive_probes=6. 

13.03.2017: HPSS-Prod:     AH : archive-sftp-01/-02 edit  EXCLUDE_MOUNTPOINTS in /etc/rear/local.conf.

10.03.2017: HPSS-Prod:     AH : archive-sftp-01/-02 change stream=16M to default 8M for /hpss/bwda fuse mount in /etc/fstab. remount /hpss/bwda.

10.03.2017: HPSS-Prod:     AH : archive-sftp-01/-02 disable /etc/cron.daily/mlocate.cron for updatedb. (ToDo: only exclude fuse/bindmouts from /etc/updatedb.conf)

09.03.2017: TSM:           MB : CVE-2017-2636 Mitigation,done on LSDF-01/02, scc-histor-n01 and GRID1,  # echo "install n_hdlc /bin/true">> /etc/modprobe.d/disable-n_hdlc.conf

08.03.2017: HPSS-Prod:     AH : archive-sftp-01/-02 edit /etc/sysctl.conf change vm.swappiness=20 (it was =0).
  
03.03.2017: HPSS-Prod:     DL : tape storage class 103  : Maximum VVs To Write: 4 (restart Migration/Purge Server and restart Core Server)

03.03.2017: HPSS-Prod:     AH : on both archive-sftp-01/02: add /etc/fail2ban/action.d/iptables-common.local to solve a fail2ban error related to iptables-v1.4.7

03.03.2017: HPSS-Prod:     AH : Downtime 10-12 am, on both (archive-sftp-01/02):yum --security update, vm.overcommit_memory=2, 'rvl=86400' for hpssfs mounts.  

03.03.2017: HPSS-Prod:     AH : Downtime 10-12 am, on both (archive-sftp-01/02):fuse remounts needed to apply ulimits for max open files changed on 24.2.2017.

28.02.2017: HPSS-Prod:     DL : Migration Policy 13(test GridKA) "Aggregate Files to tape" enabled, "Max Files in Aggregate" set to 1000

28.02.2017: HPSS-Prod:     DL : Migration Policy 13(test GridKA) "Number of Migration Streams per File Family" and "Total Migration Streams" changed from 1 to 4. 

27.02.2017: TSM:           MB : To prevent warning messages "sb03: setopt idletimeout 30"

27.02.2017: TSM:           MB : Recabling SAN at TSM CN

24.02.2017: HPSS-Prod:     DL :  Maximum Open BitFiles: 10000 (it was 2000 before)
24.02.2017: HPSS-Test:     DL :  Maximum Open BitFiles: 10000 (it was 2000 before)

24.02.2017: HPSS-Prod:     AH : ulimits for max open files and #procs increased on sftp-01/-02 in (/etc/security/limits.conf).

17.02.2017: GridKa:        MB : Laufwerk 576004009255 2,0,1,1  ausgetauscht am 17.02.2017 Neue SN:576004004986

17.02.2017: GridKa:        MB : Laufwerk 576004009226 2,0,1,2  ausgetauscht am 17.02.2017 Neue SN:576004003733

13.02.2017: TSM:           MB : TSM server sb03 database reallocate to 4 volumes

10.02.2017: GridKa:        MB : Laufwerk 576004001000 2,1,1,14 ausgetauscht am 10.02.2017 Neue SN:576004004932

07.02.2017: icinga:        KS : yum update (with kernel update) on monitoring servers scc/gridka-ora-mon



28.11.2016: HPA:           CEP: Further RedHat images provided at /export/hpa/RH_repos/RHEL/. 7.3 copied into /export/hpa/RH_repos/rhel_repo/7.3
28.11.2016: HPA:           CEP: RedHat updates applied.
28.11.2016: HPSS-Prod:     JvW: added allow_other option to LSDF-Archiv hpssfs mount command in /etc/fstab because the user could not chdir since the id is not known.
25.11.2016: HPSS-Prod:     DL : added new volumes for primary copy as well as for secondary copy (125 T10KD cartridges and 258 TS1140 cartridges) 
14.11.2016: HPSS-Prod:     JvW: Removed scp and rsync entries from /etc/rssh.conf on archive-sftp-0[12]. This will ensure the proper warning message if a users tries to login with ssh.

Entries below are from Ahmad and copied from text file.

4.8.2016: hpss prod. UDA-Checksum : all Fuse mounts mounted on archive-sftp-01/-02 with cksum=md5,nch=g options
4.8.2016: hpss prod. new Fuse-Bugfix upgrade hpssfs-fuse-2.0.1-0.el6.x86_64 on archive-sftp-01/-02 (rpm from Scott/IBM)
/SFTP/KIT -> /hpss/bwda/000000/<user>
Projects new directories for bwdatadiss and radar
1-4.8.2016 hpss prod.: Downtime for HPSS-FS layout change (1. phase) without username, uid and gid changes doen vai scrub rename command
(!) yum update failed becuase of missing x86_64 library that is no include in the rhel repository (Bugzilla)

29.7.2016 hpss prod.+ hpss test: yum security updates on HPSS Server + Frontends (archive-sftp-01/-02 and BWDAHUB) + repoot
8.7.2016 hpss prod. Update to hpssfs-fuse-Bug-Fix (Software/HPSS/fuse-BugFix-v2001-2010/hpssfs-fuse-2.0.0-1.el6.x86_64.rpm) installed on both archive-sft-01/-02. Mounted as before without cksum Option.

4.7.2016 hpss prod. 3xTapeDives (101->mvr1 and 301,302->mvr3) had "suspect" status starting on 30.6 be4i one drive then the two others next days. and were marked repaired. Martin checked the library logs -> OK. Email sent to Waldecker on 1.7 -> no reply.

20.6.2016 hpss test: Upgrade from 743p2 to 743p3 and test read files (wriiten in 743p2) with e2e configured system (743p3).
/db2_backup/offline/hcfg/HCFG.0.hpssdb.DBPART000.20151221154752.001
/db2_backup/offline/hsubsys1/HSUBSYS1.0.hpssdb.DBPART000.20151221154835.001
16.6.2016 hpss test: E2E Tests: Downgrade from 743p3 to 743p2 based on db2 offline backup to test read of files:
30.5.2016 hpss test: Testsystem reconfiguration with Tape SC 101, 102 -> Done by Dorin
gftp03: /root/Software/HPSS/fuse-BugFix-v2001-2010/fuse_with_BZ6036
. ver. hpssfs-fuse-2.0.1-0.el6.x86_64.rpm
24.5.2016 hpss test: fuse BugFix (got from Scott) rpm installed on test FE archvie-gftp-03
24.5.2016: hpss prod. drive ID 101, IBM TS1140 on mvr01 -> Suspect status
* sftp-01/-02 # uname -r -> 2.6.32-573.26.1.el6.x86_64
* for HPSS server (cr01,cr02, mvr01,02,03,04) (!) they were already up to date?
# uname -r -> 2.6.32-573.18.1.el6.x86_64
hpss test: Update done by Dorin. for tcr03, tmvr
# uname -r -> 2.6.32-573.22.1.el6.x86_64
archive-gftp-03 done by Ahmad, uname -r -> 2.6.32-573.26.1.el6.x86_64

24.5.2016 hpss prod. Maintenance: RHEL yum update for HPSS and FE (archive-sftp-01/-02)

19.5.2016 hpss prod. E2E disk reconfiguration done. -> Email from HPSS Support/IBM Tobias
After testing access I forgot to set the right permissions -> Ahmad.
(!) - Still to decide about all projects directory permission on production

18.5.2016 hpss prod. Bareos hpss folder permissions change: frp, 777 to scrub> chmod /hpss/bareos 771
18.5.2016 hpss prod. COS 122 (GridKa SC 12) Allowed, Core Server -> restart, MPS -> restart to activate SC 12 for Disk resources to be imported/Created into HPSS after the E2E reconfiguration. Request from HPSS Support/Tobias
18.5.2016 hpss prod. Mover1/Drive 101 was marke Suspect, IBM_IU1_PVR (Major). -> Marked Repaired

11.5.2016: hpss prod. TSM team will take 50x STKT10KD volumes from STK HPSS pool. Thes volumes are not imported and crated into HPSS therefore no further work needed. BUT we have get new once later to nit got short of volumes for the first copy.
11.5.2016: hpss prod. Total broken IBM Library, IBM_IU1_01 PVR(Major). -> Martin B. => no second copy.
Refer to howtos: ghi-thresholds.howto
9.5.2016: prod. GHI. apply threshold policy as described by Scott/IBM. Done together with GPFS -Admin(Lusmilla)

6.5.2016 hpss prod. New 100x Tape volumes added from IBM TS1140 (SC 205 sec. copy) added (imported/created) after an warning message by HPSS gui about no space left on tape.

3.5.2016 hpss prod. new project bareos setup, COS 1204, Large SC 23. -> FrontEnd (bht.lsdf.kit.edu, Stephanie Boehringer) -> Done by Ahmad
April: 2016 hpss prod. New projekt gridka-dcache project setup with Fuse. COS 123. Done by

- login OK
(!) After reboot the hpss_disk_mvr05 server could be started via HPSS GUI also PVL -> OK
(!) Only hpssmvr07 resources still in Unkown Status, TODO: Decide about what to do with.

hscroot@hpss-hmc-a:~> chsysstate -r lpar -m Server-8247-22L-SN212960A -o shutdown --immed --restart -n redhat7-03
- To reboot hpssmvr05 aka redhat7-03:
29.4.2016: Test. hpssmvr05 rebooted duo host was ping/ssh unreachable and hmc console login via vterm of mkvterm hangs after Open Completed message.
# Stefan Waldecker replaced the drive's broken jbec. (-> Martin)
Then disabled all Migration and Purge Policies assigned to SCs (1,2,3)
For die Default SC 99 I disabled the Migration Policy "Test-Migrate-Disk-To-Tape-01" before.
Reason: The Tape SC 10 and 11 did not have any tape volumes assigned to both and the MPS Process kept failing with a lot of log files in /var/hpss/log every minute. And PVL went "Major".
Howto disable M&P: GUI: Configre/Storage Space/Storage Classes -> mark SC1,2,3 -> Configure-> Mig/Pruge Policy -> NONE
Core server Restart (Shutdown/Start)
MPS Restart (Shutdown/Start)

(!) SC 501 and SC 701 kept unchanged.

29.4.2016: Test. backup all hpss configs via /opt/hpss/bin/lshpss into /root/Software/HPSS/hpss-configs/

27.4.2016: hpss prod. IBM-1140 Drive 0000078DB20E problem Solved. -> HPSS OK
On movr3 : /dev/hpss/st.1b.1-03592E07-0000078DB20E (missing)
# sg_map -st -x -i
# lsscasi
showd only one tape device -> Martin informed.
25.4.2016 hpss prod. IBM-1140 Drive (0000078DB20E) went Suspect on 25.4, -> mover3 Suspect -> HPSS, PVL Major.
22.4.2016: hpss prod. GHI: ddf-s-005# chmod +x /var/hpss/ghi/etc/ghi_backup_migration.ksh

21.4.2016 hpss prod. COS 1205 (Medium 2x copies) Max file Size changed from 500GB to 600GB. (Jos Request to allow End user to tar 8192x68MB in one file of a simulation run see mail from <agathe.chouippe@kit.edu>)
For all SCs (13, 21, 22, 23) purge policy changed (see. 13.4.2016):
Start purge when space used exceeded: from 0% to 50%
Stop Purge when space used falls to: from 0% to 40%
21.4.2016 hpss prod. Change purge policies while reformat of first storage system still running
- Migrate, Purge and Repack is done for D0300100 - D0401800, therefor all disk on Storage Unit 2 are empty.
- Firmware Update on Storage Unit 2 is done.
- All Logical Drives are deleted and recreated with T10PI enabled.
- Formatting is expected to take up to 12 days (((36 Logical Drives / 8 in parallel) * 60 hours) / 24 hours).

16.4 Status mail from Tobias
- dump_acct_sum created a list of all coses and #Files belonging to
- Purge Policies (SCs: 13, 21, 22, 23) chagnged to 0%
Start purge when space used exceeded: 0%
Stop Purge when space used falls to: 0%
- Most files on disk were only on Tape and purged from disk
15.4 - Rest files on disk Repacked, Firmware upgrade done by IBM/Tobias Elpelt
- IBM/Tobias removed from HPSS and started formating first disk system. ()
- Older COSes (1201, 1202, 1203) were Allowed and activated again, Core_server and MPS restartet.
- Delete all files belonged to older COSes and other rest unneeded files.

13.4.2016 hpss prod. Start Reconfiguration procedure to reforamt Disk storage systems for E2E support:
Core Server Shutdown/Start
MPS Shutdown/Start

5.4.2016 hpss prod. UTF8 Support GUI/Global/Global Flags/Object names can contain unprintable characters (ON)
-> also site statistics report sent to Jae Kerr produced by this new script

30.3.2016 hpss prod. /etc/cron.d/hpssstat changed to /usr/local/bin/hpssstats2.py (new version downloaed from hpss wiki)
29.3.2016 hpss prod. tape mover on hpssmvr01 sent suspect. Reason drive (101) suspect status.
16.3.2016 hpss prod. tape mover on hpssmvr02 sent Suspect. Reason drive (201) suspect status.
Notice: Work done by Dorin. Reason to proved LSDF with tape Storage duo to lack of space on other libs.
10.3.2016 hpss prod. STK Library (CN_STK_LIB01) has been partitioned. The 2x STK-Drives of STK_01_PVR on hpssmvr02 changed drive address to 3,1,1,12 and 3,1,1,0.
4.3.2016 test: upgrade HPSS Testsystem vrom hpss-7.4.3p2 to hpss-7.4.3p3 including DB2 Conversation

23.2.2016: test: FUSE und HPSS client rpms updated on archive-tgftp.lsdf.kit.edu to hpssfs-fuse-2.0.1-0.el6.x86_64.rpm. But also updating HPSS client software to 7.4.3p2-1 was also needed to be able to update the fuse package.

22.2.2016: hpss prod. SFTP-01/02 yum-autoupdate has been installed with email to root,hpss-admin@lists.kit.edu in /etc/sysconfig/yum-autoupdate -> Dorin
- Check if BWDAHUB has been updated by Frank!
(!!!)TODO: - powerMovers mvr05 and mvr07 still to be updated(!!!)
22.2.2016: hpss prod. cr01/02, mvr01, 02,03,04, SFTP-01/02, OS update duo to glibc security bug (CVE-2015-7547)
19.2.2016: test: hpsstcr03 and hpsstmvr OS update duo to glibc security bug (CVE-2015-7547)
rm: cannot remove `file_50MB.txt.copy.61991': Invalid argument
19.2.2016 prod. Problem: After deletion of 23 files under /home/ahmad2/* deleting them from Trash .Trash/root/* -> error:
see. code sftp-01:/root/coding_ahmad
TODO: still 10 files of cosid 1204.
19.2.2016 prod. changecos of zerofiles: rm zero file, touch zerofile, chmod user. zerofile, rm ./Trash/root/zerofile
now permission set currectly:
ddf-s-005# -rwxr--r--. 1 root root 13028 Jul 19 2014 /var/hpss/ghi/etc/ghi_backup_migration.ksh
(!!)TODO: check and keep an eye on it to see if every thing is OK!
11.2.2016 prod. GHI. Problem of missing policy backup and migration solved: /var/hpss/ghi/etc/ghi_backup_migration.ksh had no x-permission.
-> Email from Support/tobias (Re: [hpss-scc] KIT HPSS call - agenda for 2016-02-10) -> sent on 11.2.2016.
10.2.2016 Test File deletion from TrashCan not possible due to removed COS (104) that the file belongs to. log - Critical.
10.2.2016 hpss prod. - GHI policy backup and migration still not running. -> Email send an support/Telco. to look at (-> tobias/scott)
(!) GHI:- OS still on old 6.4. but scott said, when we go productiv ddf-s nodes should be upgraded to RHEL 6.7.
- for the hpss re-compilation and setup of ghi will be needed. s. howto!
4.2.2016 hpss prod. RHEL upgrade 6.4 to 6.7 Done. HPSS Upgrade 743p1-743p2 done. 8xddf-s-* GHI nodes upgrade to hpss-743p2 done. (Downtime for one week) -> ahmad, dorin
(!) RDAC migration for Testsystem was tested by Tobias IBM and continued for testCore by Dorin. in Jan/16

3.2.2016 hpss prod. Migration from RDAC to Multipath for disk storage for all HPSS Server done. -> started in mid Jan/16. -> Dorin
17.12.2015 hpss prod. API logging on FrontEnds archive-sftp-01/-02 disabled. /var/hpss/tmp/hpss.api.resource.disabled
- Ahmad adapted the changes as discribed by IBM/Scott (10.12) and copied both files to all nodes (ddf-s-001-008)
- Started GHI again:
ddf-s-005]# initctl start start_ghi_run
# initctl start ghi_iom_hpss
# /opt/hpss/bin/ghistartup -g

14.12.2015 hpss prod. GHI:
/var/hpss/ghi/policy/hpss/backup_migration.policy
/var/hpss/ghi/policy/hpss/migrate.policy
RULE 'scratch_exclude' EXCLUDE WHERE path_name like '%/scratch/.ghi/%'
Should be changed to:
RULE 'scratch_exclude' EXCLUDE WHERE path_name like '%/scratch/%'

10.12.2105 hpss prod. GHI: Scott found out, not whole scratch dir was excluded: (s. email)

10.12.2015: hpss test: RDAC-Multipath: IBM/Tobias Elpelt Tested migration from Disk-Drivers-RDAC-Multipath. hpsstmvr shutdown by Ahmad, and restarted by tobias after work done. tobias->Email (how-to RDAC to dm-multipath migration 11.12.2015)
8.12.2015 hpss prod. After adding tape volumess -> Major erros, All Disks 100% full, PVL for PVRs -> Down. GHI stopped, Ticket to IBM Support. Uwe, Jae solved> DB2 user (hpss) was not existing.
8.12.2015 hpss prod. New Tape volumes imported and created into PVRs STK_01 and IBM_IU1_01. (TK005000-TK014900 and ?-TS009900)
7.12.2015: hpss prod. HPSS Downtime for one week for maintenance. OS/RHEL6.4-6.7 and HPSS 743p1->743p2 Upgrade. Email sent to users.

2.12.2015 hpss prod. ChangeCos for zero files issue closed. TODO: recreate zeot files in neew COS 1205 after deleting from old COS 1204.
25.11.2015: hpss prod. GHI: Scott checked and responded. See email and ghi-problems-scott.howto. Now GHI is OK! But still some migration errors. -> Scott.
24.11.2015: hpss prod. OK: Mover03 and drives markted repaired afer cancel jobs. -> Dorin
23.11.2105; prod. Suspect: Mover03 and suspect drives of the IBM Lib -> broken air condition at weekend!
19.11.2015: hpss prod. GHI: still showed error messages. Email sent to IBM Support -> Scott
- /opt/hpss/bin/htar.ksh was not on any of GHI nodes.
- scott upgraded GHI-HTAR to the latest version for 2.4,GHI-HTAR 5.0.0.1g, for bug fixes.
- Scott: patch released for GHI 2.4.0.1. for bug fixes icluding a Sev 1 fix. Postponed for later.
19.11.2015 hpss prod. GHI Scott findings:
18.11.2015 hpss prod. GHI: Due to error messages email sent to IBM Support-> Scott repsponded and found out 19.11.2015:
- delete old/create new Hier and COS 1102, 1103 via GUI for ghi policies.
- on ddf-s-001 # mmchfs hpss -z Yes (see. 2.11.2015)
# mmdsh chmod +x /opt/hpss/bin/ghi_iom
- mmmount /hpss -a
# chmod +x /var/hpss/ghi/etc/start_ghi_run
- change log path in /etc/init/start_ghi_run.conf (>> /data/ghi_log/start_ghi_run.log)
- initctl start start_ghi_run
- had to start ghi_iom manually: (Update: I should have used# initctl start ghi_iom_hpss)
# ddf-s-001: # mmdsh /opt/hpss/bin/ghi_iom hpss 8012 (see. ddf-s-005: /etc/init/ghi_iom_hpss.conf)
- /opt/hpss/bin/ghistartup -g
- on ddf-s-005 # chmod +x /var/hpss/ghi/etc/ghi_backup_migration.ksh

17.11.2015 hpss prod. GHI:
16.11.2015 Test. RHEL-OS Upgrade from 6.4 -> 6.7 on hpsstcr03 and hpsstmvrand recomplie rdac DCS3700 both Servers.
Reason: Find out (9.11.15) that after the OS Upgrade (2xweeks ago) from 2.6.32-358.el6.x86_64 to 2.6.32-358.23.2.el6.x86_64
/lib/modules/2.6.32-358.23.2.el6.x86_64/kernel/drivers/scsi/
mppUpper.ko mppVhba.ko (were missing)
Solution (IBM DCS3700 redbook):
- cp mvr1:~/Software/rdac-LINUX-09.03.0C05.0652-source.tar.gz mvrt:~Software/DCS3700/
- untar
- make uninstall
- make clean (in case *.o are compiled)
- make
- make install
- Add to /boot/grub/menu.list:
title Red Hat Enterprise Linux Server (2.6.32-358.23.2.el6.x86_64) with MPP
root (hd0,0)
kernel /vmlinuz-2.6.32-358.23.2.el6.x86_64 ro root=/dev/sda3 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet console=tty0 console=ttyS0,115200n81r
initrd /mpp-2.6.32-358.23.2.el6.x86_64.img
- reboot

5.11.2015: Test. After hpsstmvr reboot: got disk read error. Therefore disk-tmvr broken
- LOGs: CORE0068: Space usage in tablespace STORAGESEGTAPEABIX of database subsys1 has exceeded critical threshold of 90%; usage at 1
- CORE0069: Spacee usage in tablespace TAPESEGUNLINK of database subsys1 has exceeded warning threshold of 85%; usage at 89%
- Solution: (Jae IBM): Check both "global" and "subsystem" configurations.
Make sure the "Metadata Space Monitor Interval" value is set to 0 seconds.

4.11.2015 Test. Core Server GUI Broken Status:
- P8-TestCOSes (500, 700) disabled (GUI/Subsystems/Configure).
- Data purged from DSC 1,2,3. MPS should be stareted to see the SCs under GUI/Monitor?Sorage Classes active
- Duo to CoreServer Major and Critical messages, /opt/hpss/bin/rc.hpss stop/start
and then Purge/Start

4.11.2015 hpss test:
4.11.2015 prod. GHI: old Hierarchy 1401 and COS 1401 (named "GHI Metadata") deleted and new onces with same IDs created with new Tape SC 204 and 205 (DSC 31, TSC 204, 205). restart: core Server/PVL/MPS
(!) Before enable scripts and starting ghi you have to (mmchfs hpss -z Yes).
2.11.2015 prod. GHI: all daemons Stoped: On ddf-s-005: disable scripts (1. chmod -x /opt/hpss/bin/ghi_iom) and (2. chmod -x /var/hpss/ghi/etc/start_ghi_run) and (mmchfs hpss -z No # to mount hpps without ghi started!!)
28.10.2015 prod. on sftp-01/02 sftp logging set up. logfile: /var/log/sftp
28.10.2015 prod. API loging activated via rsource file as described by Scott (IBM). See email.
28.10.2015 test. Jonanathan (IBM) put P8 Disk Storage OFF to solve PVL major status.
23.10.2015 prod. Purge started from GUI manually for Disk SC 22. Duo to user errors and log errors (No space left on device). -> Scott adwise to purge since no segments nore because of fragmentation. HPSs Telco 21.10.2015.
20.10.2015 prod. changecos for logs files in /log march-14.10.2105 from COSID 1102 into COSID 1205. And reinstialize log daemon and client daemon23.10.2015 prod. Disk SC 22 Purge policy changed from Space exceeds 90% -> 70% and Stop Purge space left unchanged.
17.9.2015 prod. cronjobs on core server cr2 for accounting and site statistics. /etc/cron.d/(hpssstat, hpssacct)
11.9.2015 prod. cronjobs on core server cr1 for accounting and site statistics. /etc/cron.d/(hpssstat, hpssacct)
10.9.2015 prod. Trashcans activated at 16:03. GUI/Global/Trashcans Settings (5, 86400,864000, 3600)
7.9.2015 hpss prod. IBM_IU1_01 was still in Major. reporting Null mounts. Only PVR shutdown/start helped.-> OK
2.9.2015 hpss prod. IBM_IU1_01 PVR went Major. Cause: CN_IBM_N01 broken -> Martin Beizinger Maintenance, done 4.9.
2.9.2015 prod. STK-Drive change: broken STK-drive (id 202) connected to hpssmvr02 replaced with a new one and added to devices&drives. (old: mvr02:/dev/hpss/st.1b.1-T10000D-579004000755 -> new: mvr02:/dev/hpss/st.1b.1-T10000D-579004000661)
27.8.2015 prod. Class Of Service changed from 1102 to 1205 for LogDaemon GUI/Servers/Log Daemon/specific/Archive Class of Service
26.8.2015 prod. Max file size for 1205 changed to max 500GB. Based on new calculations came from HPSS Support. -> Scott (Email)
25.8.2015 prod. 8 PM end of changecos . But 33555 missing-Files still have COSID 1204. To be reported to HPSS Support.
24.8.2015 prod. GHI manager on ddf-s-005 disk 100% full duo to /var/hpss/ghi/log/start_ghi_run.log (12GB). temporary solution: copyed file to ddf-s-005:/data/ghi_log_backup/. empty file /var/hpss/ghi/log/start_ghi_run.log
20.8.2015 prod. 10 new tape volumes for STK_01_PVR added. 30 new tape volumes for IBM_IU1_01 added
20.8.2015 prod. STK drives for PVR STK_01 PVR went suspect status while change cos for tape vilume TK001500.
17.8.2015 prod. on sftp servers (sftp-01/02) unmount fuse 1204 an mount fuse with new cos 1205 but OLD PATH. /SFTP/KIT
17.8.2015 prod. 11 AM start of changecos from 1204 to 1205
13.8.2015 prod. Max. File size for COS 1204 and 1025 extended to 2TB. (OK came from Jos and HPSS Support-Scott)
Martin reported the drive. I Checked the drives in GUI as (Mark Repaired) => Green OK
# Suspect drives after firemware update
11.8.2015 hpss prod. T10:IBM|03592E07 0000078DB20E
13.8.2015 hpss prod. T10:IBM|03592E07 0000078DB1EE

GHI shutdown -g before upgrade and startup afterwards on ddf-s-005 successfully. (Ahmad)
After shutdown applypolicy jobs kept running. -.Scott said no worry after a timeout will stop.
GHI logfiles while applypolicy jobs made local disk full.
-> Scott said to disable applypolicy scripts since GHI not used. (he didn't confirm)
iom processes kept running. -> Scott not problem, automateclly startet. You can only restart with -i.

3.8.2015 hpss prod. Duo to GPFS-3.5-18 Bug upgrade to 3.5-25 on node ddf-s-006 done by (Ludmilla/Ursula)
29.7.2015 Test. Reformating HPSS_T_Cache1 Disk as preperation for the End2End Protection (T10PI). -> IBM Jonathan
core Server and PVL restart.
9.4.2015 prod. gui Migration/Purger server Restarted cause: Drives mounted for days. PVL job cancle but not effect, PVL (Major)
9.4.2015 prod. gui configure/Subsystems/Configure/Allow disabled for ID 1201, 1203, Error: "Disk migration Failure". (No Tape behind)
8.4.2015 Support Ticket submitted by IBM-DE Kay Jenke for broken Disk. (01V4NJL,724 for the Disk Alert)
7.4.2015 Test Drive failure Enclosure 99, Drawer 2, Slot 9

Jan/Feb. 2015 hpss prod. HPSS upgrade to 7.4.3p1

back to HPPS main page

hidden:System Logs (HPSS, TSM): Difference between revisions

Revision as of 12:11, 26 October 2017

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools