hidden:System Logs (HPSS, TSM)
From Lsdf
Jump to navigationJump to search
Format of the entries
date: service [Druva|HPA|HPSS|HPSS BwDA|HPSS-Prod|HPSS-Test|Icinga|TSM|TSM GridKa|TSM KIT|TSM LSDF]: admin [CEP,DL,DR,FB,HF,IM,JvW,MB] name: record: link to further info in gridka or lsdf wiki
06.03.2019 HPSS bwDA DL: Tape drive exchanged STK-LIB02 Oracle SR#3-19419206511 Drive 1,3,-2,1,1 Bay32 SN: 579004007294 TraySN: 464970G+1508J90344 replaced by S/N 579004002234 01.03.2019 TSM GridKa MB: Tape Drive exchange SR 3-19477907621 : Drive with stuck cartridge STK-LIB01 DriveId: 2,1,1,7,0 Bay35 SN: 579004007108 TraySN: 464970G+1717J92551 NewSN: 579004007613 28.02.2019 ALL Libs MB: Upgrade MDVOP to SDP2 2.5.0-2 25.02.2019 TSM LSDF MB: STK-LIB02, BAY48 Firmware-Update STK tapedrive to Microcode 4.16.101-5.60 21.02.2019 TSM LSDF KS: STK-LIB02 Firmware-Update STK tapedrives to Microcode 4.16.101-5.60 HPSS (drives Bay 32 and 45 already done before, drive Bay 48 to be done later since loaded) 21.02.2019 TSM GridKa MB: Tape Drive exchange SR 3-19365788616 : Drive communication time-out. STK-LIB01 DriveId: 2,1,1,7,0 Bay35 SN: 579004006466 TraySN: 464970G+1717J92551 NewSN: 579004007108 18.02.2019 HPSS BwDA KS: STKLIB02 Firmware-Update STK tapedrive Bay 32 (Device Address: 1.3.-2.1.1 SN: 579004007294) to Microcode 4.16.101-5.60 14.02.2019 TSM GridKa MB: Tape Drive discarded Library Name: STK_ACS2 Drive Name: STK_ACS2_08 Bay56 ACS DriveId: 2,0,1,2 Serial Number: 576004003733 TraySerial: 464970G+1450J70004 11.02.2019 HPSS BwDA KS: STKLIB02 Firmware-Update STK tapedrive Bay 45 (Device Address: 1.2.2.1.1 SN: 579004000601) to Microcode 4.16.101-5.60 08.01.2019 TSM GridKa IM: Tape Drive exchange SR #3-19153851791 : Drive STK-LIB01 DriveId: 2,3,1,4 Serial Number: 579004000568 STK_ACS2D_09 DriveId: 2,3,1,4 Bay15 SN: 579004000568 TraySN: 464970G+1717J92547 NewSN: 579004000199 STK_ACS2_07 DrievId: 2,0,1,13 Bay57 SN: 576004004844 TraySN: 464970G+1539J70102 NewSN: 576004005745 08.01.2019 TSM GridKa MB: Tape Drive exchange SR #3-18972975069 : Drive communication time-out STK-LIB01 DriveId: 2,0,1,14 Bay53 SN: 576004009274 TraySN: 464970G+1450J70006 New SN: 576004003015 SR #3-19003542481 : Code: 604, Drive not communicating, LIB01 Bay35 STK-LIB01 DriveId: 2,1,1,7 Bay35 SN: 579004007299 TraySN: 464970G+1717J92551 NewSN: 579004006466 19.12.2018 TSM GridKa MB: Tape Drive exchange SR 3-18936661441 : Drive defective Bay47 STK-LIB01 DriveId: 2,1,1,4 Bay47 SN: 576004006167 TraySN: 576000200822 New SN: 576004001844 17.12.2018 TSM GridKa MB: Tape Drive exchange SR 3-18928977311 : Drive STK_ACS2D_05 errors with different tapes STK-LIB01 DriveId: 2,1,1,15 Bay33 SN: 579004007289 TraySN: 464970G+1717J92538 New SN: 579004002362 17.12.2018 TSM LSDF MB: Tape Drive exchange SR 3-18915577271 : rewindUnload Successful with mediaError STK-LIB02 DriveId: 1,0,1,2 Bay56 SN: 579004003143 TraySN: 464970G+1511J90413 New SN: 579004004270 13.12.2018 TSM Gridka MB: Tape Drive exchange SR 3-18912865575 : Drive communication time-out STK-LIB01 DriveId: 2,1,1,11 Bay34 SN: 579004007285 TraySN: 464970G+1717J92537 New SN: 579004005690 SR 3-18820971501 : Drive communication time-out STK_LIB01 DriveId: 2,3,1,4 Bay15 SN: 579004007296 TraySN: 464970G+1717J92542 New SN: 579004000568 11.12.2018 TSM Gridka MB: Error message seen in TSM 12/07/2018 18:34:01 ANR3619W (Session: 1201176, Origin: STA-F01-032-105) The user limit for open files is below the recommended minimum value of 8192. (SESSION: 1201176) [root@a01-013-110]# ulimit -n 16384 Edit /etc/security/limits.d/90-nproc.conf and adjust this line * soft nproc 16348 06.12.2018 STK-Libs MB: Firmwareupdate to FRS 8.51 STKLIB01 and STKLIB02 30.11.2018 TSM Gridka MB: Two failed DB Backups, new strange behaviour. Changed now the value for "tcpserveraddr" in /opt/tivoli/tsm/client/api/bin64/dsm.sys form "localhost" to 127.0.0.1 13.11.2018 TSM Gridka MB: Tape Drive exchange SR 3-18606189736 : Drive communication time-out STK-LIB01 DriveId: 2,2,1,4 Bay31 SN: 579004007232 TraySN:1717J92549 NEW SN: 579004001928 SR 3-18686908491 : Drive not communicating SN: 1450J70005, Bay54 STK-LIB01 DriveId: 2,0,1,10 Bay54 SN: 576004002409 TraySN:14150J70005 NEW SN: 576004003048 06.11.2018 BWDataArchiv MB: Firmware-Update STK tapedrives to 4.15.102-5.60 STK-LIB02 15.10.2018 TSM GridKa MB: Firmware-Update STK tapedrives to 4.15.102-5.60 STK-LIB01 10.09.2018 TSM LSDF KIT: 6 LTO5 drives from K&P for library scc-tsmlib01.scc.kit.edu, formerly IU1 FXDY Lib-SN SN von K&P WWN, Verwendung F2R1 00078ACEAE 00078A27A1 KIT F1R7 00078ACE98 00078AEC29 50:05:07:63:00:52:04:07 F1R8 00078ACE8E 00078AEC60 50:05:07:63:00:52:04:08 F4R4 00078A18AE 00078A11FC 50:05:07:63:00:52:04:34 F4R5 00078A1E49 00078A2737 50:05:07:63:00:52:04:35 F4R12 00078A1E48 00078A11ED 50:05:07:63:00:52:04:3C 11.09.2018 TSM GridKa: KS: Tape Drive exchange SR 3-18266690351 : STK-LIB01 DriveId: 2,0,1,11 Bay 50 SN: 579004001302 Tray: 1717J92543 NEW SN: 579004008352 06.09.2018 TSM GridKa: KS: Tape Drive exchange SR 3-18245040931 : STK-LIB01 DriveId: 2,2,1,7 Bay 19 SN: 579004008311 Tray: 1826J92950 NEW SN: 579004000147 04.09.2018 TSM GridKa: MB: Tape Drive exchange SR 3- 3-18208933461: STK-LIB01 DriveId: 2,0,1,13, Bay 57, SN: 576004003504 Tray: 1539J70102 NEW SN: 576004004844 14.08.2018 TSM LSDF-01: MB: Tape Drive exchange SR 3-18045355851: STK-LIB02 DriveId: 1,0,1,3, Bay 52, SN: 576004003156 Tray: 1511J90411 NEW SN: 576004006742 27.06.2018: TSM GridKa: IM: Tape Drive exchange STKLIB01 SR 3-17749380856: ACS DriveId: 2,0,1,0 Bay64 Serial Number: 576004007817 Tray: 1539J70107 NEW SN: 576004009751 08.06.2018: LSDF TSM01: DL: STK-LIB02 Drive 1.1.-1.1.4(Bay51) STK_ACS1_02/SN:579004002420 replaced by SN:579004002581 07.06.2018: TSM GridKa: IM: Tape Drive exchange STK-LIB01, STK_ACS2_00 DriveId: 2,0,1,0, Bay 64, SN: 576004009068 NEW SN: 576004007817 24.05.2018: HPSS bwDA: DL: STK-LIB02 Drive 1,2,-2,1,1(Bay48) bwDA_3_1_1_0_202/SN 579004000661 replaced by SN:579004001489 16.05.2018: HPSS BwDA: KS: Tape Drive exchange STK-LIB02 SR 3-17482036161 : Drive BwDA_3,2,1,4_303, Bay 31, SN:579004003155 New SN:579004000742 09.05.2018: TSM GridKa: IM: Tape Drive exchange STKLIB01 SR 3-17434695021 : ACS DriveId: 2,3,1,8 Bay14 Serial Number: 579004007295 Tray: 1717J92550 NEW SN: 579004002914 SR 3-17434587291 : ACS DriveId: 2,0,1,0 Bay64 Serial Number: 576004009751 Tray: 1539J70107 NEW SN: 576004009068 SR 3-17366093101 : ACS DriveId: 2,0,1,13 Bay57 Serial Number: 576004006798 Tray: 1539J70102 NEW SN: 576004003504 SR 3-17406261378 : ACS DriveId: 2,1,1,12 Bay45 Serial Number: 576004006015 Tray: 576000200839 NEW SN: 576004006786 17.04.2018: TSM GridKa: MB: Tape Drive exchange STK-LIB01, STK_ACS2_03 DriveId: 2,0,1,12, Bay 61, SN: 576004009752 NEW SN: 576004002012 16.04.2018: TSM GridKa IM, MB: new StorageAgents on f01-032-101,103,105,107 ActiveLogSize reduced from 8192 to 7168 26.03.2018: ERMM MB: ERMM und Library OU1 scc-tsmlib-n02 abgeschaltet 22.03.2018: TSM GridKa: MB: Tape Drive exchange STK-LIB01 STK_ACS2D_08 Bay 29 DriveId: 2,2,1,12 SN: 579004007284 New SN: 579004006645 22.02.2018: TSM GridKa: MB: Tape Drive exchange STK-LIB01, STK_ACS2_2 DriveId: 2,0,1,10, Bay 54, SN: 576004009252 NEW SN: 576004002409 13.02.2018: TSM GridKa: MB: Altered yum repos, disabled ROCKs and enabled SL Linux 23.01.2018: HPSS HPA: DL: default gateway changed on hpa from 172.18.92.200 to 172.18.92.1 16.01.2018: HPSS: DL: redhat upgrade on all hpss machines 6.7 to 6.9 security updates on all archive-sftp-0{1,2,4,5,6} (Meltdown and Spectre) 12.01.2018: TSM GridKa: MB: Tape Drive exchange STK-LIB01 STK_ACS2D_2,0,1,11 Bay 50 DriveId: 2,0,1,11 SN: 579004000737 New SN: 579004001302
09.01.2018: TSM LSDF: MB: Documentation of TSM LSDF-01 Installation on new hardware Media:LSDF-01_NEW-2017.txt
09.01.2018: TSM LSDF: MB: Installed and started yum-cron on lsdf-tsm01, /etc/sysconfig/yum-cron: checkonly=yes 07.12.2017: TSM GridKa: MB: Tape Drive exchange STK-LIB01, STK_ACS2D_11 DriveId: 2,3,1,12, Bay 13, SN: 579004001438 NEW SN: 579004000672 05.12.2017: TSM LSDF: MB: Tape Drive exchange STK-LIB02, Bay 51, LSDF_1,0,1,7 SN: 579004002367 NEW SN: 579004002420 05.12.2017: TSM LSDF: MB: Drives Frame1/R7 00078ACE98 and Frame4/R4 00078A18AE deleted in TSM and logical library 17.11.2017: SFTPServer: DL: /etc/fail2ban/fail2ban.conf changed logtarget = SYSLOG by logtarget = /var/log/fail2ban.log and restarted fail2ban bwdahub 13.11.2017: TSM GridKa: MB : Firmware update T10000D STKLIB01 Bay13-Bay15 , 4.13.101-5.60 02.11.2017: TSM LSDF: MB : Tape Drive exchange STK-LIB02, Bay 51, LSDF_1,0,1,7 SN: 579004007136 New SN:579004002367 26.10.2017: HPSS BwDA: MB : on hpssrc deleted entry in crontab # sync storage array time #MB: 26.10.2017# 0 1 * * * /usr/local/bin/sync_storage_time.sh The storage is managed only on hpa 19.10.2017: HPSS BwDA: KS : Tape Drive exchange STK-LIB02, Bay 32, BwDA_3,2,1,0_102 SN:579004003070 New SN:579004007294 04.10.2017: HPSS BwDA: MB : Tape Drive exchange STK-LIB02, Bay 45, BwDA_3,1,1,12_201 SN: 579004001831 New SN:579004000601 28.09.2017: TSM GridKa: MB : Roboter 1.3.0.2.0 und 1.3.0.1.0 ausgetauscht in STK_LIB01 27.09.2017: TSM GridKa: MB : Tape Drive exchange STK_ACS2_7 DriveId: 2,0,1,4 Bay 63 WWN: 500104F000A4B331 SN 576004009753 new SN 579004004337 07.09.2017: TSM GridKa: MB : Tape Drive exchange STK_ACS2D_11 DriveId: 2,3,1,12 Bay 13 WWN: 500104F000A4B29B SN 579004007298 new SN 579004001438 04.09.2017: HPA: MB : - enable target Graphical Interface (runlevel 5) for Storage Management GUI - installation of SanTricity Software for storage management, start the GUI with "SMclient" Software is on hpa:/export/hpa/Software/storage/ - firmware stored for E5600 on hpa:/export/hpa/Software/storage/firmware/E5600 28.08.2017: TSM GridKa: MB : Tape Drive exchange STK_ACS2D_07 DriveId: 2,2,1,8 WWN: 500104F000A4B2CE SN: 579004007291 New SN: 579004000138 28.08.2017: TSM GridKa: MB : 2 new cron jobs for database protecion on a01-013-110, GRID1 # Backup the database backup from disk to NB03, tape 15 12 * * * dsmc i /db_backup -se=backup 1>/dev/null 2>/dev/null # database protection , backup /archlog all 15 minutes to tape on NB03 15 * * * * dsmc i /archlog -se=backup 1>/dev/null 2>/dev/null 22.08.2017: TSM GridKa: MB : Tape Drives exchange STK_ACS2_16, DriveId: 2,1,1,14 WWN: 500104F000A4B2E3 SN: 576004004932 New SN 576004003756 STK_ACS2D_01,DriveId: 2,0,1,11 WWN: 500104F000A4B30A SN: 579004007300 New SN 579004000737 16.08.2017: TSM GridKa: MB : Special build firmware installed on T10000C drives RE363101-5.50, should prevent FSC37F6 in the future 11.08.2017: TSM GridKa: MB : TSM Server GRID1 Admin Schedule BA_DB_SNAP renamed to BA_DB_DISK, now every day one full backup to disk at 12:00 in directory /db_backup 10.08.2017: TSM GridKa: MB : TSM Server GRID1 /actlog and /archlog moved on other disks, created new filesystem for database backups /db_backup, admin schedule BA_DB_SNAP now everyday 25.07.2017: TSM GridKa: MB : Roboter 1.2.0.2.0 ausgetauscht in STK_LIB01 24.07.2017: TSM GridKa: MB : Tape Drives exchange Bay 55 2,0,1,6 SN:576004006401 TRAY_SN:464940G 1450J70010 New_SN:576004009059 T10000C Bay 58 2,0,1,9 SN:576004009749 TRAY_SN464970G 1539J700104 New_SN:576004003752 T10000C Bay 59 2,0,1,5 SN:576004006915 TRAY_SN:464940G 1450J70009 New_SN:576004009398 T10000C 20.07.2017: TSM GridKa: MB : Roboter 1.1.0.1.0 ausgetauscht in STK_LIB01 18.07.2017: HPSS BwDA: DL : new host certificates installed on sftp01/02 11.07.2017: TSM GridKa: MB : Roboter 1.2.0.1.0 ausgetauscht in STK_LIB01 06.07.2017: TSM GridKa: MB : New replacement cartridges inserted: TA2110, TA2112, TA2416, TA2449, TA1820 05.07.2017: HPA: KS : included /var into backup again, produces occasional TSM Error Mails 05.07.2017: GridKa: MB : On TSM Server GRID1 "set actlogretention 35" ==> ANR2090I Activity log retention set to 35 for management by DATE. 29.06.2017: GridKa: MB : Laufwerk 576004002418 2,1,1,4 ausgetauscht ,Neue SN:576004006167 T10000C 07.06.2017: TSM GridKa: MB : Inbetriebnahme T10000D Laufwerke für ALICE und CMS 01.06.2017: TSM GridKa: MB : Firmware Update T10000D Laufwerke nach 4.12.106-5.60 in STK-LIB01 30.05.2017: HPA: JvW: Excluded /var from backup for the time being because of the activity it throw backup errors. Need to change the local storage to LVM 18.05.2017: TSM GridKa: MB : Laufwerk 579004009748 2,0,1,13 Bay 57 ausgetauscht, Neue SN:579004006798 T10000C 28.04.2017: TSM KIT: MB : time adjustment, changed crontab entry to ".... ntp.scc.kit.edu ..." on TSM Servers scc-tsm-n01,n02,s01,s02 27.04.2017: HPSS-Prod: JvW: added /opt/hpss/bin to the root PATH variable in bash_profile. Will add a corresponding rule in Ansible 26.04.2017: GridKa: MB : Update TSM database form V5.5 to 6.3.5.100, now DB2 ! 24.04.2017: HPSS-Prod: DL : Migration Policy 13(test GridKA) "Number of Migration Streams per File Family" changed from 4 to 1(decided in HPSS GridKa JF) 13.04.2017: HPSS-Prod: AH : archive-sftp-0[12456] set default values: tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes in /etc/sysctl.conf 11.04.2017: HPSS-Prod: AH : hosts archive-sftp-0[456] set some limits like archive-sftp-0[12] in /etc/security/limits.conf and /etc/sysctl.conf. 11.04.2017: HPSS-Test: AH : hpssmvr7 mover sever (Power8) and its NetApp-Disk Devices removed from HPSS. 11.04.2017: HPSS-Prod: AH : hosts archive-sftp-0[456] yum repo changed from rhs.scc.kit.edu to RHEL with new registration via subscription-manager. 10.04.2017: HPSS-Prod: JvW: Config changes for the archive-sftp-[0-6] hosts. Most importantly all hosts keys were synced. Users of sftp complianed about wrong fingerprints after the new hosts were added to the production 06.04.2017: TSM LSDF: MB : Laufwerk 579004003031 1,0,1,7 ausgetauscht am 06.04.2017 Neue SN:579004007136 T10000D 04.04.2017: HPSS-Prod: AH : hosts archive-sftp-0[456] installed/configured hpss-7.4.3p2,hpssfuse 2.0.1-Bugfix(Scott),pam_mysql,chroot,autofs map files. 03.04.2017: GridKa: MB : Laufwerk 576004002668 2,1,1,8 ausgetauscht am 03.04.2017 Neue SN:576004001220 T10000C 01.04.2017: HPSS-Prod: JvW: cleaning yum db after switching to RH subscription 01.04.2017: HPSS-Test: JvW: cleaning yum db after switching to RH subscription 01.04.2017: HPSS-Prod: JvW: new hosts archive-sftp-0[456] installed with RH6.8 29.03.2017: HPSS-Prod: JvW: enabled syslog in hpss. HPSS logging now additionally goes to hpa via rsyslog) 29.03.2017: HPSS-Prod: AH : hpsscr/GUI change (Max. VVs To Write): TapeSC 204 (from 5 to 1), TapeSC 205 (from 3 to 1). 29.03.2017: HPSS-Prod: AH : hpsscr/GUI Mirgation policy (SC22) change (# Streams Per FF, Total M. Streams) from 3, 3 to 1, 1. 29.03.2017: HPSS-Prod: AH : hpsscr/GUI purge policy (SC21,SC22) from 60min,50%,40% to 1440min,80%,70% (Last file access, Start purge, Stop purge) 24.03.2017: TSM LSDF: MB : 4 LTO5 Laufwerke in scc-tsmlib-n01.scc.kit.edu eingebaut, F4R1,F4R3,F4R6,F4R8 23.03.2017: HPSS-Prod: AH : archive-sftp-01/-02 set hpss-fuse option stream=1 (MB) for all project mounts /hpss/<project> 22.03.2016: HPSS: MB : Drive 3,1,1,12 SN:579004000634 exchanged, new SN: 579004001831 21.03.2017: HPSS-Prod: AH : archive-sftp-01/-02 set hpss-fuse option stream=1 (MB) for /hpss/bwda 21.03.2017: HPSS-Prod: AH : archive-sftp-01/-02 edit /etc/sysctl.conf set vm.vfs_cache_pressure=100. (Default) 17.03.2017: HPSS-Prod: AH : archive-sftp-01/-02 edit /etc/sysctl.conf set vm.vfs_cache_pressure=1. (Test!) 15.03.2017: HPSS-Prod: AH : archive-sftp-01 install smem package (yum --enablerepo=epel install smem python-matplotlib). 14.03.2017: GridKa: MB : Laufwerk 576004002628 2,1,1,0 ausgetauscht am 14.03.2017 Neue SN:576004003031 14.03.2017: HPSS-Prod: AH : archive-sftp-01/-02 edit /etc/sysctl.conf set net.ipv4.tcp_keepalive_time=600, net.ipv4.tcp_keepalive_intvl=10, net.ipv4.tcp_keepalive_probes=6. 13.03.2017: HPSS-Prod: AH : archive-sftp-01/-02 edit EXCLUDE_MOUNTPOINTS in /etc/rear/local.conf. 10.03.2017: HPSS-Prod: AH : archive-sftp-01/-02 change stream=16M to default 8M for /hpss/bwda fuse mount in /etc/fstab. remount /hpss/bwda. 10.03.2017: HPSS-Prod: AH : archive-sftp-01/-02 disable /etc/cron.daily/mlocate.cron for updatedb. (ToDo: only exclude fuse/bindmouts from /etc/updatedb.conf) 09.03.2017: TSM: MB : CVE-2017-2636 Mitigation,done on LSDF-01/02, scc-histor-n01 and GRID1, # echo "install n_hdlc /bin/true">> /etc/modprobe.d/disable-n_hdlc.conf 08.03.2017: HPSS-Prod: AH : archive-sftp-01/-02 edit /etc/sysctl.conf change vm.swappiness=20 (it was =0). 03.03.2017: HPSS-Prod: DL : tape storage class 103 : Maximum VVs To Write: 4 (restart Migration/Purge Server and restart Core Server) 03.03.2017: HPSS-Prod: AH : on both archive-sftp-01/02: add /etc/fail2ban/action.d/iptables-common.local to solve a fail2ban error related to iptables-v1.4.7 03.03.2017: HPSS-Prod: AH : Downtime 10-12 am, on both (archive-sftp-01/02):yum --security update, vm.overcommit_memory=2, 'rvl=86400' for hpssfs mounts. 03.03.2017: HPSS-Prod: AH : Downtime 10-12 am, on both (archive-sftp-01/02):fuse remounts needed to apply ulimits for max open files changed on 24.2.2017. 28.02.2017: HPSS-Prod: DL : Migration Policy 13(test GridKA) "Aggregate Files to tape" enabled, "Max Files in Aggregate" set to 1000 28.02.2017: HPSS-Prod: DL : Migration Policy 13(test GridKA) "Number of Migration Streams per File Family" and "Total Migration Streams" changed from 1 to 4. 27.02.2017: TSM: MB : To prevent warning messages "sb03: setopt idletimeout 30" 27.02.2017: TSM: MB : Recabling SAN at TSM CN 24.02.2017: HPSS-Prod: DL : Maximum Open BitFiles: 10000 (it was 2000 before) 24.02.2017: HPSS-Test: DL : Maximum Open BitFiles: 10000 (it was 2000 before) 24.02.2017: HPSS-Prod: AH : ulimits for max open files and #procs increased on sftp-01/-02 in (/etc/security/limits.conf). 17.02.2017: GridKa: MB : Laufwerk 576004009255 2,0,1,1 ausgetauscht am 17.02.2017 Neue SN:576004004986 17.02.2017: GridKa: MB : Laufwerk 576004009226 2,0,1,2 ausgetauscht am 17.02.2017 Neue SN:576004003733 13.02.2017: TSM: MB : TSM server sb03 database reallocate to 4 volumes 10.02.2017: GridKa: MB : Laufwerk 576004001000 2,1,1,14 ausgetauscht am 10.02.2017 Neue SN:576004004932 07.02.2017: Icinga: KS : yum update (with kernel update) on monitoring servers scc/gridka-ora-mon -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 28.11.2016: HPA: CEP: Further RedHat images provided at /export/hpa/RH_repos/RHEL/. 7.3 copied into /export/hpa/RH_repos/rhel_repo/7.3 28.11.2016: HPA: CEP: RedHat updates applied. 28.11.2016: HPSS-Prod: JvW: added allow_other option to LSDF-Archiv hpssfs mount command in /etc/fstab because the user could not chdir since the id is not known. 25.11.2016: HPSS-Prod: DL : added new volumes for primary copy as well as for secondary copy (125 T10KD cartridges and 258 TS1140 cartridges) 14.11.2016: HPSS-Prod: JvW: Removed scp and rsync entries from /etc/rssh.conf on archive-sftp-0[12]. This will ensure the proper warning message if a users tries to login with ssh.
Entries below are from Ahmad and copied from text file.
4.8.2016: hpss prod. UDA-Checksum : all Fuse mounts mounted on archive-sftp-01/-02 with cksum=md5,nch=g options 4.8.2016: hpss prod. new Fuse-Bugfix upgrade hpssfs-fuse-2.0.1-0.el6.x86_64 on archive-sftp-01/-02 (rpm from Scott/IBM) /SFTP/KIT -> /hpss/bwda/000000/<user> Projects new directories for bwdatadiss and radar 1-4.8.2016 hpss prod.: Downtime for HPSS-FS layout change (1. phase) without username, uid and gid changes doen vai scrub rename command (!) yum update failed becuase of missing x86_64 library that is no include in the rhel repository (Bugzilla) 29.7.2016 hpss prod.+ hpss test: yum security updates on HPSS Server + Frontends (archive-sftp-01/-02 and BWDAHUB) + repoot 8.7.2016 hpss prod. Update to hpssfs-fuse-Bug-Fix (Software/HPSS/fuse-BugFix-v2001-2010/hpssfs-fuse-2.0.0-1.el6.x86_64.rpm) installed on both archive-sft-01/-02. Mounted as before without cksum Option. 4.7.2016 hpss prod. 3xTapeDives (101->mvr1 and 301,302->mvr3) had "suspect" status starting on 30.6 be4i one drive then the two others next days. and were marked repaired. Martin checked the library logs -> OK. Email sent to Waldecker on 1.7 -> no reply. 20.6.2016 hpss test: Upgrade from 743p2 to 743p3 and test read files (wriiten in 743p2) with e2e configured system (743p3). /db2_backup/offline/hcfg/HCFG.0.hpssdb.DBPART000.20151221154752.001 /db2_backup/offline/hsubsys1/HSUBSYS1.0.hpssdb.DBPART000.20151221154835.001 16.6.2016 hpss test: E2E Tests: Downgrade from 743p3 to 743p2 based on db2 offline backup to test read of files: 30.5.2016 hpss test: Testsystem reconfiguration with Tape SC 101, 102 -> Done by Dorin gftp03: /root/Software/HPSS/fuse-BugFix-v2001-2010/fuse_with_BZ6036 . ver. hpssfs-fuse-2.0.1-0.el6.x86_64.rpm 24.5.2016 hpss test: fuse BugFix (got from Scott) rpm installed on test FE archvie-gftp-03 24.5.2016: hpss prod. drive ID 101, IBM TS1140 on mvr01 -> Suspect status * sftp-01/-02 # uname -r -> 2.6.32-573.26.1.el6.x86_64 * for HPSS server (cr01,cr02, mvr01,02,03,04) (!) they were already up to date? # uname -r -> 2.6.32-573.18.1.el6.x86_64 hpss test: Update done by Dorin. for tcr03, tmvr # uname -r -> 2.6.32-573.22.1.el6.x86_64 archive-gftp-03 done by Ahmad, uname -r -> 2.6.32-573.26.1.el6.x86_64 24.5.2016 hpss prod. Maintenance: RHEL yum update for HPSS and FE (archive-sftp-01/-02) 19.5.2016 hpss prod. E2E disk reconfiguration done. -> Email from HPSS Support/IBM Tobias After testing access I forgot to set the right permissions -> Ahmad. (!) - Still to decide about all projects directory permission on production 18.5.2016 hpss prod. Bareos hpss folder permissions change: frp, 777 to scrub> chmod /hpss/bareos 771 18.5.2016 hpss prod. COS 122 (GridKa SC 12) Allowed, Core Server -> restart, MPS -> restart to activate SC 12 for Disk resources to be imported/Created into HPSS after the E2E reconfiguration. Request from HPSS Support/Tobias 18.5.2016 hpss prod. Mover1/Drive 101 was marke Suspect, IBM_IU1_PVR (Major). -> Marked Repaired 11.5.2016: hpss prod. TSM team will take 50x STKT10KD volumes from STK HPSS pool. Thes volumes are not imported and crated into HPSS therefore no further work needed. BUT we have get new once later to nit got short of volumes for the first copy. 11.5.2016: hpss prod. Total broken IBM Library, IBM_IU1_01 PVR(Major). -> Martin B. => no second copy. Refer to howtos: ghi-thresholds.howto 9.5.2016: prod. GHI. apply threshold policy as described by Scott/IBM. Done together with GPFS -Admin(Lusmilla) 6.5.2016 hpss prod. New 100x Tape volumes added from IBM TS1140 (SC 205 sec. copy) added (imported/created) after an warning message by HPSS gui about no space left on tape. 3.5.2016 hpss prod. new project bareos setup, COS 1204, Large SC 23. -> FrontEnd (bht.lsdf.kit.edu, Stephanie Boehringer) -> Done by Ahmad April: 2016 hpss prod. New projekt gridka-dcache project setup with Fuse. COS 123. Done by - login OK (!) After reboot the hpss_disk_mvr05 server could be started via HPSS GUI also PVL -> OK (!) Only hpssmvr07 resources still in Unkown Status, TODO: Decide about what to do with. hscroot@hpss-hmc-a:~> chsysstate -r lpar -m Server-8247-22L-SN212960A -o shutdown --immed --restart -n redhat7-03 - To reboot hpssmvr05 aka redhat7-03: 29.4.2016: Test. hpssmvr05 rebooted duo host was ping/ssh unreachable and hmc console login via vterm of mkvterm hangs after Open Completed message. # Stefan Waldecker replaced the drive's broken jbec. (-> Martin) Then disabled all Migration and Purge Policies assigned to SCs (1,2,3) For die Default SC 99 I disabled the Migration Policy "Test-Migrate-Disk-To-Tape-01" before. Reason: The Tape SC 10 and 11 did not have any tape volumes assigned to both and the MPS Process kept failing with a lot of log files in /var/hpss/log every minute. And PVL went "Major". Howto disable M&P: GUI: Configre/Storage Space/Storage Classes -> mark SC1,2,3 -> Configure-> Mig/Pruge Policy -> NONE Core server Restart (Shutdown/Start) MPS Restart (Shutdown/Start) (!) SC 501 and SC 701 kept unchanged. 29.4.2016: Test. backup all hpss configs via /opt/hpss/bin/lshpss into /root/Software/HPSS/hpss-configs/ 27.4.2016: hpss prod. IBM-1140 Drive 0000078DB20E problem Solved. -> HPSS OK On movr3 : /dev/hpss/st.1b.1-03592E07-0000078DB20E (missing) # sg_map -st -x -i # lsscasi showd only one tape device -> Martin informed. 25.4.2016 hpss prod. IBM-1140 Drive (0000078DB20E) went Suspect on 25.4, -> mover3 Suspect -> HPSS, PVL Major. 22.4.2016: hpss prod. GHI: ddf-s-005# chmod +x /var/hpss/ghi/etc/ghi_backup_migration.ksh 21.4.2016 hpss prod. COS 1205 (Medium 2x copies) Max file Size changed from 500GB to 600GB. (Jos Request to allow End user to tar 8192x68MB in one file of a simulation run see mail from <agathe.chouippe@kit.edu>) For all SCs (13, 21, 22, 23) purge policy changed (see. 13.4.2016): Start purge when space used exceeded: from 0% to 50% Stop Purge when space used falls to: from 0% to 40% 21.4.2016 hpss prod. Change purge policies while reformat of first storage system still running - Migrate, Purge and Repack is done for D0300100 - D0401800, therefor all disk on Storage Unit 2 are empty. - Firmware Update on Storage Unit 2 is done. - All Logical Drives are deleted and recreated with T10PI enabled. - Formatting is expected to take up to 12 days (((36 Logical Drives / 8 in parallel) * 60 hours) / 24 hours). 16.4 Status mail from Tobias - dump_acct_sum created a list of all coses and #Files belonging to - Purge Policies (SCs: 13, 21, 22, 23) chagnged to 0% Start purge when space used exceeded: 0% Stop Purge when space used falls to: 0% - Most files on disk were only on Tape and purged from disk 15.4 - Rest files on disk Repacked, Firmware upgrade done by IBM/Tobias Elpelt - IBM/Tobias removed from HPSS and started formating first disk system. () - Older COSes (1201, 1202, 1203) were Allowed and activated again, Core_server and MPS restartet. - Delete all files belonged to older COSes and other rest unneeded files. 13.4.2016 hpss prod. Start Reconfiguration procedure to reforamt Disk storage systems for E2E support: Core Server Shutdown/Start MPS Shutdown/Start 5.4.2016 hpss prod. UTF8 Support GUI/Global/Global Flags/Object names can contain unprintable characters (ON) -> also site statistics report sent to Jae Kerr produced by this new script 30.3.2016 hpss prod. /etc/cron.d/hpssstat changed to /usr/local/bin/hpssstats2.py (new version downloaed from hpss wiki) 29.3.2016 hpss prod. tape mover on hpssmvr01 sent suspect. Reason drive (101) suspect status. 16.3.2016 hpss prod. tape mover on hpssmvr02 sent Suspect. Reason drive (201) suspect status. Notice: Work done by Dorin. Reason to proved LSDF with tape Storage duo to lack of space on other libs. 10.3.2016 hpss prod. STK Library (CN_STK_LIB01) has been partitioned. The 2x STK-Drives of STK_01_PVR on hpssmvr02 changed drive address to 3,1,1,12 and 3,1,1,0. 4.3.2016 test: upgrade HPSS Testsystem vrom hpss-7.4.3p2 to hpss-7.4.3p3 including DB2 Conversation 23.2.2016: test: FUSE und HPSS client rpms updated on archive-tgftp.lsdf.kit.edu to hpssfs-fuse-2.0.1-0.el6.x86_64.rpm. But also updating HPSS client software to 7.4.3p2-1 was also needed to be able to update the fuse package. 22.2.2016: hpss prod. SFTP-01/02 yum-autoupdate has been installed with email to root,hpss-admin@lists.kit.edu in /etc/sysconfig/yum-autoupdate -> Dorin - Check if BWDAHUB has been updated by Frank! (!!!)TODO: - powerMovers mvr05 and mvr07 still to be updated(!!!) 22.2.2016: hpss prod. cr01/02, mvr01, 02,03,04, SFTP-01/02, OS update duo to glibc security bug (CVE-2015-7547) 19.2.2016: test: hpsstcr03 and hpsstmvr OS update duo to glibc security bug (CVE-2015-7547) rm: cannot remove `file_50MB.txt.copy.61991': Invalid argument 19.2.2016 prod. Problem: After deletion of 23 files under /home/ahmad2/* deleting them from Trash .Trash/root/* -> error: see. code sftp-01:/root/coding_ahmad TODO: still 10 files of cosid 1204. 19.2.2016 prod. changecos of zerofiles: rm zero file, touch zerofile, chmod user. zerofile, rm ./Trash/root/zerofile now permission set currectly: ddf-s-005# -rwxr--r--. 1 root root 13028 Jul 19 2014 /var/hpss/ghi/etc/ghi_backup_migration.ksh (!!)TODO: check and keep an eye on it to see if every thing is OK! 11.2.2016 prod. GHI. Problem of missing policy backup and migration solved: /var/hpss/ghi/etc/ghi_backup_migration.ksh had no x-permission. -> Email from Support/tobias (Re: [hpss-scc] KIT HPSS call - agenda for 2016-02-10) -> sent on 11.2.2016. 10.2.2016 Test File deletion from TrashCan not possible due to removed COS (104) that the file belongs to. log - Critical. 10.2.2016 hpss prod. - GHI policy backup and migration still not running. -> Email send an support/Telco. to look at (-> tobias/scott) (!) GHI:- OS still on old 6.4. but scott said, when we go productiv ddf-s nodes should be upgraded to RHEL 6.7. - for the hpss re-compilation and setup of ghi will be needed. s. howto! 4.2.2016 hpss prod. RHEL upgrade 6.4 to 6.7 Done. HPSS Upgrade 743p1-743p2 done. 8xddf-s-* GHI nodes upgrade to hpss-743p2 done. (Downtime for one week) -> ahmad, dorin (!) RDAC migration for Testsystem was tested by Tobias IBM and continued for testCore by Dorin. in Jan/16 3.2.2016 hpss prod. Migration from RDAC to Multipath for disk storage for all HPSS Server done. -> started in mid Jan/16. -> Dorin 17.12.2015 hpss prod. API logging on FrontEnds archive-sftp-01/-02 disabled. /var/hpss/tmp/hpss.api.resource.disabled - Ahmad adapted the changes as discribed by IBM/Scott (10.12) and copied both files to all nodes (ddf-s-001-008) - Started GHI again: ddf-s-005]# initctl start start_ghi_run # initctl start ghi_iom_hpss # /opt/hpss/bin/ghistartup -g 14.12.2015 hpss prod. GHI: /var/hpss/ghi/policy/hpss/backup_migration.policy /var/hpss/ghi/policy/hpss/migrate.policy RULE 'scratch_exclude' EXCLUDE WHERE path_name like '%/scratch/.ghi/%' Should be changed to: RULE 'scratch_exclude' EXCLUDE WHERE path_name like '%/scratch/%' 10.12.2105 hpss prod. GHI: Scott found out, not whole scratch dir was excluded: (s. email) 10.12.2015: hpss test: RDAC-Multipath: IBM/Tobias Elpelt Tested migration from Disk-Drivers-RDAC-Multipath. hpsstmvr shutdown by Ahmad, and restarted by tobias after work done. tobias->Email (how-to RDAC to dm-multipath migration 11.12.2015) 8.12.2015 hpss prod. After adding tape volumess -> Major erros, All Disks 100% full, PVL for PVRs -> Down. GHI stopped, Ticket to IBM Support. Uwe, Jae solved> DB2 user (hpss) was not existing. 8.12.2015 hpss prod. New Tape volumes imported and created into PVRs STK_01 and IBM_IU1_01. (TK005000-TK014900 and ?-TS009900) 7.12.2015: hpss prod. HPSS Downtime for one week for maintenance. OS/RHEL6.4-6.7 and HPSS 743p1->743p2 Upgrade. Email sent to users. 2.12.2015 hpss prod. ChangeCos for zero files issue closed. TODO: recreate zeot files in neew COS 1205 after deleting from old COS 1204. 25.11.2015: hpss prod. GHI: Scott checked and responded. See email and ghi-problems-scott.howto. Now GHI is OK! But still some migration errors. -> Scott. 24.11.2015: hpss prod. OK: Mover03 and drives markted repaired afer cancel jobs. -> Dorin 23.11.2105; prod. Suspect: Mover03 and suspect drives of the IBM Lib -> broken air condition at weekend! 19.11.2015: hpss prod. GHI: still showed error messages. Email sent to IBM Support -> Scott - /opt/hpss/bin/htar.ksh was not on any of GHI nodes. - scott upgraded GHI-HTAR to the latest version for 2.4,GHI-HTAR 5.0.0.1g, for bug fixes. - Scott: patch released for GHI 2.4.0.1. for bug fixes icluding a Sev 1 fix. Postponed for later. 19.11.2015 hpss prod. GHI Scott findings: 18.11.2015 hpss prod. GHI: Due to error messages email sent to IBM Support-> Scott repsponded and found out 19.11.2015: - delete old/create new Hier and COS 1102, 1103 via GUI for ghi policies. - on ddf-s-001 # mmchfs hpss -z Yes (see. 2.11.2015) # mmdsh chmod +x /opt/hpss/bin/ghi_iom - mmmount /hpss -a # chmod +x /var/hpss/ghi/etc/start_ghi_run - change log path in /etc/init/start_ghi_run.conf (>> /data/ghi_log/start_ghi_run.log) - initctl start start_ghi_run - had to start ghi_iom manually: (Update: I should have used# initctl start ghi_iom_hpss) # ddf-s-001: # mmdsh /opt/hpss/bin/ghi_iom hpss 8012 (see. ddf-s-005: /etc/init/ghi_iom_hpss.conf) - /opt/hpss/bin/ghistartup -g - on ddf-s-005 # chmod +x /var/hpss/ghi/etc/ghi_backup_migration.ksh 17.11.2015 hpss prod. GHI: 16.11.2015 Test. RHEL-OS Upgrade from 6.4 -> 6.7 on hpsstcr03 and hpsstmvrand recomplie rdac DCS3700 both Servers. Reason: Find out (9.11.15) that after the OS Upgrade (2xweeks ago) from 2.6.32-358.el6.x86_64 to 2.6.32-358.23.2.el6.x86_64 /lib/modules/2.6.32-358.23.2.el6.x86_64/kernel/drivers/scsi/ mppUpper.ko mppVhba.ko (were missing) Solution (IBM DCS3700 redbook): - cp mvr1:~/Software/rdac-LINUX-09.03.0C05.0652-source.tar.gz mvrt:~Software/DCS3700/ - untar - make uninstall - make clean (in case *.o are compiled) - make - make install - Add to /boot/grub/menu.list: title Red Hat Enterprise Linux Server (2.6.32-358.23.2.el6.x86_64) with MPP root (hd0,0) kernel /vmlinuz-2.6.32-358.23.2.el6.x86_64 ro root=/dev/sda3 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet console=tty0 console=ttyS0,115200n81r initrd /mpp-2.6.32-358.23.2.el6.x86_64.img - reboot 5.11.2015: Test. After hpsstmvr reboot: got disk read error. Therefore disk-tmvr broken - LOGs: CORE0068: Space usage in tablespace STORAGESEGTAPEABIX of database subsys1 has exceeded critical threshold of 90%; usage at 1 - CORE0069: Spacee usage in tablespace TAPESEGUNLINK of database subsys1 has exceeded warning threshold of 85%; usage at 89% - Solution: (Jae IBM): Check both "global" and "subsystem" configurations. Make sure the "Metadata Space Monitor Interval" value is set to 0 seconds. 4.11.2015 Test. Core Server GUI Broken Status: - P8-TestCOSes (500, 700) disabled (GUI/Subsystems/Configure). - Data purged from DSC 1,2,3. MPS should be stareted to see the SCs under GUI/Monitor?Sorage Classes active - Duo to CoreServer Major and Critical messages, /opt/hpss/bin/rc.hpss stop/start and then Purge/Start 4.11.2015 hpss test: 4.11.2015 prod. GHI: old Hierarchy 1401 and COS 1401 (named "GHI Metadata") deleted and new onces with same IDs created with new Tape SC 204 and 205 (DSC 31, TSC 204, 205). restart: core Server/PVL/MPS (!) Before enable scripts and starting ghi you have to (mmchfs hpss -z Yes). 2.11.2015 prod. GHI: all daemons Stoped: On ddf-s-005: disable scripts (1. chmod -x /opt/hpss/bin/ghi_iom) and (2. chmod -x /var/hpss/ghi/etc/start_ghi_run) and (mmchfs hpss -z No # to mount hpps without ghi started!!) 28.10.2015 prod. on sftp-01/02 sftp logging set up. logfile: /var/log/sftp 28.10.2015 prod. API loging activated via rsource file as described by Scott (IBM). See email. 28.10.2015 test. Jonanathan (IBM) put P8 Disk Storage OFF to solve PVL major status. 23.10.2015 prod. Purge started from GUI manually for Disk SC 22. Duo to user errors and log errors (No space left on device). -> Scott adwise to purge since no segments nore because of fragmentation. HPSs Telco 21.10.2015. 20.10.2015 prod. changecos for logs files in /log march-14.10.2105 from COSID 1102 into COSID 1205. And reinstialize log daemon and client daemon23.10.2015 prod. Disk SC 22 Purge policy changed from Space exceeds 90% -> 70% and Stop Purge space left unchanged. 17.9.2015 prod. cronjobs on core server cr2 for accounting and site statistics. /etc/cron.d/(hpssstat, hpssacct) 11.9.2015 prod. cronjobs on core server cr1 for accounting and site statistics. /etc/cron.d/(hpssstat, hpssacct) 10.9.2015 prod. Trashcans activated at 16:03. GUI/Global/Trashcans Settings (5, 86400,864000, 3600) 7.9.2015 hpss prod. IBM_IU1_01 was still in Major. reporting Null mounts. Only PVR shutdown/start helped.-> OK 2.9.2015 hpss prod. IBM_IU1_01 PVR went Major. Cause: CN_IBM_N01 broken -> Martin Beizinger Maintenance, done 4.9. 2.9.2015 prod. STK-Drive change: broken STK-drive (id 202) connected to hpssmvr02 replaced with a new one and added to devices&drives. (old: mvr02:/dev/hpss/st.1b.1-T10000D-579004000755 -> new: mvr02:/dev/hpss/st.1b.1-T10000D-579004000661) 27.8.2015 prod. Class Of Service changed from 1102 to 1205 for LogDaemon GUI/Servers/Log Daemon/specific/Archive Class of Service 26.8.2015 prod. Max file size for 1205 changed to max 500GB. Based on new calculations came from HPSS Support. -> Scott (Email) 25.8.2015 prod. 8 PM end of changecos . But 33555 missing-Files still have COSID 1204. To be reported to HPSS Support. 24.8.2015 prod. GHI manager on ddf-s-005 disk 100% full duo to /var/hpss/ghi/log/start_ghi_run.log (12GB). temporary solution: copyed file to ddf-s-005:/data/ghi_log_backup/. empty file /var/hpss/ghi/log/start_ghi_run.log 20.8.2015 prod. 10 new tape volumes for STK_01_PVR added. 30 new tape volumes for IBM_IU1_01 added 20.8.2015 prod. STK drives for PVR STK_01 PVR went suspect status while change cos for tape vilume TK001500. 17.8.2015 prod. on sftp servers (sftp-01/02) unmount fuse 1204 an mount fuse with new cos 1205 but OLD PATH. /SFTP/KIT 17.8.2015 prod. 11 AM start of changecos from 1204 to 1205 13.8.2015 prod. Max. File size for COS 1204 and 1025 extended to 2TB. (OK came from Jos and HPSS Support-Scott) Martin reported the drive. I Checked the drives in GUI as (Mark Repaired) => Green OK # Suspect drives after firemware update 11.8.2015 hpss prod. T10:IBM|03592E07 0000078DB20E 13.8.2015 hpss prod. T10:IBM|03592E07 0000078DB1EE GHI shutdown -g before upgrade and startup afterwards on ddf-s-005 successfully. (Ahmad) After shutdown applypolicy jobs kept running. -.Scott said no worry after a timeout will stop. GHI logfiles while applypolicy jobs made local disk full. -> Scott said to disable applypolicy scripts since GHI not used. (he didn't confirm) iom processes kept running. -> Scott not problem, automateclly startet. You can only restart with -i. 3.8.2015 hpss prod. Duo to GPFS-3.5-18 Bug upgrade to 3.5-25 on node ddf-s-006 done by (Ludmilla/Ursula) 29.7.2015 Test. Reformating HPSS_T_Cache1 Disk as preperation for the End2End Protection (T10PI). -> IBM Jonathan core Server and PVL restart. 9.4.2015 prod. gui Migration/Purger server Restarted cause: Drives mounted for days. PVL job cancle but not effect, PVL (Major) 9.4.2015 prod. gui configure/Subsystems/Configure/Allow disabled for ID 1201, 1203, Error: "Disk migration Failure". (No Tape behind) 8.4.2015 Support Ticket submitted by IBM-DE Kay Jenke for broken Disk. (01V4NJL,724 for the Disk Alert) 7.4.2015 Test Drive failure Enclosure 99, Drawer 2, Slot 9 Jan/Feb. 2015 hpss prod. HPSS upgrade to 7.4.3p1