primary drive failing... how to rescue system without a backup
- Monday morning... email from research scientist... the electromagnetic analysis software license server is down...
- log into the system... ps -ef | grep license-server-process ... nothing...
- dmesg | more... i/o error..., try ls -al... command not found..., (not good)... init 6... command not found... (not good worse...)
- press power button... wait for the electrons to settle to the bottom of the capacitors... press power again
- system boots up... check license process... happiness... check dmesg (system log files)
- find an odd entry... search for the string... find it is an indication of a failing drive, learn of the new command below
- try the command: --> sudo /usr/sbin/smartctl -i /dev/sda and for /dev/sdb... details presented as shown below
- --> sudo /usr/sbin/smartctl -H /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
Log Sense failed, IE page [scsi response fails sanity test] (NOT GOOD... again)
- about this time commands stop working, i/o errors... been here, done that, pull the power cord, yank the drive, toss it in the freezer
- pull disk from freezer, find another drive of the same size, connect both to the SATA power and data
- insert Gparted drive (see http://johnmeister.com/linux/Notes/Gparted-for-Recovery/ALL.html
- instead of using the GUI tool to rework partions, open a terminal window, change colors (black text on yellow), type sudo su -
- type fdisk -l - note that the failing drive (or so I thought) was /dev/sda, new drive, /dev/sdb
- in the terminal window I type; dd if=/dev/sda of=/dev/sdb ; place a fan aimed at the drives with the side cover off...
- I watch the blinking light showing disk activity... email users, and let it run... overnight
- come in, it's done, execute fdisk /dev/sdb... minor error... type "w" to write, it prints... type fdisk -l, both drives read the same
- success, cloned the 500GB drive perfectly, file systems match, mount to the magical "a" mount:, mkdir a ; mount /dev/sdb1 a
- cd a, ls -al... all good... about this time I notice that there's no swap... wait... the 750GB drive was the primary... uh oh...
- I look at the 750GB drive, look in cabinets, drawers, etc... no happiness... the failing drive was the 750, not the 500!!! (too rushed, or senior moment?)
- at this point I also realize that the workstation is obsolete, and this is the 3rd or 4th time I've rescued it...
- problem is this device provides samba, web and license support (tied to hardware!)... the OS is old... but it works... PLAN B
- I realize that the 750GB drive is failing, fading fast... so... the 750GB goes into the freezer, this time...
- I remove Gparted from the drive, unplug the old workstation and raid another device at my disposal that was wasted by using Microsoft on it
- I set it up, install Gparted, bring it up and check BiOS... then get into Gparted and NUKE the drive... it was encrypted.
- I create a new partition using ext3, apply it, and eject GParted and install the SuSE 13.2 DVD, reboot to DVD
- install 13.2, configure with the failing system's IP, etc. THEN pull the failing 750GB drive out of the freezer, power down the system and connecxt.
- I bring up the new system, do some basic config stuff so I can use the system... sudo su -
- make sure the system as the correct IP, gateway, network, netmask, hostname
- as luser, update /home/luser/.ssh and .bashrc and .History... then sudo su -
- cd to the primary user, mkdir OLD-SYSTEM-FILES,and TMP... mount the drive, mount /dev/sdb1 /home/luser/TMP
- then begin to look at system files in TMP and cp -rp directories such as etc, var/log, home/luser, srv/www, and so on to OLD-SYSTEM-FILES
- all the while watching the condition of the 750GB failing drive... making sure the /etc gets copied with the primary user...
- it's clear a few files are corrupted... ls -al shows up ??? in a few places...
- once the key original /etc files are present start backing up the "default" files, mv /etc/apache2 /etc/apache2-original-2015-06-16
- copy /home/luser/OLD-SYSTEM-FILES/etc/apache2 /etc (as superuser of course)... verify that the config files are correct.
- try to start apache2... /etc/init.d/apache2 configcheck ; /etc/init.d/apache2 start ; ps -ef | grep http
- repeat process for NFS/Samba and other tools...
- verify that samba, web server is working... if something starts going sideways, check the gateway again...
- after validating that the web server works, cgi-bin scripts work and that the system is on the network, email users, and head home and type notes.
- to bring up the license server from the failing 750gb drive the plan will be to copy directly from the 750GB disk to a 500Gb, excluding luser files.
smartctl info
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 90 to 87
http://www.linuxjournal.com/content/know-when-your-drives-are-failing-smartd
Use smartctl -X to abort test.
--> /usr/sbin/smartctl -i /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
Smartctl open device: /dev/sda failed: Permission denied
------------------------------------------------
--> sudo /usr/sbin/smartctl -i /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Blue (SATA)
Device Model: WDC WD7500AAKS-00RBA0
Serial Number: WD-WCAPT0448703
LU WWN Device Id: 5 0014ee 255de83fa
Firmware Version: 30.04G30
User Capacity: 750,156,374,016 bytes [750 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA/ATAPI-7 (minor revision not indicated)
Local Time is: Mon Jun 15 13:52:23 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
------------------------------------------------
--> sudo /usr/sbin/smartctl -i /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.9
Device Model: ST3500641AS
Serial Number: 3PM1VB7X
Firmware Version: 3.ADG
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA/ATAPI-7 (minor revision not indicated)
Local Time is: Mon Jun 15 13:53:10 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
------------------------------------------------
--> sudo /usr/sbin/smartctl -H /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
------------------------------------------------
--> sudo /usr/sbin/smartctl -H /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
--> sudo smartctl -t short /dev/sda
sudo: smartctl: command not found
------------------------------------------------
luser@linuxBox [/home/luser/CONFIG]
------------------------------------------------
--> sudo /usr/sbin/smartctl -t short /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Jun 15 14:00:49 2015
Use smartctl -X to abort test.
------------------------------------------------
luser@linuxBox [/home/luser/CONFIG]
------------------------------------------------
--> sudo /usr/sbin/smartctl -t short /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Jun 15 14:01:02 2015
Use smartctl -X to abort test.
------------------------------------------------
--> sudo /usr/sbin/smartctl -H /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
Log Sense failed, IE page [scsi response fails sanity test]
------------------------------------------------
--> sudo /usr/sbin/smartctl -H /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
------------------------------------------------
--> sudo /usr/sbin/smartctl -t long /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
Extended Background Self Test has begun
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
Use smartctl -X to abort test
-------------------------------------------------------------------------
------------------------------------------------
--> sudo /usr/sbin/smartctl --help
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
Usage: smartctl [options] device
============================================ SHOW INFORMATION OPTIONS =====
-h, --help, --usage
Display this help and exit
-V, --version, --copyright, --license
Print license, copyright, and version information and exit
-i, --info
Show identity information for device
--identify[=[w][nvb]]
Show words and bits from IDENTIFY DEVICE data (ATA)
-g NAME, --get=NAME
Get device setting: all, aam, apm, lookahead, security, wcache, rcache, wcreorder
-a, --all
Show all SMART information for device
-x, --xall
Show all information for device
--scan
Scan for devices
--scan-open
Scan for devices and try to open each device
================================== SMARTCTL RUN-TIME BEHAVIOR OPTIONS =====
-q TYPE, --quietmode=TYPE (ATA)
Set smartctl quiet mode to one of: errorsonly, silent, noserial
-d TYPE, --device=TYPE
Specify device type to one of: ata, scsi, sat[,auto][,N][+TYPE], usbcypress[,X], usbjmicron[,p][,x][,N], usbsunplus, marvell, areca,N/E, 3ware,N, hpt,L/M/N, megaraid,N, cciss,N, auto, test
-T TYPE, --tolerance=TYPE (ATA)
Tolerance: normal, conservative, permissive, verypermissive
-b TYPE, --badsum=TYPE (ATA)
Set action on bad checksum to one of: warn, exit, ignore
-r TYPE, --report=TYPE
Report transactions (see man page)
-n MODE, --nocheck=MODE (ATA)
No check if: never, sleep, standby, idle (see man page)
============================== DEVICE FEATURE ENABLE/DISABLE COMMANDS =====
-s VALUE, --smart=VALUE
Enable/disable SMART on device (on/off)
-o VALUE, --offlineauto=VALUE (ATA)
Enable/disable automatic offline testing on device (on/off)
-S VALUE, --saveauto=VALUE (ATA)
Enable/disable Attribute autosave on device (on/off)
-s NAME[,VALUE], --set=NAME[,VALUE]
Enable/disable/change device setting: aam,[N|off], apm,[N|off],
lookahead,[on|off], security-freeze, standby,[N|off|now],
wcache,[on|off], rcache,[on|off], wcreorder,[on|off]
======================================= READ AND DISPLAY DATA OPTIONS =====
-H, --health
Show device SMART health status
-c, --capabilities (ATA)
Show device SMART capabilities
-A, --attributes
Show device SMART vendor-specific Attributes and values
-f FORMAT, --format=FORMAT (ATA)
Set output format for attributes: old, brief, hex[,id|val]
-l TYPE, --log=TYPE
Show device log. TYPE: error, selftest, selective, directory[,g|s],
xerror[,N][,error], xselftest[,N][,selftest],
background, sasphy[,reset], sataphy[,reset],
scttemp[sts,hist], scttempint,N[,p],
scterc[,N,M], devstat[,N], ssd,
gplog,N[,RANGE], smartlog,N[,RANGE]
-v N,OPTION , --vendorattribute=N,OPTION (ATA)
Set display OPTION for vendor Attribute N (see man page)
-F TYPE, --firmwarebug=TYPE (ATA)
Use firmware bug workaround:
none, nologdir, samsung, samsung2, samsung3, xerrorlba, swapid
-P TYPE, --presets=TYPE (ATA)
Drive-specific presets: use, ignore, show, showall
-B [+]FILE, --drivedb=[+]FILE (ATA)
Read and replace [add] drive database from FILE
[default is +/etc/smart_drivedb.h
and then /usr/share/smartmontools/drivedb.h]
============================================ DEVICE SELF-TEST OPTIONS =====
-t TEST, --test=TEST
Run test. TEST: offline, short, long, conveyance, force, vendor,N,
select,M-N, pending,N, afterselect,[on|off]
-C, --captive
Do test in captive mode (along with -t)
-X, --abort
Abort any non-captive test on device
=================================================== SMARTCTL EXAMPLES =====
smartctl --all /dev/hda (Prints all SMART information)
smartctl --smart=on --offlineauto=on --saveauto=on /dev/hda
(Enables SMART on first disk)
smartctl --test=long /dev/hda (Executes extended disk self-test)
smartctl --attributes --log=selftest --quietmode=errorsonly /dev/hda
(Prints Self-Test & Attribute errors)
smartctl --all --device=3ware,2 /dev/sda
smartctl --all --device=3ware,2 /dev/twe0
smartctl --all --device=3ware,2 /dev/twa0
smartctl --all --device=3ware,2 /dev/twl0
(Prints all SMART info for 3rd ATA disk on 3ware RAID controller)
smartctl --all --device=hpt,1/1/3 /dev/sda
(Prints all SMART info for the SATA disk attached to the 3rd PMPort
of the 1st channel on the 1st HighPoint RAID controller)
smartctl --all --device=areca,3/1 /dev/sg2
(Prints all SMART info for 3rd ATA disk of the 1st enclosure
on Areca RAID controller)
----------------------
OOPS…
------------------------------------------------
--> ps -ef | grep smart
-bash: /usr/bin/ps: Input/output error
THE SYSTEM WAS FAILING DURING THE TEST!!!
note: when cutting and pasting the info below, for some reason extra blank lines were added, to get rid of them the following steps
were taken:
- >esc< :set nu
- >esc< :13,542g/^$/d (266 fewer lines)
Statistics Based on 49,056 Hard Drives
|