My Solaris Notes: 2016

Saturday, February 13, 2016

VCS Cheat Sheet

LLT and GAB

VCS uses two components, LLT and GAB, to share data over the private networks among systems.

These components provide the performance and reliability required by VCS.

LLT	LLT (Low Latency Transport) provides fast, kernel-to-kernel comms and monitors network connections. The system admin configures the LLT by creating a configuration file (llttab) that describes the systems in the cluster and private network links among them. The LLT runs in layer 2 of the network stack
GAB	GAB (Group membership and Atomic Broadcast) provides the global message order required to maintain a synchronised state among the systems, and monitors disk comms such as that required by the VCS heartbeat utility. The system admin configures GAB driver by creating a configuration file ( gabtab).

LLT and GAB files

/etc/llthosts	The file is a database, containing one entry per system, that links the LLT system ID with the hosts name. The file is identical on each server in the cluster.
/etc/llttab	The file contains information that is derived during installation and is used by the utility lltconfig.
/etc/gabtab	The file contains the information needed to configure the GAB driver. This file is used by the gabconfig utility.
/etc/VRTSvcs/conf/config/main.cf	The VCS configuration file. The file contains the information that defines the cluster and its systems.

gabdiskconf	-i Initialises the disk region -s Start Block -S Signature
gabdiskhb (heartbeat disks)	-a Add a gab disk heartbeat resource -s Start Block -p Port -S Signature
gabconfig	-c Configure the driver for use -n Number of systems in the cluster.

LLT and GAB Commands

Verifying that links are active for LLT	lltstat -n
verbose output of the lltstat command	lltstat -nvv \| more
open ports for LLT	lltstat -p
display the values of LLT configuration directives	lltstat -c
lists information about each configured LLT link	lltstat -l
List all MAC addresses in the cluster	lltconfig -a list
stop the LLT running	lltconfig -U
start the LLT	lltconfig -c
verify that GAB is operating	gabconfig -a Note: port a indicates that GAB is communicating, port h indicates that VCS is started
stop GAB running	gabconfig -U
start the GAB	gabconfig -c -n <number of nodes>
override the seed values in the gabtab file	gabconfig -c -x

GAB Port Memberbership

List Membership	gabconfig -a
Unregister port f	/opt/VRTS/bin/fsclustadm cfsdeinit
Port Function	a gab driver b I/O fencing (designed to guarantee data integrity) d ODM (Oracle Disk Manager) f CFS (Cluster File System) h VCS (VERITAS Cluster Server: high availability daemon) o VCSMM driver (kernel module needed for Oracle and VCS interface) q QuickLog daemon v CVM (Cluster Volume Manager) w vxconfigd (module for cvm)

Cluster daemons

High Availability Daemon	had
Companion Daemon	hashadow
Resource Agent daemon	<resource>Agent
Web Console cluster managerment daemon	CmdServer

Cluster Log Files

Log Directory	/var/VRTSvcs/log
primary log file (engine log file)	/var/VRTSvcs/log/engine_A.log

Starting and Stopping the cluster

"-stale" instructs the engine to treat the local config as stale "-force" instructs the engine to treat a stale config as a valid one	hastart [-stale\|-force]
Bring the cluster into running mode from a stale state using the configuration file from a particular server	hasys -force <server_name>
Stop the cluster on the local server. Note: This will also bring any clustered resources offline.	hastop -local
Stop cluster on local server but evacuate (failover) the application/s to another node within the cluster	hastop -local -evacuate
Stop the cluster on all nodes but leave the clustered resources online.	hastop -all -force

Cluster Status

display cluster summary	hastatus -summary
continually monitor cluster	hastatus
verify the cluster is operating	hasys -display

Cluster Details

information about a cluster	haclus -display
value for a specific cluster attribute	haclus -value <attribute>
modify a cluster attribute	haclus -modify <attribute name> <new>
Enable LinkMonitoring	haclus -enable LinkMonitoring
Disable LinkMonitoring	haclus -disable LinkMonitoring

Users

add a user	hauser -add <username>
modify a user	hauser -update <username>
delete a user	hauser -delete <username>
display all users	hauser -display

System Operations

add a system to the cluster	hasys -add <sys>
delete a system from the cluster	hasys -delete <sys>
Modify a system attributes	hasys -modify <sys> <modify options>
list a system state	hasys -state
Force a system to start	hasys -force
Display the systems attributes	hasys -display [-sys]
List all the systems in the cluster	hasys -list
Change the load attribute of a system	hasys -load <system> <value>
Display the value of a systems nodeid (/etc/llthosts)	hasys -nodeid
Freeze a system (No offlining system, No groups onlining)	hasys -freeze [-persistent][-evacuate] Note: main.cf must be in write mode
Unfreeze a system ( reenable groups and resource back online)	hasys -unfreeze [-persistent] Note: main.cf must be in write mode

Dynamic Configuration

The VCS configuration must be in read/write mode in order to make changes. When in read/write mode the

configuration becomes stale, a .stale file is created in $VCS_CONF/conf/config. When the configuration is put

back into read-only mode the .stale file is removed.

Change configuration to read/write mode	haconf -makerw
Change configuration to read-only mode	haconf -dump -makero
Check what mode cluster is running in	haclus -display \|grep -i 'readonly' 0 = write mode 1 = read only mode
Check the configuration file	hacf -verify /etc/VRTS/conf/config Note: you can point to any directory as long as it has main.cf and types.cf
convert a main.cf file into cluster commands	hacf -cftocmd /etc/VRTS/conf/config -dest /tmp
convert a command file into a main.cf file	hacf -cmdtocf /tmp -dest /etc/VRTS/conf/config

Service Groups

add a service group	haconf -makerw hagrp -add groupw hagrp -modify groupw SystemList sun1 1 sun2 2 hagrp -autoenable groupw -sys sun1 haconf -dump -makero
delete a service group	haconf -makerw hagrp -delete groupw haconf -dump -makero
change a service group	haconf -makerw hagrp -modify groupw SystemList sun1 1 sun2 2 sun3 3 haconf -dump -makero Note: use the "hagrp -display <group>" to list attributes
list the service groups	hagrp -list
list the groups dependencies	hagrp -dep <group>
list the parameters of a group	hagrp -display <group>
display a service group's resource	hagrp -resources <group>
display the current state of the service group	hagrp -state <group>
clear a faulted non-persistent resource in a specific grp	hagrp -clear <group> [-sys] <host> <sys>
Change the system list in a cluster	# remove the host hagrp -modify grp_zlnrssd SystemList -delete <hostname> # add the new host (don't forget to state its position) hagrp -modify grp_zlnrssd SystemList -add <hostname> 1 # update the autostart list hagrp -modify grp_zlnrssd AutoStartList <host> <host>

Service Group Operations

Start a service group and bring its resources online	hagrp -online <group> -sys <sys>
Stop a service group and takes its resources offline	hagrp -offline <group> -sys <sys>
Switch a service group from system to another	hagrp -switch <group> to <sys>
Enable all the resources in a group	hagrp -enableresources <group>
Disable all the resources in a group	hagrp -disableresources <group>
Freeze a service group (disable onlining and offlining)	hagrp -freeze <group> [-persistent] note: use the following to check "hagrp -display <group> \| grep TFrozen"
Unfreeze a service group (enable onlining and offlining)	hagrp -unfreeze <group> [-persistent] note: use the following to check "hagrp -display <group> \| grep TFrozen"
Enable a service group. Enabled groups can only be brought online	haconf -makerw hagrp -enable <group> [-sys] haconf -dump -makero Note to check run the following command "hagrp -display \| grep Enabled"
Disable a service group. Stop from bringing online	haconf -makerw hagrp -disable <group> [-sys] haconf -dump -makero Note to check run the following command "hagrp -display \| grep Enabled"
Flush a service group and enable corrective action.	hagrp -flush <group> -sys <system>

Resources

add a resource	haconf -makerw hares -add appDG DiskGroup groupw hares -modify appDG Enabled 1 hares -modify appDG DiskGroup appdg hares -modify appDG StartVolumes 0 haconf -dump -makero
delete a resource	haconf -makerw hares -delete <resource> haconf -dump -makero
change a resource	haconf -makerw hares -modify appDG Enabled 1 haconf -dump -makero Note: list parameters "hares -display <resource>"
change a resource attribute to be globally wide	hares -global <resource> <attribute> <value>
change a resource attribute to be locally wide	hares -local <resource> <attribute> <value>
list the parameters of a resource	hares -display <resource>
list the resources	hares -list
list the resource dependencies	hares -dep

Resource Operations

Online a resource	hares -online <resource> [-sys]
Offline a resource	hares -offline <resource> [-sys]
display the state of a resource( offline, online, etc)	hares -state
display the parameters of a resource	hares -display <resource>
Offline a resource and propagate the command to its children	hares -offprop <resource> -sys <sys>
Cause a resource agent to immediately monitor the resource	hares -probe <resource> -sys <sys>
Clearing a resource (automatically initiates the onlining)	hares -clear <resource> [-sys]

Resource Types

Add a resource type	hatype -add <type>
Remove a resource type	hatype -delete <type>
List all resource types	hatype -list
Display a resource type	hatype -display <type>
List a partitcular resource type	hatype -resources <type>
Change a particular resource types attributes	hatype -value <type> <attr>

Resource Agents

add a agent	pkgadd -d . <agent package>
remove a agent	pkgrm <agent package>
change a agent	n/a
list all ha agents	haagent -list
Display agents run-time information i.e has it started, is it running ?	haagent -display <agent_name>
Display agents faults	haagent -display \|grep Faults

Resource Agent Operations

Start an agent	haagent -start <agent_name>[-sys]
Stop an agent	haagent -stop <agent_name>[-sys]

Friday, February 12, 2016

Workaround for "FATAL: system is not bootable, boot command is disabled" on an obp

{1} ok boot
FATAL: system is not bootable, boot command is disabled

Workaround:

{1} ok setenv auto-boot? false
auto-boot? =          false
{1} ok reset-all

SC Alert: Host System has Reset
 
Sun Fire V210, No Keyboard
Copyright 1998-2003 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.11.4, 4096 MB memory installed, Serial #xxxxxxxx.
Ethernet address 0:3:ba:xx:xx:xx, Host ID: 83xxxxxx.

{1} ok setenv auto-boot? true
auto-boot? =          true
{1} ok boot

How to Collect a Snapshot on SPARC M series servers Mx000 and M10-x systems - M3000/M4000/M5000/M8000/M9000

Running snapshot

The syntax varies slightly from the Mx000 to the M10-x .The M10 requires a "-a" option to collect logs
from all chassis.

The two most common usages are described below. The first example for each platform does not
require physical presence at the system and works across the network. The second uses a USB
stick, and requires someone on site to install and remove the USB stick for transporting the data
to another system for upload. The command requires platadm or fieldeng privileges.

Mx000 Example of collecting a snapshot

XSCF> snapshot -L F -t <username>@<hostname or IP addr>:<location to write to>

Mx000 Example of collecting a snapshot to an external USB stick:

XSCF> snapshot -L F -d usb0

M10-x Example of collecting a snapshot

XSCF> snapshot -a -L F -t <username>@<hostname or IP addr>:<location to write to>

M10-x Example of collecting a snapshot to an external USB stick:

XSCF> snapshot -a -L F -d usb0

Bear in mind, that the external media device connected to the XSCF's USB port is expected to have
a partition 1, formatted with the FAT32 filesystem. The external USB device can have multiple
partitions, as long as partition 1 is FAT32. That partition will then be used by the snapshot
command. For more details on the snapshot command please visit the manual page.

Example:

XSCF> snapshot -L F -t root@10.3.2.121:/tmp

Downloading Public Key from '10.3.2.121'...

Public Key Fingerprint: 98:c0:ba:95:4d:70:9a:dc:24:01:09:5f:94:43:07:c7

Accept this public key (yes/no)? yes

Enter ssh password for user 'root' on host '10.3.2.121':

Setting up ssh connection to root@10.3.2.121...

Collecting data into root@10.3.2.121:/tmp/servera-mgmt-lan0_10.3.2.150_2015-10-08T04-51-30.zip

Data collection complete

XSCF>

Troubleshooting Steps

SSH: Could not resolve hostname

XSCF> snapshot -L F -t username@hostname:/home/username

Downloading Public Key from 'hostname'...

Error downloading key for host 'hostname'

- Program exited unexpectedly: /usr/bin/ssh

- Output: "ssh: Could not resolve hostname <hostname>: Temporary failure in name resolution"

Error with SSH settings

Resolution: Use the ip address instead of the host name

===================================================

Unable to mount USB device

After inserting a USB memory stick into the maintenance port of an OPL XSCF, snapshot reported
that it is unable to mount the USB device.

Sun SPARC Enterprise M3000/M4000/M5000/M8000/M9000 (OPL) Servers: snapshot
"Unable to mount USB device".

The snapshot command expects the USB device to have a partition 1 with a fat32 filesystem.
(/dev/sda1) If your device does not have any partitions, the XSCF will see your USB memory stick
as /dev/sda. Without the partition snapshot will not be able to mount the device correctly and
report "Unable to mount USB device".

Resolution: To create a partition table on your USB stick, you can plug your USB memory stick
into a Windows Computer, which should re-partition and format the device with one large partition
as a FAT32 file system.

This issue is fixed in XCP 1050.

How to take ILOM SnapShot - Sparc T series servers

From the CLI:

***

Log into the ILOM in an admin role and run the following from the ILOM shell:

->set /SP/diag/snapshot dataset=normal

-> set /SP/diag/snapshot dump_uri=sftp://userid@<server IP>/tmp

->cd /SP/diag/snapshot

->show

The snapshot should be running at this point.

You can run another "show" every once in a while to check the progress.

Once it has stopped running, we should have it available on FTP server.

***

Alternatively, you can collect the snapshot through the web UI:

***

If you have access to the WebUI, select "Maintenance", then "Snapshot".

Select "Data Set" "Normal".

Select "Transfer Method" "Browser

XSCF commands/Cheat Sheet

Abbreviations:

XSCF : eXtended System Control Facility
IOU : I/O Unit

The IOU includes PCI slots.

PSB : Physical System Board

The PSB includes at least 1 CPUM, 1 MEMB and 1 IOU on M4000 or M5000 server.

A M4000 server includes 1 PSB (PSB#00) and a M5000 server includes 2 PSB(PSB#00 and PSB#01).

XSB : Extended System Board

The PSB is configured either in Uni-XSB mode or Quad-XSB mode.

In uni-XSB mode, the XSB are named XSB#XX-0.

In quad-XSB mode, the XSB are named XSB#XX-0, XSB#XX-1, XSB#XX-2 and XSB#XX-3

XX represents the PSB number.

LSB : Logical System Board

Before adding an XSB to a domain, it is necessary to assign a number of LSB in the DCL ).

DCL : Domain Component List

Each domain has his own DCL. Each DCL contains 16 LSB.

DCU : Domain Configuration Unit
XCP : XSCF Control Package

Domain Administration:

List all domains on the frame:

XSCF> showdomainstatus -a

DID Domain Status

00 Running

01 Running

02 -

03 -

Connecting to the console:

XSCF> console -d [console #]

-f force connection to writable console session

-r read only connection

-y answer yes to all prompts

Exiting the console:

Press the Enter key and then enter "#.".

To see who has the console:

showconsolepath -d [domain #]

To display logs:

showlogs -d [domain #] console -> console log

showlogs -M -d [domain #] console -> console log paged

showlogs -r -M -d [domain #] console -> console log paged in reverse order

showlogs -d [domain #] panic -> panic log

Poweron all domains

XSCF> poweron -a

Poweron only domain 0

XSCF> poweron -d 0

Poweroff all domains

XSCF> poweroff -a

Poweroff domain 0

XSCF> poweroff -d 0

Reboot XSCF

XSCF> rebootxscf

To send a break:

sendbreak -d [domain #]

If the sendbreak command does not work then you might need to turn off the domain's secure mode.

To check current setting, at the XSCF> prompt, type:

showdomainmode -d [domain #]

to change the security mode, at the XSCF> prompt, type:

setdomainmode -d [domain #] -m secure=off

To forcibly reset a hung domain:

reset -d [domain #] level

Where level is "por", "panic" or "xir":

– "por" to reset the domain

– "panic" to panic the server and generate a core

– "xir" to reset the domains' CPUs

To see all the commands available:

[tab] [tab]

Man pages:

man [command]

To exit XSCF

exit or ctrl-D

XSCF Firmware upgrade

Check current firmware level

XSCF> version -c xcp -v

Check version of staged firmware

XSCF> getflashimage -l

Download new firmware

XSCF> getflashimage -y -v http://IP:PORT/firmware/FFXCP1120.tar.gz or getflashimage -y -v -u <user> ftp://IP:PORT/firmware/FFXCP1120.tar.gz

XSCF> getflashimage -l

Check upgrade is possible or not

XSCF> flashupdate -c check -m xcp -s 1120

Update the firmware

XSCF> flashupdate -y -c update -m xcp -s 1120

Confirm that the XSCF firmware update has finished

XSCF> showlogs monitor

Verify new version of firmware

XSCF> version -c xcp -v

Reboot the domains: The domains should be rebooted soon after firmware upgrade is performed.

User Administration

Creating a New user

XSCF> adduser -u admin

Delete a user

XSCF> deleteuser admin

Disable a user

XSCF> disableuser admin

Enable a user

XSCF> enableuser admin

Display user account information

XSCF> showuser -a

Set or change a User (admin) password

XSCF> password admin

Network

• Show the network configuration

XSCF> showssh

XSCF> showhostname -a

XSCF> shownetwork -a

XSCF> showroute -a

XSCF> showntp -a

XSCF> shownameserver

Domain Components FRU

• Display

XSCF> showfru -a sb

Device Location XSB Mode Memory Mirror Mode

sb 00 Uni no

sb 01 Uni no

XSCF> showfru sb 0

• Define

To Uni-XSB

XSCF> setupfru -x 1 sb 0

To Quad-XSB

XSCF> setupfru -x 4 sb 0

DCL

• Display

XSCF> showdcl -a -v

XSCF> showdcl -v -d 0

• Define a DCL number to a XSB

XSCF> setdcl -d 0 -a 0=00-0

The XSB 00-0 has the DCL number 0 on the domain 0.

• Suppress

XSCF> setdcl -d 0 -r 00

Board

• Display all boards

XSCF> showboards -a -v

• Add a board to a domain

XSCF> addboard -d 0 -c assign 00-0

We add the xsb 00-0 to the domain 0

• Suppress a board

XSCF> deleteboard -c unassign 00-0

Device Information

Display a resume of hardware configuration

XSCF> showhardconf -u

Display page by page

XSCF> showhardconf -M

Display the attached device to a domain( The OS must run on the domain )

XSCF> showdevices -d 0

It is necessary to configure the dscp and also to start the SMF services else the command returns the follow error message :

Can't get device information from DomainID 1.

Display the hardware with degraded status

XSCF> showstatus

MBU_B;

MEMB#0;

* MEM#0A; Status:Faulted;

Replace the FRU (FAN and PSU for M4000/M5000)

XSCF> replacefru