TECH ALERT: Images written as part of a Storage Lifecycle Policy (SLP) may be prematurely marked as "lifecycle complete" prior to scheduled duplication due to a problem allocating resources from the operating system. This may lead to data loss when these images expire, having not been duplicated.
Details:
Introduction:
Due to an issue in nbstserv
which can result in requests for more open file descriptors than can be
allocated, images written as part of a Storage Lifecycle Policy (SLP) may be
erroneously marked "Lifecycle Complete" before they are duplicated. This
can lead to the only copies of an image expiring prematurely, resulting in data
loss when the duplicated images are unexpectedly unavailable... due to the fact
that they were never created.
This condition has only been reported on
Solaris master servers. Although it may be seen on any supported operating
system, the underlying cause - problems attempting to access a file descriptor
within a certain range (in the case of Solaris, values from 0-255) due to use of
all values in that range - is uniquely constrained under Solaris so that it is
more likely to be encountered there than upon systems using a different
operating system, where the range of open file descriptors in use can rise well
into the thousands before this issue would theoretically be seen.
This
condition occurs when nbstserv requests an open file descriptor when none
is available due to a file descriptor leak, resulting in an erroneous "defined
storage service not found" report. When this happens prior to an SLP
duplication, the original image is marked as complete, even though
duplication did not take place. The image is then reassigned with the
retention level as configured in the SLP backup setting - often a much shorter
interval than the one assigned to the duplicated image - and expire as scheduled
based upon that retention level.
What is Affected:
This
condition is rare, but all operating systems are potentially affected
running the following NetBackup versions and utilizing Storage Lifecycle
Policies:
Note: This issue has only
been reported on Solaris systems. It is expected that this issue would be
extremely rare on other operating systems or architectures, if it could occur at
all.
The following symptoms make this issue more likely to
occur:
- Solaris
master
- Many
Storage Lifecycle Policies configured and running
- nbstserv
process running for a very long period of time
- Regularly
activating and deactivating SLP may contribute to a greater use of open file
descriptors
- Displaying
many SLPs in a short time using the GUI may also exacerbate this
issue
How to Determine if
Affected:
Examine the nbstserv log (OID 226) for log entries
containing the phrase "defined storage service not
found":
# cd
/usr/openv/logs
# grep -i defined
51216-226* | grep "storage service not found"
If results are
returned, the log should be further scrutinized, especially if a reject
image message is seen.
In the following example, this error is
recorded and the image is marked
complete:
03/18/09 07:16:15.347 226 PID:19187
TID:5 [No context]
[ReplicationSession::SelectImagesForProcessing] 5: ReplicationSession::SelectImagesForProcessing:
####
03/18/09 07:16:15.347 226
PID:19187 TID:5 [No context]
[ReplicationSession::SelectImagesForProcessing] 5: ReplicationSession::SelectImagesForProcessing:
######## new image:
app1_1237360465
03/18/09 07:16:15.347
226 PID:19187 TID:5 [No context]
[ReplicationSession::SelectImagesForProcessing] 5: ReplicationSession::SelectImagesForProcessing:
####
03/18/09 07:16:15.347 226
PID:19187 TID:5 [No context] [ReplicationSession::getStorageServiceRec] +++
ENTERING +++ : obj = 31d9c70
(ReplicationSession.cpp:2032)
03/18/09
07:16:15.348 226 PID:19187 TID:5 [No context]
[ReplicationSession::getStorageServiceRec] --- EXITING --- : obj = 31d9c70
(ReplicationSession.cpp:2032)
03/18/09
07:16:15.348 226 PID:19187 TID:5 [No context]
[ReplicationSession::SelectImagesForProcessing] reject image: app1_1237360465
defined storage service not found, Storage
service=SLP-PROD-1-MONTH(ReplicationSession.cpp:1199)
03/18/09
07:16:15.348 226 PID:19187 TID:5 [No context]
[ImageInProcess::processLifecycleComplete] backupid= app1_1237360465 storage
service=
SLP-PROD-1-MONTH(ImageInProcess.cpp:841)
03/18/09
07:16:15.348 226 PID:19187 TID:5 [No context]
[ImageInProcess::findMaxFixedRetention] +++ ENTERING +++ : obj = 744630
(ImageInProcess.cpp:723)
03/18/09
07:16:15.348 226 PID:19187 TID:5 [No context]
[ImageInProcess::findMaxFixedRetention] --- EXITING --- : obj = 744630
(ImageInProcess.cpp:723)
03/18/09
07:16:15.348 226 PID:19187 TID:5 [No context]
[ImageInProcess::processLifecycleComplete] set fixed expiration for
app1_1237360465 copy number 1, Current Expiration time=2147483647, Current
Try_to_keep=1237619665, New Expiration time =
1237619665(ImageInProcess.cpp:892)
03/18/09
07:16:15.489 226 PID:19187 TID:5 [No context] [iipCopy::setExpireTime] +++
ENTERING +++ : obj = b4dbfe0
(ImageInProcess.cpp:216)
03/18/09
07:16:15.489 137 PID:19187 TID:5 [No context] [emmlib_UpdateImageCopyExpireTime]
(1) Updating Image copy records from EMM
database
MasterServerName
= < master1
>
BackupID
= < app1_1237360465
>
CopyNumber
= < 1
>
ExpireTime
= < 1237619665 >
03/18/09
07:16:15.514 137 PID:19187 TID:5 [No context] [emmlib_UpdateImageCopyExpireTime]
(1) updateImageCopy successful, emmError = 0, nbError =
0
03/18/09 07:16:15.514 226 PID:19187
TID:5 [No context] [iipCopy::setExpireTime] --- EXITING --- : obj = b4dbfe0
(ImageInProcess.cpp:216)
...
03/18/09
07:16:15.606 226 PID:19187 TID:5 [No context]
[catalogImgRec::retrieveCatalogImgRecord] +++ ENTERING +++ : obj = 744634
(ImageInProcess.cpp:2264)
03/18/09
07:16:15.607 226 PID:19187 TID:5 [No context]
[catalogImgRec::retrieveCatalogImgRecord] 5: catalogImgRec::retrieveCatalogImgRecord:
pfi type = false
03/18/09
07:16:15.607 226 PID:19187 TID:5 [No context]
[catalogImgRec::retrieveCatalogImgRecord] 5: catalogImgRec::retrieveCatalogImgRecord:
returning catalog record for image: app1_1237360465, primary copy: 1, number of
copies: 1, PFI type: false
03/18/09
07:16:15.608 226 PID:19187 TID:5 [No context]
[catalogImgRec::retrieveCatalogImgRecord] --- EXITING --- : obj = 744634
(ImageInProcess.cpp:2264)
03/18/09
07:16:15.608 226 PID:19187 TID:5 [No context] [catalogImgRec::setComplete] +++
ENTERING +++ : obj = 744634
(ImageInProcess.cpp:2179)
03/18/09
07:16:15.609 226 PID:19187 TID:5 [No context] [catalogImgRec::setComplete]
catalog image app1_1237360465 set to completed: 3 status:
0(ImageInProcess.cpp:2211)
03/18/09 07:16:15.609 226 PID:19187
TID:5 [No context] [catalogImgRec::setComplete] --- EXITING --- : obj = 744634
(ImageInProcess.cpp:2179)
03/18/09
07:16:15.609 226 PID:19187 TID:5 [No context] [catalogImgRec::setComplete]
catalog image app1_1237360465 set to completed: 3 status:
0(ImageInProcess.cpp:2211)
03/18/09
07:16:15.609 226 PID:19187 TID:5 [No context] [catalogImgRec::setComplete] ---
EXITING --- : obj = 744634
(ImageInProcess.cpp:2179)
03/18/09
07:16:15.609 226 PID:19187 TID:5 [No context]
[ImageInProcess::processLifecycleComplete] setting image: app1_1237360465 as
lifecycle complete in EMM
database(ImageInProcess.cpp:1071)
03/18/09
07:16:15.609 226 PID:19187 TID:5 [No context] [ImageInProcess::setImageState]
+++ ENTERING +++ : obj = 744630
(ImageInProcess.cpp:1300)
03/18/09
07:16:15.610 137 PID:19187 TID:5 [No context]
[emmlib_UpdateImageStorageServiceState] (1) Updating Image records from EMM
database
MasterServerName
= < master1
>
BackupID
= < app1_1237360465
>
StorageServiceState
= < 3 >
03/18/09 07:16:15.632
137 PID:19187 TID:5 [No context] [emmlib_UpdateImageStorageServiceState] (1)
updateImage failed, emmError = 2007031, nbError =
0
03/18/09 07:16:15.633 226 PID:19187
TID:5 [No context] [ImageInProcess::setImageState] --- EXITING --- : obj =
744630
(ImageInProcess.cpp:1300)
03/18/09
07:16:15.633 226 PID:19187 TID:5 [No context]
[ImageInProcess::processLifecycleComplete] image: app1_1237360465 - failed to
set as lifecycle complete in EMM
database!(ImageInProcess.cpp:1075)
From the bpdbm log
on the master:
07:14:45.916 [506] <4>
emmlib_UpdateImage: (0)
MasterServerName
= < master1 > < TRUE
>
BackupID
= < app1_1237360465 > < TRUE
>
ClientName
= < NULL > < FALSE
>
BackupTime
= < 0 > < FALSE
>
PolicyName
= < NULL > < FALSE
>
ClientType
= < 0 > < FALSE
>
ScheduleType
= < 0 > < FALSE
>
StorageServiceName
= < SLP-PROD-1-MONTH > < TRUE
>
StorageServiceState
= < 1 > < TRUE
>
TimeInProcess
= < 0 > < FALSE
>
ClassificationID
= < NULL > < FALSE
>
07:14:45.939 [506] <4>
emmlib_UpdateImage: (0) updateImage successful, emmError = 0, nbError =
0
07:14:45.939 [506] <4>
db_EMMSetStorageState: emmlib_UpdateImage successfully updated
image(0)
07:14:45.940 [506] <4>
DeleteConnectionID: (-) Cleared Connection ID
0
07:14:45.941 [506] <4>
uninitialize: (0) Destroying the
orb
07:14:45.946 [506] <4>
emmlib_uninitialize: (0) Successfully released EMM session and database
objects
07:14:45.946 [506] <2>
bpdbm: request complete: exit status 0
bpdbm
transaction once nbstserv sends request to
update:
07:16:15.608 [921] <2>
get_image_by_bid: likely found backupid in
/usr/openv/netbackup/db/images/app1/1237000000/UNIX-PROD_1237360465_INCR
07:16:15.609
[921] <2> get_image_by_bid: found
app1_1237360465
07:16:15.609 [921]
<2> write_imagerec:
/usr/openv/netbackup/db/images/app1/1237000000/UNIX-PROD_1237360465_INCR
07:16:15.609
[921] <2> write_imagerec: fragnum: 1, expir: 1237619665, vers: 8 copy:
1
07:16:15.609 [921]
<2> write_imagerec: NULL
prev_backup_image
07:16:15.609 [921]
<2> read_next_image: calling db_IMAGEreceive() for next
image
07:16:15.634 [919] <2>
image_by_backupid: trying app1 for
app1_1237360463
07:16:15.635 [919]
<2> image_by_backupid: likely found backupid in
/usr/openv/netbackup/db/images/app1/1237000000/UNIX-PROD_1237360463_INCR
07:16:15.635
[919] <2> image_by_backupid: found
app1_1237360463
07:16:15.635 [919]
<2> image_by_backupid: calling db_IMAGEreceive() for next
image
07:16:15.637 [921] <2>
update_and_chgexp: Q_IMAGE_UPDATE
Once that message is
recorded, examine the corresponding log_<ctime> file located in
/usr/openv/netbackup/db/error. In this following example,
log_1239826000 is examined:
1239826258 1
33412 4 slpmedia 756 753 0 slpmedia bptm begin writing backup id
slpmedia_1239826251, copy 1, fragment 1, destination path
/opt/fb1
1239826258 1 4 4
slpmedia 756 753 0 slpmedia bptm successfully wrote backup id
slpmedia_1239826251, copy 1, fragment 1, 800 Kbytes at 9638.554
Kbytes/sec
1239826361
1 2 4 slpmedia 0 0 0 *NULL* bpdbm successfully modified backupid
slpmedia_1239826251
In the above excerpt, it can be seen
that the backup image in question was modified a mere 103 seconds after it was
written, due to being set "SLP complete."
Detection in the
nbstserv log file can be hampered if nbstserv has stopped
logging. This can occur when the log normally rolls over at midnight if
nbstserv attempts to open a file descriptor when it becomes time to roll
over the log, but cannot due to this issue.
On Solaris systems,
pfiles can be used to monitor file descriptors in use. Two examples of
pfiles output follow.
This is an example of output from a system
where this issue may be seen:
# pgrep
nbstserv
1102
#
pfiles 1102
1102:
/usr/openv/netbackup/bin/nbstserv
Current
rlimit: 8192 file descriptors
0: S_IFCHR mode:0666 dev:351,0 ino:6815752 uid:0 gid:3
rdev:13,2
O_RDONLY
/devices/pseudo/mm@0:null
1: S_IFCHR mode:0666 dev:351,0 ino:6815752 uid:0 gid:3
rdev:13,2
O_WRONLY|O_CREAT|O_TRUNC
/devices/pseudo/mm@0:null
2: S_IFCHR mode:0666 dev:351,0 ino:6815752 uid:0 gid:3
rdev:13,2
O_WRONLY|O_CREAT|O_TRUNC
/devices/pseudo/mm@0:null
3: S_IFREG mode:0664 dev:32,14 ino:2067521 uid:0 gid:0
size:30197009
O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
/opt/openv653/logs/51216-226-2216927724-090428-0000000000.log
4: S_IFDOOR mode:0444 dev:360,0 ino:57 uid:0 gid:0
size:0
O_RDONLY|O_LARGEFILE
FD_CLOEXEC door to
nscd[180]
/var/run/name_service_door
5: S_IFREG mode:0664 dev:32,14 ino:2067521 uid:0 gid:0
size:30197009
O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
/opt/openv653/logs/51216-226-2216927724-090428-0000000000.log
.
. [uninterrupted sequence
of file descriptors from 6 to 248]
.
249: S_IFSOCK mode:0666 dev:357,0
ino:1883 uid:0 gid:0
size:0
O_RDWR
SOCK_STREAM
SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
sockname:
AF_INET <IP Address> port:
35837
peername:
AF_INET <IP Address> port:
13724
250: S_IFSOCK mode:0666
dev:357,0 ino:1882 uid:0 gid:0
size:0
O_RDWR
SOCK_STREAM
SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
sockname:
AF_INET <IP Address> port:
35843
peername:
AF_INET <IP Address> port:
13724
251: S_IFSOCK mode:0666
dev:357,0 ino:1881 uid:0 gid:0
size:0
O_RDWR
SOCK_STREAM
SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
sockname:
AF_INET <IP Address> port:
35849
peername:
AF_INET <IP Address> port:
13724
252: S_IFSOCK mode:0666
dev:357,0 ino:1880 uid:0 gid:0
size:0
O_RDWR
SOCK_STREAM
SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
sockname:
AF_INET <IP Address> port:
35855
peername:
AF_INET <IP Address> port:
13724
253: S_IFSOCK mode:0666
dev:357,0 ino:1879 uid:0 gid:0
size:0
O_RDWR
SOCK_STREAM
SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
sockname:
AF_INET <IP Address> port:
35861
peername:
AF_INET <IP Address> port:
13724
254: S_IFSOCK mode:0666
dev:357,0 ino:1878 uid:0 gid:0
size:0
O_RDWR
SOCK_STREAM
SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
sockname:
AF_INET <IP Address> port:
35867
peername:
AF_INET <IP Address> port:
13724
255: S_IFSOCK mode:0666
dev:357,0 ino:1877 uid:0 gid:0
size:0
O_RDWR
SOCK_STREAM
SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
sockname:
AF_INET <IP Address> port:
35873
peername:
AF_INET <IP Address> port:
13724
256: S_IFIFO mode:0000
dev:359,0 ino:632 uid:0 gid:0
size:0
O_RDWR|O_NONBLOCK
FD_CLOEXEC
257: S_IFIFO mode:0000
dev:359,0 ino:632 uid:0 gid:0
size:0
O_RDWR
FD_CLOEXEC
.
. [uninterrupted
sequence of file descriptors from 258 to 622]
.
623: S_IFSOCK mode:0666 dev:357,0
ino:42470 uid:0 gid:0
size:0
O_RDWR
SOCK_STREAM
SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
sockname:
AF_INET <IP Address> port:
39392
peername:
AF_INET <IP Address> port: 13724
Note: In
the example above, all file descriptors from 0 to 255 have been
exhausted. In general, it is abnormal for nbstserv to consume this
many file descriptors.
This is an example output from a system running
normally:
Current rlimit: 8192 file
descriptors
0: S_IFIFO
mode:0000 dev:358,0 ino:433506 uid:0 gid:0
size:723
O_RDWR
1: S_IFCHR mode:0620 dev:350,0 ino:12582954 uid:0 gid:7
rdev:24,19
O_RDWR|O_LARGEFILE
/devices/pseudo/pts@0:19
2: S_IFCHR mode:0620 dev:350,0 ino:12582954 uid:0 gid:7
rdev:24,19
O_RDWR|O_LARGEFILE
/devices/pseudo/pts@0:19
3: S_IFDOOR mode:0444 dev:359,0 ino:58 uid:0 gid:0
size:0
O_RDONLY|O_LARGEFILE
FD_CLOEXEC door to
nscd[167]
/var/run/name_service_door
4: S_IFREG mode:0600 dev:32,8 ino:10419 uid:0 gid:0
size:537
O_RDWR|O_CREAT
/etc/vx/vss/VRTSat.conf.lock
5: S_IFREG mode:0600 dev:32,12 ino:25252 uid:0 gid:1
size:20
O_RDWR|O_CREAT
/var/VRTSat/.VRTSat/profile/VRTSatlocal.conf.lock
6: S_IFREG mode:0600 dev:32,12 ino:24904 uid:0 gid:0
size:0
O_RDWR
/var/VRTSat/.VRTSat/profile/certstore/keystore/KeyStore.lock
7: S_IFREG mode:0664 dev:32,14 ino:2067530 uid:0 gid:0
size:49
O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
/opt/openv653/logs/51216-261-2216927724-090426-0000000000.log
256: S_IFIFO mode:0000 dev:358,0 ino:433668 uid:0 gid:0
size:0
O_RDWR|O_NONBLOCK
FD_CLOEXEC
257: S_IFIFO mode:0000
dev:358,0 ino:433668 uid:0 gid:0
size:0
O_RDWR
FD_CLOEXEC
258: S_IFIFO mode:0000
dev:358,0 ino:433669 uid:0 gid:0
size:0
O_RDWR|O_NONBLOCK
FD_CLOEXEC
259: S_IFIFO mode:0000
dev:358,0 ino:433669 uid:0 gid:0
size:0
O_RDWR
FD_CLOEXEC
260: S_IFSOCK mode:0666
dev:356,0 ino:51296 uid:0 gid:0
size:0
O_RDWR|O_NONBLOCK
FD_CLOEXEC
SOCK_STREAM
SO_KEEPALIVE,SO_SNDBUF(65536),SO_RCVBUF(57344),IP_NEXTHOP(0.0.224.0)
sockname:
AF_INET <IP Address> port:
44505
peername:
AF_INET <IP Address> port: 1556
In this example, far
fewer file descriptors are in use and plenty in the 0-255 range are
available.
Formal Resolution:
A partial resolution to this
issue is included in the following release:
In this release, the conditions leading to the
file descriptor leak will be addressed. This should dramatically reduce
the already rare potential for this issue to be seen.
The formal
resolution to this issue (Etrack 1588246) is scheduled to be included in
the following release:
- NetBackup
6.5 Release Update 5 (6.5.5), scheduled for release in the fourth quarter of
2009.
When NetBackup 6.5.5 are released, please
visit the following link for download and readme
information:
http://www.symantec.com/business/support/overview.jsp?pid=15143
Workaround:Stop and restart
nbstserv before the
amount of file descriptors (FD) grows too large.
pfiles may be
utilized to monitor file descriptor use by
nbstserv.
If this
workaround is unfeasible, please contact Symantec technical support, referencing
this document ID and Etrack
1588155 to obtain a fix for this
issue.
Best Practices:
Symantec strongly recommends the
following best practices:
1. Always perform a full backup prior to and after
any changes to your environment.
2. Always make sure that your environment is
running the latest version and patch level.
3. Perform periodic "test"
restores.
4. Subscribe to technical articles.
How to Subscribe to
Email Notification:Article Subscription:Subscribe to this
TechNote for any updates that are made to this article, by clicking on the
following
link:
http://maillist.support.veritas.com/notification.asp?doc=324189
Software Alerts:If you have not received this from the
Symantec Technical Support Email Notification Service, please click on the
following link to subscribe to future
Notifications:
http://maillist.entsupport.symantec.com/subscribe.asp
Products Applied:
NetBackup Enterprise Server 6.5, 6.5.1, 6.5.2, 6.5.3, 6.5.4
Subjects:
NetBackup Enterprise Server
Publishing Status: Techalert
Languages:
English (US)
Operating Systems:
Windows 2000Server SP4
AIX
5.1, 5.2, 5.3
TRU64
5.1B
HP-UX
11.11, 11i v2 (IA64), 11i v2 (PA-RISC), 11iv3
Solaris
10, 8.0, 9.0
Linux
Open Enterprise Server, RHEL 3.0 (AS, ES, WS), RHEL 4.0, RHEL 4.0 (IA64), RHEL 5, Red Flag DC Server 4.1, Red Flag DC Server 5.0 SP1, RedHat Enterprise Server 2.1 (AS, ES, WS), SLES 10 SP1, SLES 8, SLES 9
Windows Server 2003
DataCenter SP1, Enterprise ServerSP1, R2, Standard Server SP1, Web Server SP1