Time accuracy is a very important aspect of every infrastructure, especially when most of it is automated in some form or another.

Just think about it for a second, everything from kerberos ticket validation to log correlation, from access logging to security auditing relies on an accurate timestamp.

How do we ensure that time is consistently accurate across all our infrastructure and what are the critical part to monitor?

An Introduction to UTC and time standardization…

Freely adapted from NTP FAQ - (http://www.ntp.org/ntpfaq/NTP-a-faq.htm)

What is UTC?

UTC (Universal Time Coordinated) is an official standard for the current time. UTC evolved from the former GMT (Greenwich Mean Time) that once was used to set the clocks on ships before they left for a long journey.

  • Universal means that the time can be used everywhere in the world, meaning that it is independent from time zones (i.e. it’s not local time). To convert UTC to local time, one would have to add or subtract the local time zone.

  • Coordinated means that several institutions contribute their estimate of the current time, and UTC is built by combining these estimates.

The UTC second has been defined by the 13th General Conference of Weights and Measures in 1967 as “The second is the duration of 9,192,631,770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the cesium-133 atom.”

Unfortunately the earth’s rotation is not very much impressed by the definition of the UTC second. Having 86400 UTC seconds per day on an earth that’s slowing down would mean that midnight would eventually fall in the middle of the day. As this is probably unacceptable, some extra seconds can be added or removed inside the UTC time-scale to keep synchronization. That patch work is named leap seconds.

During a leap second, either one second is removed from the current day, or a second is added. In both cases this happens at the end of the UTC day. If a leap second is inserted, the time in UTC is specified as 23:59:60. In other words, it takes two seconds from 23:59:59 to 0:00:00 instead of one. If a leap second is deleted, time will jump from 23:59:58 to 0:00:00 in one second instead of two.

Understanding and Designing an NTP implementation.

Hardware Clock vs. Software Clock

As many of you know, computer motherboards have a specialized hardware component called real-time clock. This clock (which is a quartz that ticks at a specific frequency) is what is commonly called a Hardware Clock and it’s the one that keeps ticking when the hardware is powered off as there’s a battery that backs it up independently from the regular power supply.

This clock is usually read by the operating system upon boot that uses it as a foundation for the Software Clock. A Software clock is a software representation of a clock and is usually implemented by the kernel of the operating system of your choice, in layman’s terms, this is the clock that serves the operating system and the applications running on it.

Keeping time synchronized.

Once this software clock is up and running, we still need to make sure it’s accurately synced all across our devices, the most commonly used method to accomplish this is via NTP, the network time protocol.

From an architectural perspective NTP relies on a hierarchy of servers in order to distribute the load and keep accuracy, the hierarchy is comprised of different Stratum level, where a lower Stratum means being closer to the actual clock source (usually a very accurate GPS clock or a specialized industrial Atomic Clock).

Ideally, every enterprise should have a reference clock in house (for greater accuracy and to avoid dependency on external network services) on which an internal hiearchy of NTP servers should be built.

If such reference clock is not justifiable, NTP.org offers a large pool of very accurate public time servers, they are geographically distributed and can be used for free.

Distributing the time.

Keeping time accurate is important in every layer of your infrastructure, that means that all the computing devices in your network should be synchronized using NTP.

It is obviously not sane to point all your NTP clients to a single time source (such as your reference clock), the best way to achieve a correct time synchronization is to create a hierarchy of NTP servers with lower stratums (assuming your reference clock is stratum 0) and distribute the load among them.

NTP also prefers to have access to several sources of lower stratum time since it can apply an agreement algorithm to detect insanity on the part of any one of these, to make these algorithms work properly it is critical to ensure at least three different time sources for every client.

This diagram depicts how a medium-sized NTP hierarchy is tipically built:

NTP hierarchy layout

As you can see, starting from the reference clock (usually GPS-based) we have core piece of the infrastructure (Network Core switches) that serve as stratum 1 servers (usually core switches are the best choice for low stratum servers in an enterprise).

Core switches get their time from the reference clock (and potentially from an external stratum 1 server too) and have a peer relationship with each other to ensure sanity, we then move into servers (usually identity systems like Active Directory) and then distribute to the highest stratum layer that is usually the clients at the edge of the network.

Troubleshoot NTP behavior and avoid pitfalls.

NTP behavior can be puzzling at times, and understanding what is happening behind the scenes (and the terminology) can be a daunting task, I’ll try to summarize a few tips and commands you can use to pinpoint errors easily or just better understand how the clock is behaving.

One of my favorite tool is ntpq that is commonly found in every UNIX machine, and is usually part of every ntpd distribution, ntpq queries the ntpd daemon that is running locally and keep the software clock in sync.

A common way to query ntpd for status is to issue ntpq -p (call the peers list).

~ # ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
infrastructure. .LOCL.           1 u   51   64  377    4.959  2683.47   9.507

This table sums up all the fields that are returned by the command:

Remote the ntp server you specified in your configuration (in this case infrastructure.mydomain.lab)
Refid the ntp server time source, this could be a local source (like mine, that says .LOCL.) or could be another upstream NTP server.
St (Stratum) how far is the server from the time source, time source is 0, stratum 1 means 1 degree of distance from the time source.
T (type) type of source, u means unicast (you could have l as local or other types such as b: broadcast or multicast, s: symmetric (peer), A: manycast server, B: broadcast server, M: multicast server)
When the number of seconds passed since last response
Poll the polling interval in seconds.
Reach a bit vector (in octal) of the last eight poll attempts (1=replied, 0=no reply) - (NTP is a UDP protocol so some packet drops aren't weird).
Delay the roundtrip time to receive a reply, in milliseconds.
Offset the time difference between the client server and source, in milliseconds
Jitter the difference between two samples still in milliseconds

One of the most common pitfalls is that after a correct ntpd set up a ntpq -p reports a massive skew from the time coming from the source and seems that ntpd is not even trying to bring the clock in sync.

This is due to a feature built into ntpd called panic threshold, that defaults to 1000s, if the clock skew between the local software clock and the time reference is larger than the panic threshold ntpd will not try to sync the time, you will have to manually bring the local software clock within the panic threshold and then let ntpd handle the remaining time drift.

To synchronize the clock manually, stop ntpd and run ntpd -q -g, the -g flag will tell ntpd to synchronize even if the skew is > 1000s and -q will make ntpd run only once.

Once the time is back within the panic threshold, restart ntpd.

Another common misconception is that you can use hwclock to manipulate the time on your machine, hwclock only reads/writes the Hardware clock that is used to initialize the time upon reboot. It does not affect the current time on the software clock.

Using hwclock and then restarting ntpd will have no useful effect – ntpd will not pick up the time from the battery-backed clock at that point. In fact, even using hwclock and then rebooting will have no useful effect, as usually shutdown scripts automatically copy the system software-maintained clock to the battery-backed clock, which will overwrite whatever changes you made with hwclock.

Basically, hwclock is not something a user should ever run except for very specialized troubleshooting purposes.

Geektastic: Build your own super-accurate reference clock.

While there are countless options for commercial-grade reference clocks (GPS or even Atomic), those are usually very expensive and out of reach for the regular enthusiast, however, it’s possible to build a very accurate GPS reference clock out of inexpensive parts you can find off eBay.

Soekris Net4501 are very inexpensive SBCs (single board computers) that geeks all around the world have learnt to love as they provide a very inexpensive starting point for all the sorts of geeky weekend projects, one of them, made by Jason Rabel over at extremeoverclocking.com shows how to build a stratum 1 reference clock out of a Soekris 4501, a Motorola GPS time receiver and FreeBSD, check his wonderful tutorial here: http://www.extremeoverclocking.com/articles/howto/Building_S1_NTP_Server_1.html

Fabio Rapposelli Picture

About the author...

  Fundamentals

Comments