https://www.teldat.com/wp-content/uploads/2024/01/Ricardo-Sanz-96x96.jpg

TELDAT Blog

Communicate with us

Improving device reliability and redundancy

Mar 5, 2024

Improving device reliability and redundancyIn a previous article we looked at how to mathematically calculate the reliability of electronic equipment. That is, the probability of it working correctly for a given period of time. This is best characterised by the Mean Time Between Failure (MTBF), or its opposite, Failure In Time (FIT). MTBF values are often in the order of hundreds of thousands of hours for electronic equipment.

Cummulative probability

Source: Ricardo Saiz

The probability of a failure occurring during time t is best expressed as an exponential function, similar to a straight line for small intervals.

 

How do you calculate the MTBF of a device?

Device reliability depends on that of its component parts (solderable electronic components, modules, wiring, etc.). The total MTBF is the sum of the inverse of the MTBF of each part, similarly to parallel resistances. If, in an electrical circuit, admittance is added up, the FIT of a device made up of many components (parallel paths leading to a failure) shall be the sum of all these FIT. This is why it is easier to operate with FIT than MTBF.

 

How to make devices more reliable?

In turn, the FIT of a component is not an immutable value but depends on the environment and (more specifically) on the temperature. Heat is directly related to the failure rate, and indeed to the speed of many physical processes and chemical reactions. Swedish scientist Svante Arrhenius (1859 – 1927) was the first to model this relationship, in 1889, with the equation that bears his name:

Formula

According to this formula, when the absolute temperature is close to zero, reactions stop. However, they accelerate significantly with increasing temperature.

 

High service availability

Our device will become less reliable as temperatures rise, but how can we make it more reliable? We can’t fight the laws of physics, but we can use them to make the best engineering decisions. In addition to heeding the advice found in manuals (“do not cover the ventilation slots” or “install the device far from heat sources”), we can improve the reliability of the system. This is known as service availability, which is ultimately what matters.

 

Redundancy of devices

In a router or switch, we can duplicate the power supply (one of the elements with the highest failure rate). The probability of a power supply failing in a t interval is:

power supply

This formula equals 0 when t=0, but its derivative is:

Other formula

The device will stop working if both power supply units – PSUs fail. The probability of this happening is the previous formula squared:

Source formula

As in the previous case, this formula equals 0 at t=0. However, its derivative is also 0 at the power supply unit – PSU.

Formula 2

Source: Ricardo Saiz

With two units working simultaneously (only one of which is essential), the failure rate draws a very different curve. This is especially true for shorter periods (when compared to the MTBF). Let’s look at a simple example.

We have a power supply with an MTBF of 200,000 hours. What are the chances of it failing within a year?

MTBF

200,000 hours may seem like a long time, but there is a 4.3% chance of it breaking down in the first year of use. If we have a pool of 23 devices, we will suffer an average of one breakdown per year (with the ensuing service outage).

If we set up two power supply units – PSUs working redundantly, the probability of a critical failure over a year is:

formula 3

It only amounts to 0.18%.

If we also connect each power supply unit – PSU to a separate electrical circuit (e.g., an uninterruptible power supply or UPS), we obtain another advantage: having a power cut leave us temporarily without service will be much less likely.

If our equipment sends an alert to the network administrator when it detects a failure, the faulty device can be replaced within a short period of time. Ideally before a second, critical failure occurs.

When combining redundancy with diligent fault detection and remediation, service availability is extremely high. This is because, after a failure occurs, suffering another breakdown during the time it takes to repair the device (presumably hours or a few days) is unlikely. We can understand this graphically, since we move in the grey line’s flat area (i.e., where the derivative is almost zero).

formula 4

Source: Ricardo Saiz

 

MTBF findings and more

Teldat devices (such as the new generation of switches, some of which are equipped with redundant power supplies to meet the most demanding requirements) offer MTBF figures ranging between 500,000 and one million hours. We also carry out a rigorous Reliability, Availability, Maintainability and Safety (RAMS) analysis for equipment intended for special scenarios, such as railways. Using Fault Tree Analysis (FTA), we can identify potential failures and design alternative operating modes in the event of simple failures. As a result, our service availability figures are close to 100%.

Related PostsÂ