Learn Vital Equipment Reliability Basics for Equipment Reliability Management Success

Learning how to run your industrial plant and equipment to get outstanding reliability requires an understanding of its failure mechanisms. You can have tremendous equipment reliability if you use the right maintenance strategies at the right times during the equipment life cycle to prevent the causes of failure.

 


 

Abstract:
Equipment Reliability Basics 101 for Reliability Management. When you understand the behaviour of equipment over its lifetime you will understand why proactive maintenance (as opposed to repairs) is critical. Our technologically based society depends on machinery and equipment to keep it operating. Knowing how equipment behaves during its operating life, and why it fails from time to time, means you can set up the right maintenance strategies to deliver lifetime reliability. Learn to apply the correct ways to use equipment failure curves to select your maintenance strategies. Change your maintenance and operating practices to control and eliminate failure and ensure you deliver outstanding equipment reliability.

Keyword:  Keywords: Equipment reliability management, reliability curves, equipment failure curves

The Truth about Machinery and Equipment
When you buy equipment, like a car or an air compressor or a computer, you buy a product made of numerous parts operating together as a system to deliver a function you want. The parts form interconnected systems that deliver you a service.

A car has thousands of items in it making up several systems – the engine, transmission, fuel supply, braking, chassis, body, etc. Mostly all that you want from a car is to take you from place to place.

An air compressor is made of hundreds of parts combined into systems – the compressor, oil supply, pressure receiver, motor drive, etc. You just want the compressed air.

A computer is made of hundreds of parts – the central processing unit, image processing, memory storage, monitor, key board, etc. You just want to type letters, analyse data and communicate with the world.

We use machinery and equipment to do particular functions we need done. When they cannot provide that function we say that they have failed or broken.

All equipment will eventually fail. It is the nature of our early 21st Century technology that equipment parts do not last forever. In some cases they do not last very long at all, maybe only hours. In a few cases they will last untouched for a generation or more.

The truth about equipment is that it will fail. What we do not know is what part will fail or when it will fail, but you can guarantee with certainty that one day your car will stop dead, the air compressor will not work and the computer will ‘die’. The aim of maintenance is to control the timing of the failures so that you can select when they happen.

Reliability Research Turns Up New Facts
Research done in the late part of the 1970’s and 1980’s on equipment used in airplanes and naval ships found that there were six (6) separate patterns that reflected how equipment failed. The research was tested again at the start of the 21st century, again by the USA Navy, and was confirmed as correct. These failure patterns, or failure curves, are shown in Figure 1.

The curves reflected particular equipment populations. For example numerous airplane assemblies of the same type were tracked through their working lives. There was numerous navy ship assemblies tracked for their entire history. Their frequency of failure, the inverse of which is their reliability, was graphed. In all cases those assemblies fitted one of the reliability curves from ‘A’ to ‘F’.

What the research found was that no matter the equipment type, its lifetime reliability would match one of the curves. They went as far as trending the equipment types that matched each curve. The percentages fitted to each type of curve are also shown is Figure 1. A piece of equipment will fall into one of the six failure patterns shown. They do not mean that of 100 identical items of equipment, some of the 100 will fit to all the curves. Rather, the curves mean that for 100 identical equipment items all 100 will fit one reliability curve, depending on what happens to the item of equipment.

The range and difference of values between the airline industry and naval results, as well as within the naval results, are believed to reflect the different environments and duty requirements within the respective industry. For example the naval results came for both surface ships and submarines. Each has very different service requirements matching different operating conditions.

By understanding these failure, or reliability curves, and the information they contain, you can select the right maintenance strategies to deliver lifetime equipment reliability!

Equipment reliability curves, or equipment failure curves

A very important observation you must make from the patterns of the failure curves is that three significant phases cover most equipment life.

The first period is early life, or childhood, or wear-in. About 70% of airline equipment and 10% to 30% of naval equipment suffered higher rates of failure very early in their life. The second is the long mid-life where all equipment suffers a constant chance of failure or, as with curve ‘C’, an increasing chance of failure. The third period, or end of life, or wear-out, includes about 6% of airline equipment and 12% to 20% of naval equipment. At this stage the chance of failure rises quickly because of age of use – the parts are worn-out. Curve ‘C’ constantly slopes upward and means that for 5 % of airline equipment and 3 to 17% of naval equipment the chance of failure increases the longer it is in use.

The second observation you must make is from curve ‘D’. Only 6% to 11% of equipment is certain to work right when first used. From then on the chance of failure rises quickly till it becomes constant. The implication is that about 90% of all equipment can fail early in life. Not that they will fail, but they could.

Another important observation to make from the equipment failure curves, or better to call them ‘equipment likelihood of continuing in service’ curves, is that at the time Patterns ‘D’, ‘E’ and ‘F’ represented 87% – 94% of airline equipment and 54% – 84% of naval equipment. In other words, once equipment is past its infant stage, you cannot predict when most of your equipment will fail! It may last for years or it may fail tomorrow. What is certain is that for that type of equipment there will be a constant rate of equipment failures.

The final observation I will make from the failure curves, and perhaps the most important, is that the likelihood of failure is not a fixed number. It cannot be said that (say) 2% of all computers will be failed at anyone time. It cannot be said that 1% of all cars are failed at some point. The actual chance of failure is not fixed, it can be influenced. This is a wonderful discovery because now you can chose to use the right maintenance strategies to drive the rate of failure down to very low levels.

Of course you can choose to do nothing and just live life by ‘luck’ and repair things when they break! That is called ‘reactive maintenance’. It is the most costly, the most unsafe and the most production-losing form of maintenance you can do.

Maintenance Practices Need To Match Reliability Needs
All we can ever hope to do to change the equipment reliability curves is either to:

  • extend the length of time between failures, or,
  • prevent equipment failures by replacing the ‘worrisome’ components before they fail.

Making equipment more reliable is all about extending the time between failures. Preventing failure by renewal of components and lubricants is what maintenance is all about.

Maintenance is done to keep things working the way that they were meant to work. When equipment is not maintained you can guarantee that it will fail and stop sooner rather than later! The only really useful questions to ask about maintenance are: How much maintenance do I need to do? What is the right maintenance to do? And when do I need to do it?

Doing maintenance provides a means to reduce the likelihood of failure. Do the maintenance and the machine will keep working as it should for longer. Do not do the maintenance and the machine will fail sooner than it would have.

It is important that you understand the difference between maintenance and repair! A repair is done when something is broken and needs to be fixed. Maintenance is done to stop a thing from breaking. Maintenance is cheap. It is repair which is expensive. You want to be doing enough of the right maintenance to not have any repairs. A repair means something is failed and equipment is not performing its function. When your car is broken down you can’t get to where you need to be! A repair is a very stressful situation for all involved, whereas maintenance is just a normal part of life.

Increased reliability is achieved by doing maintenance well, doing it completely and doing it on time. Increased reliability is also achieved by better designing, better assembly, better installation and better operation.

Let’s take a closer look at how to use the ‘equipment likelihood of continuing in service’ curves, that is the reliability curves, to really understand the maintenance strategies you need to be doing in your business.

The three maintenance zones of equipment life

Figure 2 shows the zones of equipment life. We know that one curve cannot describe all failure patterns, but nearly all equipment reliability can be described by some combination of the three zones shown in Figure 2. Curve ‘D’ is the only exception. But ‘D’ covers few equipment types and even those soon revert to standard likelihoods of failure.

Maintenance for Controlling Equipment Infant Mortality
Given that with today’s technology infant mortality is a possibility for up to 70% of equipment types, what do you do to reduce the chance it will happen to your equipment? Figure 3 shows the infant mortality zone and asks the important questions that need to be asked by every company that uses equipment.

Equipment infant mortality maintenance zone strategies

What are the causes of premature failure when equipment is new? Why should up to 70% of brand new plant and equipment fail soon after they start into service? The answers are very embarrassing! I can tell you what I have seen, and sadly done myself, over the last twenty years of my maintenance and engineering career that explain a great part of infant mortality.

First is that equipment and plant designers do not know what they don’t know. You will find that equipment designed by a novice engineer will have many more failures than equipment designed by an experienced engineer.

Yet even a very experienced engineer will not know the true process conditions in which the equipment will be used, and will not know how the process conditions affects the equipment’s lifetime performance. They will not really know how the plant operators will use the equipment. Here then is the first maintenance strategy to adopt – review every equipment selection and design around a table with a team of experienced operators, maintainers, engineers and process people before building or buying it. Go into the design details, ask why it’s designed that way, look for compatibility problems, spot what problems you can and get them fixed!

Part of the answer to the question of how to reduce and lessen the impact of infant mortality can be found in the difference between the results of curve ‘F’ for the airline industry and the naval industry.

It is standard navy practice to specify in their equipment supply contracts that equipment must be tested to full specification before it is accepted. That allows most of the early failure causes to be fixed before the navy gets the equipment. That then is the second maintenance strategy to adopt – only accept equipment when it has been tested to full specification and duty for a good period of time.

When equipment is built you do not know the skills and abilities of the fabricators and assemblers. Like the new design engineer mentioned above, a person new to fabrication or assembly will make more errors that one well experienced in their trade. People work to different levels of accuracy. Unless their work system proactively controls and rectifies variance you will get inaccurate parts assembled together.

A machine or plant built of parts that do not fit together properly will soon fail. So your third maintenance strategy is to have a quality system that proves your equipment is made accurately, installed accurately and is rebuilt accurately. You might want to go further and only buy from manufactures and use installers that crave accuracy and live ‘quality’. You can tell them apart from the rest because they can show you records of proof that their parts are made right and went together right.

New equipment has not yet proven itself. How do you know that it will behave as it was designed to perform when you have never seen it operating? When up to 70% new equipment first starts-up, the failure curves tell us that they are at much higher risk of failure than once they have been operated for some time. Your fourth maintenance strategy to reduce equipment infant mortality is to have a planned and complete commissioning process where each equipment item is individually checked and proven ready to go into operation.

The fifth strategy to offer you at the moment for minimising downtime at the infant mortality stage is to be sure you have access to critical spares and replacements. Infant mortality is why manufacturers are forced to have a warranty period. Be sure that they can give you the parts you need if their equipment fails soon after start-up.

Maintenance for Controlling Random Failures
Once your equipment is past its early life it enters a long period of time where it suffers random failures. Figure 4 tells us that from time to time something will go wrong and the equipment fails. Given that there will be random failures, what can you do to reduce the chance that they will happen to you? How do you drive the likelihood of failure down, down, down so that you have fewer problems?

Equipment random failure maintenance zone strategies

There is an old maintenance adage that you ‘don’t touch it till it breaks’. Since the discovery of the six reliability curves it has come to be viewed as the thinking that causes ‘breakdown’ maintenance and ‘reactive maintenance’. But if seen in another light, there is more than a grain of truth behind its intention.

We can see from the reliability curves that for nearly 100% of equipment in the period between run-in and wear-out, you cannot predict when a particular piece of equipment will fail. All you know for sure is that at any one time a proportion of the equipment type population will have failed. That makes it is almost pointless to overhaul a machine and replace its parts, because most of the time they are perfectly fine. So in a sense you should leave a well operating piece of equipment alone. Hence ‘don’t touch it till it breaks’ should become ‘why touch till it breaks’?

If during the random failure period it is not sensible to replace parts based on the passing of time, what should be done? We touched on it earlier – we must monitor the condition of the machine or equipment to see if it is changing. And only when we see a change for the worst do we ‘touch it’! The first maintenance strategy for the random failure zone is to find ways to non-intrusively monitor the condition of equipment looking for signs of impending failure. When you find evidence of degrading performance then schedule the equipment for repair.

Random failure also includes a component of accumulated damage and stress. Say you were to drive your family car off-road once and treat it as you would an off-road vehicle. You run over boulders, go through streams, and rev the engine hard. How much time would pass before the car would fail? You cannot be sure. One time might not make a big difference to when it next failed. But if you took it off-road every weekend and treated it rough then how long before it would fail?

Again you could not be sure, but you could confidently say that because you have overstressed it and treated it in ways it was not designed to be used, the chance that it would fail sooner rather than later has increased. Your second maintenance strategy in the random failure zone is to treat equipment as it was designed to be treated. You must practice ‘precision operation’ and run it without stress or overload. If you overburden it and stress it you will definitely be causing it to fail sooner and more often! It may even require that the operators be trained in how to ‘love and care’ for their machines!

To extend equipment life between failures, also known as increasing the mean time between failure (MTBF), you must manage the environment in which the equipment works. For example keep the oil in your car and machines clean by replacing it or cleaning it regularly, keep air circulating past electric motors clean, clean away dirt build-up on equipment, keep water and moisture away from oil and grease, and regularly replace used grease in roller bearings with fresh, clean grease. The third maintenance strategy to lower random failures is to religiously do the preventative maintenance and keep your equipment’s internal and local environment clean and fresh.

Since you will not be stripping your equipment down until there is evidence that it is needed, it becomes important when it is stripped and rebuilt that two things happen. One is to check it for any other possible failure modes or impending problems and second is to ensure it is rebuilt to the manufacturer’s specifications and the highest quality.

This requirement for creative, watchful disassembly and precision quality when stripping and rebuilding is known as precision maintenance. Precision maintenance includes balancing rotating equipment, checking shaft alignment is within close tolerance, reinstalling the equipment without stress and testing bearing vibration after the rebuild to prove the machine is running perfectly. So your fourth strategy for the random failure zone is to practice and prove you have done precision maintenance on your equipment.

If during the random failure period of your equipments’ life you notice that the equipment is not failing randomly, but it is in fact failing regularly and predictably, then you have a recurring failure. This can be good because now you can schedule a replacement before the failure happens, and that puts you in control of the situation. It also gives the opportunity to apply the fifth strategy for the random failure zone, which is to design-out predicable failures. If you do not design them out they will continually recur at the most inconvenient times.

Maintenance for Controlling Wear-Out Failures
When equipment is worn-out its failure rate increases. It stops working more often than it used to do. Over the years parts wear and don’t fit together properly, or components suffer cumulative stress and fatigue. Sometimes equipment frames become badly corroded or are so deformed from poor operating or maintenance practices that internal parts cannot keep their running tolerances. When wear-out zones failures occur, your maintenance options become limited.

Equipment wear-out failure maintenance zone strategies

Ideally your aim at the wear-out zone is to put the equipment back into the random failure phase or to reduce the rate of failure to levels you are willing to accept. Figure 5 shows the possible ways to combat the problems of equipment wearing-out.

By replacing only the worn parts it maybe possible to extend the time to total replacement. But when new parts are run along with old parts you have only restarted the failure rate of the new parts. The old parts will continue to experience their own failure rate likelihood. This means you will gain some improvement on the equipment’s reliability but the old component reliabilities will cause premature failures not experienced in a new machine.

Maintenance strategies for the wear-out zone are typically replacement with new or a total rebuild to as-new condition. Replacement of worn parts with new will provide a temporary improvement in reliability.

But… if you put in a lot of new components and assemblies, they all start at the infant mortality stage with its associated higher failure rates!

To get a better understanding of the maintenance and operating strategies you can chose and how they work, click on this link and read the article titled Equipment Maintenance Strategy 101.

Further Equipment Reliability Assistance

Please contact me if you wish more information on any questions you may have from this article.  I can be contacted on the email address found in the ‘Contact Us‘ page.

Do You Want To Know When The Next New Article Is Posted?
New articles on maintenance strategy, engineering asset management, equipment reliability and operational excellence are regularly posted on this web site. If you want to be sent an email when the next article is posted please click this link and join our ‘Subscription List‘.

 

My best regards to you,

Mike Sondalini
Managing Director
Lifetime Reliability Solutions