The Time-Triggered Architecture explained

How to make computer systems safe?

Why do planes fly? What’s going on inside an autonomous vehicle? There are many technologies we take for granted and rarely think about what it takes for them to function safely. We often leave it up to engineers and scientists to figure these things out for us. But the way our computer systems work is fascinating, multifaceted, and more “human” than you might think. In this guide, we would like to introduce you to the kind of technology TTTECH pioneered: a way of making computer systems safer. It will not explain why planes fly, but it will at least make it clearer why they continue to do so. Let’s start at the beginning.

Systems need rules

Human life is governed by many rules. Some of these, like gravity, are imposed on us by nature. Some of them, we impose on ourselves to structure our society and make it easier to share resources. Think of a traffic intersection, where vehicles of different types and pedestrians come together. In human society, we have developed a system of traffic rules to determine who gets to use the shared resource “road” at which point. For example, we may install a traffic light at the intersection or use a zebra crossing. Everyone who participates in road traffic shares the same set of knowledge about the rules that apply, so that they know how to behave in the presence of a traffic light.

In a so-called distributed computer system, several computers share resources.

Our computing systems are no different because, after all, they were built by humans who simply transferred their intrinsic logic and rules from the physical to the virtual sphere. In a so-called distributed computer system, several computers share resources, like access to network bandwidth for example, or access to a sensor or actuator, much like humans share a road network. The way information is transmitted in a computer system follows a set of rules, the same as road traffic does.

There are points at which the physical world and computer systems overlap, where hardware and software components interact to perform some tasks in the physical world. We call these cyber-physical systems. A washing machine is one example, a car is another.

Like in traffic, a set of rules is required for the interplay between the hardware and software components.

Typically, the larger a cyber-physical system is, the more distributed computers it implements, e.g., a washing machine may suffice implementing one computer while a modern car implements well beyond a hundred computers where many are networked to each other.

Getting the overall cyber-physical system to perform the expected tasks, e.g., automatically braking when the car identifies an obstacle, may be a real challenge, especially when multiple distributed computers need to coordinate their work, communicating with each other. Therefore, a set of rules is required for the interplay between the hardware and software components. We call this set of rules the “correctness-by-design” paradigm: when we build the distributed computer system following said set of rules, we can be assured that it works.

Time is on our side!

We schedule at which points in time which actions in a distributed computer system are supposed to occur.

One correctness-by-design paradigm is the Time-Triggered Architecture which has been pioneered by TTTECH. It emphasizes systematically employing the nature of time. Using trains as an example: trains run on a schedule which means that we upfront plan when which train will use which track. Similarly, we schedule at which points in time which actions in a distributed computer system are supposed to occur. Then when we execute this plan, we can be sure that all the elements in the system do what they must do when they are expected to do it. Calculating such schedules can be quite compute-intense but this is not a problem because it is part of a system’s development process. On the other hand, while operating the system, it becomes trivial for the system to simply identify the next action to be taken by looking at the schedule. Different cyber-physical systems have different characteristics, and some may time-schedule only a few critical actions, while other systems may schedule most actions.

Via clock-synchronization we provide every node/component in the network the same perception of time by continually resynchronizing their clocks

In a cyber-physical system, we also need to consider clock drift. This is a physical phenomenon that means all clocks eventually begin to drift apart and show different times. No two clocks will keep the same time unless resynchronized regularly. This isn’t usually a problem in our everyday interactions, because the drift is so tiny, that we barely perceive it until one day we notice our watch is suddenly one minute behind. In cyber-physical systems, however, data transfer must be in the microsecond range of precision. To follow the schedule the components of our network need a clock. If each computer in the system has a different clock, the clocks will inevitably drift apart, eventually leading to system failure.

How can we prevent this? Clock-synchronization: We must give everyone in the network the same idea of time and continually synchronize all clocks, even in the event that some clocks, like any other physical component, may fail. What sounds easy enough on paper is extremely challenging in a distributed system. Components in say, a plane, come from hundreds of different vendors that come with their own material properties and software.

Rules can be broken

The element of time is thus introduced into our system as a set of rules laid out in the Time-Triggered Architecture. However, as you know, rules are violated all the time. You have surely seen someone cross a street at a red light. Again, the same is true for distributed computer systems. There are countless reasons a system might fail. The question is, which failures can be tolerated, and which can’t? If somebody crosses at a red light when there are no cars in sight, that is a rule breach we can accept. If they cross in busy traffic, we may have a potential safety incident on our hands. Therefore, rules in our society are regularly enforced, for example by the police. And again, the concept of policing is mirrored in our computer systems where special mechanisms make sure the rules are followed by everyone in a network.

Fault tolerance

Some systems, like those found in cars, planes, or our energy supply, must be safer than others. To make absolutely certain a system remains safe, we introduce additional rules that determine how the system behaves even IF individual components fail. We call such systems fault-tolerant systems. In our traffic example, we could make a crossing even safer by adding additional rules that also account for human misbehavior: If we wanted to make sure nobody ever crosses the road at a red light, a traffic light alone probably wouldn’t be enough. We’d have to install a physical barrier in addition, like a fence that goes up whenever the light is red.

How far we take this also depends on whether the system is critical or non-critical. Critical systems are systems that would pose a large risk if failures are not handled adequately. In a plane, for example, the systems that keep the plane flying are safety-critical – if they failed the plane would crash. The onboard entertainment system, on the other hand, would be seen as non-critical. If any critical parts of a plane fail, it must keep up operation at least until a safe landing is enabled. These types of applications we call fail-operational – any failure that could affect the safety-critical systems must be mitigated. A faulty robot arm in an industrial plant, on the other hand, in many cases can be stopped until somebody has come to repair it. We call this principle fail-safe – in case of failure, the system will go to a safe status.

As the Time-Triggered Architecture predefines when which actions are supposed to occur it is simple to identify when actions do not occur as specified. Furthermore, fault-tolerance mechanisms can also be preplanned. The Time-Triggered Architecture is, thus, an ideal fit to build fault-tolerant systems.

In applications like planes or cars, hundreds of computer systems have to work together.

Let’s recap

In a cyber-physical system hardware and software intersect in extremely complicated ways. Every system is different, and depending on its function in the real world, the impact of failure is more or less dramatic. To build a fault-tolerant system we must therefore know three things:

What the system is for – its purpose
Which rules are required to make it fulfill its purpose
How rules are enforced – eliminate failures by design and tolerate failures without compromising the system operation

What are the benefits for business?

We have discussed safety in depth as one major benefit of applying a Time-Triggered Architecture. We all benefit from a plane not crashing, but there are additional benefits, of course.

The correctness-by-design paradigm sets rules for the interplay between the hardware and software components. This minimizes test efforts because new components can be integrated into the system while getting exactly the expected results. In the automotive industry, for example, this can decrease the number of recalls, because the designed system comes with an intrinsic set of rules even if elements are added to it, leading to cost savings and happier customers.

Applying our solution to the design of new systems, or integrating it into legacy systems, requires a series of customized steps. As we have shown above – all systems are different. We apply the Time-Triggered Architecture to a concrete system by:

finding correct instantiations of the rules (system design and configuration) and
applying these rules (realization of the system / embedded).

Our basic ruleset is based on decades of research and has been validated during more than two billion cumulated flight hours in airplanes, two million kilometers covered in deep space, as well as through rigorous reviews performed by certification bodies, academics, and scientific partners. We refine it each time and combine it with our experience to adapt it to each new system. The core methodology is well consolidated, but there is no one-size-fits-all solution. That’s why we work together with our customers and partners in customizing the solution to their precise needs.

To make sure the solutions we develop work in each of our use cases – like automotive, aerospace, or space, we subject them to very strict procedures that are specific to each application and industry. Our employees are among the top researchers in various interdisciplinary areas (like real-time systems, networking, or fault-tolerance). We follow strict processes to fulfill industry-dependent certification requirements and partner with academia to advance our methods beyond the current state-of-the-art, which is illustrated by the large number of high-quality publications and registered patents coming out of TTTECH. Our engineers follow top industry standards to guarantee the highest quality of development.

Go deeper?

Now that you have understood TTTECH’s core technology, why not go deeper?

Technologies

Time-Triggered Ethernet

Fault-tolerant clock synchronization mechanism (SAE AS6802) is used to synchronize non-faulty networking components even in the present of faulty clocks. This provides Guarantee of Service for safety systems, even in the presence of faults (fail operationality). Messages are forwarded in an extremely precise way, down to the individual packet. This enables a granular control over time-scheduled traffic.

Time Sensitive Networking (IEEE TSN)

Time-Sensitive Networking (TSN) is a set of IEEE 802 Ethernet sub-standards that are defined by the IEEE TSN task group. These standards enable fully deterministic real-time communication over Ethernet. TSN achieves determinism over Ethernet by using a global sense of time and a schedule which is shared between network components.