The Civil Aviation Department must reboot workstations of the new air traffic management system (ATMS) every two weeks to prevent sluggish system operations, as well as checking its sub-systems manually every hour, FactWire can reveal.
One of the ATMS sub-system, namely the Tower Electronic Flight Strips (TEFS) system, requires an hourly manual check-up after a system failure at the end of last year. The server has to be rebooted whenever the computer memory exceeds a certain amount of usage, yet a malfunction still occurred in early May this year.
Front-line air traffic controllers said problems such as so-called ghost targets and other false alarms were still occurring despite a software update in March, and at least two cases of ‘loss of separation’ between aircrafts occurred.
One systems engineering specialist said that a system or server that required such a manual reboot was ‘unreasonable’, and inferred that there was a ‘problem in the algorithm’ of the new ATMS.
The new ATMS workstations are in charge of integrating data concerning aviation, surveillance, detection and communication as well as showing the situation within region for the air traffic controllers to manage flights. FactWire has acquired one of the ‘Bi-weekly Workstation Restart Schedules for ATMS Workstations, CCWS and TCWS’ prepared by the CAD for the maintenance staff. The schedule indicates a rotational restart schedule every two weeks for a total of 53 workstations at the new Air Traffic Control Centre and Control Tower. Some have been assigned a weekly restart. It is initiated from midnight to 7am Hong Kong time, the least busy time. Three workstations at the Centre and one to two at the Control Tower have to be restarted on a daily basis. There is a total of around 70 new ATMS workstations set up in the Air Traffic Control Centre, Control Tower and Backup Centre and Tower, according to sources.
Sources stated that the measures were implemented this year to prevent sluggish operations or ‘system crash’, adding that ‘this type of measures are normally applicable only to systems that are very old and always malfunction, so it is shameful that the new AMTS has to rely on this to operate smoothly’.
Professor Anthony So Man-cho, Assistant Dean (Student Affairs) of the Faculty of Engineering at the Chinese University of Hong Kong, is a systems design and algorithm optimisation specialist. He said he never heard of the old ATMS or any overseas system employing such regular restart measures.
He said of the reboots: ‘I think it isn’t very reasonable… the measures are quite backwards, requiring a person to standby and reboot the machine at a specific time. Everyone knows that even computers at home do not require constant reboots, and you only have to close and open the lid to use the laptops.
The air traffic control system is a massive system which should remain standby, so constant reboots would raise eyebrows. Why does it need to be done? Moreover, since the system is so expensive, does it have to be done in manual, backward ways?’
FactWire also found that manual ‘health checks’ have to be carried out by the maintenance staff hourly for two sub-systems incorporated into the new ATMS, namely the TEFS System and the Arrival Manager System (AMAN). The ‘Guidelines for ATMS Tower EFS/ITF Server Reboot’ indicates that if the staff discover during a routine check that the ‘JVM memory heap used exceeds 2.5 Gbytes’, ‘the server has no response’ or there is a ‘failure indication on the client screen’, they must report it to management and reboot the server, which takes around 15 minutes.
The CAD responded to FactWire’s enquiries and admitted that they would carry out a ‘regular maintenance procedure’ based on the international cases, recommendations by contractors and actual operations, to ensure a safe and smooth running of the system. The procedure includes ‘closely monitoring the performances and functions of every sub-system (including the TEFS and AMAN system), and carrying all types of inspection and maintenance work regularly for every workstation at the new Air Traffic Control Centre and Control Tower (including a regular restart for workstations and sub-systems)’.
However, the CAD did not directly respond to whether they would reboot the workstations every two weeks, check the sub-systems manually by the hour, nor did they provide the commencement time of the new practice and the reasons behind the practice.
The CAD pointed that flights and aviation safety would not be affected in the process of the procedure, that ‘the old ATMS and different brands of air traffic control systems around the world also need to carry out regular maintenance procedures’.
However, the hourly checks do not seem to prevent hiccups. The TEFS system, which provides flight plan data of departing and arriving flights, malfunctioned between 5:40am and 7:30am on May 2, failing to process some data. The workstations of several front-line air traffic controllers twice rang the alarm. The touch-screen monitors of the workstations became unresponsive.
The technical support staff then discovered that the TEFS system switched from its main server to the backup server at around 5:30am, and the unresponsive workstations restarted at around 6am. The system server was rebooted at around 7am, resuming service at 7:30am. The CAD did not announce this incident to the public.
The TEFS system also went wrong on December 18 last year, failing to process some flight plan data. The CAD told the media then that the supplier was Frequentis from Austria and not US-based Raytheon Company, that ‘situations would sometimes occur regardless of an independent or integrated new ATMS’ and ‘it could be dealt with through a system restart, which would not affect aviation safety’.
Sources told FactWire that the TEFS system had functioned normally when the old ATMS was still in use. Problems such as the malfunction of touch-screen monitors or sluggish system operation were recent, occurring after the operation of the new ATMS, they said. All technical problems have been followed up by Raytheon, and the CAD also implemented the hourly server memory check after the hiccups last year.
The CAD apparently issued an urgent notice to Raytheon after the May 2 incident, demanding swift action, including an in-depth investigation on whether it was related to last year’s incident. The CAD demanded solutions to ‘reduce the negative image brought to the CAD and Raytheon’.
The AMAN system, which provides the order for arriving flights, also requires an hourly ‘health check’. There were three system failures on November 18 last year, January 2 and February 12, when some flight orders could not be shown on screen, and air traffic control officers had to handle the landing sequences manually. The CAD replied to media enquiries in April saying the air traffic control staff could definitely face these situations by attempting to reboot the systems.
So said the sub-system and the ATMS needed to assimilate and transmit data mutually, therefore there could be a problem in the algorithm for merging data in the ATMS.
He said: ‘This situation is like making calculations on the whiteboard. When the board is filled, I won’t have enough space to draw formulas for the next question. I’ll therefore have to clean it up, but the system cannot recognize used memory and clear it up for more space, so it requires manual labour to reset the memory. This is possible, but should a good system work like this? The answer is never. A good system should know how to sort and release used memory at a specific time.’
FactWire reported on February 10 this year on the problems since the new ATMS was implemented, including ‘ghost targets’, ‘target drops’ and ‘split tracks’. At the end of last year, Raytheon had admitted during a presentation to the CAD that the system’s algorithm was lacking and promised to upgrade the software. The CAD issued a press release that night and did not deny FactWire’s report. On April 3, an expert panel set up by the CAD published an interim report on ‘teething issues’ arising from the commissioning of the new ATMS. At that time, Director-General of Civil Aviation Simon Li Tin-chui stated that new software supplied by Raytheon had been installed on March 20, and the problems were ‘almost all solved’.
But CAD staff have told FactWire that there has been no apparent improvement and that in fact there have been two incidents of ‘loss of separation’. This occurs when flights fail to maintain the correct distance between them as specified by the International Civil Aviation Organisation (ICAO). The CAD has not made this information public (See table).
The horizontal distance must be at least five nautical miles (NM), and the vertical distance should be at least 1,000ft for the aircrafts in the two incidents. On April 21, about 50NM southeast of Hong Kong International Airport, aircraft VN578 from Hanoi to Taipei and aircraft 3U8845 from Sanya in Hainan to Nanjing, were both at an altitude of 31,000ft, with a horizontal and vertical distance of 3.5NM and 700ft. And on May 10, about 120NM east of Hong Kong airport, CX50 from Hong Kong bound for Shanghai and KE313 from Seoul bound for Hong Kong were both situated at an altitude of 29,000ft, with a horizontal and vertical distance of only 3.2NM and 500ft.
So worries that the algorithm used by the new system may not be suitable.
‘The new ATMS may have employed an algorithm or a series of algorithm … to conduct data merge. If a bug fix or procedure amendment is carried out under the same algorithmic framework, it merely rectifies the mistake within the framework and is a small adjustment. The more fundamental problem could be whether this algorithm … is fit for this kind of work.’
So pointed out that the algorithm was the core of the whole data process system, so it could hardly be described as ‘teething problems’ if it was to blame by the authority. ‘If the root cause is the algorithm, or it is inadequate and requires an upgrade, this concerns the upgrade of the whole system and …equates to almost purchasing a new system’, he said.
FactWire’s report on February 10 revealed at least 80 cases of malfunction in radar detection since the implementation of the new ATMS in November last year, including 30 false alarms. Raytheon has admitted that the problems related to the ‘algorithm’ of the system. At least six safety incidents of ‘loss of separation’ occurred in January this year, a frequency ‘rarely found in years’.
The interim report released on April 3 said the new system had been providing ‘safe, reliable and generally smooth air traffic services’, and the cases of ‘loss of separation’ in January were ‘minor in nature’.
This story is picked up and reported by the following news media: