On June 26th at 2:57pm EDT our Network Operations Center started to receive indications of system alerts which triggered the Platform Engineering and the Platform Operations teams to initiate investigations. After confirmation of anomalies at 3:10pm EDT, notices started to be posted as we initiated deeper anomaly analysis.
This was a complex problem to identify as we investigated Hardware Security Module (HSM) proxies, the storage system, active directory, and the network. Because each of these systems were displaying anomalies, each were isolated, failed over, and restarted, but without success as the root cause was elsewhere.
As we took a deeper dive into these systems, we discovered that there were duplicate IP addresses present as a result of human error related to the deployment of our new dispersed memory-resident auto-replicating global directory. A virtual control server was serving erroneous IP addresses, which caused network collisions and the inability for the HSM proxies, storage system, and other components to behave in a predictable way. Services were restored at 5:49pm EDT.
No damage was caused to documents nor to other NetDocuments objects. This issue affected the US region only, as the European and Australian regions were non-affected. The US region was unable to open documents as the HSM proxies were affected by the wrong IP addresses. Documents from emergency backup locations such as the echo location, ndOffice echo, ndMirror, and ndSync were not affected, and available for users to access.
As a means of preventing future similar incidents, we have placed into production the prevention of duplicate IP addresses by the insertion of code that will identify if a duplicate IP address is present. Now new servers will be automatically stopped from entering the network if duplicate IP addresses are identified. As a collateral benefit, we are also fine-tuning the emergency processes and vendor responsiveness of each sub-system that was unnecessarily investigated in this incident, including HSM proxies, storage, and network systems.
NetDocuments takes every incident very seriously, and although we made measurable improvements in system reliability in the recent past, especially in the area of system self-remediation, this particular issue was new and without prior precedence. We are seriously aware of the trouble that this caused and are very grateful for your support.