Service Issue Reported

Incident Report for NetDocuments US

Postmortem

On June 26th at 2:57pm EDT our Network Operations Center started to receive indications of system alerts which triggered the Platform Engineering and the Platform Operations teams to initiate investigations. After confirmation of anomalies at 3:10pm EDT, notices started to be posted as we initiated deeper anomaly analysis.

This was a complex problem to identify as we investigated Hardware Security Module (HSM) proxies, the storage system, active directory, and the network. Because each of these systems were displaying anomalies, each were isolated, failed over, and restarted, but without success as the root cause was elsewhere.

As we took a deeper dive into these systems, we discovered that there were duplicate IP addresses present as a result of human error related to the deployment of our new dispersed memory-resident auto-replicating global directory. A virtual control server was serving erroneous IP addresses, which caused network collisions and the inability for the HSM proxies, storage system, and other components to behave in a predictable way. Services were restored at 5:49pm EDT.

No damage was caused to documents nor to other NetDocuments objects. This issue affected the US region only, as the European and Australian regions were non-affected. The US region was unable to open documents as the HSM proxies were affected by the wrong IP addresses. Documents from emergency backup locations such as the echo location, ndOffice echo, ndMirror, and ndSync were not affected, and available for users to access.

As a means of preventing future similar incidents, we have placed into production the prevention of duplicate IP addresses by the insertion of code that will identify if a duplicate IP address is present. Now new servers will be automatically stopped from entering the network if duplicate IP addresses are identified. As a collateral benefit, we are also fine-tuning the emergency processes and vendor responsiveness of each sub-system that was unnecessarily investigated in this incident, including HSM proxies, storage, and network systems.

NetDocuments takes every incident very seriously, and although we made measurable improvements in system reliability in the recent past, especially in the area of system self-remediation, this particular issue was new and without prior precedence. We are seriously aware of the trouble that this caused and are very grateful for your support.

Posted Jun 27, 2018 - 13:00 EDT

Resolved

Our engineering team has restored all service functionality. We will continue to monitor the service and provide updates accordingly. We will discontinue scheduled updates at this time and provide relevant updates should any additional unanticipated issues arise. If you continue to experience performance issues, please contact Technical Support. We will conduct a post mortem of this event, which will be posted within 48 hours. Thank you for your patience and understanding.

Posted Jun 26, 2018 - 17:48 EDT

Update

Our engineering team continues to investigate this service disruption. We are actively working to mitigate this issue and will provide an additional update within 15 minutes.

Posted Jun 26, 2018 - 15:41 EDT

Identified

We are aware of a service impacting issue in which customers are receiving an error when interacting with documents. Our engineering team is investigating and will providing updates within 15 minutes.

Posted Jun 26, 2018 - 14:57 EDT

This incident affected: Platform.