Service Issue

Incident Report for NetDocuments US

Postmortem

During this past weekend (4/27/2019), NetDocuments had a pre-scheduled Maintenance Window that took place within the US Service. Over the past year, we have been implementing a Couchbase Directory Service as part of our ongoing updates to our global infrastructure. During the recent maintenance window, our engineering teams worked to migrate Repositories, Groups, Cabinets and User Data to the new Couchbase Service. This work had already been completed in both our Australian and EU based data centers. The work was successfully completed, and the US Service was in a normal state following the updates.

At approximately 9:50am EDT on Monday, April 29th, we began to see performance issues with the US Service. Our teams were immediately alerted to the issue and began their investigations, guided by Couchbase Support. The Couchbase Directory capacity was increased twice between 10:15am EDT and 12:15pm EDT based upon input from the Couchbase Support Teams. During this same time, a code review was also undertaken in order to vet the updates that had been put into place over the weekend.

At approximately 1:30pm EDT, a potential code issue was discovered involving a query that was running at an unusual rate relative to its normal function. Once identified and reviewed, it was determined that a code-based update could be developed and safely deployed that would reduce the frequency of the query. At approximately 3:10pm EDT, the engineering team began to implement the code-based patch across the server pools. Performance began to improve as the patch was installed. The Service returned to a normal state and the issue was considered resolved.

No data or inter-process anomalies were identified and NetDocuments meticulously maintained our change control procedures. The Couchbase Directory Service deployment will increase service scalability and availability. This is part of the continuous drive for software-based datacenters which included the successful migration to object store with erasure coding, HSM-based cryptography, hyper-converge technology, and Solr platform. Once the Couchbase service was optimized and fully running in the US Service, our metrics experienced measurable and significant improvements in service scalability. In order to prevent future issues, we will deploy deeper verbosity in our internal logs, to further improve our ability to detect very minor changes in query processes. We apologize for this incident.

Posted May 01, 2019 - 15:31 EDT

Resolved

Our engineering team has implemented a fix to resolve the performance issue. If you continue to experience issues, please open a ticket with our Support team for further assistance. We will conduct a postmortem of this event, which will be posted within 48 hours.

We prefer that you go to support.netdocuments.com to submit a support request for your issue, so we can manage it more efficiently and provide you with access to your cases.

However, if you need to contact Technical Support via the phone use the following information:

US customers call 801-226-6882 or 866-NETDOCS (638-3627)
EU customers call +44(0) 2034.556770
AU customers call +61 2 8310 4319

Posted Apr 29, 2019 - 15:50 EDT

Update

We are seeing performance improvements within the Service. We will continue to update this post as required.

Posted Apr 29, 2019 - 15:29 EDT

Update

Our engineering team is in the process of updating the Service in order to rectify the current performance issues. We will continue to update this posting as that process is completed.

Posted Apr 29, 2019 - 15:17 EDT

Update

We are continuing to monitor for any further issues.

Posted Apr 29, 2019 - 15:11 EDT

Update

The performance issues continue and we are working to resolve it as quickly as possible. We understand the impact this has placed upon our customers. We anticipate a resolution will be forthcoming shortly. Thank you for your patience.

Posted Apr 29, 2019 - 14:45 EDT

Update

The performance issue continues and we are working to resolve it as quickly as possible. Customers may experience slow logins or overall slowness within the Service once logged in.

Posted Apr 29, 2019 - 13:35 EDT

Update

We are continuing to monitor for any further issues.

Posted Apr 29, 2019 - 13:33 EDT

Monitoring

A fix has been implemented and we are monitoring the results. Our Engineering team has taken action to resolve the intermittent performance issues. The primary impact of this was slow logins and/or general slowness once logged in.

Posted Apr 29, 2019 - 12:37 EDT

Update

We are continuing to work on a fix for this issue.

Posted Apr 29, 2019 - 12:35 EDT

Update

We are continuing to work on a fix for this issue.

Posted Apr 29, 2019 - 12:03 EDT

Update

The performance issue continues and we are working to resolve it as quickly as possible. Customers may experience slow logins or overall slowness within the Service once logged in.

Posted Apr 29, 2019 - 11:59 EDT

Identified

The performance issue continues and we are working to resolve it as quickly as possible.

Posted Apr 29, 2019 - 11:29 EDT

Monitoring

Our Engineering team has taken action to resolve the intermittent performance issues. The primary impact of this was slow logins. We are currently monitoring the Service.

Posted Apr 29, 2019 - 10:13 EDT

Investigating

The Service is currently experiencing intermittent performance issues. We are working to resolve the situation and apologize for any inconvenience.

Posted Apr 29, 2019 - 10:00 EDT

This incident affected: Platform.