Carbonite Crash Reporter

Overview

The crash reporter is intended to catch and handle exceptions and signals that are produced at runtime by any app that loads it. On startup, if configured to do so, the crash reporter will install itself in the background and wait for an unhandled exception or signal to occur. There is no performance overhead for this to occur and for the most part the crash reporter plugin just sits idle until a crash actually occurs. The only exception to this is that it will monitor changes to the /crashreporter/ branch in the settings registry (managed by the carb::settings::ISettings interface if present).

The crash reporter plugin does not have any other dependent plugins. It will however make use of the carb::settings::ISettings interface if it is loaded in the process at the time that the crash reporter plugin is loaded. Any changes to the /crashreporter/ settings branch will be monitored by the plugin and will have the possibility to change its configuration at runtime. See below in Configuration Options for more information in the specific settings that can be used to control its behavior.

The implementation of the crash reporter plugin that is being referred to here is based on the Google Breakpad project. The specific plugin is called carb.crashreporter-breakpad.plugin.

Setting Up the Crash Reporter

When the Carbonite framework is initialized and configured, by default an attempt will be made to find and load an implementation of the carb.crashreporter-*.plugin plugin. This normally occurs after the initial set of plugins has been loaded, including the plugin that implements the carb::settings::ISettings interface. If a crash reporter implementation plugin is successfully loaded, it will be ‘registered’ by the Carbonite framework using a call to carb::crashreporter::registerCrashReporterForClient(). This will ensure the crash reporter’s main interface carb::crashreporter::ICrashReporter is loaded and available to all modules. The default behavior of loading the crash reporter plugin can be overridden using the flag carb::StartupFrameworkDesc::disableCrashReporter when starting the framework. If this is set to true, the search, load, and registration for the plugin will be skipped. In that case, it will be up to the host app to explicitly load and register its own crash reporter if its services are desired.

Once loaded and registered, the Carbonite framework will make an attempt to upload old crash dump files if the /app/uploadDumpsOnStartup setting is true (this is also the default value). This upload process will happen asynchronously in the background and will not affect the functionality of other tasks. If the process tries to exit early however, this background uploading could cause the exit of the process to be delayed until the current upload finishes (if any).

Most host apps will not need to interact with the crash reporter very much after this point. The only functionality that may be useful for a host app is to provide the crash reporter with various bits of metadata about the process throughout its lifetime. Providing this metadata is discussed below in Crash Handling.

Crash Handling

When a crash does occur in the app, the crash reporter will catch it. Upon catching a crash, the crash reporter plugin will create a crash dump file and collect metadata from the running app. The format of the crash dump file will differ depending on the platform.

On Windows, a minidump file compatible with Microsoft Visual Studio will be created. On Linux, a proprietary crash dump file will be created. This crash dump file can be converted to a standard Linux core dump file with the use of a helper tool from the Breakpad library (distributed separately in the Google Breakpad packman package. The specific tool is located at utils/minidump-2-core in that package). A minidump or core dump file contains some portions of the state of the process at the time it crashed. This state includes the list of running threads, each of their register states, portions of their stack memory, a list of loaded modules, and some selected memory blocks that were referenced on the various thread stacks. From this crash state information some investigation can successfully be done into what may have caused the crash. The dump files do not contain all of the process’ state information by default since that could be several gigabytes of data.

The metadata will be collected from multiple sources both at crash time, and as the program runs. The metadata is simply a set of key-value pairs specified by the host app. The metadata values may be any string, integer, floating point, or boolean value (arrays of these values are not currently supported) and are collected from these sources:

  • Any values written to the /crashreporter/data/ branch of the settings registry. This registers a constant metadata key-value pair and is best used for values that do not change at all or do not change frequently throughout the app’s lifetime. These metadata values are collected and stored immediately.

  • Any value specified in a call to carb::crashreporter::addCrashMetadata(). This is just a helper wrapper for adding metadata values through the /crashreporter/data/ settings branch.

  • Any ‘volatile’ metadata values specified with carb::crashreporter::ICrashReporter::addVolatileMetadata(). This registers a value to be collected at crash time through a callback function. This type of metadata is intended to be used for values that change frequently and would be too expensive to update immediately every time they change. The only value that is important is the last value at the time of a crash.

Loading a Crash Dump to Investigate

On Windows, a crash dump file can be opened by dragging it into Visual Studio then selecting “Debug with native only” on the right hand side of the window. This will attempt to load the state of the process at the time of the crash and search available symbol servers for symbols and code to the modules that were loaded at the time of the crash. The specific symbol and source servers that are needed to collect this information depend on the specific project being debugged.

Once loaded, many of the features of the Visual Studio debugger will be available. Note that symbols and source code may or may not be available for every module depending on your access to such resources. Some restrictions in this mode are that you won’t be able to step through code or change the instruction pointer’s position. Also, global data may not be available depending on the contents of the crash dump file.

If a particular crash is repeatable, the /crashreporter/dumpFlags setting can be used to collect more information in the crash dump file that is created. Note though that some of the flags that are available can make the crash dump very large. On Windows, the following dump flags are available:

  • Normal: only capture enough information for basic stack traces of each thread.

  • WithDataSegs: include the memory for the data sections of each module. This can make the dump file very large because it will include the global memory space for each loaded module.

  • WithFullMemory: include all of the process’ mapped memory in the dump file. This can cause the dump file to become very large. This will however result in the most debuggable dump file in the end.

  • WithHandleData: includes all of the OS level information about open handles in the process.

  • FilterMemory: attempts to filter out blocks of memory that are not strictly needed to generate a stack trace for any given thread.

  • ScanMemory: attempts to scan stack memory for values that may be pointers to interesting memory blocks to include in the dump file. This can result in a larger dump file if a lot of large blocks are included as a result of the scan.

  • WithUnloadedModules: attempts to include a list of modules that had been recently unloaded by the process.

  • WithIndirectlyReferencedMemory: includes blocks of memory that are referenced on the stack of each thread. This can result in a significantly larger dump file.

  • FilterModulePaths: filters out module paths that may include user names or other user related directories. This can avoid potential issues with personally identifying information (PII), but might result in some module information not being found while loading the dump file.

  • WithProcessThreadData: includes full process and thread information from the operating system.

  • WithPrivateReadWriteMemory: searches the process’s virtual memory space and includes all pages that have the PAGE_READWRITE protection.

  • WithoutOptionalData: attempts to remove memory blocks that may be specific to the user or is not strictly necessary to create a usable dump file. This does not guarantee that the dump file will be devoid of PII, just reduces the possibility of it.

  • WithFullMemoryInfo: includes information about the various memory regions in the process. This is simply the page allocation, protections, and state information, not the data in those memory regions itself.

  • WithThreadInfo: includes full thread state information. This includes thread context and stack memory. Depending on the number of threads and amount of stack space used, this can make the dump file larger.

  • WithCodeSegs: includes code segments from each module. Depending on the number and size of modules loaded, this can make the dump file much larger.

  • WithoutAuxiliaryState: disables the automatic collection of some extra memory blocks.

  • WithFullAuxiliaryState: includes memory and state from auxilary data providers. This can cause the dump file to become much larger.

  • WithPrivateWriteCopyMemory: includes memory blocks that have the PAGE_WRITECOPY protection. This can make the dump file larger if a lot of large blocks exist.

  • IgnoreInaccessibleMemory: if the WithFullMemory flag is also used, this prevents the dump file generation from failing if an inaccessible region of memory is encountered. The unreadable pages will not be included in the dump file.

  • WithTokenInformation: includes security token information in the dump file.

  • WithModuleHeaders: includes the headers from each loaded module.

  • FilterTriage: adds filter triage related data (not clear exactly what this adds).

  • WithAvxXStateContext: includes the AVX state context for each thread (x86_64 only).

  • WithIptTrace: includes additional Intel Processor Trace information in the dump file.

On Linux, the process for loading a crash dump file is not entirely defined yet. Depending on how in depth the investigation needs to be, there are two currently known methods. Both require some tools from the Breakpad SDK. The following methods are suggested but not officially supported yet:

  • use the minidump-2-core tool from Breakpad to convert the crash dump file to a standard Linux core dump file. Note that by default this tool will output the result to stdout which can break some terminals. Instead the output should always be redirected to a file. This file can then be opened with GDB using the command gdb <executable> --core <core_file>. GDB may also need to be pointed to the various symbol files for the process. Please see the manual for GDB on how to find and load symbol files if needed.

  • use the minidump-stackwalk tool to attempt to retrieve a stack backtrace for each thread listed in the crash dump file. This will produce a lot of output so it is best to redirect it to a file. This can provide some basic information about where the crash occurred and can give at least an idea of a starting point for an investigation.

Uploading Crash Dumps

NVIDIA provides a default URL to send crash reports to - https://services.nvidia.com/submit. At this location, crash dumps and metadata will be accepted via HTTP POST commands. The expected format of the POST is a multipart form that provides key/value pairs for each of the metadata items followed by the binary data for the crash dump file itself. The crash dump files are processed at this location and stored for later investigation. This default location can always be overridden by using the /crashreporter/url setting. The new URL will still be expected to accept POSTed forms in the same format. This URL does not currently support accepting compressed crash dump files.

Once a crash dump is created locally on a machine, the default behavior (if enabled) is to attempt to upload the crash dump and its associated metadata to the current upload URL. There are multiple settings that can affect how and if the crash dump upload will occur. See Configuration Options for more information on those specific settings. The upload is performed synchronously in the crashing thread. Once finished and if successful, the crash dump file and its metadata may be deleted locally (depending on the /crashreporter/preserveDump setting). If the upload is not successful for any reason, the crash dump and metadata files will be left locally to retry again later.

Should the upload fail for any reason on the first attempt (ie: in the crashing process), an attempt to upload it again will be made the next time the app is run. The original upload could fail for many reasons including network connection issues, another crash occurred while trying to do the original upload, or even that the server side rejected the upload. When retrying an upload in future runs of the app, old crash dump files will be uploaded sequentially with their original metadata. Should a retry also fail, a counter in the metadata will be incremented. If an upload attempt fails too many times (see /crashreporter/retryCount below), the crash dump file and its metadata file will be deleted anyway.

Public Interfaces and Utilities

Instead of being configured programmatically through an interface, all of the crash reporter’s configuration goes through the carb::settings::ISettings settings registry. Upon load of the plugin, the crash reporter plugin will start monitoring for changes in the /crashreporter/ branch of the settings registry. As soon as any value in that branch changes, the crash reporter will be synchronously notified and will update its configuration.

While the crash reporter is intended to be a service that largely works on its own, there are still Some operations a host app can perform on it. These are outlined in the documentation for the carb::crashreporter::ICrashReporter interface. These operations include starting a task of trying to upload old crash dump files, registering callback functions for any time a crash dump upload completes, resolving addresses to symbols (for debug purposes only), and adding volatile metadata for the process.

There are also some utility helper functions in the carb::crashreporter namespace that can simplify some operations such as adding new static metadata values. The only set of functions that should be directly called from there are the carb::crashreporter::addCrashMetaData().

Configuration Options

The Carbonite crash reporter (carb.crashreporter-breakpad.plugin) has several configuration options that can be used to control its behavior. These are specified either in an app’s config file or on the command line. The following settings keys are defined:

  • "/crashreporter/url": The URL to use when uploading crash dump files. By default this will be an empty string. The URL is expected to be able to accept multipart form messages being posted to it. Many omniverse apps will be automatically configured to use the default upload URL of https://services.nvidia.com/submit using this setting. This can then be overridden on the command line or in a config file if needed.

  • "/crashreporter/product": Sets the name of the product for which crash reports will be generated. This setting is required in order for any uploads of crash dumps to occur. This becomes the product name that is included with the crash dump’s metadata. Without this metadata value set, the NVIDIA URL will reject the dump file. This may be any string value, but should be descriptive enough of the name of the app that it can be distinguished from crash dumps for other products. This defaults to an empty string.

  • "/crashreporter/version": Sets the version information for the app. This setting is required in order for any uploads of crash dumps to occur. This becomes the version information that is included with the crash dump’s metadata. Without this metadata value set, the NVIDIA URL will reject the dump file. This may be any string value, but should be descriptive enough of the version information of the crashing app that an investigation can be done on it. This defaults to an empty string.

  • "/crashreporter/dumpDir": The full path to the location to write crash dump and metadata files to on the local machine. This will also be the location that old crash dumps are uploaded from (if they exist) on subsequent runs of the app. This directory must already exist and will not be created by the crash reporter itself. By default this is the current working directory.

  • "/crashreporter/enabled": Sets whether the crash reporter is enabled or not. By default, the crash reporter will be enabled on load of the plugin. This setting can change at any point during the process’ lifetime and it will be acted on immediately by the crash reporter. When the crash reporter is disabled, its exception/signal catching hooks will be removed. The plugin will remain loaded and functional, but no action will be taken if a crash does occur. When the crash reporter is enabled, the exception/signal catching hooks will be installed again. This defaults to true.

  • "/crashreporter/alwaysUpload": Sets whether crash dump files should be uploaded after they are created. This can be used to override the user’s performance consent setting for the purposes of uploading a crash report if needed. If this is false, the user’s performance consent setting will control whether uploads are attempted. Note that this setting is effectively ignored if no upload URL has been set in /crashreporter/url. This defaults to false.

  • "/crashreporter/skipOldDumpUpload": Indicates whether attempts to upload old crash dump files should be skipped. This is useful for situations such as test apps or launching child instances of an app so that they don’t potentially end up blocking during shutdown due to an upload in progress. This defaults to false.

  • "/crashreporter/log": When enabled, this indicates whether a stack trace of the crashing thread should be written out to the app log. This will attempt to resolve the symbols on the call stack as best it can with the debugging information that is available. This defaults to true.

  • "/crashreporter/preserveDump": When enabled, this indicates that crash dump files that were successfully uploaded should not be deleted. This is useful in situations such as CI/CD so that any crash dump files from a crashed process can be stored as job artifacts. This defaults to false.

  • "/crashreporter/data": Any non-array settings values created under this settings branch will be captured as metadata values for the process. These metadata values can be added using the carb::crashreporter::addCrashMetadata() helper function. This defaults to an empty settings branch.

  • "/crashreporter/uploadTimeoutMs": Windows only. Provides a timeout in milliseconds that, when exceeded, will consider the upload as failed. This does not limit the actual amount of time that the upload takes due to a bug in wininet. Typically this value does not need to be changed. This defaults to 7,200,000ms (2 hours).

  • "/crashreporter/debuggerAttachTimeoutMs": Determines the time in milliseconds to wait for a debugger to attach after a crash occurs. If this is a non-zero value, the crash report processing and upload will proceed once a debugger successfully attaches to the process. This is useful when trying to debug post-crash functionality since some debuggers don’t let the original exception go completely unhandled to the point where the crash reporter is allowed to handle it (ie: if attached before the crash). This setting defaults to 0ms meaning the wait is disabled.

  • "/crashreporter/dumpFlags": Flags to control which data is written to the minidump file (on Windows). These can either be specified as a single hex value for all the flags to use (assuming the user knows what they are doing), or with MiniDump* flag names separated by comma (‘,’), colon (‘:’), bar (‘|’), or whitespace. There should be no whitespace between flags when specified on the command line. The ‘MiniDump’ prefix on each flag name may be omitted if desired. This defaults to an empty string (ie: no extra flags). The flags specified here may either override the default flags or be added to them depending on the value of /crashreporter/overrideDefaultDumpFlags. This setting is ignored on Linux. For more information on the flags and their values, look up MiniDumpNormal on MSDN or see a brief summary above at Loading a Crash Dump to Investigate.

  • "/crashreporter/overrideDefaultDumpFlags": Indicates whether the crash dump flags specified in /crashreporter/dumpFlags should replace the default crash dump flags (when true) or simply be added to the default flags (when false). This defaults to false. This setting is ignored on Linux.

  • "/crashreporter/compressDumpFiles": Indicates whether the crash dump files should be compressed as zip files before uploading to the server. The compressed crash dump files are typically ~10% the size of the original, so upload time should be greatly reduced. This feature must be supported on the server side to as well to be useful for upload. However, if this setting is enabled the crash dumps will still be compressed locally on disk and will occupy less space should the initial upload fail. This defaults to false.

  • "/crashreporter/retryCount": Determines the maximum number of times to try to upload any given crash dump to the server. The number of times the upload has been retried for any given crash dump is stored in its metadata. When the dump file is first created, the retry count will be set to the 0. Each time the upload fails, the retry count will be incremented by one. When the count reaches this limit (or goes over it if it changes from the default), the dump file and its metadata will be deleted whether the upload succeeds or not. This defaults to 10.