Skip directly to search Skip directly to A to Z list Skip directly to navigation Skip directly to page options Skip directly to site content

INDUSTRY AND OCCUPATION CODING

	NIOCCS Logo

How NIOCCS Works

Here you will learn more technical details about the NIOCCS program.

NIOCCS Coding Engine

NIOCCS codes industry and occupation text based on the Census Industry and Occupation Classification system supplemented with special codes developed by CDC/NIOSH for non-paid workers, non-workers, and the military (see NIOSH I&O coding documentation for more information).

The NIOCCS Coding Engine design has processes that cover phrase-based and word-based, exact match and proximity match, and weighted and not-weighted matching. Each process has its specialty of best-fit coding areas, so the combined coding ability is enhanced.

A high level view of the NIOCCS coding engine is illustrated in the diagram below.

	NIOCCS coding engine diagram

The NIOCCS Knowledgebase (KB) is designed to handle common industry and occupation combinations and common misspellings. It is the first process in the coding engine. Input records that have an exact match in the KB will be automatically coded and will not need to be processed through further coding algorithms. The NIOCCS KB was developed using one million records coded by the Bureau of Census on Census surveys and 260,000 death certificate records coded by NIOSH. These records were reviewed by our expert I&O coders to include in the KB. The initial NIOCCS KB has approximately 40,000 records.

NIOCCS makes use of Confidence Levels (CL) to decide the coding path, i.e. Autocoding or Computer-Assisted coding. Records that meet the user specified autocode confidence level setting will be automatically coded. Records that fall below the confidence level setting are made available in the computer-assisted coding module.

Confidence Level (CL) Setting options
High If records are processed using the HIGH confidence level setting, then only matched candidates where NIOCCS has 90% or greater confidence of accuracy will be automatically coded.
Medium If records are processed using the MEDIUM confidence level setting, then only matched candidates where NIOCCS has 70% or greater confidence of accuracy will be automatically coded.

NOTE: The higher confidence level (CL) setting will normally result in higher accuracy of the coded results however it may reduce the number of records automatically coded. See Chapter 5 in the NIOCCS User Manual for more information about the NIOCCS Autocoding Confidence Levels.

The I&O Restriction Filter is an inter-dependency arbitrator. The industry code and occupation code sometimes are inter-dependent, in that one industry title may map to more than one industry code, and the most accurate one can be decided only by considering the occupation information; likewise, one occupation title may map to more than one occupation code, only the industry code can help to narrow them down to the most appropriate one. Thus, NIOCCS first assigns the industry code, and then the occupation code, because in most cases the occupation codes are restricted by industry codes. If there is still more than one set of industry and occupation codes that cannot be further screened, they will be output as all possible candidates together with their confidence levels. See Chapter 6.5.2.4 in the NIOCCS User Manual for more information on industry restriction rules.

Autocoding Results

Benchmarks for NIOCCS autocoding are based on accuracy rates of the data that is autocoded by the system.Accuracy is tested using large sets of records that have been coded and verified by NIOSH trained I&O coders.The benchmark goals for NIOCCS are:

High Confidence Level: 10% or less error rate found in autocoded data

Medium Confidence Level: 25% or less error rate found in autocoded data

Production rates are determined by calculating the percent of records coded automatically by NIOCCS.NOTE:The quality of data input for coding can result in very different autocoding production rates. Using the benchmarks set for coding accuracy, the average NIOCCS production rates for autocoding has demonstrated the following:

HIGH Confidence Level, Both Industry and Occupation Autocoded

Data Type Year 2013 Year 2014 Year 2015
Death Certificates 64% 64% 70%
Surveys 49% 50% 51%
Average of All Data Types 51% 56% 60%

Medium Confidence Level settings will typically result in a 10-15% increase in the number of records autocoded depending on the quality of the data.

NOTE: The higher confidence level (CL) setting will normally result in higher accuracy of the coded results however it may reduce the number of records automatically coded.

Continual Improvement

The NIOCCS project team continually works to identify adjustments that can be made to the system to improve autocoding and accuracy rates. User feedback is welcome and is used to identify and prioritize improvements to be made to the system. NIOCCS system architecture was developed to enable the following types of ongoing system improvements:

Knowledge-base (KB)

The NIOCCS KB will be continually evaluated as NIOSH coding and IT staff analyze more coded data to identify the refinements that could be made to the knowledgebase to improve accuracy and efficiency.

Coding Engine

As more data have been processed and studied, the internal parameters (such as the weight of process, weight of keywords, etc.) will be adjusted to the optimal values, thus accuracy and production are increased.

Special Coding Rules

Specific rules for unique industry or occupation titles will be added or modified as needed to improve coding accuracy. Each rule will be tested and approved by expert coders before adding into the system, and will be periodically validated, so that invalid or obsolete rules are removed.

Data Quality

Coding results will vary and depend upon overall quality of the source data. Different data sources may render significantly different accuracy and production rates. Structured and detailed data sources will have higher accuracy and production rates than data sources with liberal text, insufficient information, or numbers or symbols included in the text.

NIOCCS uses only the industry and occupation text to assign codes. Records that contain employer name and/or job duties will not code at the same rate of accuracy as records containing only industry and occupation. This is because the additional pieces of information (employer and job duties) can conflict and/or provide more detailed information that could alter the I&O codes assigned. Including this information can be helpful however when using the computer-assisted coding module to ensure that appropriate codes are assigned manually.

Limitations

Performance

Internet bandwidth will significantly affect the interactivity of the computer-assisted coding.

The Auto-coding process may take a significant amount of time when the volume of the data is significantly large. The turnaround time for autocoding may also depend on the traffic in the queue of coding jobs.

File Size Limitations

Upload file size is currently (September 2014) limited to 2.5 mg. The number of records this equates to will vary depending on how many of the optional fields on the input file format are used. Files uploaded using the expanded file format will equate to approximately 10,000 – 20,000 records. For files that use slim file format, it equates to approximately 20,000 – 25,000 records.

Coding directly to NAICS and SOC

NIOCCS coding is based on the Bureau of Census I&O Classification schemes. NAICS and SOC codes can be obtained through NIOCCS, however the NAICS and SOC codes will be limited to the detail provided in the Census Alphabetic Indexes. Users can not code directly to NAICS and SOC codes.

Top
Error processing SSI file