Skip Navigation
National Profile on Alternate Assessments Based on Alternate Achievement Standards:

NCSER 2009-3014
August 2009

Appendix A: Methodology

Document Analyses and Verification Activities

The primary source of data for the NSAA State Profiles and NSAA National Profile is the documentation that states submitted to the U.S. Department of Education's (ED) Office of Elementary and Secondary Education in response to the Standards and Assessment Peer Review (Peer Review) process. The Peer Review process is an ongoing process to evaluate whether states' assessment systems meet No Child Left Behind Act of 2001 (NCLB) requirements. The Standards and Assessment Peer Review Guidance provided by the Office of Elementary and Secondary Education framed the data collection activities for the NSAA State and National Profiles, as recommended by a panel of experts. States' submissions to the Peer Review included the following seven sections. Each section included elements that defined how states' various assessments met established professional standards and ED requirements:

  • challenging academic content standards;
  • challenging academic achievement standards;
  • a single statewide system of annual high-quality assessments;
  • high technical quality;
  • alignment of academic content standards, academic achievement standards, and assessments;
  • inclusion of all students in the assessment system; and
  • assessment reporting.

The Peer Review submissions had several advantages for the purposes of the NSAA. First, they provided a common framework to which all states responded. Second, the responses included much of the evidence in a single location. Third, the responses and evidence provided were likely to be reliable for the 2005–06 and 2006–07 school years. Fourth, the Peer Review sections and elements addressed issues related to states' alternate assessment systems in light of the states' overall assessment systems. Fifth, the submissions provided an opportunity to observe how states responded to issues raised by peer reviewers.

The SRI study team and its partners used two data collection methods to investigate the status and direction of state alternate assessments for children with significant cognitive disabilities between summer 2006 and fall 2007. First, the team reviewed in depth the state document submissions to the Peer Review process and information pertaining to the alternate assessments on state websites. Second, the study team conducted structured telephone interviews with knowledgeable informants in each of the 50 states and the District of Columbia. The purpose of these interviews was to verify findings from the document review and to obtain additional information about assessments that could not be gleaned from states' submissions.

To document its review and analysis of the state documents, SRI developed a data collection instrument and web-based database and data collection system for compiling states' responses to the elements in the Standards and Assessment Peer Review Guidance, additional information gathered from state websites, and states' responses to the subsequent interviews. The seven Peer Review components and corresponding elements provided in the guidance document became the basis for the data collection instrument. Instrument items were phrased in the form of a question similar to the phrasing in the Peer Review guidance. This phrasing provided a standard way for asking a state respondent to provide information that was not contained in the document review. A panel of experts reviewed the initial data collection instrument items to ensure that the items accurately reflected the intent of the Peer Review elements. A few additional items (e.g., number of content standards assessed on the alternate assessment) were recommended by ED and the same panel of experts as important for documenting alternate assessment systems and were included in the instrument. The full instrument was administered to State Department of Education officials from four states as part of the piloting process. These individuals provided feedback on the clarity of the items and assisted NSAA in determining the feasibility of the document analysis and verification process and the procedures. State Department of Education officials did not suggest any changes to the items. They did, however, suggest approaches to facilitate data collection and reduce states' burden and time commitment. It should be noted that it was not necessary to administer all instrument items to state respondents. State respondents were administered only items that were not completed based on the document review.

During the initial data collection phase, the evolving nature of alternate assessments based on alternate achievement standards became evident. As states received feedback through the Peer Review process, the majority found it necessary to revise and, in some cases, to discontinue use of their alternate assessments administered in 2005–06. Thus, while the original plan had been to focus on the 2005–06 school year, as a result of changes taking place in states' alternate assessment systems, the study team decided, in consultation with ED and a panel of experts, to collect data for both the 2005–06 and 2006–07 school years.

In June 2006, research team members were trained to use a systematic set of procedures for analyzing state documents and websites and for entering the data into the database. Two analysts were assigned to each state, with the intention of having two researchers reviewing each state's extensive documentation and becoming highly familiar with the state's information. Procedures for data collection and the role and responsibilities of each researcher were clearly defined, based on the following steps.

Step 1. One researcher, identified as R1, reviewed the state's submission narrative and the Peer Review Consensus notes, and conducted the initial web-based search for policy documents, state academic content standards (including "extended" academic content standards for students with significant cognitive disabilities), alternate assessment training manuals and technical manuals, and alternate assessment blueprints or test specifications that were available on state department of education websites. R1 entered the information into the database. When R1 had completed the review of electronic and online materials as fully as possible, he or she turned the findings over to the second assigned researcher (R2). During this "handover," R1 briefed R2 on missing information and where this information might be located in the Peer Review submission materials housed at ED.

Step 2. The R2 researcher then went to the ED headquarters building where the complete set of Peer Review submissions were stored to locate documents pertinent to the state's alternate assessment. Examples of documents sought and reviewed included technical reports, results of internal and/ or external validity and reliability studies, Board of Education minutes, notes from state Technical Advisory Committees, the minutes from sessions for setting alternate achievement standards, results of alignment studies, and state timelines for meeting Peer Review requirements. R2 then entered the findings of this review into the study database. Throughout, both R1 and R2 team members reconfirmed the accuracy of their respective reviews and identified areas that needed further clarification or resolution by a third supervising researcher (R3). The information collected on each state was reviewed by R3 to promote consistency in responses and to reconcile any differences between R1 and R2 findings.

Step 3. When the data collection for each state was completed, the data were downloaded into a data verification version of the NSAA State Data Summary that presented the information collected for each state's alternate assessment system and provided a mechanism for states to verify the researchers' findings. The verification instrument (see figure A-1) included check boxes that allowed states to indicate whether the information collected was

  • accurate and complete for 2005–06; or
  • not accurate and/or not complete for 2005–06; and whether
  • information had changed for 2006–07.

The verification instrument was piloted in four states during December 2006 and January 2007. On the basis of feedback from these states, and in consultation with ED, the NSAA State Data Summary verification process was further streamlined and includes the items presented in this report.

In March 2007, the study team sent a letter to the state director of assessment and accountability and the state superintendent of public instruction in each state and the District of Columbia from the Commissioner of the National Center for Special Education Research. This letter described the purpose of the study, introduced SRI and its partners, and asked states to identify the persons most appropriate to review the NSAA State Data Summary and participate in a telephone interview.

The NSAA State Data Summary was sent to each state between March 27 and May 8, 2007, with detailed instructions to the state informant(s) on completing and returning the summary to SRI within 2 weeks.

Step 4. In March 2007, the research team was trained on the procedures to be followed in conducting the telephone interviews with state administrators. A lead researcher and a support researcher were identified for each state. These researchers were usually the same individuals who had conducted the document analysis. SRI developed a website to record the results of the state interviews; the site included a call log, data entry screen for interviewing, and mechanism for combining notes taken during the interview by the two participating researchers.

Step 5. When states returned their completed NSAA State Data Summary reviews (April to September 2007), a research team member entered states' responses into the NSAA database. The lead researcher for each state arranged a convenient interview time with the state informant(s). The lead researcher asked each state to provide information only about items informants indicated were not accurate or not complete for 2005–06 and those for which information had changed from 2005–06 to 2006–07. During the interview, two researchers recorded interview responses and comments about the data into the NSAA database. At this time, the researchers also developed a list of documents not previously available to the study team that states agreed to send to NSAA as they became available. These documents included, for example, new training manuals and technical reports about the alternate assessments' reliability, validity, or alignment with state standards for the 2006–07 school year.

Step 6. Following completion of the telephone interview, the two interviewers updated and edited the state information on the NSAA database to include data collected prior to the interview that had been verified as correct by the state informant(s), data from the interview, and data from any additional documents cited during the interview and subsequently provided by the state to NSAA for review. This process was completed for all states and the District of Columbia by the end of September 2007.

Top

Data Processing and Coding Activities

The NSAA data collected through the document review and telephone interview process resulted in four types of data formats: yes/no items, multiple-choice items, closed-ended text items (such as the name of the assessment and number of content standards addressed for a specific subject), and open-ended response items.

Coding of open-ended items. In September 2007, senior NSAA researchers met for 3 days to develop procedures for coding open-ended items. They used the following inductive analytic procedures for systematically analyzing qualitative data, as defined by Glaser and Strauss (1967) and Strauss and Corbin (1990).

Step 1. For each open-ended item, the researchers worked in pairs to read and understand the state responses and to create initial coding categories. Each researcher in the pair then independently coded approximately 10 randomly selected state responses by reading line by line and assigning coding categories. The two researchers then discussed their proposed coding categories in detail, defining the reasoning behind each code and its definition and then reconciling differences and refining existing codes or adding others as needed. The researchers then independently coded another 10 items to test the proposed coding scheme.

Step 2. Over a 2-week period, codes for all sections were revised and examined to prepare for a "Coding Institute." State responses to most of the items could be differentiated into a relatively few easily coded categories. When necessary, redundant or overlapping items were collapsed and coded together. A small number of items elicited little or no information because the items applied to too few or no states' alternate assessment systems. For example, no state had multiple alternate assessment forms, so coding categories were not developed for the items referring to multiple test forms.

Step 3. The Coding Institute was a week-long meeting that included training and practice for researchers on using the codes, followed by coding the open-ended items for the 50 states and the District of Columbia. A pair of researchers coded each item, first coding the item independently, then comparing their codes, and finally reconciling any disagreements for a final code (see Interrater Reliability section below). In some instances, the data provided by the state were ambiguous and could not be coded. These items were further researched and then were subjected to the same process of having two researchers independently code them, followed by a comparison of codes and a reconciliation of any disagreements for a final code. Codes were recorded on hard-copy coding sheets.

Step 4. The data from the hard-copy coding sheets were double entered (to ensure accuracy) into Excel files, from which a data file for each state was generated that included all items and their response codes for final review and verification by the lead researcher/ interviewer for each state.

Step 5. Coding was verified for all 50 states and the District of Columbia. The lead researcher/interviewer reviewed all responses for consistency, based on his or her knowledge and understanding of the state. Updates and revisions were made to a few of the items, and those changes were documented in the final data set.

Processing of uncoded items. Data from the yes/no, multiple-choice, and closed-ended items did not require coding as described above, and were verified and recorded in spreadsheets.

Top

State Profile Data Verification

State profiles were created from the collected data. After consultation with ED staff, a decision was made to focus the profiles only on the 2006–07 school year rather than both the 2005–06 and 2006–07 school years, for which data had been collected. The decision allowed the most up-to-date data to be reported clearly and without redundancy. Because the state profiles displayed data in a different format than states had previously reviewed, the state profile was sent to each state for a final review in May and June 2008, using the following procedure:

Step 1. Each state profile was sent to the state assessment director or previous respondent for review. A cover letter explained that changes to the profile were possible only if they could be supported by state documentation.

Step 2. Each state was contacted by phone and e-mail to discuss the profile. Information and documentation were collected for items that might be considered for update or revision.

Step 3. The research team reviewed any information and documentation submitted by each state. A profile was updated to add missing information or correct inaccuracies if sufficient documentation was provided by the state.

Top

Interrater Reliability

Open-Ended Data Coding Activities
Two sets of comparisons were calculated for the interrater reliability of the open-ended items. The first consisted of comparisons between the codes for R1's and R2's individual results from the Coding Institute (see step 3, p. A-7). The second set of comparisons calculated the interrater reliability of the reconciled final codes determined by R1 and R2 from the Coding Institute (step 3) and codes determined during the final verification by the lead researcher/interviewer (step 5). Each comparison was calculated by percent agreement and Cohen's Kappa analyses. The percent agreement calculation used the number of agreements1 divided by the sum of agreements and disagreements. The Cohen's Kappa calculation used the number observed to agree minus the number expected to agree by chance divided by the number of items minus the number expected to agree.

Percent agreement analyses. The following interrater reliability findings apply only to the 2006–07 school year because only those data are reported in the profiles. In the first set of comparisons, the overall interrater agreement level between the two coders was 92.3 percent. In addition, calculations were conducted on an item-by-item basis and by response category for each item. The interrater agreement by item ranged from 84.1 percent to 99.1 percent. The interrater agreement by response category for each item ranged from 75.9 percent to 100 percent (figure A-2).

The second set of comparisons calculated the interrater reliability of the reconciled final codes determined by R1 and R2 from the Coding Institute (step 3) and codes determined during the final verification by the lead researcher/interviewer (step 5). The same approach in calculating the percentage of agreement between coders was used. The overall interrater agreement level was 93.7 percent. In addition, as with the first set of comparisons, the interrater reliability was calculated for each item and for each response category. The interrater agreement ranged from 78.4 percent to 97.5 percent for the item-by-item analysis. The interrater agreement by response category ranged from 55.8 percent (a single outlier) to 100 percent.

Cohen's Kappa analyses. The overall Cohen's Kappa between the two coders was .80. The interrater agreement by item ranged from .59 to .96. The interrater agreement by response category for each item ranged from -.03 to 1.0 (figure A-2).

For the second set of comparisons, the reconciled final codes determined by R1 and R2 from the Coding Institute (step 3) and codes determined during the final verification by the lead researcher/interviewer (step 5), the overall interrater agreement level was .84. In addition, as with the first set of comparisons, the interrater reliability was calculated for each item and for each response category. The interrater agreements ranged from .58 to .95 for the item-by-item analysis. The interrater reliability coefficients by response category ranged from 0 to 1.0.

The low values for kappa tended to occur on items where the raters both agreed for almost all states but almost all of the ratings were in one of the two possible rating categories. For example, for item D5 the coders marked "student's special education teacher" for most of the states and their agreement was high, but because the vast majority were marked as "student's special education teacher," few disagreements (relative to the total number of states) between coders 1 and 2 resulted in a low kappa value. We noted that when there was substantial imbalance in the percentage of observations in the two rating categories, kappa could give counterintuitive results. For example, suppose that each rater has a 90 percent chance of correctly rating a state but that all states belong to only one of the two rating categories. Then we would expect 81 percent agreement between the raters in the correct rating category, 1 percent agreement between the raters in the wrong rating category, and 18 percent disagreement between the raters (split between the two off-diagonal cells in a 2 x 2 rating table). This results in an expected kappa of .0 even though both raters have a 90 percent correct rating ability. If, on the other hand, half of the states are in each of the two rating categories then, the expected kappa is .64.

State Profile Verification Activities
For the state profile verification activities, all items, both open-ended and uncoded (yes/no items, multiple-choice items, and closed-ended text items) were reviewed by states. One comparison was calculated for the interrater reliability during the state profile verification activities. The comparison calculated the interrater reliability between the final verification by the lead researcher/ interviewer (step 5) and the final review by the state of all items. Similar to the open-ended items, the comparison was calculated by percent agreement and Cohen's Kappa analyses. Each analysis was examined separately for the open-ended and uncoded items. The yes/no items, multiple-choice items, and closed-ended text items were originally reported by the state during the interview phase. If the items were changed during the review process, that change was due to the state review.

Percent agreement analyses. The overall interrater agreement level between the final verification by the researcher team and final review by the state for open-ended items was 90.7 percent. In addition, calculations were conducted on an item-by-item basis and by response category for each item. The interrater agreement by item ranged from 75.0 percent to 100 percent. The interrater agreement by response category for each item ranged from 53.9 percent (a single outlier) to 100 percent (figure A-3).

For uncoded items, the overall interrater agreement level was 91.8 percent (figure A-3). The interrater agreement by item ranged from 84.8 percent to 100 percent. The interrater agreement by response category for each item ranged from 50.0 percent (a single outlier) to 100 percent.

Cohen's Kappa analyses. The overall Cohen's Kappa between the final verification by the researcher team and final review by the state was .80 (figure A-3). The interrater agreement by item ranged from .33 to 1.0. The interrater agreement by response category for each item ranged from -.03 to 1.0.

For uncoded items, the overall interrater agreement level was .84. The interrater agreement by item ranged from .69 to 1.0. The interrater agreement by response category for each item ranged from 0 to 1.0 (figure A-3).

Two items in the state profile (the number of content standards assessed by alternate assessment and the number of general content standards) were not included in the Cohen's Kappa calculation because of the structure of the items.

Similar to the open-ended coding activities, the low values for kappa tended to occur on items where the raters both agreed for almost all states but almost all of the ratings were in one of the two possible rating categories. We noted that when there was substantial imbalance in the percentage of observations in the two rating categories, kappa could give counterintuitive results.

Top

1 Agreements and disagreements were determined by comparing the codes for each response category. Agreements were assigned the value of 100; disagreements were assigned the value of 0. Averages of all agreements and disagreements for the 50 states and the District of Columbia for all the response categories across all the items are reported as the overall percentage of agreement for each year. Averages of all items and of response categories for each item are reported at the item level.