Difference between revisions of "Automated Essay Scoring"

From Penn Center for Learning Analytics Wiki
Jump to navigation Jump to search
(Created page with "Brent Bridgeman, Catherine Trapani, and Yigal Attali (2009)https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.577.7573&rep=rep1&type=pdf pdf * E-Rater system that automatically grades a student’s essay")
 
(Added Litman et al. (2021))
 
(26 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Brent Bridgeman, Catherine Trapani, and Yigal Attali (2009)[[https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.577.7573&rep=rep1&type=pdf pdf]]
Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring page]
* E-Rater system that automatically grades a student’s essay
 
* Automated scoring models for evaluating English essays, or e-rater
* E-Rater gave significantly better scores than human rater for 11th grade essays written by Hispanic students and Asian-American students
* E-Rater gave significantly better scores than human rater for TOEFL essays (independent task) written by speakers of Chinese and Korean
* E-Rater correlated poorly with human rater and gave better scores than human rater for GRE essays (both issue and argument prompts) written by Chinese speakers
* E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays, TOEFL, and GRE writings
 
 
 
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]
 
* A later version of automated scoring models for evaluating English essays, or e-rater
* E-rater gave significantly lower score than human rater when assessing African-American students’ written responses to issue prompt in GRE
* E-rater gave  better scores for test-takers from Chinese speakers (Mainland China, Taiwan, Hong Kong) and Korean speakers when assessing TOEFL (independent prompt) essay
* E-rater gave lower scores for Arabic, Hindi, and Spanish speakers when assessing their written responses to independent prompt in TOEFL
* E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students
 
 
 
 
Ramineni & Williamson (2018) [https://onlinelibrary.wiley.com/doi/10.1002/ets2.12192 pdf]
 
* Revised automated scoring engine for assessing GSE essay
 
* E-rater gave African American test-takers significantly lower scores than human raters when assessing their written responses to argument prompts
* The shorter essays written by African American test-takers were more likely to receive lower scores as showing weakness in content and organization
 
 
 
 
Wang et al. (2018) [https://www.researchgate.net/publication/336009443_Monitoring_the_performance_of_human_and_automated_scores_for_spoken_responses pdf]
*Automated scoring model for evaluating English spoken responses
*SpeechRater gave a significantly lower score than human raters for German students
*SpeechRater scored students from China higher than human raters, with H1-rater scores higher than mean
 
 
Litman et al. (2021) [https://link.springer.com/chapter/10.1007/978-3-030-78292-4_21 html]
* Automated essay scoring models inferring text evidence usage
* All algorithms studied have less than 1% of error explained by whether student is female and male, whether student is Black, or whether student receives free/reduced price lunch

Latest revision as of 12:33, 4 July 2022

Bridgeman et al. (2009) page

  • Automated scoring models for evaluating English essays, or e-rater
  • E-Rater gave significantly better scores than human rater for 11th grade essays written by Hispanic students and Asian-American students
  • E-Rater gave significantly better scores than human rater for TOEFL essays (independent task) written by speakers of Chinese and Korean
  • E-Rater correlated poorly with human rater and gave better scores than human rater for GRE essays (both issue and argument prompts) written by Chinese speakers
  • E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays, TOEFL, and GRE writings


Bridgeman et al. (2012) pdf

  • A later version of automated scoring models for evaluating English essays, or e-rater
  • E-rater gave significantly lower score than human rater when assessing African-American students’ written responses to issue prompt in GRE
  • E-rater gave better scores for test-takers from Chinese speakers (Mainland China, Taiwan, Hong Kong) and Korean speakers when assessing TOEFL (independent prompt) essay
  • E-rater gave lower scores for Arabic, Hindi, and Spanish speakers when assessing their written responses to independent prompt in TOEFL
  • E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students



Ramineni & Williamson (2018) pdf

  • Revised automated scoring engine for assessing GSE essay
  • E-rater gave African American test-takers significantly lower scores than human raters when assessing their written responses to argument prompts
  • The shorter essays written by African American test-takers were more likely to receive lower scores as showing weakness in content and organization



Wang et al. (2018) pdf

  • Automated scoring model for evaluating English spoken responses
  • SpeechRater gave a significantly lower score than human raters for German students
  • SpeechRater scored students from China higher than human raters, with H1-rater scores higher than mean


Litman et al. (2021) html

  • Automated essay scoring models inferring text evidence usage
  • All algorithms studied have less than 1% of error explained by whether student is female and male, whether student is Black, or whether student receives free/reduced price lunch