https://www.pcla.wiki/api.php?action=feedcontributions&user=Ryan&feedformat=atomPenn Center for Learning Analytics Wiki - User contributions [en]2024-03-28T20:10:09ZUser contributionsMediaWiki 1.37.1https://www.pcla.wiki/index.php?title=At-risk/Dropout/Stopout/Graduation_Prediction&diff=472At-risk/Dropout/Stopout/Graduation Prediction2023-06-29T17:34:21Z<p>Ryan: </p>
<hr />
<div>Kai et al. (2017) [https://www.upenn.edu/learninganalytics/ryanbaker/DLRN-eVersity.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J48 decision trees achieved much lower Kappa and AUC for Black students than White students<br />
* J48 decision trees achieved significantly lower Kappa but higher AUC for male students than female students<br />
* JRip decision rules achieved almost identical Kappa and AUC for Black students and White students<br />
* JRip decision trees achieved much lower Kappa and AUC for male students than female students<br />
<br />
<br />
Hu and Rangwala (2020) [https://files.eric.ed.gov/fulltext/ED608050.pdf pdf]<br />
* Models predicting if a college student will fail in a course<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against African-American students, while other models (particularly Logistic Regression and Rawlsian Fairness) performed far worse<br />
* The level of bias was inconsistent across courses, with MCCM prediction showing the least bias for Psychology and the greatest bias for Computer Science<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against male students, performing particularly better for Psychology course.<br />
* Other models (Logistic Regression and Rawlsian Fairness) performed far worse for male students, performing particularly worse in Computer Science and Electrical Engineering.<br />
<br />
<br />
Anderson et al. (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM2019_paper56.pdf pdf]<br />
* Models predicting six-year college graduation<br />
* False negatives rates were greater for Latino students when Decision Tree and Random Forest yielded was used<br />
* White students had higher false positive rates across all models, Decision Tree, SVM, Logistic Regression, Random Forest, and SGD<br />
* False negatives rates were greater for male students than female students when SVM, Logistic Regression, and SGD were used<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed little difference in AUC among White, Black, Hispanic, Asian, American Indian and Alaska Native, and Native Hawaiian and Pacific Islander.<br />
* The decision trees showed very minor differences in AUC between female and male students<br />
<br />
<br />
Gardner, Brooks and Baker (2019) [[https://www.upenn.edu/learninganalytics/ryanbaker/LAK_PAPER97_CAMERA.pdf pdf]]<br />
* Model predicting MOOC dropout, specifically through slicing analysis<br />
* Some algorithms performed worse for female students than male students, particularly in courses with 45% or less male presence<br />
<br />
<br />
Baker et al. (2020) [[https://www.upenn.edu/learninganalytics/ryanbaker/BakerBerningGowda.pdf pdf]]<br />
* Model predicting student graduation and SAT scores for military-connected students<br />
* For prediction of graduation, algorithms applying across population resulted an AUC of 0.60, degrading from their original performance of 70% or 71% to chance.<br />
* For prediction of SAT scores, algorithms applying across population resulted in a Spearman's ρ of 0.42 and 0.44, degrading a third from their original performance to chance.<br />
<br />
<br />
Kai et al. (2017) [https://files.eric.ed.gov/fulltext/ED596601.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J-48 decision trees achieved much higher Kappa and AUC for students whose parents did not attend college than those whose parents did<br />
* J-Rip decision rules achieved much higher Kappa and AUC for students whose parents did not attended college than those whose parents did<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* The model showed better recall for students who are under-represented minority (URM; not White or Asian), male, first-generation, or with greater financial needs<br />
* Whether the socio-demographic information was included or not, the model showed worse accuracy and true negative rates for residential students who are under-represented minority (URM; not White or Asian), male, first-generation, or with greater financial needs<br />
* Both accuracy and true negative rates were better for students who are first-generation, or with greater financial needs<br />
<br />
<br />
Verdugo et al. (2022) [https://dl.acm.org/doi/abs/10.1145/3506860.3506902 pdf]<br />
* An algorithm predicting dropout from university after the first year<br />
* Several algorithms achieved better AUC and F1 for students who attended public high schools than for students who attended private high schools.<br />
* Several algorithms predicted better AUC for male students than female students; F1 scores were more balanced.<br />
<br />
<br />
Sha et al. (2022) [https://ieeexplore.ieee.org/abstract/document/9849852]<br />
* Predicting dropout in XuetangX platform using neural network<br />
* A range of over-sampling methods tested<br />
* Regardless of over-sampling method used, dropout performance was slightly better for males.<br />
<br />
<br />
Queiroga et al. (2022) [https://www.mdpi.com/2078-2489/13/9/401 pdf]<br />
* Models predicting secondary school students at risk of failure or dropping out<br />
* Model was unable to make prediction of student success (F1 score = 0.0) for students not in a social welfare program (higher socioeconomic status)<br />
* Model had slightly lower AUC ROC (0.52 instead of 0.56) for students not in a social welfare program (higher socioeconomic status)<br />
<br />
<br />
Permodo et al.(2023) [https://www.researchgate.net/publication/370001437_Difficult_Lessons_on_Social_Prediction_from_Wisconsin_Public_Schools pdf]<br />
* Paper discusses system that predicts probabilities of on-time graduation<br />
*Prediction is less accurate for White students than other students<br />
*Prediction is more accurate for students with Disabilities than students without Disabilities<br />
*Prediction is more accurate for low-income students than for non-low-income students.<br />
*Prediction is comparable for Males and Females</div>Ryanhttps://www.pcla.wiki/index.php?title=At-risk/Dropout/Stopout/Graduation_Prediction&diff=448At-risk/Dropout/Stopout/Graduation Prediction2023-06-04T15:37:35Z<p>Ryan: Rewrote Queiroga</p>
<hr />
<div>Kai et al. (2017) [https://www.upenn.edu/learninganalytics/ryanbaker/DLRN-eVersity.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J48 decision trees achieved much lower Kappa and AUC for Black students than White students<br />
* J48 decision trees achieved significantly lower Kappa but higher AUC for male students than female students<br />
* JRip decision rules achieved almost identical Kappa and AUC for Black students and White students<br />
* JRip decision trees achieved much lower Kappa and AUC for male students than female students<br />
<br />
Hu and Rangwala (2020) [https://files.eric.ed.gov/fulltext/ED608050.pdf pdf]<br />
* Models predicting if a college student will fail in a course<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against African-American students, while other models (particularly Logistic Regression and Rawlsian Fairness) performed far worse<br />
* The level of bias was inconsistent across courses, with MCCM prediction showing the least bias for Psychology and the greatest bias for Computer Science<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against male students, performing particularly better for Psychology course.<br />
* Other models (Logistic Regression and Rawlsian Fairness) performed far worse for male students, performing particularly worse in Computer Science and Electrical Engineering.<br />
<br />
<br />
<br />
Anderson et al. (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM2019_paper56.pdf pdf]<br />
* Models predicting six-year college graduation<br />
* False negatives rates were greater for Latino students when Decision Tree and Random Forest yielded was used<br />
* White students had higher false positive rates across all models, Decision Tree, SVM, Logistic Regression, Random Forest, and SGD<br />
* False negatives rates were greater for male students than female students when SVM, Logistic Regression, and SGD were used<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed little difference in AUC among White, Black, Hispanic, Asian, American Indian and Alaska Native, and Native Hawaiian and Pacific Islander.<br />
* The decision trees showed very minor differences in AUC between female and male students<br />
<br />
<br />
Gardner, Brooks and Baker (2019) [[https://www.upenn.edu/learninganalytics/ryanbaker/LAK_PAPER97_CAMERA.pdf pdf]]<br />
* Model predicting MOOC dropout, specifically through slicing analysis<br />
* Some algorithms performed worse for female students than male students, particularly in courses with 45% or less male presence<br />
<br />
<br />
Baker et al. (2020) [[https://www.upenn.edu/learninganalytics/ryanbaker/BakerBerningGowda.pdf pdf]]<br />
* Model predicting student graduation and SAT scores for military-connected students<br />
* For prediction of graduation, algorithms applying across population resulted an AUC of 0.60, degrading from their original performance of 70% or 71% to chance.<br />
* For prediction of SAT scores, algorithms applying across population resulted in a Spearman's ρ of 0.42 and 0.44, degrading a third from their original performance to chance.<br />
<br />
<br />
Kai et al. (2017) [https://files.eric.ed.gov/fulltext/ED596601.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J-48 decision trees achieved much higher Kappa and AUC for students whose parents did not attend college than those whose parents did<br />
* J-Rip decision rules achieved much higher Kappa and AUC for students whose parents did not attended college than those whose parents did<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* The model showed better recall for students who are under-represented minority (URM; not White or Asian), male, first-generation, or with greater financial needs<br />
* Whether the socio-demographic information was included or not, the model showed worse accuracy and true negative rates for residential students who are under-represented minority (URM; not White or Asian), male, first-generation, or with greater financial needs<br />
* Both accuracy and true negative rates were better for students who are first-generation, or with greater financial needs<br />
<br />
Verdugo et al. (2022) [https://dl.acm.org/doi/abs/10.1145/3506860.3506902 pdf]<br />
* An algorithm predicting dropout from university after the first year<br />
* Several algorithms achieved better AUC and F1 for students who attended public high schools than for students who attended private high schools.<br />
* Several algorithms predicted better AUC for male students than female students; F1 scores were more balanced.<br />
<br />
Sha et al. (2022) [https://ieeexplore.ieee.org/abstract/document/9849852]<br />
* Predicting dropout in XuetangX platform using neural network<br />
* A range of over-sampling methods tested<br />
* Regardless of over-sampling method used, dropout performance was slightly better for males.<br />
<br />
Queiroga et al. (2022) [https://www.mdpi.com/2078-2489/13/9/401 pdf]<br />
<br />
* Models predicting secondary school students at risk of failure or dropping out<br />
* Model was unable to make prediction of student success (F1 score = 0.0) for students not in a social welfare program (higher socioeconomic status)<br />
* Model had slightly lower AUC ROC (0.52 instead of 0.56) for students not in a social welfare program (higher socioeconomic status)</div>Ryanhttps://www.pcla.wiki/index.php?title=Socioeconomic_Status&diff=447Socioeconomic Status2023-06-04T15:35:18Z<p>Ryan: Rewrote Queiroga</p>
<hr />
<div>Yudelson et al. (2014) [https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.659.872&rep=rep1&type=pdf pdf]<br />
<br />
* Models discovering generalizable sub-populations of students across different schools to predict students' learning with Carnegie Learning’s Cognitive Tutor (CLCT)<br />
<br />
* Models trained on schools with a high proportion of low-SES student performed worse than those trained with medium or low proportion<br />
* Models trained on schools with low, medium proportion of SES students performed similarly well for schools with high proportions of low-SES students<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
<br />
* Models predicting undergraduate course grades and average GPA<br />
<br />
* Students from low-income households were inaccurately predicted to perform worse for both short-term (final course grade) and long-term (GPA)<br />
* Fairness of model improved if it included only clickstream and survey data<br />
<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
*Models predicting college dropout for students in residential and fully online program<br />
*Whether the socio-demographic information was included or not, the model showed worse accuracy and true negative rates for residential students with greater financial needs<br />
*The model showed better recall for students with greater financial needs, especially for those studying in person<br />
<br />
<br />
Kung & Yu (2020)<br />
[https://dl.acm.org/doi/pdf/10.1145/3386527.3406755 pdf]<br />
* Predicting course grades and later GPA at public U.S. university<br />
* Equal performance for low-income and upper-income students in course grade prediction for several algorithms and metrics<br />
* Worse performance on independence for low-income students than high-income students in later GPA prediction for four of five algorithms; one algorithm had worse separation and two algorithms had worse sufficiency<br />
<br />
<br />
Litman et al. (2021) [https://link.springer.com/chapter/10.1007/978-3-030-78292-4_21 html]<br />
* Automated essay scoring models inferring text evidence usage<br />
* All algorithms studied have less than 1% of error explained by whether student receives free/reduced price lunch<br />
<br />
<br />
Queiroga et al. (2022) [https://www.mdpi.com/2078-2489/13/9/401 pdf]<br />
<br />
* Models predicting secondary school students at risk of failure or dropping out<br />
* Model was unable to make prediction of student success (F1 score = 0.0) for students not in a social welfare program (higher socioeconomic status)<br />
* Model had slightly lower AUC ROC (0.52 instead of 0.56) for students not in a social welfare program (higher socioeconomic status)</div>Ryanhttps://www.pcla.wiki/index.php?title=Course_Grade_and_GPA_Prediction&diff=439Course Grade and GPA Prediction2023-02-23T13:09:24Z<p>Ryan: </p>
<hr />
<div>Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
<br />
* Models predicting college success (or median grade or above)<br />
*Random forest algorithms performed significantly worse for underrepresented minority students (URM; American Indian, Black, Hawaiian or Pacific Islander, Hispanic, and Multicultural) than non-URM students (White and Asian), for male students than female students<br />
*Random forest algorithms performed significantly worse for male students than female students<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br /><br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
<br />
* Models predicting undergraduate course grades and average GPA<br />
<br />
* Students who are international, first-generation, or from low-income households were inaccurately predicted to get lower course grade and average GPA than their peer, and fairness of models improved with the inclusion of clickstream and survey data<br />
*Female students were inaccurately predicted to achieve greater short-term and long-term success than male students, and fairness of models improved when a combination of institutional and click data was used in the model<br />
<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
<br />
* Models predicting course outcome of students in a virtual learning environment (VLE)<br />
* More male students were predicted to pass the course than female students, but this overestimation was fairly small and not consistent across different algorithms<br />
*Among the algorithms, Naive Bayes had the lowest normalized mutual information value and the highest ABROCA value, or differences between the area under curve<br />
* Students with self-declared disability were predicted to pass the course more often<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
<br />
Kung & Yu (2020)<br />
[https://dl.acm.org/doi/pdf/10.1145/3386527.3406755 pdf]<br />
* Predicting course grades and later GPA at public U.S. university<br />
* Five algorithms and three metrics (independence, separation, sufficiency) analyzed<br />
* Poorer performance for Latinx students on course grade prediction for all three metrics; poorer performance for Latinx students on GPA prediction in terms of independence and sufficiency, but not separation<br />
* Poorer performance for first-generation students on course grade prediction for independence and separation, and for some algorithms for GPA prediction as well<br />
* Poorer performance for low-income students in several cases, about 1/3 of cases checked<br />
<br />
<br />
Jeong et al. (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Jeong_Racial_Bias_ML_Algs.pdf]<br />
* Predicting 9th grade math score from academic performance, surveys, and demographic information<br />
* Despite comparable accuracy, model tends to overpredict Asian and White students' performance, and underpredict Black, Hispanic, and Native American students' performance<br />
* Several fairness correction methods equalize false positive and false negative rates across groups.<br />
<br />
<br />
Sha et al. (2022) [https://ieeexplore.ieee.org/abstract/document/9849852]<br />
* Predicting course pass/fail with random forest in Open University data<br />
* A range of over-sampling methods tested<br />
* Regardless of over-sampling method used, course pass/fail performance was moderately better for males<br />
<br />
<br />
Deho et al. (2023) [https://files.osf.io/v1/resources/5am9z/providers/osfstorage/63eaf170a3fade041fe7c9db?format=pdf&action=download&direct&version=1]<br />
* Predicting whether course grade will be above or below 0.5<br />
* Better prediction for female students in some courses, better prediction for male students in other courses<br />
* Generally worse prediction for international students</div>Ryanhttps://www.pcla.wiki/index.php?title=International_Students&diff=438International Students2023-02-23T13:09:02Z<p>Ryan: Added Deho et al 2023</p>
<hr />
<div>Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
<br />
* Model predicting undergraduate course grades and average GPA<br />
<br />
* International students were inaccurately predicted to get lower course grade and average GPA than their peers when personal background was included<br />
* Fairness of the model improved if it included both clickstream and survey data<br />
<br />
<br />
Deho et al. (2023) [https://files.osf.io/v1/resources/5am9z/providers/osfstorage/63eaf170a3fade041fe7c9db?format=pdf&action=download&direct&version=1]<br />
* Predicting whether course grade will be above or below 0.5<br />
* Generally worse prediction international students</div>Ryanhttps://www.pcla.wiki/index.php?title=Gender:_Male/Female&diff=437Gender: Male/Female2023-02-23T13:08:09Z<p>Ryan: Added Deho et al 2023</p>
<hr />
<div>Kai et al. (2017) [https://www.upenn.edu/learninganalytics/ryanbaker/DLRN-eVersity.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J48 decision trees achieved significantly lower Kappa but higher AUC for male students than female students<br />
* JRip decision rules achieved much lower Kappa and AUC for male students than female students<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed very minor differences in AUC between female and male students<br />
<br />
<br />
Hu and Rangwala (2020) [https://files.eric.ed.gov/fulltext/ED608050.pdf pdf]<br />
* Models predicting if a college student will fail in a course<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against male students, performing particularly better for Psychology course.<br />
* Other models (Logistic Regression and Rawlsian Fairness) performed far worse for male students, performing particularly worse in Computer Science and Electrical Engineering.<br />
<br />
<br />
Anderson et al. (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM2019_paper56.pdf pdf]<br />
* Models predicting six-year college graduation<br />
* False negatives rates were greater for male students than female students when SVM, Logistic Regression, and SGD were used<br />
<br />
<br />
Gardner, Brooks and Baker (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/LAK_PAPER97_CAMERA.pdf pdf]<br />
* Model predicting MOOC dropout, specifically through slicing analysis<br />
* Some algorithms studied performed worse for female students than male students, particularly in courses with 45% or less male presence<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
* Model predicting course outcome<br />
* Marginal differences were found for prediction quality and in overall proportion of predicted pass between groups<br />
* Inconsistent in direction between algorithms.<br />
<br />
<br />
Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
* Models predicting college success (or median grade or above)<br />
* Random forest algorithms performed significantly worse for male students than female students<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
* Model predicting undergraduate short-term (course grades) and long-term (average GPA) success<br />
* Female students were inaccurately predicted to achieve greater short-term and long-term success than male students.<br />
* The fairness of models improved when a combination of institutional and click data was used in the model<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* Whether the socio-demographic information was included or not, the model showed worse true negative rates and worse accuracy for male students<br />
* The model showed better recall for male students, especially for those studying in person<br />
* The difference in recall and true negative rates were lower, and thus fairer, for male students studying online if their socio-demographic information was not included in the model<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
* Models predicting course outcome of students in a virtual learning environment (VLE)<br />
* More male students were predicted to pass the course than female students, but this overestimation was fairly small and not consistent across different algorithms<br />
* Among the algorithms, Naive Bayes had the lowest normalized mutual information value and the highest ABROCA value<br />
<br />
<br />
Bridgeman et al. (2009) <br />
[https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]<br />
<br />
* Automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays<br />
<br />
<br />
<br />
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]<br />
* A later version of automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students<br />
<br />
<br />
Verdugo et al. (2022) [https://dl.acm.org/doi/abs/10.1145/3506860.3506902 pdf]<br />
* An algorithm predicting dropout from university after the first year<br />
* Several algorithms achieved better AUC for male than female students; results were mixed for F1.<br />
<br />
<br />
Zhang et al. (in press)<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across gender groups. <br />
* No gender group consistently had best-performing detectors<br />
<br />
<br />
Rzepka et al. (2022) [https://www.insticc.org/node/TechnicalProgram/CSEDU/2022/presentationDetails/109621 pdf]<br />
* Models predicting whether student will quit spelling learning activity without completing<br />
* Multiple algorithms have slightly better false positive rates and AUC ROC for male students than female students, but equivalent performance on multiple other metrics.<br />
<br />
<br />
Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf]<br />
* Models predicting whether two students will communicate on an online discussion forum<br />
* Multiple fairness approaches lead to ABROCA of under 0.01 for female versus male students<br />
<br />
<br />
Sha et al. (2021) [https://angusglchen.github.io/files/AIED2021_Lele_Assessing.pdf pdf]<br />
* Models predicting a MOOC discussion forum post is content-relevant or content-irrelevant<br />
* Some algorithms achieved ABROCA under 0.01 for female students versus male students,<br />
but other algorithms (Naive Bayes) had ABROCA as high as 0.06<br />
* Balancing the size of each group in the training set reduced ABROCA<br />
<br />
<br />
Litman et al. (2021) [https://link.springer.com/chapter/10.1007/978-3-030-78292-4_21 html]<br />
* Automated essay scoring models inferring text evidence usage<br />
* All algorithms studied have less than 1% of error explained by whether student is female and male<br />
<br />
<br />
Sha et al. (2022) [https://ieeexplore.ieee.org/abstract/document/9849852]<br />
* Three data sets and algorithms: predicting course pass/fail (random forest), dropout (neural network), and forum post relevance (neural network)<br />
* A range of over-sampling methods tested<br />
* Regardless of over-sampling method used, course pass/fail performance was moderately better for males, dropout performance was slightly better for males, and forum post relevance performance was moderately better for females.<br />
<br />
<br />
Deho et al. (2023) [https://files.osf.io/v1/resources/5am9z/providers/osfstorage/63eaf170a3fade041fe7c9db?format=pdf&action=download&direct&version=1]<br />
* Predicting whether course grade will be above or below 0.5<br />
* Better prediction for female students in some courses, better prediction for male students in other courses</div>Ryanhttps://www.pcla.wiki/index.php?title=Course_Grade_and_GPA_Prediction&diff=436Course Grade and GPA Prediction2023-02-23T13:07:30Z<p>Ryan: </p>
<hr />
<div>Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
<br />
* Models predicting college success (or median grade or above)<br />
*Random forest algorithms performed significantly worse for underrepresented minority students (URM; American Indian, Black, Hawaiian or Pacific Islander, Hispanic, and Multicultural) than non-URM students (White and Asian), for male students than female students<br />
*Random forest algorithms performed significantly worse for male students than female students<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br /><br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
<br />
* Models predicting undergraduate course grades and average GPA<br />
<br />
* Students who are international, first-generation, or from low-income households were inaccurately predicted to get lower course grade and average GPA than their peer, and fairness of models improved with the inclusion of clickstream and survey data<br />
*Female students were inaccurately predicted to achieve greater short-term and long-term success than male students, and fairness of models improved when a combination of institutional and click data was used in the model<br />
<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
<br />
* Models predicting course outcome of students in a virtual learning environment (VLE)<br />
* More male students were predicted to pass the course than female students, but this overestimation was fairly small and not consistent across different algorithms<br />
*Among the algorithms, Naive Bayes had the lowest normalized mutual information value and the highest ABROCA value, or differences between the area under curve<br />
* Students with self-declared disability were predicted to pass the course more often<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
<br />
Kung & Yu (2020)<br />
[https://dl.acm.org/doi/pdf/10.1145/3386527.3406755 pdf]<br />
* Predicting course grades and later GPA at public U.S. university<br />
* Five algorithms and three metrics (independence, separation, sufficiency) analyzed<br />
* Poorer performance for Latinx students on course grade prediction for all three metrics; poorer performance for Latinx students on GPA prediction in terms of independence and sufficiency, but not separation<br />
* Poorer performance for first-generation students on course grade prediction for independence and separation, and for some algorithms for GPA prediction as well<br />
* Poorer performance for low-income students in several cases, about 1/3 of cases checked<br />
<br />
<br />
Jeong et al. (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Jeong_Racial_Bias_ML_Algs.pdf]<br />
* Predicting 9th grade math score from academic performance, surveys, and demographic information<br />
* Despite comparable accuracy, model tends to overpredict Asian and White students' performance, and underpredict Black, Hispanic, and Native American students' performance<br />
* Several fairness correction methods equalize false positive and false negative rates across groups.<br />
<br />
<br />
Sha et al. (2022) [https://ieeexplore.ieee.org/abstract/document/9849852]<br />
* Predicting course pass/fail with random forest in Open University data<br />
* A range of over-sampling methods tested<br />
* Regardless of over-sampling method used, course pass/fail performance was moderately better for males<br />
<br />
<br />
Deho et al. (2023) [https://files.osf.io/v1/resources/5am9z/providers/osfstorage/63eaf170a3fade041fe7c9db?format=pdf&action=download&direct&version=1]<br />
* Predicting whether course grade will be above or below 0.5<br />
* Better prediction for female students in some courses, better prediction for male students in other courses<br />
* Generally better prediction for Australian citizens than foreign nationals</div>Ryanhttps://www.pcla.wiki/index.php?title=Course_Grade_and_GPA_Prediction&diff=435Course Grade and GPA Prediction2023-02-23T13:07:16Z<p>Ryan: Added Deho et al 2023</p>
<hr />
<div>Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
<br />
* Models predicting college success (or median grade or above)<br />
*Random forest algorithms performed significantly worse for underrepresented minority students (URM; American Indian, Black, Hawaiian or Pacific Islander, Hispanic, and Multicultural) than non-URM students (White and Asian), for male students than female students<br />
*Random forest algorithms performed significantly worse for male students than female students<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br /><br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
<br />
* Models predicting undergraduate course grades and average GPA<br />
<br />
* Students who are international, first-generation, or from low-income households were inaccurately predicted to get lower course grade and average GPA than their peer, and fairness of models improved with the inclusion of clickstream and survey data<br />
*Female students were inaccurately predicted to achieve greater short-term and long-term success than male students, and fairness of models improved when a combination of institutional and click data was used in the model<br />
<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
<br />
* Models predicting course outcome of students in a virtual learning environment (VLE)<br />
* More male students were predicted to pass the course than female students, but this overestimation was fairly small and not consistent across different algorithms<br />
*Among the algorithms, Naive Bayes had the lowest normalized mutual information value and the highest ABROCA value, or differences between the area under curve<br />
* Students with self-declared disability were predicted to pass the course more often<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
<br />
Kung & Yu (2020)<br />
[https://dl.acm.org/doi/pdf/10.1145/3386527.3406755 pdf]<br />
* Predicting course grades and later GPA at public U.S. university<br />
* Five algorithms and three metrics (independence, separation, sufficiency) analyzed<br />
* Poorer performance for Latinx students on course grade prediction for all three metrics; poorer performance for Latinx students on GPA prediction in terms of independence and sufficiency, but not separation<br />
* Poorer performance for first-generation students on course grade prediction for independence and separation, and for some algorithms for GPA prediction as well<br />
* Poorer performance for low-income students in several cases, about 1/3 of cases checked<br />
<br />
<br />
Jeong et al. (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Jeong_Racial_Bias_ML_Algs.pdf]<br />
* Predicting 9th grade math score from academic performance, surveys, and demographic information<br />
* Despite comparable accuracy, model tends to overpredict Asian and White students' performance, and underpredict Black, Hispanic, and Native American students' performance<br />
* Several fairness correction methods equalize false positive and false negative rates across groups.<br />
<br />
<br />
Sha et al. (2022) [https://ieeexplore.ieee.org/abstract/document/9849852]<br />
* Predicting course pass/fail with random forest in Open University data<br />
* A range of over-sampling methods tested<br />
* Regardless of over-sampling method used, course pass/fail performance was moderately better for males<br />
<br />
Deho et al. (2023) [https://files.osf.io/v1/resources/5am9z/providers/osfstorage/63eaf170a3fade041fe7c9db?format=pdf&action=download&direct&version=1]<br />
* Predicting whether course grade will be above or below 0.5<br />
* Better prediction for female students in some courses, better prediction for male students in other courses<br />
* Generally better prediction for Australian citizens than foreign nationals</div>Ryanhttps://www.pcla.wiki/index.php?title=Algorithmic_Bias_in_Education&diff=434Algorithmic Bias in Education2023-02-15T19:59:15Z<p>Ryan: </p>
<hr />
<div>== Empirical Evidence for Algorithmic Bias in Education: The Wiki ==<br />
<br />
This Wiki summarizes the current peer-reviewed published evidence surrounding Algorithmic Bias in Education:<br />
which groups are impacted, and in which contexts.<br />
<br />
For a relatively recent review on this topic, see <br />
Baker, R.S., Hawn, M.A. (in press) Algorithmic Bias in Education. To appear in <em>International Journal of Artificial Intelligence and Education</em><br />
([https://www.upenn.edu/learninganalytics/ryanbaker/AlgorithmicBiasInEducation_rsb3.7.pdf pdf])<br />
<br />
Note: Within this Wiki, we recommend that page editors use the group labels originally<br />
used within the publications being cited, to best represent the articles included here. We also request that each page center the members of the group the page is about.<br />
<br />
This wiki can be cited as<br />
Penn Center for Learning Analytics (*current year*) Empirical Evidence for Algorithmic Bias in Education: The Wiki. <br />
Philadelphia, PA: Penn Center for Learning Analytics. <br />
Retrieved *current date* from https://www.pcla.wiki/index.php/Algorithmic_Bias_in_Education <br />
<br />
== By Group Impacted ==<br />
* Race and Ethnicity<br />
** [[Black/African-American Learners in North America]]<br />
** [[Latino/Latina/Latinx/Hispanic Learners in North America]]<br />
** [[Asian/Asian-American Learners in North America]]<br />
** [[White Learners in North America]]<br />
** [[Indigenous Learners in North America]]<br />
** [[Research on Race and Ethnicity Conducted Outside of North America]] <br />
* [[Gender: Male/Female]]<br />
* [[Gender: Non-Binary and Transgender Learners]]<br />
* [[Sexual Orientation]]<br />
* [[National Origin or National Location]]<br />
* [[International Students]]<br />
* [[Native Language and Dialect]]<br />
* [[Learners with Disabilities]]<br />
* [[Age]]<br />
* [[Urbanicity]]<br />
* [[Parental Educational Background]]<br />
* [[Socioeconomic Status]]<br />
* [[Military-Connected Status]]<br />
* [[Children of Migrant Workers]]<br />
* [[Religion and Religious Background]]<br />
* [[Public or Private K-12 School]]<br />
* [[Intersectional Research]]<br />
<br />
== By Algorithm Application == <br />
* [[At-risk/Dropout/Stopout/Graduation Prediction]]<br />
* [[Course Grade and GPA Prediction]]<br />
*[[National and International Examination]]<br />
* [[Short-term Performance and Learning Gains Prediction]]<br />
* [[Automated Essay Scoring]]<br />
* [[Speech Recognition for Education]]<br />
* [[Other NLP Applications of Algorithms in Education]]<br />
* [[Student Knowledge Modeling]]<br />
* [[Engagement and Affect Detection]]<br />
*[[Self-regulated Learning]]<br />
* [[Task/Activity Quit Prediction]]<br />
* [[Social Network Link Prediction]]</div>Ryanhttps://www.pcla.wiki/index.php?title=National_and_International_Examination&diff=431National and International Examination2022-12-17T15:04:56Z<p>Ryan: </p>
<hr />
<div>Baker et al. (2020) [https://www.upenn.edu/learninganalytics/ryanbaker/BakerBerningGowda.pdf pdf]<br />
<br />
* Model predicting student graduation and SAT scores for military-connected students<br />
* For prediction of graduation, algorithms applying across population resulted an AUC of 0.60, degrading from their original performance by 70% or 71% towards chance.<br />
* For prediction of SAT scores, algorithms applying across population resulted in a Spearman's ρ of 0.42 and 0.44, degrading a third from their original performance to chance.<br />
<br />
<br />
Li et al. (2021) [https://arxiv.org/pdf/2103.15212.pdf pdf]<br />
*Model predicting student achievement on the standardized examination PISA<br />
*Inaccuracy of the U.S.-trained model was greater for students from countries with lower scores of national development (e.g. Indonesia, Vietnam, Moldova)<br />
<br />
<br />
Sulaiman & Roy (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Sulaiman_Transformers.pdf]<br />
* Models predicting whether a law student will pass the bar exam (to practice law)<br />
* Compared White and non-White students<br />
* Models not applying fairness constraints performed significantly worse for White students in terms of ABROCA<br />
* Models applying fairness constraints performed equivalently for White and non-White students</div>Ryanhttps://www.pcla.wiki/index.php?title=Student_Knowledge_Modeling&diff=430Student Knowledge Modeling2022-12-17T14:58:28Z<p>Ryan: fixed broken link</p>
<hr />
<div>Yudelson et al. (2014) [https://www.yudelson.info/pdf/EDM2014_YudelsonFRBNJ.pdf pdf]<br />
*Models discovering generalizable sub-populations of students across different schools to predict students' learning with Carnegie Learning’s Cognitive Tutor (CLCT)<br />
*Models trained on schools with a high proportion of low-SES student performed worse than those trained with medium or low proportion<br />
*Models trained on schools with low, medium proportion of SES students performed similarly well for schools with high proportions of low-SES students</div>Ryanhttps://www.pcla.wiki/index.php?title=Algorithmic_Bias_in_Education&diff=429Algorithmic Bias in Education2022-11-10T10:25:36Z<p>Ryan: /* By Group Impacted */</p>
<hr />
<div>== Empirical Evidence for Algorithmic Bias in Education: The Wiki ==<br />
<br />
This Wiki summarizes the current peer-reviewed published evidence surrounding Algorithmic Bias in Education:<br />
which groups are impacted, and in which contexts.<br />
<br />
For a relatively recent review on this topic, see <br />
Baker, R.S., Hawn, M.A. (in press) Algorithmic Bias in Education. To appear in <em>International Journal of Artificial Intelligence and Education</em><br />
([https://www.upenn.edu/learninganalytics/ryanbaker/AlgorithmicBiasInEducation_rsb3.7.pdf pdf])<br />
<br />
Note: Within this Wiki, we recommend that page editors use the group labels originally<br />
used within the publications being cited, to best represent the articles included here. We also request that each page center the members of the group the page is about.<br />
<br />
This wiki can be cited as<br />
Penn Center for Learning Analytics (*current year*) Empircal Evidence for Algorithmic Bias in Education: The Wiki. <br />
Philadelphia, PA: Penn Center for Learning Analytics. <br />
Retrieved *current date* from https://www.pcla.wiki/index.php/Algorithmic_Bias_in_Education <br />
<br />
== By Group Impacted ==<br />
* Race and Ethnicity<br />
** [[Black/African-American Learners in North America]]<br />
** [[Latino/Latina/Latinx/Hispanic Learners in North America]]<br />
** [[Asian/Asian-American Learners in North America]]<br />
** [[White Learners in North America]]<br />
** [[Indigenous Learners in North America]]<br />
** [[Research on Race and Ethnicity Conducted Outside of North America]] <br />
* [[Gender: Male/Female]]<br />
* [[Gender: Non-Binary and Transgender Learners]]<br />
* [[Sexual Orientation]]<br />
* [[National Origin or National Location]]<br />
* [[International Students]]<br />
* [[Native Language and Dialect]]<br />
* [[Learners with Disabilities]]<br />
* [[Age]]<br />
* [[Urbanicity]]<br />
* [[Parental Educational Background]]<br />
* [[Socioeconomic Status]]<br />
* [[Military-Connected Status]]<br />
* [[Children of Migrant Workers]]<br />
* [[Religion and Religious Background]]<br />
* [[Public or Private K-12 School]]<br />
* [[Intersectional Research]]<br />
<br />
== By Algorithm Application == <br />
* [[At-risk/Dropout/Stopout/Graduation Prediction]]<br />
* [[Course Grade and GPA Prediction]]<br />
*[[National and International Examination]]<br />
* [[Short-term Performance and Learning Gains Prediction]]<br />
* [[Automated Essay Scoring]]<br />
* [[Speech Recognition for Education]]<br />
* [[Other NLP Applications of Algorithms in Education]]<br />
* [[Student Knowledge Modeling]]<br />
* [[Engagement and Affect Detection]]<br />
*[[Self-regulated Learning]]<br />
* [[Task/Activity Quit Prediction]]<br />
* [[Social Network Link Prediction]]</div>Ryanhttps://www.pcla.wiki/index.php?title=Learners_with_Disabilities&diff=428Learners with Disabilities2022-11-10T10:07:11Z<p>Ryan: Correction to description of Riazy et al article</p>
<hr />
<div> Loukina & Buzick (2017) [https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/ets2.12170 pdf]<br />
<br />
* a model (the SpeechRater) automatically scoring open-ended spoken responses for speakers with documented or suspected speech impairments<br />
* SpeechRater was less accurate for test takers who were deferred for signs of speech impairment (ρ<sup>2</sup> = .57) than test takers who were given accommodations for documented disabilities (ρ<sup>2</sup> = .73)<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
<br />
* Models predicting course outcome of students in a virtual learning environment (VLE)<br />
* Disparate impact was found for students with self-declared disabilities, with systematic inaccuracies in predictions for learners in this group.</div>Ryanhttps://www.pcla.wiki/index.php?title=Other_NLP_Applications_of_Algorithms_in_Education&diff=427Other NLP Applications of Algorithms in Education2022-08-31T20:25:58Z<p>Ryan: added Sha et al 2022</p>
<hr />
<div>Naismith et al. (2018) [http://d-scholarship.pitt.edu/40665/1/EDM2018_paper_37.pdf pdf]<br />
<br />
* a model that measures L2 learners’ lexical sophistication with the frequency list based on the native speaker corpora<br />
* Arabic-speaking learners are rated systematically lower across all levels of English proficiency than speakers of Chinese, Japanese, Korean, and Spanish.<br />
* Level 5 Arabic-speaking learners are unfairly evaluated to have similar level of lexical sophistication as Level 4 learners from China, Japan, Korean and Spain .<br />
* When used on ETS corpus, “high”-labeled essays by Japanese-speaking learners are rated significantly lower in lexical sophistication than Arabic, Japanese, Korean and Spanish peers.<br />
<br />
<br />
Samei et al. (2015) [https://files.eric.ed.gov/fulltext/ED560879.pdf pdf]<br />
<br />
* Models predicting classroom discourse properties (e.g. authenticity and uptake)<br />
* Model trained on urban students (authenticity: 0.62, uptake: 0.60) performed with similar accuracy when tested on non-urban students (authenticity: 0.62, uptake: 0.62)<br />
* Model trained on non-urban (authenticity: 0.61, uptake: 0.59) performed with similar accuracy when tested on urban students (authenticity: 0.60, uptake: 0.63)<br />
<br />
<br />
Sha et al. (2021) [https://angusglchen.github.io/files/AIED2021_Lele_Assessing.pdf pdf]<br />
* Models predicting a MOOC discussion forum post is content-relevant or content-irrelevant<br />
* MOOCs taught in English<br />
* Some algorithms achieved ABROCA under 0.01 for female students versus male students, but other algorithms (Naive Bayes) had ABROCA as high as 0.06<br />
* ABROCA varied from 0.03 to 0.08 for non-native speakers of English versus native speakers<br />
* Balancing the size of each group in the training set reduced ABROCA values<br />
<br />
<br />
Sha et al. (2022) [https://ieeexplore.ieee.org/abstract/document/9849852]<br />
* Predicting forum post relevance to course in Moodle data (neural network)<br />
* A range of over-sampling methods tested<br />
* Regardless of over-sampling method used, forum post relevance performance was moderately better for females.</div>Ryanhttps://www.pcla.wiki/index.php?title=At-risk/Dropout/Stopout/Graduation_Prediction&diff=426At-risk/Dropout/Stopout/Graduation Prediction2022-08-31T20:25:00Z<p>Ryan: added Sha et al 2022</p>
<hr />
<div>Kai et al. (2017) [https://www.upenn.edu/learninganalytics/ryanbaker/DLRN-eVersity.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J48 decision trees achieved much lower Kappa and AUC for Black students than White students<br />
* J48 decision trees achieved significantly lower Kappa but higher AUC for male students than female students<br />
* JRip decision rules achieved almost identical Kappa and AUC for Black students and White students<br />
* JRip decision trees achieved much lower Kappa and AUC for male students than female students<br />
<br />
<br />
Hu and Rangwala (2020) [https://files.eric.ed.gov/fulltext/ED608050.pdf pdf]<br />
* Models predicting if a college student will fail in a course<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against African-American students, while other models (particularly Logistic Regression and Rawlsian Fairness) performed far worse<br />
* The level of bias was inconsistent across courses, with MCCM prediction showing the least bias for Psychology and the greatest bias for Computer Science<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against male students, performing particularly better for Psychology course.<br />
* Other models (Logistic Regression and Rawlsian Fairness) performed far worse for male students, performing particularly worse in Computer Science and Electrical Engineering.<br />
<br />
<br />
Anderson et al. (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM2019_paper56.pdf pdf]<br />
* Models predicting six-year college graduation<br />
* False negatives rates were greater for Latino students when Decision Tree and Random Forest yielded was used<br />
* White students had higher false positive rates across all models, Decision Tree, SVM, Logistic Regression, Random Forest, and SGD<br />
* False negatives rates were greater for male students than female students when SVM, Logistic Regression, and SGD were used<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed little difference in AUC among White, Black, Hispanic, Asian, American Indian and Alaska Native, and Native Hawaiian and Pacific Islander.<br />
* The decision trees showed very minor differences in AUC between female and male students<br />
<br />
<br />
Gardner, Brooks and Baker (2019) [[https://www.upenn.edu/learninganalytics/ryanbaker/LAK_PAPER97_CAMERA.pdf pdf]]<br />
* Model predicting MOOC dropout, specifically through slicing analysis<br />
* Some algorithms performed worse for female students than male students, particularly in courses with 45% or less male presence<br />
<br />
<br />
Baker et al. (2020) [[https://www.upenn.edu/learninganalytics/ryanbaker/BakerBerningGowda.pdf pdf]]<br />
* Model predicting student graduation and SAT scores for military-connected students<br />
* For prediction of graduation, algorithms applying across population resulted an AUC of 0.60, degrading from their original performance of 70% or 71% to chance.<br />
* For prediction of SAT scores, algorithms applying across population resulted in a Spearman's ρ of 0.42 and 0.44, degrading a third from their original performance to chance.<br />
<br />
<br />
Kai et al. (2017) [https://files.eric.ed.gov/fulltext/ED596601.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J-48 decision trees achieved much higher Kappa and AUC for students whose parents did not attend college than those whose parents did<br />
* J-Rip decision rules achieved much higher Kappa and AUC for students whose parents did not attended college than those whose parents did<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* The model showed better recall for students who are under-represented minority (URM; not White or Asian), male, first-generation, or with greater financial needs<br />
* Whether the socio-demographic information was included or not, the model showed worse accuracy and true negative rates for residential students who are under-represented minority (URM; not White or Asian), male, first-generation, or with greater financial needs<br />
* Both accuracy and true negative rates were better for students who are first-generation, or with greater financial needs<br />
<br />
Verdugo et al. (2022) [https://dl.acm.org/doi/abs/10.1145/3506860.3506902 pdf]<br />
* An algorithm predicting dropout from university after the first year<br />
* Several algorithms achieved better AUC and F1 for students who attended public high schools than for students who attended private high schools.<br />
* Several algorithms predicted better AUC for male students than female students; F1 scores were more balanced.<br />
<br />
Sha et al. (2022) [https://ieeexplore.ieee.org/abstract/document/9849852]<br />
* Predicting dropout in XuetangX platform using neural network<br />
* A range of over-sampling methods tested<br />
* Regardless of over-sampling method used, dropout performance was slightly better for males.</div>Ryanhttps://www.pcla.wiki/index.php?title=Course_Grade_and_GPA_Prediction&diff=425Course Grade and GPA Prediction2022-08-31T20:23:27Z<p>Ryan: added Sha et al 2022</p>
<hr />
<div>Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
<br />
* Models predicting college success (or median grade or above)<br />
*Random forest algorithms performed significantly worse for underrepresented minority students (URM; American Indian, Black, Hawaiian or Pacific Islander, Hispanic, and Multicultural) than non-URM students (White and Asian), for male students than female students<br />
*Random forest algorithms performed significantly worse for male students than female students<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br /><br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
<br />
* Models predicting undergraduate course grades and average GPA<br />
<br />
* Students who are international, first-generation, or from low-income households were inaccurately predicted to get lower course grade and average GPA than their peer, and fairness of models improved with the inclusion of clickstream and survey data<br />
*Female students were inaccurately predicted to achieve greater short-term and long-term success than male students, and fairness of models improved when a combination of institutional and click data was used in the model<br />
<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
<br />
* Models predicting course outcome of students in a virtual learning environment (VLE)<br />
* More male students were predicted to pass the course than female students, but this overestimation was fairly small and not consistent across different algorithms<br />
*Among the algorithms, Naive Bayes had the lowest normalized mutual information value and the highest ABROCA value, or differences between the area under curve<br />
* Students with self-declared disability were predicted to pass the course more often<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
<br />
Kung & Yu (2020)<br />
[https://dl.acm.org/doi/pdf/10.1145/3386527.3406755 pdf]<br />
* Predicting course grades and later GPA at public U.S. university<br />
* Five algorithms and three metrics (independence, separation, sufficiency) analyzed<br />
* Poorer performance for Latinx students on course grade prediction for all three metrics; poorer performance for Latinx students on GPA prediction in terms of independence and sufficiency, but not separation<br />
* Poorer performance for first-generation students on course grade prediction for independence and separation, and for some algorithms for GPA prediction as well<br />
* Poorer performance for low-income students in several cases, about 1/3 of cases checked<br />
<br />
<br />
Jeong et al. (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Jeong_Racial_Bias_ML_Algs.pdf]<br />
* Predicting 9th grade math score from academic performance, surveys, and demographic information<br />
* Despite comparable accuracy, model tends to overpredict Asian and White students' performance, and underpredict Black, Hispanic, and Native American students' performance<br />
* Several fairness correction methods equalize false positive and false negative rates across groups.<br />
<br />
<br />
Sha et al. (2022) [https://ieeexplore.ieee.org/abstract/document/9849852]<br />
* Predicting course pass/fail with random forest in Open University data<br />
* A range of over-sampling methods tested<br />
* Regardless of over-sampling method used, course pass/fail performance was moderately better for males</div>Ryanhttps://www.pcla.wiki/index.php?title=Gender:_Male/Female&diff=424Gender: Male/Female2022-08-31T20:22:22Z<p>Ryan: added Sha et al 2022</p>
<hr />
<div>Kai et al. (2017) [https://www.upenn.edu/learninganalytics/ryanbaker/DLRN-eVersity.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J48 decision trees achieved significantly lower Kappa but higher AUC for male students than female students<br />
* JRip decision rules achieved much lower Kappa and AUC for male students than female students<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed very minor differences in AUC between female and male students<br />
<br />
<br />
Hu and Rangwala (2020) [https://files.eric.ed.gov/fulltext/ED608050.pdf pdf]<br />
* Models predicting if a college student will fail in a course<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against male students, performing particularly better for Psychology course.<br />
* Other models (Logistic Regression and Rawlsian Fairness) performed far worse for male students, performing particularly worse in Computer Science and Electrical Engineering.<br />
<br />
<br />
Anderson et al. (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM2019_paper56.pdf pdf]<br />
* Models predicting six-year college graduation<br />
* False negatives rates were greater for male students than female students when SVM, Logistic Regression, and SGD were used<br />
<br />
<br />
Gardner, Brooks and Baker (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/LAK_PAPER97_CAMERA.pdf pdf]<br />
* Model predicting MOOC dropout, specifically through slicing analysis<br />
* Some algorithms studied performed worse for female students than male students, particularly in courses with 45% or less male presence<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
* Model predicting course outcome<br />
* Marginal differences were found for prediction quality and in overall proportion of predicted pass between groups<br />
* Inconsistent in direction between algorithms.<br />
<br />
<br />
Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
* Models predicting college success (or median grade or above)<br />
* Random forest algorithms performed significantly worse for male students than female students<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
* Model predicting undergraduate short-term (course grades) and long-term (average GPA) success<br />
* Female students were inaccurately predicted to achieve greater short-term and long-term success than male students.<br />
* The fairness of models improved when a combination of institutional and click data was used in the model<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* Whether the socio-demographic information was included or not, the model showed worse true negative rates and worse accuracy for male students<br />
* The model showed better recall for male students, especially for those studying in person<br />
* The difference in recall and true negative rates were lower, and thus fairer, for male students studying online if their socio-demographic information was not included in the model<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
* Models predicting course outcome of students in a virtual learning environment (VLE)<br />
* More male students were predicted to pass the course than female students, but this overestimation was fairly small and not consistent across different algorithms<br />
* Among the algorithms, Naive Bayes had the lowest normalized mutual information value and the highest ABROCA value<br />
<br />
<br />
Bridgeman et al. (2009) <br />
[https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]<br />
<br />
* Automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays<br />
<br />
<br />
<br />
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]<br />
* A later version of automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students<br />
<br />
<br />
Verdugo et al. (2022) [https://dl.acm.org/doi/abs/10.1145/3506860.3506902 pdf]<br />
* An algorithm predicting dropout from university after the first year<br />
* Several algorithms achieved better AUC for male than female students; results were mixed for F1.<br />
<br />
<br />
Zhang et al. (in press)<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across gender groups. <br />
* No gender group consistently had best-performing detectors<br />
<br />
<br />
Rzepka et al. (2022) [https://www.insticc.org/node/TechnicalProgram/CSEDU/2022/presentationDetails/109621 pdf]<br />
* Models predicting whether student will quit spelling learning activity without completing<br />
* Multiple algorithms have slightly better false positive rates and AUC ROC for male students than female students, but equivalent performance on multiple other metrics.<br />
<br />
<br />
Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf]<br />
* Models predicting whether two students will communicate on an online discussion forum<br />
* Multiple fairness approaches lead to ABROCA of under 0.01 for female versus male students<br />
<br />
<br />
Sha et al. (2021) [https://angusglchen.github.io/files/AIED2021_Lele_Assessing.pdf pdf]<br />
* Models predicting a MOOC discussion forum post is content-relevant or content-irrelevant<br />
* Some algorithms achieved ABROCA under 0.01 for female students versus male students,<br />
but other algorithms (Naive Bayes) had ABROCA as high as 0.06<br />
* Balancing the size of each group in the training set reduced ABROCA<br />
<br />
<br />
Litman et al. (2021) [https://link.springer.com/chapter/10.1007/978-3-030-78292-4_21 html]<br />
* Automated essay scoring models inferring text evidence usage<br />
* All algorithms studied have less than 1% of error explained by whether student is female and male<br />
<br />
Sha et al. (2022) [https://ieeexplore.ieee.org/abstract/document/9849852]<br />
* Three data sets and algorithms: predicting course pass/fail (random forest), dropout (neural network), and forum post relevance (neural network)<br />
* A range of over-sampling methods tested<br />
* Regardless of over-sampling method used, course pass/fail performance was moderately better for males, dropout performance was slightly better for males, and forum post relevance performance was moderately better for females.</div>Ryanhttps://www.pcla.wiki/index.php?title=Algorithmic_Bias_in_Education&diff=423Algorithmic Bias in Education2022-08-05T11:42:19Z<p>Ryan: </p>
<hr />
<div>== Empirical Evidence for Algorithmic Bias in Education: The Wiki ==<br />
<br />
This Wiki summarizes the current peer-reviewed published evidence surrounding Algorithmic Bias in Education:<br />
which groups are impacted, and in which contexts.<br />
<br />
For a relatively recent review on this topic, see <br />
Baker, R.S., Hawn, M.A. (in press) Algorithmic Bias in Education. To appear in <em>International Journal of Artificial Intelligence and Education</em><br />
([https://www.upenn.edu/learninganalytics/ryanbaker/AlgorithmicBiasInEducation_rsb3.7.pdf pdf])<br />
<br />
Note: Within this Wiki, we recommend that page editors use the group labels originally<br />
used within the publications being cited, to best represent the articles included here. We also request that each page center the members of the group the page is about.<br />
<br />
This wiki can be cited as<br />
Penn Center for Learning Analytics (*current year*) Empircal Evidence for Algorithmic Bias in Education: The Wiki. <br />
Philadelphia, PA: Penn Center for Learning Analytics. <br />
Retrieved *current date* from https://www.pcla.wiki/index.php/Algorithmic_Bias_in_Education <br />
<br />
== By Group Impacted ==<br />
* Race and Ethnicity<br />
** [[Black/African-American Learners in North America]]<br />
** [[Latino/Latina/Latinx/Hispanic Learners in North America]]<br />
** [[Asian/Asian-American Learners in North America]]<br />
** [[White Learners in North America]]<br />
** [[Indigenous Learners in North America]]<br />
** [[Research on Race and Ethnicity Conducted Outside of North America]] <br />
* [[Gender: Male/Female]]<br />
* [[Gender: Non-Binary and Transgender Learners]]<br />
* [[Sexual Orientation]]<br />
* [[Linguistic Origin]]<br />
* [[National Origin or National Location]]<br />
* [[International Students]]<br />
* [[Native Language and Dialect]]<br />
* [[Learners with Disabilities]]<br />
* [[Age]]<br />
* [[Urbanicity]]<br />
* [[Parental Educational Background]]<br />
* [[Socioeconomic Status]]<br />
* [[Military-Connected Status]]<br />
* [[Children of Migrant Workers]]<br />
* [[Religion and Religious Background]]<br />
* [[Public or Private K-12 School]]<br />
* [[Intersectional Research]]<br />
<br />
== By Algorithm Application == <br />
* [[At-risk/Dropout/Stopout/Graduation Prediction]]<br />
* [[Course Grade and GPA Prediction]]<br />
*[[National and International Examination]]<br />
* [[Short-term Performance and Learning Gains Prediction]]<br />
* [[Automated Essay Scoring]]<br />
* [[Speech Recognition for Education]]<br />
* [[Other NLP Applications of Algorithms in Education]]<br />
* [[Student Knowledge Modeling]]<br />
* [[Engagement and Affect Detection]]<br />
*[[Self-regulated Learning]]<br />
* [[Task/Activity Quit Prediction]]<br />
* [[Social Network Link Prediction]]</div>Ryanhttps://www.pcla.wiki/index.php?title=Course_Grade_and_GPA_Prediction&diff=422Course Grade and GPA Prediction2022-08-04T20:06:32Z<p>Ryan: Added Jeong et al (2022)</p>
<hr />
<div>Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
<br />
* Models predicting college success (or median grade or above)<br />
*Random forest algorithms performed significantly worse for underrepresented minority students (URM; American Indian, Black, Hawaiian or Pacific Islander, Hispanic, and Multicultural) than non-URM students (White and Asian), for male students than female students<br />
*Random forest algorithms performed significantly worse for male students than female students<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br /><br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
<br />
* Models predicting undergraduate course grades and average GPA<br />
<br />
* Students who are international, first-generation, or from low-income households were inaccurately predicted to get lower course grade and average GPA than their peer, and fairness of models improved with the inclusion of clickstream and survey data<br />
*Female students were inaccurately predicted to achieve greater short-term and long-term success than male students, and fairness of models improved when a combination of institutional and click data was used in the model<br />
<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
<br />
* Models predicting course outcome of students in a virtual learning environment (VLE)<br />
* More male students were predicted to pass the course than female students, but this overestimation was fairly small and not consistent across different algorithms<br />
*Among the algorithms, Naive Bayes had the lowest normalized mutual information value and the highest ABROCA value, or differences between the area under curve<br />
* Students with self-declared disability were predicted to pass the course more often<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
<br />
Kung & Yu (2020)<br />
[https://dl.acm.org/doi/pdf/10.1145/3386527.3406755 pdf]<br />
* Predicting course grades and later GPA at public U.S. university<br />
* Five algorithms and three metrics (independence, separation, sufficiency) analyzed<br />
* Poorer performance for Latinx students on course grade prediction for all three metrics; poorer performance for Latinx students on GPA prediction in terms of independence and sufficiency, but not separation<br />
* Poorer performance for first-generation students on course grade prediction for independence and separation, and for some algorithms for GPA prediction as well<br />
* Poorer performance for low-income students in several cases, about 1/3 of cases checked<br />
<br />
<br />
Jeong et al. (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Jeong_Racial_Bias_ML_Algs.pdf]<br />
* Predicting 9th grade math score from academic performance, surveys, and demographic information<br />
* Despite comparable accuracy, model tends to overpredict Asian and White students' performance, and underpredict Black, Hispanic, and Native American students' performance<br />
* Several fairness correction methods equalize false positive and false negative rates across groups.</div>Ryanhttps://www.pcla.wiki/index.php?title=Indigenous_Learners_in_North_America&diff=421Indigenous Learners in North America2022-08-04T20:05:54Z<p>Ryan: Added Jeong et al (2022)</p>
<hr />
<div>Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
*Models predicting college success (or median grade or above)<br />
*Random forest algorithms performed significantly worse for underrepresented minority students (URM; American Indian, Black, Hawaiian or Pacific Islander, Hispanic, and Multicultural) than non-URM students (White and Asian)<br />
*The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
*Models predicting student's high school dropout<br />
*The decision trees showed little difference in AUC among American Indian and Alaska Native, White, Black, Hispanic, Asian, and Native Hawaiian and Pacific Islander.<br />
*The decision trees showed very minor differences in AUC between female and male students<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups (including Native American and Pacific Islander students)<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
<br />
Jeong et al. (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Jeong_Racial_Bias_ML_Algs.pdf]<br />
* Predicting 9th grade math score from academic performance, surveys, and demographic information<br />
* Despite comparable accuracy, model tends to underpredict Native American students' performance<br />
* Several fairness correction methods equalize false positive and false negative rates across groups.</div>Ryanhttps://www.pcla.wiki/index.php?title=White_Learners_in_North_America&diff=420White Learners in North America2022-08-04T20:05:27Z<p>Ryan: Added Jeong et al (2022)</p>
<hr />
<div>Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]<br />
* Automated scoring models for evaluating English essays, or e-rater <br />
* The score difference between human rater and e-rater was significantly smaller for 11th grade essays written by White and African American students than other groups<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
Zhang et al. (in press) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM22_paper_35.pdf]<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across racial/ethnic groups. <br />
* No racial/ethnic group consistently had best-performing detectors<br />
<br />
<br />
Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf]<br />
* Models predicting whether two students will communicate on an online discussion forum<br />
* Compared members of overrepresented racial groups to members of underrepresented racial groups (overrepresented group approximately 90% White)<br />
* Multiple fairness approaches lead to ABROCA of under 0.01 for overrepresented versus underrepresented students<br />
<br />
<br />
Sulaiman & Roy (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Sulaiman_Transformers.pdf]<br />
* Models predicting whether a law student will pass the bar exam (to practice law)<br />
* Compared White and non-White students<br />
* Models not applying fairness constraints performed significantly worse for White students in terms of ABROCA<br />
* Models applying fairness constraints performed equivalently for White and non-White students<br />
<br />
<br />
Jeong et al. (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Jeong_Racial_Bias_ML_Algs.pdf]<br />
* Predicting 9th grade math score from academic performance, surveys, and demographic information<br />
* Despite comparable accuracy, model tends to overpredict White students' performance<br />
* Several fairness correction methods equalize false positive and false negative rates across groups.</div>Ryanhttps://www.pcla.wiki/index.php?title=Latino/Latina/Latinx/Hispanic_Learners_in_North_America&diff=419Latino/Latina/Latinx/Hispanic Learners in North America2022-08-04T20:05:05Z<p>Ryan: Added Jeong et al (2022)</p>
<hr />
<div>Anderson et al. (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM2019_paper56.pdf pdf]<br />
* Models predicting six-year college graduation<br />
* False negatives rates were greater for Latino students when Decision Tree and Random Forest yielded was used<br />
* White students had higher false positive rates across all models, Decision Tree, SVM, Logistic Regression, Random Forest, and SGD<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed little difference in AUC among Hispanic, White, Black, Asian, American Indian and Alaska Native, and Native Hawaiian and Pacific Islander.<br />
<br />
<br />
Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
* Models predicting college success (or median grade or above)<br />
* Random forest algorithms performed significantly worse for underrepresented minority students (URM; Hispanic, American Indian, Black, Hawaiian or Pacific Islander, and Multicultural) than non-URM students (White and Asian)<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
* Model predicting undergraduate short-term (course grades) and long-term (average GPA) success<br />
* Hispanic students were inaccurately predicted to perform worse for both short-term and long-term<br />
* The fairness of models improved when either click or a combination of click and survey data, and not institutional data, was included in the model<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* Whether the socio-demographic information was included or not, the model showed worse true negative rates for students who are underrepresented minority (URM; or not White or Asian), and worse accuracy if URM students are studying in person <br />
* The model showed better recall for URM students, whether they were in residential or online program<br />
<br />
<br />
Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring page]<br />
* Automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater gave significantly better scores than human rater for 11th grade essays written by Hispanic students and Asian-American students<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
<br />
Zhang et al. (in press) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM22_paper_35.pdf pdf]<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across racial/ethnic groups. <br />
* No racial/ethnic group consistently had best-performing detectors<br />
<br />
<br />
Kung & Yu (2020)<br />
[https://dl.acm.org/doi/pdf/10.1145/3386527.3406755 pdf]<br />
* Predicting course grades and later GPA at public U.S. university<br />
* Poorer independence, separation, sufficiency for Latinx students than white students for five different classic machine learning algorithms<br />
<br />
<br />
Jeong et al. (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Jeong_Racial_Bias_ML_Algs.pdf]<br />
* Predicting 9th grade math score from academic performance, surveys, and demographic information<br />
* Despite comparable accuracy, model tends to underpredict Hispanic students' performance<br />
* Several fairness correction methods equalize false positive and false negative rates across groups.</div>Ryanhttps://www.pcla.wiki/index.php?title=Black/African-American_Learners_in_North_America&diff=418Black/African-American Learners in North America2022-08-04T20:04:39Z<p>Ryan: Added Jeong et al (2022)</p>
<hr />
<div>Kai et al. (2017) [https://www.upenn.edu/learninganalytics/ryanbaker/DLRN-eVersity.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J48 decision trees achieved much lower Kappa and AUC for Black students than White students<br />
* JRip decision rules achieved almost identical Kappa and AUC for Black students and White students<br />
<br />
<br />
Hu and Rangwala (2020) [https://files.eric.ed.gov/fulltext/ED608050.pdf pdf]<br />
* Models predicting if a college student will fail in a course<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against African-American students, while other models (particularly Logistic Regression and Rawlsian Fairness) performed far worse<br />
* The level of bias was inconsistent across courses, with MCCM prediction showing the least bias for Psychology and the greatest bias for Computer Science<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed little difference in AUC among Black, White, Hispanic, Asian, American Indian and Alaska Native, and Native Hawaiian and Pacific Islander.<br />
<br />
<br />
Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
* Models predicting college success (or median grade or above)<br />
* Random forest algorithms performed significantly worse for underrepresented minority students (URM; Black, American Indian, Hawaiian or Pacific Islander, Hispanic, and Multicultural) than non-URM students (White and Asian)<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
* Model predicting undergraduate short-term (course grades) and long-term (average GPA) success<br />
* Black students were inaccurately predicted to perform worse for both short-term and long-term<br />
* The fairness of models improved when either click or a combination of click and survey data, and not institutional data, was included in the model<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* Whether the socio-demographic information was included or not, the model showed worse true negative rates for students who are underrepresented minority (URM; or not White or Asian), and worse accuracy if URM students are studying in person <br />
* The model showed better recall for URM students, whether they were in residential or online program<br />
<br />
<br />
Ramineni & Williamson (2018) [https://files.eric.ed.gov/fulltext/EJ1202928.pdf pdf]<br />
* Revised automated scoring engine for assessing GRE essay<br />
* E-rater gave African American test-takers significantly lower scores than human raters when assessing their written responses to argument prompts<br />
* The shorter essays written by African American test-takers were more likely to receive lower scores as showing weakness in content and organization<br />
<br />
<br />
<br />
Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]<br />
* Automated scoring models for evaluating English essays, or e-rater <br />
* The score difference between human rater and e-rater was significantly smaller for 11th grade essays written by African American and White students<br />
<br />
<br />
<br />
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502 pdf]<br />
* A later version of automated scoring models for evaluating English essays, or e-rater<br />
* E-rater gave significantly lower score than human rater when assessing African-American students’ written responses to issue prompt in GRE<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
<br />
Zhang et al. (in press) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM22_paper_35.pdf pdf]<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across racial/ethnic groups. <br />
* No racial/ethnic group consistently had best-performing detectors<br />
<br />
<br />
Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf]<br />
* Models predicting whether two students will communicate on an online discussion forum<br />
* Compared members of overrepresented racial groups to members of underrepresented racial groups (over 2/3<br />
Black/African American)<br />
* Multiple fairness approaches lead to ABROCA of under 0.01 for overrepresented versus underrepresented students<br />
<br />
<br />
Litman et al. (2021) [https://link.springer.com/chapter/10.1007/978-3-030-78292-4_21 html]<br />
* Automated essay scoring models inferring text evidence usage<br />
* All algorithms studied have less than 1% of error explained by whether student is Black<br />
<br />
<br />
Jeong et al. (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Jeong_Racial_Bias_ML_Algs.pdf]<br />
* Predicting 9th grade math score from academic performance, surveys, and demographic information<br />
* Despite comparable accuracy, model tends to underpredict Black students' performance<br />
* Several fairness correction methods equalize false positive and false negative rates across groups.</div>Ryanhttps://www.pcla.wiki/index.php?title=Asian/Asian-American_Learners_in_North_America&diff=417Asian/Asian-American Learners in North America2022-08-04T20:03:24Z<p>Ryan: Added Jeong et al (2022)</p>
<hr />
<div>Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed little difference in AUC among Asian, White, Black, Hispanic, American Indian and Alaska Native, and Native Hawaiian and Pacific Islander.<br />
<br />
<br />
Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring page]<br />
* Automated scoring models for evaluating English essays, or e-rater <br />
* E-Rater gave significantly better scores than human rater for 11th grade essays written by Hispanic students and Asian-American students<br />
<br />
<br />
<br />
Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
* Models predicting college success (or median grade or above)<br />
* Random forest algorithms performed significantly better for non-URM students (Asian and White) than for underrepresented minority students (URM; American Indian, Black, Hawaiian or Pacific Islander, Hispanic, and Multicultural)<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
<br />
Jeong et al. (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Jeong_Racial_Bias_ML_Algs.pdf]<br />
* Predicting 9th grade math score from academic performance, surveys, and demographic information<br />
* Despite comparable accuracy, model tends to overpredict Asian students' performance<br />
* Several fairness correction methods equalize false positive and false negative rates across groups.</div>Ryanhttps://www.pcla.wiki/index.php?title=National_and_International_Examination&diff=416National and International Examination2022-08-04T19:58:07Z<p>Ryan: Added Sulaiman & Roy</p>
<hr />
<div>Baker et al. (2020) [https://www.upenn.edu/learninganalytics/ryanbaker/BakerBerningGowda.pdf pdf]<br />
<br />
* Model predicting student graduation and SAT scores for military-connected students<br />
* For prediction of graduation, algorithms applying across population resulted an AUC of 0.60, degrading from their original performance of 70% or 71% to chance.<br />
* For prediction of SAT scores, algorithms applying across population resulted in a Spearman's ρ of 0.42 and 0.44, degrading a third from their original performance to chance.<br />
<br />
<br />
Li et al. (2021) [https://arxiv.org/pdf/2103.15212.pdf pdf]<br />
*Model predicting student achievement on the standardized examination PISA<br />
*Inaccuracy of the U.S.-trained model was greater for students from countries with lower scores of national development (e.g. Indonesia, Vietnam, Moldova)<br />
<br />
<br />
Sulaiman & Roy (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Sulaiman_Transformers.pdf]<br />
* Models predicting whether a law student will pass the bar exam (to practice law)<br />
* Compared White and non-White students<br />
* Models not applying fairness constraints performed significantly worse for White students in terms of ABROCA<br />
* Models applying fairness constraints performed equivalently for White and non-White students</div>Ryanhttps://www.pcla.wiki/index.php?title=White_Learners_in_North_America&diff=415White Learners in North America2022-08-04T19:57:29Z<p>Ryan: Added Sulaiman & Roy</p>
<hr />
<div>Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]<br />
* Automated scoring models for evaluating English essays, or e-rater <br />
* The score difference between human rater and e-rater was significantly smaller for 11th grade essays written by White and African American students than other groups<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
Zhang et al. (in press) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM22_paper_35.pdf]<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across racial/ethnic groups. <br />
* No racial/ethnic group consistently had best-performing detectors<br />
<br />
<br />
Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf]<br />
* Models predicting whether two students will communicate on an online discussion forum<br />
* Compared members of overrepresented racial groups to members of underrepresented racial groups (overrepresented group approximately 90% White)<br />
* Multiple fairness approaches lead to ABROCA of under 0.01 for overrepresented versus underrepresented students<br />
<br />
<br />
Sulaiman & Roy (2022) [https://fated2022.github.io/assets/pdf/FATED-2022_paper_Sulaiman_Transformers.pdf]<br />
* Models predicting whether a law student will pass the bar exam (to practice law)<br />
* Compared White and non-White students<br />
* Models not applying fairness constraints performed significantly worse for White students in terms of ABROCA<br />
* Models applying fairness constraints performed equivalently for White and non-White students</div>Ryanhttps://www.pcla.wiki/index.php?title=MORF:Data_Studies&diff=412MORF:Data Studies2022-07-19T21:26:16Z<p>Ryan: added details</p>
<hr />
<div>This page lists all known MORF based data studies since 2020.<br />
<br />
== Published Studies ==<br />
<br />
====== Hutt et al. (2022)<ref>Hutt, S., Baker, R. S., Ashenafi, M. M., Andres‐Bray, J. M., & Brooks, C. (2022). Controlled outputs, full data: A privacy‐protecting infrastructure for MOOC data. ''British Journal of Educational Technology''.</ref> ======<br />
Title - Controlled outputs, full data: A privacy-protecting infrastructure for MOOC data.<br />
<br />
====== Andres-Bray (2021)<ref>Andres-Bray, J. M. L. (2021). ''Replication in Massive Open Online Course Research Using the MOOC Replication Framework'' (Doctoral dissertation, University of Pennsylvania).</ref> ======<br />
Title - Replication in Massive Open Online Course Research Using the MOOC Replication Framework (Doctoral dissertation, University of Pennsylvania).<br />
<br />
====== Zhao, Wang, & Sahebi (2020)<ref>Zhao, S., Wang, C., & Sahebi, S. (2020). Modeling knowledge acquisition from multiple learning resource types. ''arXiv preprint arXiv:2006.13390''.</ref> ======<br />
Title - Modeling knowledge acquisition from multiple learning resource types.<br />
<br />
====== Wang et al. (2021)<ref>Wang, C., Sahebi, S., Zhao, S., Brusilovsky, P., & Moraes, L. O. (2021, June). Knowledge Tracing for Complex Problem Solving: Granular Rank-Based Tensor Factorization. In ''Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization'' (pp. 179-188).</ref> ======<br />
Title - Knowledge Tracing for Complex Problem Solving: Granular Rank-Based Tensor Factorization.<br />
<br />
== Ongoing Studies ==<br />
<br />
* Investigating algorithmic bias in predicting dropout from MOOCs for intersectional identities (led by Shamya Karumbaiah, CMU and Haripriya Valayaputtar, UPenn)<br />
* Detecting which MOOC forum posts should be responded to by course staff <br />
* Applying foundation models to MOOC Data (led by Anthony Botelho, U. Florida and Seth Adjei, Northern Kentucky University)_<br />
* Other projects by researchers at SUNY Albany, University of Pennsylvania<br />
<br />
== References ==</div>Ryanhttps://www.pcla.wiki/index.php?title=MORF&diff=411MORF2022-07-19T21:23:38Z<p>Ryan: updated unis</p>
<hr />
<div>The MOOC Replication Framework (MORF) is a framework that facilitates the replication of previously published findings across multiple data sets. It facilitates the construction and evaluation of end-to-end pipelines from raw data to evaluation. MORF is designed to ensure the seamless integration of new findings as new research is conducted or new hypotheses are generated, and to support the generation of novel research in the learning sciences.<br />
<br />
MORF is a joint project between multiple research laboratories, with primary implementation in recent years occurring at the University of Pennsylvania Center for Learning Analytics, the etc lab at the University of Michigan School of Information, and the Human-Computer Interaction Institute at Carnegie Mellon University. Other universities also have instances of MORF.<br />
<br />
MORF has now partnered with the ASSISTments E-TRIALS infrastructure to create RAILKaM, an integration where researchers will be able to link MOOC and intelligent tutor data.<ref>https://educational-technology-collective.github.io/morf/about/</ref><br />
<br />
<br />
[[MORF:Studies|Studies]]<br />
<br />
[[MORF:Data Studies|Data Studies]]<br />
<br />
== References ==<br />
<br />
=== Citations ===<br />
<references /></div>Ryanhttps://www.pcla.wiki/index.php?title=Algorithmic_Bias_in_Education&diff=401Algorithmic Bias in Education2022-07-07T07:52:48Z<p>Ryan: changed wiki title</p>
<hr />
<div>== Empircal Evidence for Algorithmic Bias in Education: The Wiki ==<br />
<br />
This Wiki summarizes the current peer-reviewed published evidence surrounding Algorithmic Bias in Education:<br />
which groups are impacted, and in which contexts.<br />
<br />
For a relatively recent review on this topic, see <br />
Baker, R.S., Hawn, M.A. (in press) Algorithmic Bias in Education. To appear in <em>International Journal of Artificial Intelligence and Education</em><br />
([https://www.upenn.edu/learninganalytics/ryanbaker/AlgorithmicBiasInEducation_rsb3.7.pdf pdf])<br />
<br />
Note: Within this Wiki, we recommend that page editors use the group labels originally<br />
used within the publications being cited, to best represent the articles included here. We also request that each page center the members of the group the page is about.<br />
<br />
This wiki can be cited as<br />
Penn Center for Learning Analytics (*current year*) Empircal Evidence for Algorithmic Bias in Education: The Wiki. <br />
Philadelphia, PA: Penn Center for Learning Analytics. <br />
Retrieved *current date* from https://www.pcla.wiki/index.php/Algorithmic_Bias_in_Education <br />
<br />
== By Group Impacted ==<br />
* Race and Ethnicity<br />
** [[Black/African-American Learners in North America]]<br />
** [[Latino/Latina/Latinx/Hispanic Learners in North America]]<br />
** [[Asian/Asian-American Learners in North America]]<br />
** [[White Learners in North America]]<br />
** [[Indigenous Learners in North America]]<br />
** [[Research on Race and Ethnicity Conducted Outside of North America]] <br />
* [[Gender: Male/Female]]<br />
* [[Gender: Non-Binary and Transgender Learners]]<br />
* [[Sexual Orientation]]<br />
* [[Linguistic Origin]]<br />
* [[National Origin or National Location]]<br />
* [[International Students]]<br />
* [[Native Language and Dialect]]<br />
* [[Learners with Disabilities]]<br />
* [[Age]]<br />
* [[Urbanicity]]<br />
* [[Parental Educational Background]]<br />
* [[Socioeconomic Status]]<br />
* [[Military-Connected Status]]<br />
* [[Children of Migrant Workers]]<br />
* [[Religion and Religious Background]]<br />
* [[Public or Private K-12 School]]<br />
* [[Intersectional Research]]<br />
<br />
== By Algorithm Application == <br />
* [[At-risk/Dropout/Stopout/Graduation Prediction]]<br />
* [[Course Grade and GPA Prediction]]<br />
*[[National and International Examination]]<br />
* [[Short-term Performance and Learning Gains Prediction]]<br />
* [[Automated Essay Scoring]]<br />
* [[Speech Recognition for Education]]<br />
* [[Other NLP Applications of Algorithms in Education]]<br />
* [[Student Knowledge Modeling]]<br />
* [[Engagement and Affect Detection]]<br />
*[[Self-regulated Learning]]<br />
* [[Task/Activity Quit Prediction]]<br />
* [[Social Network Link Prediction]]</div>Ryanhttps://www.pcla.wiki/index.php?title=Algorithmic_Bias_in_Education&diff=400Algorithmic Bias in Education2022-07-06T20:08:49Z<p>Ryan: Added citation information</p>
<hr />
<div>== Algorithmic Bias in Education: The Wiki ==<br />
<br />
This Wiki summarizes the current peer-reviewed published evidence surrounding Algorithmic Bias in Education:<br />
which groups are impacted, and in which contexts.<br />
<br />
For a relatively recent review on this topic, see <br />
Baker, R.S., Hawn, M.A. (in press) Algorithmic Bias in Education. To appear in <em>International Journal of Artificial Intelligence and Education</em><br />
([https://www.upenn.edu/learninganalytics/ryanbaker/AlgorithmicBiasInEducation_rsb3.7.pdf pdf])<br />
<br />
Note: Within this Wiki, we recommend that page editors use the group labels originally<br />
used within the publications being cited, to best represent the articles included here. We also request that each page center the members of the group the page is about.<br />
<br />
This wiki can be cited as<br />
Penn Center for Learning Analytics (*current year*) Algorithmic Bias in Education: The Wiki. <br />
Philadelphia, PA: Penn Center for Learning Analytics. <br />
Retrieved *current date* from https://www.pcla.wiki/index.php/Algorithmic_Bias_in_Education <br />
<br />
== By Group Impacted ==<br />
* Race and Ethnicity<br />
** [[Black/African-American Learners in North America]]<br />
** [[Latino/Latina/Latinx/Hispanic Learners in North America]]<br />
** [[Asian/Asian-American Learners in North America]]<br />
** [[White Learners in North America]]<br />
** [[Indigenous Learners in North America]]<br />
** [[Research on Race and Ethnicity Conducted Outside of North America]] <br />
* [[Gender: Male/Female]]<br />
* [[Gender: Non-Binary and Transgender Learners]]<br />
* [[Sexual Orientation]]<br />
* [[Linguistic Origin]]<br />
* [[National Origin or National Location]]<br />
* [[International Students]]<br />
* [[Native Language and Dialect]]<br />
* [[Learners with Disabilities]]<br />
* [[Age]]<br />
* [[Urbanicity]]<br />
* [[Parental Educational Background]]<br />
* [[Socioeconomic Status]]<br />
* [[Military-Connected Status]]<br />
* [[Children of Migrant Workers]]<br />
* [[Religion and Religious Background]]<br />
* [[Public or Private K-12 School]]<br />
* [[Intersectional Research]]<br />
<br />
== By Algorithm Application == <br />
* [[At-risk/Dropout/Stopout/Graduation Prediction]]<br />
* [[Course Grade and GPA Prediction]]<br />
*[[National and International Examination]]<br />
* [[Short-term Performance and Learning Gains Prediction]]<br />
* [[Automated Essay Scoring]]<br />
* [[Speech Recognition for Education]]<br />
* [[Other NLP Applications of Algorithms in Education]]<br />
* [[Student Knowledge Modeling]]<br />
* [[Engagement and Affect Detection]]<br />
*[[Self-regulated Learning]]<br />
* [[Task/Activity Quit Prediction]]<br />
* [[Social Network Link Prediction]]</div>Ryanhttps://www.pcla.wiki/index.php?title=Automated_Essay_Scoring&diff=399Automated Essay Scoring2022-07-04T16:33:17Z<p>Ryan: Added Litman et al. (2021)</p>
<hr />
<div>Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring page]<br />
<br />
* Automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater gave significantly better scores than human rater for 11th grade essays written by Hispanic students and Asian-American students<br />
* E-Rater gave significantly better scores than human rater for TOEFL essays (independent task) written by speakers of Chinese and Korean<br />
* E-Rater correlated poorly with human rater and gave better scores than human rater for GRE essays (both issue and argument prompts) written by Chinese speakers<br />
* E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays, TOEFL, and GRE writings<br />
<br />
<br />
<br />
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]<br />
<br />
* A later version of automated scoring models for evaluating English essays, or e-rater<br />
* E-rater gave significantly lower score than human rater when assessing African-American students’ written responses to issue prompt in GRE<br />
* E-rater gave better scores for test-takers from Chinese speakers (Mainland China, Taiwan, Hong Kong) and Korean speakers when assessing TOEFL (independent prompt) essay<br />
* E-rater gave lower scores for Arabic, Hindi, and Spanish speakers when assessing their written responses to independent prompt in TOEFL<br />
* E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students<br />
<br />
<br />
<br />
<br />
Ramineni & Williamson (2018) [https://onlinelibrary.wiley.com/doi/10.1002/ets2.12192 pdf]<br />
<br />
* Revised automated scoring engine for assessing GSE essay<br />
<br />
* E-rater gave African American test-takers significantly lower scores than human raters when assessing their written responses to argument prompts<br />
* The shorter essays written by African American test-takers were more likely to receive lower scores as showing weakness in content and organization<br />
<br />
<br />
<br />
<br />
Wang et al. (2018) [https://www.researchgate.net/publication/336009443_Monitoring_the_performance_of_human_and_automated_scores_for_spoken_responses pdf]<br />
*Automated scoring model for evaluating English spoken responses<br />
*SpeechRater gave a significantly lower score than human raters for German students<br />
*SpeechRater scored students from China higher than human raters, with H1-rater scores higher than mean<br />
<br />
<br />
Litman et al. (2021) [https://link.springer.com/chapter/10.1007/978-3-030-78292-4_21 html]<br />
* Automated essay scoring models inferring text evidence usage<br />
* All algorithms studied have less than 1% of error explained by whether student is female and male, whether student is Black, or whether student receives free/reduced price lunch</div>Ryanhttps://www.pcla.wiki/index.php?title=Socioeconomic_Status&diff=398Socioeconomic Status2022-07-04T16:32:30Z<p>Ryan: Added Litman et al. (2021)</p>
<hr />
<div>Yudelson et al. (2014) [https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.659.872&rep=rep1&type=pdf pdf]<br />
<br />
* Models discovering generalizable sub-populations of students across different schools to predict students' learning with Carnegie Learning’s Cognitive Tutor (CLCT)<br />
<br />
* Models trained on schools with a high proportion of low-SES student performed worse than those trained with medium or low proportion<br />
* Models trained on schools with low, medium proportion of SES students performed similarly well for schools with high proportions of low-SES students<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
<br />
* Models predicting undergraduate course grades and average GPA<br />
<br />
* Students from low-income households were inaccurately predicted to perform worse for both short-term (final course grade) and long-term (GPA)<br />
* Fairness of model improved if it included only clickstream and survey data<br />
<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
*Models predicting college dropout for students in residential and fully online program<br />
*Whether the socio-demographic information was included or not, the model showed worse accuracy and true negative rates for residential students with greater financial needs<br />
*The model showed better recall for students with greater financial needs, especially for those studying in person<br />
<br />
<br />
Kung & Yu (2020)<br />
[https://dl.acm.org/doi/pdf/10.1145/3386527.3406755 pdf]<br />
* Predicting course grades and later GPA at public U.S. university<br />
* Equal performance for low-income and upper-income students in course grade prediction for several algorithms and metrics<br />
* Worse performance on independence for low-income students than high-income students in later GPA prediction for four of five algorithms; one algorithm had worse separation and two algorithms had worse sufficiency<br />
<br />
<br />
Litman et al. (2021) [https://link.springer.com/chapter/10.1007/978-3-030-78292-4_21 html]<br />
* Automated essay scoring models inferring text evidence usage<br />
* All algorithms studied have less than 1% of error explained by whether student receives free/reduced price lunch</div>Ryanhttps://www.pcla.wiki/index.php?title=Black/African-American_Learners_in_North_America&diff=397Black/African-American Learners in North America2022-07-04T16:30:57Z<p>Ryan: Added Litman et al. (2021)</p>
<hr />
<div>Kai et al. (2017) [https://www.upenn.edu/learninganalytics/ryanbaker/DLRN-eVersity.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J48 decision trees achieved much lower Kappa and AUC for Black students than White students<br />
* JRip decision rules achieved almost identical Kappa and AUC for Black students and White students<br />
<br />
<br />
Hu and Rangwala (2020) [https://files.eric.ed.gov/fulltext/ED608050.pdf pdf]<br />
* Models predicting if a college student will fail in a course<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against African-American students, while other models (particularly Logistic Regression and Rawlsian Fairness) performed far worse<br />
* The level of bias was inconsistent across courses, with MCCM prediction showing the least bias for Psychology and the greatest bias for Computer Science<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed little difference in AUC among Black, White, Hispanic, Asian, American Indian and Alaska Native, and Native Hawaiian and Pacific Islander.<br />
<br />
<br />
Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
* Models predicting college success (or median grade or above)<br />
* Random forest algorithms performed significantly worse for underrepresented minority students (URM; Black, American Indian, Hawaiian or Pacific Islander, Hispanic, and Multicultural) than non-URM students (White and Asian)<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
* Model predicting undergraduate short-term (course grades) and long-term (average GPA) success<br />
* Black students were inaccurately predicted to perform worse for both short-term and long-term<br />
* The fairness of models improved when either click or a combination of click and survey data, and not institutional data, was included in the model<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* Whether the socio-demographic information was included or not, the model showed worse true negative rates for students who are underrepresented minority (URM; or not White or Asian), and worse accuracy if URM students are studying in person <br />
* The model showed better recall for URM students, whether they were in residential or online program<br />
<br />
<br />
Ramineni & Williamson (2018) [https://files.eric.ed.gov/fulltext/EJ1202928.pdf pdf]<br />
* Revised automated scoring engine for assessing GRE essay<br />
* E-rater gave African American test-takers significantly lower scores than human raters when assessing their written responses to argument prompts<br />
* The shorter essays written by African American test-takers were more likely to receive lower scores as showing weakness in content and organization<br />
<br />
<br />
Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]<br />
* Automated scoring models for evaluating English essays, or e-rater <br />
* The score difference between human rater and e-rater was significantly smaller for 11th grade essays written by African American and White students<br />
<br />
<br />
<br />
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502 pdf]<br />
* A later version of automated scoring models for evaluating English essays, or e-rater<br />
* E-rater gave significantly lower score than human rater when assessing African-American students’ written responses to issue prompt in GRE<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
<br />
Zhang et al. (in press) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM22_paper_35.pdf pdf]<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across racial/ethnic groups. <br />
* No racial/ethnic group consistently had best-performing detectors<br />
<br />
<br />
Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf]<br />
* Models predicting whether two students will communicate on an online discussion forum<br />
* Compared members of overrepresented racial groups to members of underrepresented racial groups (over 2/3<br />
Black/African American)<br />
* Multiple fairness approaches lead to ABROCA of under 0.01 for overrepresented versus underrepresented students<br />
<br />
<br />
Litman et al. (2021) [https://link.springer.com/chapter/10.1007/978-3-030-78292-4_21 html]<br />
* Automated essay scoring models inferring text evidence usage<br />
* All algorithms studied have less than 1% of error explained by whether student is Black</div>Ryanhttps://www.pcla.wiki/index.php?title=Gender:_Male/Female&diff=396Gender: Male/Female2022-07-04T16:30:02Z<p>Ryan: Added Litman et al. (2021)</p>
<hr />
<div>Kai et al. (2017) [https://www.upenn.edu/learninganalytics/ryanbaker/DLRN-eVersity.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J48 decision trees achieved significantly lower Kappa but higher AUC for male students than female students<br />
* JRip decision rules achieved much lower Kappa and AUC for male students than female students<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed very minor differences in AUC between female and male students<br />
<br />
<br />
Hu and Rangwala (2020) [https://files.eric.ed.gov/fulltext/ED608050.pdf pdf]<br />
* Models predicting if a college student will fail in a course<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against male students, performing particularly better for Psychology course.<br />
* Other models (Logistic Regression and Rawlsian Fairness) performed far worse for male students, performing particularly worse in Computer Science and Electrical Engineering.<br />
<br />
<br />
Anderson et al. (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM2019_paper56.pdf pdf]<br />
* Models predicting six-year college graduation<br />
* False negatives rates were greater for male students than female students when SVM, Logistic Regression, and SGD were used<br />
<br />
<br />
Gardner, Brooks and Baker (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/LAK_PAPER97_CAMERA.pdf pdf]<br />
* Model predicting MOOC dropout, specifically through slicing analysis<br />
* Some algorithms studied performed worse for female students than male students, particularly in courses with 45% or less male presence<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
* Model predicting course outcome<br />
* Marginal differences were found for prediction quality and in overall proportion of predicted pass between groups<br />
* Inconsistent in direction between algorithms.<br />
<br />
<br />
Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
* Models predicting college success (or median grade or above)<br />
* Random forest algorithms performed significantly worse for male students than female students<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
* Model predicting undergraduate short-term (course grades) and long-term (average GPA) success<br />
* Female students were inaccurately predicted to achieve greater short-term and long-term success than male students.<br />
* The fairness of models improved when a combination of institutional and click data was used in the model<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* Whether the socio-demographic information was included or not, the model showed worse true negative rates and worse accuracy for male students<br />
* The model showed better recall for male students, especially for those studying in person<br />
* The difference in recall and true negative rates were lower, and thus fairer, for male students studying online if their socio-demographic information was not included in the model<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
* Models predicting course outcome of students in a virtual learning environment (VLE)<br />
* More male students were predicted to pass the course than female students, but this overestimation was fairly small and not consistent across different algorithms<br />
* Among the algorithms, Naive Bayes had the lowest normalized mutual information value and the highest ABROCA value<br />
<br />
<br />
Bridgeman et al. (2009) <br />
[https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]<br />
<br />
* Automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays<br />
<br />
<br />
<br />
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]<br />
* A later version of automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students<br />
<br />
<br />
Verdugo et al. (2022) [https://dl.acm.org/doi/abs/10.1145/3506860.3506902 pdf]<br />
* An algorithm predicting dropout from university after the first year<br />
* Several algorithms achieved better AUC for male than female students; results were mixed for F1.<br />
<br />
<br />
Zhang et al. (in press)<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across gender groups. <br />
* No gender group consistently had best-performing detectors<br />
<br />
<br />
Rzepka et al. (2022) [https://www.insticc.org/node/TechnicalProgram/CSEDU/2022/presentationDetails/109621 pdf]<br />
* Models predicting whether student will quit spelling learning activity without completing<br />
* Multiple algorithms have slightly better false positive rates and AUC ROC for male students than female students, but equivalent performance on multiple other metrics.<br />
<br />
<br />
Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf]<br />
* Models predicting whether two students will communicate on an online discussion forum<br />
* Multiple fairness approaches lead to ABROCA of under 0.01 for female versus male students<br />
<br />
<br />
Sha et al. (2021) [https://angusglchen.github.io/files/AIED2021_Lele_Assessing.pdf pdf]<br />
* Models predicting a MOOC discussion forum post is content-relevant or content-irrelevant<br />
* Some algorithms achieved ABROCA under 0.01 for female students versus male students,<br />
but other algorithms (Naive Bayes) had ABROCA as high as 0.06<br />
* Balancing the size of each group in the training set reduced ABROCA<br />
<br />
<br />
Litman et al. (2021) [https://link.springer.com/chapter/10.1007/978-3-030-78292-4_21 html]<br />
* Automated essay scoring models inferring text evidence usage<br />
* All algorithms studied have less than 1% of error explained by whether student is female and male</div>Ryanhttps://www.pcla.wiki/index.php?title=Other_NLP_Applications_of_Algorithms_in_Education&diff=395Other NLP Applications of Algorithms in Education2022-07-04T16:22:42Z<p>Ryan: </p>
<hr />
<div>Naismith et al. (2018) [http://d-scholarship.pitt.edu/40665/1/EDM2018_paper_37.pdf pdf]<br />
<br />
* a model that measures L2 learners’ lexical sophistication with the frequency list based on the native speaker corpora<br />
* Arabic-speaking learners are rated systematically lower across all levels of English proficiency than speakers of Chinese, Japanese, Korean, and Spanish.<br />
* Level 5 Arabic-speaking learners are unfairly evaluated to have similar level of lexical sophistication as Level 4 learners from China, Japan, Korean and Spain .<br />
* When used on ETS corpus, “high”-labeled essays by Japanese-speaking learners are rated significantly lower in lexical sophistication than Arabic, Japanese, Korean and Spanish peers.<br />
<br />
<br />
Samei et al. (2015) [https://files.eric.ed.gov/fulltext/ED560879.pdf pdf]<br />
<br />
* Models predicting classroom discourse properties (e.g. authenticity and uptake)<br />
* Model trained on urban students (authenticity: 0.62, uptake: 0.60) performed with similar accuracy when tested on non-urban students (authenticity: 0.62, uptake: 0.62)<br />
* Model trained on non-urban (authenticity: 0.61, uptake: 0.59) performed with similar accuracy when tested on urban students (authenticity: 0.60, uptake: 0.63)<br />
<br />
<br />
Sha et al. (2021) [https://angusglchen.github.io/files/AIED2021_Lele_Assessing.pdf pdf]<br />
* Models predicting a MOOC discussion forum post is content-relevant or content-irrelevant<br />
* MOOCs taught in English<br />
* Some algorithms achieved ABROCA under 0.01 for female students versus male students, but other algorithms (Naive Bayes) had ABROCA as high as 0.06<br />
* ABROCA varied from 0.03 to 0.08 for non-native speakers of English versus native speakers<br />
* Balancing the size of each group in the training set reduced ABROCA values</div>Ryanhttps://www.pcla.wiki/index.php?title=Other_NLP_Applications_of_Algorithms_in_Education&diff=394Other NLP Applications of Algorithms in Education2022-07-04T16:22:21Z<p>Ryan: Added Sha et al (2021)</p>
<hr />
<div>Naismith et al. (2018) [http://d-scholarship.pitt.edu/40665/1/EDM2018_paper_37.pdf pdf]<br />
<br />
* a model that measures L2 learners’ lexical sophistication with the frequency list based on the native speaker corpora<br />
* Arabic-speaking learners are rated systematically lower across all levels of English proficiency than speakers of Chinese, Japanese, Korean, and Spanish.<br />
* Level 5 Arabic-speaking learners are unfairly evaluated to have similar level of lexical sophistication as Level 4 learners from China, Japan, Korean and Spain .<br />
* When used on ETS corpus, “high”-labeled essays by Japanese-speaking learners are rated significantly lower in lexical sophistication than Arabic, Japanese, Korean and Spanish peers.<br />
<br />
<br />
Samei et al. (2015) [https://files.eric.ed.gov/fulltext/ED560879.pdf pdf]<br />
<br />
* Models predicting classroom discourse properties (e.g. authenticity and uptake)<br />
* Model trained on urban students (authenticity: 0.62, uptake: 0.60) performed with similar accuracy when tested on non-urban students (authenticity: 0.62, uptake: 0.62)<br />
* Model trained on non-urban (authenticity: 0.61, uptake: 0.59) performed with similar accuracy when tested on urban students (authenticity: 0.60, uptake: 0.63)<br />
<br />
<br />
Sha et al. (2021) [https://angusglchen.github.io/files/AIED2021_Lele_Assessing.pdf pdf]<br />
* Models predicting a MOOC discussion forum post is content-relevant or content-irrelevant<br />
* MOOCs taught in English<br />
* Some algorithms achieved ABROCA under 0.01 for female students versus male students,<br />
but other algorithms (Naive Bayes) had ABROCA as high as 0.06<br />
* ABROCA varied from 0.03 to 0.08 for non-native speakers of English versus native speakers<br />
* Balancing the size of each group in the training set reduced ABROCA values</div>Ryanhttps://www.pcla.wiki/index.php?title=Native_Language_and_Dialect&diff=393Native Language and Dialect2022-07-04T16:01:04Z<p>Ryan: Added Sha et al (2021)</p>
<hr />
<div>Naismith et al. (2018) [http://d-scholarship.pitt.edu/40665/1/EDM2018_paper_37.pdf pdf]<br />
* Model that measures L2 learners’ lexical sophistication with the frequency list based on the native speaker corpora<br />
* Arabic-speaking learners are rated systematically lower across all levels of human-assessed English proficiency than speakers of Chinese, Japanese, Korean, and Spanish<br />
* Level 5 Arabic-speaking learners are inaccurately evaluated to have similar level of lexical sophistication as Level 4 learners from China, Japan, Korean and Spain<br />
* When used on the ETS corpus, essays by Japanese-speaking learners with higher human-rated lexical sophistication are rated significantly lower in lexical sophistication than Arabic, Japanese, Korean and Spanish peers<br />
<br />
<br />
<br />
Loukina et al. (2019) [https://aclanthology.org/W19-4401.pdf pdf]<br />
<br />
* Models providing automated speech scores on English language proficiency assessment<br />
* L1-specific model trained on the speaker’s native language was the least fair, especially for Chinese, Japanese, and Korean speakers, but not for German speakers<br />
* All models (Baseline, Fair feature subset, L1-specific) performed worse for Japanese speakers<br />
<br />
<br />
Rzepka et al. (2022) [https://www.insticc.org/node/TechnicalProgram/CSEDU/2022/presentationDetails/109621 pdf]<br />
* Models predicting whether student will quit spelling learning activity without completing<br />
* Multiple algorithms have slightly better false positive rates for second-language speakers than native speakers, but equivalent performance on multiple other metrics.<br />
<br />
<br />
Sha et al. (2021) [https://angusglchen.github.io/files/AIED2021_Lele_Assessing.pdf pdf]<br />
* Models predicting a MOOC discussion forum post is content-relevant or content-irrelevant<br />
* MOOCs taught in English<br />
* ABROCA varied from 0.03 to 0.08 for non-native speakers of English versus native speakers<br />
* Balancing the size of each group in the training set reduced ABROCA</div>Ryanhttps://www.pcla.wiki/index.php?title=Gender:_Male/Female&diff=392Gender: Male/Female2022-07-04T15:55:44Z<p>Ryan: Added Sha et al (2021)</p>
<hr />
<div>Kai et al. (2017) [https://www.upenn.edu/learninganalytics/ryanbaker/DLRN-eVersity.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J48 decision trees achieved significantly lower Kappa but higher AUC for male students than female students<br />
* JRip decision rules achieved much lower Kappa and AUC for male students than female students<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed very minor differences in AUC between female and male students<br />
<br />
<br />
Hu and Rangwala (2020) [https://files.eric.ed.gov/fulltext/ED608050.pdf pdf]<br />
* Models predicting if a college student will fail in a course<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against male students, performing particularly better for Psychology course.<br />
* Other models (Logistic Regression and Rawlsian Fairness) performed far worse for male students, performing particularly worse in Computer Science and Electrical Engineering.<br />
<br />
<br />
Anderson et al. (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM2019_paper56.pdf pdf]<br />
* Models predicting six-year college graduation<br />
* False negatives rates were greater for male students than female students when SVM, Logistic Regression, and SGD were used<br />
<br />
<br />
Gardner, Brooks and Baker (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/LAK_PAPER97_CAMERA.pdf pdf]<br />
* Model predicting MOOC dropout, specifically through slicing analysis<br />
* Some algorithms studied performed worse for female students than male students, particularly in courses with 45% or less male presence<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
* Model predicting course outcome<br />
* Marginal differences were found for prediction quality and in overall proportion of predicted pass between groups<br />
* Inconsistent in direction between algorithms.<br />
<br />
<br />
Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
* Models predicting college success (or median grade or above)<br />
* Random forest algorithms performed significantly worse for male students than female students<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
* Model predicting undergraduate short-term (course grades) and long-term (average GPA) success<br />
* Female students were inaccurately predicted to achieve greater short-term and long-term success than male students.<br />
* The fairness of models improved when a combination of institutional and click data was used in the model<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* Whether the socio-demographic information was included or not, the model showed worse true negative rates and worse accuracy for male students<br />
* The model showed better recall for male students, especially for those studying in person<br />
* The difference in recall and true negative rates were lower, and thus fairer, for male students studying online if their socio-demographic information was not included in the model<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
* Models predicting course outcome of students in a virtual learning environment (VLE)<br />
* More male students were predicted to pass the course than female students, but this overestimation was fairly small and not consistent across different algorithms<br />
* Among the algorithms, Naive Bayes had the lowest normalized mutual information value and the highest ABROCA value<br />
<br />
<br />
Bridgeman et al. (2009) <br />
[https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]<br />
<br />
* Automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays<br />
<br />
<br />
<br />
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]<br />
* A later version of automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students<br />
<br />
<br />
Verdugo et al. (2022) [https://dl.acm.org/doi/abs/10.1145/3506860.3506902 pdf]<br />
* An algorithm predicting dropout from university after the first year<br />
* Several algorithms achieved better AUC for male than female students; results were mixed for F1.<br />
<br />
<br />
Zhang et al. (in press)<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across gender groups. <br />
* No gender group consistently had best-performing detectors<br />
<br />
<br />
Rzepka et al. (2022) [https://www.insticc.org/node/TechnicalProgram/CSEDU/2022/presentationDetails/109621 pdf]<br />
* Models predicting whether student will quit spelling learning activity without completing<br />
* Multiple algorithms have slightly better false positive rates and AUC ROC for male students than female students, but equivalent performance on multiple other metrics.<br />
<br />
<br />
Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf]<br />
* Models predicting whether two students will communicate on an online discussion forum<br />
* Multiple fairness approaches lead to ABROCA of under 0.01 for female versus male students<br />
<br />
<br />
Sha et al. (2021) [https://angusglchen.github.io/files/AIED2021_Lele_Assessing.pdf pdf]<br />
* Models predicting a MOOC discussion forum post is content-relevant or content-irrelevant<br />
* Some algorithms achieved ABROCA under 0.01 for female students versus male students,<br />
but other algorithms (Naive Bayes) had ABROCA as high as 0.06<br />
* Balancing the size of each group in the training set reduced ABROCA</div>Ryanhttps://www.pcla.wiki/index.php?title=Social_Network_Link_Prediction&diff=391Social Network Link Prediction2022-07-04T13:22:05Z<p>Ryan: Created page with "Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf] * Models predicting whether two students will communicate on an online discussion forum * Multiple fairness approaches lead to ABROCA of under 0.01 for female versus male students * Compared members of overrepresented racial groups to members of underrepresented racial groups ** Underrepr..."</p>
<hr />
<div>Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf]<br />
* Models predicting whether two students will communicate on an online discussion forum<br />
* Multiple fairness approaches lead to ABROCA of under 0.01 for female versus male students<br />
* Compared members of overrepresented racial groups to members of underrepresented racial groups <br />
** Underrepresented group over 2/3 Black/African American<br />
** Overrepresented group approximately 90% White<br />
** Multiple fairness approaches lead to ABROCA of under 0.01 for overrepresented versus underrepresented students</div>Ryanhttps://www.pcla.wiki/index.php?title=Algorithmic_Bias_in_Education&diff=390Algorithmic Bias in Education2022-07-04T13:21:12Z<p>Ryan: /* By Algorithm Application */</p>
<hr />
<div>== Algorithmic Bias in Education ==<br />
<br />
This Wiki summarizes the current peer-reviewed published evidence surrounding Algorithmic Bias in Education:<br />
which groups are impacted, and in which contexts.<br />
<br />
For a relatively recent review on this topic, see <br />
Baker, R.S., Hawn, M.A. (in press) Algorithmic Bias in Education. To appear in <em>International Journal of Artificial Intelligence and Education</em><br />
([https://www.upenn.edu/learninganalytics/ryanbaker/AlgorithmicBiasInEducation_rsb3.7.pdf pdf])<br />
<br />
Note: Within this Wiki, we recommend that page editors use the group labels originally<br />
used within the publications being cited, to best represent the articles included here. We also request that each page center the members of the group the page is about.<br />
<br />
== By Group Impacted ==<br />
* Race and Ethnicity<br />
** [[Black/African-American Learners in North America]]<br />
** [[Latino/Latina/Latinx/Hispanic Learners in North America]]<br />
** [[Asian/Asian-American Learners in North America]]<br />
** [[White Learners in North America]]<br />
** [[Indigenous Learners in North America]]<br />
** [[Research on Race and Ethnicity Conducted Outside of North America]] <br />
* [[Gender: Male/Female]]<br />
* [[Gender: Non-Binary and Transgender Learners]]<br />
* [[Sexual Orientation]]<br />
* [[Linguistic Origin]]<br />
* [[National Origin or National Location]]<br />
* [[International Students]]<br />
* [[Native Language and Dialect]]<br />
* [[Learners with Disabilities]]<br />
* [[Age]]<br />
* [[Urbanicity]]<br />
* [[Parental Educational Background]]<br />
* [[Socioeconomic Status]]<br />
* [[Military-Connected Status]]<br />
* [[Children of Migrant Workers]]<br />
* [[Religion and Religious Background]]<br />
* [[Public or Private K-12 School]]<br />
* [[Intersectional Research]]<br />
<br />
== By Algorithm Application == <br />
* [[At-risk/Dropout/Stopout/Graduation Prediction]]<br />
* [[Course Grade and GPA Prediction]]<br />
*[[National and International Examination]]<br />
* [[Short-term Performance and Learning Gains Prediction]]<br />
* [[Automated Essay Scoring]]<br />
* [[Speech Recognition for Education]]<br />
* [[Other NLP Applications of Algorithms in Education]]<br />
* [[Student Knowledge Modeling]]<br />
* [[Engagement and Affect Detection]]<br />
*[[Self-regulated Learning]]<br />
* [[Task/Activity Quit Prediction]]<br />
* [[Social Network Link Prediction]]</div>Ryanhttps://www.pcla.wiki/index.php?title=Black/African-American_Learners_in_North_America&diff=389Black/African-American Learners in North America2022-07-04T13:20:30Z<p>Ryan: </p>
<hr />
<div>Kai et al. (2017) [https://www.upenn.edu/learninganalytics/ryanbaker/DLRN-eVersity.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J48 decision trees achieved much lower Kappa and AUC for Black students than White students<br />
* JRip decision rules achieved almost identical Kappa and AUC for Black students and White students<br />
<br />
<br />
Hu and Rangwala (2020) [https://files.eric.ed.gov/fulltext/ED608050.pdf pdf]<br />
* Models predicting if a college student will fail in a course<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against African-American students, while other models (particularly Logistic Regression and Rawlsian Fairness) performed far worse<br />
* The level of bias was inconsistent across courses, with MCCM prediction showing the least bias for Psychology and the greatest bias for Computer Science<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed little difference in AUC among Black, White, Hispanic, Asian, American Indian and Alaska Native, and Native Hawaiian and Pacific Islander.<br />
<br />
<br />
Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
* Models predicting college success (or median grade or above)<br />
* Random forest algorithms performed significantly worse for underrepresented minority students (URM; Black, American Indian, Hawaiian or Pacific Islander, Hispanic, and Multicultural) than non-URM students (White and Asian)<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
* Model predicting undergraduate short-term (course grades) and long-term (average GPA) success<br />
* Black students were inaccurately predicted to perform worse for both short-term and long-term<br />
* The fairness of models improved when either click or a combination of click and survey data, and not institutional data, was included in the model<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* Whether the socio-demographic information was included or not, the model showed worse true negative rates for students who are underrepresented minority (URM; or not White or Asian), and worse accuracy if URM students are studying in person <br />
* The model showed better recall for URM students, whether they were in residential or online program<br />
<br />
<br />
Ramineni & Williamson (2018) [https://files.eric.ed.gov/fulltext/EJ1202928.pdf pdf]<br />
* Revised automated scoring engine for assessing GRE essay<br />
* E-rater gave African American test-takers significantly lower scores than human raters when assessing their written responses to argument prompts<br />
* The shorter essays written by African American test-takers were more likely to receive lower scores as showing weakness in content and organization<br />
<br />
<br />
Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]<br />
* Automated scoring models for evaluating English essays, or e-rater <br />
* The score difference between human rater and e-rater was significantly smaller for 11th grade essays written by African American and White students<br />
<br />
<br />
<br />
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502 pdf]<br />
* A later version of automated scoring models for evaluating English essays, or e-rater<br />
* E-rater gave significantly lower score than human rater when assessing African-American students’ written responses to issue prompt in GRE<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
<br />
Zhang et al. (in press) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM22_paper_35.pdf pdf]<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across racial/ethnic groups. <br />
* No racial/ethnic group consistently had best-performing detectors<br />
<br />
<br />
Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf]<br />
* Models predicting whether two students will communicate on an online discussion forum<br />
* Compared members of overrepresented racial groups to members of underrepresented racial groups (over 2/3<br />
Black/African American)<br />
* Multiple fairness approaches lead to ABROCA of under 0.01 for overrepresented versus underrepresented students</div>Ryanhttps://www.pcla.wiki/index.php?title=White_Learners_in_North_America&diff=388White Learners in North America2022-07-04T13:20:02Z<p>Ryan: Added Li, Xing, & Leite (2022)</p>
<hr />
<div>Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]<br />
* Automated scoring models for evaluating English essays, or e-rater <br />
* The score difference between human rater and e-rater was significantly smaller for 11th grade essays written by White and African American students than other groups<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
Zhang et al. (in press) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM22_paper_35.pdf]<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across racial/ethnic groups. <br />
* No racial/ethnic group consistently had best-performing detectors<br />
<br />
<br />
Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf]<br />
* Models predicting whether two students will communicate on an online discussion forum<br />
* Compared members of overrepresented racial groups to members of underrepresented racial groups (overrepresented group approximately 90% White)<br />
* Multiple fairness approaches lead to ABROCA of under 0.01 for overrepresented versus underrepresented students</div>Ryanhttps://www.pcla.wiki/index.php?title=Black/African-American_Learners_in_North_America&diff=387Black/African-American Learners in North America2022-07-04T13:17:47Z<p>Ryan: Added Li, Xing, & Leite (2022)</p>
<hr />
<div>Kai et al. (2017) [https://www.upenn.edu/learninganalytics/ryanbaker/DLRN-eVersity.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J48 decision trees achieved much lower Kappa and AUC for Black students than White students<br />
* JRip decision rules achieved almost identical Kappa and AUC for Black students and White students<br />
<br />
<br />
Hu and Rangwala (2020) [https://files.eric.ed.gov/fulltext/ED608050.pdf pdf]<br />
* Models predicting if a college student will fail in a course<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against African-American students, while other models (particularly Logistic Regression and Rawlsian Fairness) performed far worse<br />
* The level of bias was inconsistent across courses, with MCCM prediction showing the least bias for Psychology and the greatest bias for Computer Science<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed little difference in AUC among Black, White, Hispanic, Asian, American Indian and Alaska Native, and Native Hawaiian and Pacific Islander.<br />
<br />
<br />
Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
* Models predicting college success (or median grade or above)<br />
* Random forest algorithms performed significantly worse for underrepresented minority students (URM; Black, American Indian, Hawaiian or Pacific Islander, Hispanic, and Multicultural) than non-URM students (White and Asian)<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
* Model predicting undergraduate short-term (course grades) and long-term (average GPA) success<br />
* Black students were inaccurately predicted to perform worse for both short-term and long-term<br />
* The fairness of models improved when either click or a combination of click and survey data, and not institutional data, was included in the model<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* Whether the socio-demographic information was included or not, the model showed worse true negative rates for students who are underrepresented minority (URM; or not White or Asian), and worse accuracy if URM students are studying in person <br />
* The model showed better recall for URM students, whether they were in residential or online program<br />
<br />
<br />
Ramineni & Williamson (2018) [https://files.eric.ed.gov/fulltext/EJ1202928.pdf pdf]<br />
* Revised automated scoring engine for assessing GRE essay<br />
* E-rater gave African American test-takers significantly lower scores than human raters when assessing their written responses to argument prompts<br />
* The shorter essays written by African American test-takers were more likely to receive lower scores as showing weakness in content and organization<br />
<br />
<br />
Bridgeman et al. (2009) [https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]<br />
* Automated scoring models for evaluating English essays, or e-rater <br />
* The score difference between human rater and e-rater was significantly smaller for 11th grade essays written by African American and White students<br />
<br />
<br />
<br />
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502 pdf]<br />
* A later version of automated scoring models for evaluating English essays, or e-rater<br />
* E-rater gave significantly lower score than human rater when assessing African-American students’ written responses to issue prompt in GRE<br />
<br />
<br />
Jiang & Pardos (2021) [https://dl.acm.org/doi/pdf/10.1145/3461702.3462623 pdf]<br />
* Predicting university course grades using LSTM<br />
* Roughly equal accuracy across racial groups<br />
* Slightly better accuracy (~1%) across racial groups when including race in model<br />
<br />
<br />
Zhang et al. (in press) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM22_paper_35.pdf pdf]<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across racial/ethnic groups. <br />
* No racial/ethnic group consistently had best-performing detectors<br />
<br />
<br />
Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf]<br />
* Models predicting whether two students will communicate on an online discussion forum<br />
* Compared members of overrepresented racial groups to members of underrepresented racial groups (over 2/3<br />
Black/African American<br />
* Multiple fairness approaches lead to ABROCA of under 0.01 for overrepresented versus underrepresented students</div>Ryanhttps://www.pcla.wiki/index.php?title=Gender:_Male/Female&diff=386Gender: Male/Female2022-07-04T13:15:13Z<p>Ryan: Added Li, Xing, & Leite (2022)</p>
<hr />
<div>Kai et al. (2017) [https://www.upenn.edu/learninganalytics/ryanbaker/DLRN-eVersity.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J48 decision trees achieved significantly lower Kappa but higher AUC for male students than female students<br />
* JRip decision rules achieved much lower Kappa and AUC for male students than female students<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed very minor differences in AUC between female and male students<br />
<br />
<br />
Hu and Rangwala (2020) [https://files.eric.ed.gov/fulltext/ED608050.pdf pdf]<br />
* Models predicting if a college student will fail in a course<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against male students, performing particularly better for Psychology course.<br />
* Other models (Logistic Regression and Rawlsian Fairness) performed far worse for male students, performing particularly worse in Computer Science and Electrical Engineering.<br />
<br />
<br />
Anderson et al. (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM2019_paper56.pdf pdf]<br />
* Models predicting six-year college graduation<br />
* False negatives rates were greater for male students than female students when SVM, Logistic Regression, and SGD were used<br />
<br />
<br />
Gardner, Brooks and Baker (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/LAK_PAPER97_CAMERA.pdf pdf]<br />
* Model predicting MOOC dropout, specifically through slicing analysis<br />
* Some algorithms studied performed worse for female students than male students, particularly in courses with 45% or less male presence<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
* Model predicting course outcome<br />
* Marginal differences were found for prediction quality and in overall proportion of predicted pass between groups<br />
* Inconsistent in direction between algorithms.<br />
<br />
<br />
Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
* Models predicting college success (or median grade or above)<br />
* Random forest algorithms performed significantly worse for male students than female students<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
* Model predicting undergraduate short-term (course grades) and long-term (average GPA) success<br />
* Female students were inaccurately predicted to achieve greater short-term and long-term success than male students.<br />
* The fairness of models improved when a combination of institutional and click data was used in the model<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* Whether the socio-demographic information was included or not, the model showed worse true negative rates and worse accuracy for male students<br />
* The model showed better recall for male students, especially for those studying in person<br />
* The difference in recall and true negative rates were lower, and thus fairer, for male students studying online if their socio-demographic information was not included in the model<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
* Models predicting course outcome of students in a virtual learning environment (VLE)<br />
* More male students were predicted to pass the course than female students, but this overestimation was fairly small and not consistent across different algorithms<br />
* Among the algorithms, Naive Bayes had the lowest normalized mutual information value and the highest ABROCA value<br />
<br />
<br />
Bridgeman et al. (2009) <br />
[https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]<br />
<br />
* Automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays<br />
<br />
<br />
<br />
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]<br />
* A later version of automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students<br />
<br />
<br />
Verdugo et al. (2022) [https://dl.acm.org/doi/abs/10.1145/3506860.3506902 pdf]<br />
* An algorithm predicting dropout from university after the first year<br />
* Several algorithms achieved better AUC for male than female students; results were mixed for F1.<br />
<br />
<br />
Zhang et al. (in press)<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across gender groups. <br />
* No gender group consistently had best-performing detectors<br />
<br />
<br />
Rzepka et al. (2022) [https://www.insticc.org/node/TechnicalProgram/CSEDU/2022/presentationDetails/109621 pdf]<br />
* Models predicting whether student will quit spelling learning activity without completing<br />
* Multiple algorithms have slightly better false positive rates and AUC ROC for male students than female students, but equivalent performance on multiple other metrics.<br />
<br />
<br />
Li, Xing, & Leite (2022) [https://dl.acm.org/doi/pdf/10.1145/3506860.3506869?casa_token=OZmlaKB9XacAAAAA:2Bm5XYi8wh4riSmEigbHW_1bWJg0zeYqcGHkvfXyrrx_h1YUdnsLE2qOoj4aQRRBrE4VZjPrGw pdf]<br />
* Models predicting whether two students will communicate on an online discussion forum<br />
* Multiple fairness approaches lead to ABROCA of under 0.01 for female versus male students</div>Ryanhttps://www.pcla.wiki/index.php?title=Algorithmic_Bias_in_Education&diff=385Algorithmic Bias in Education2022-06-27T01:36:05Z<p>Ryan: /* Algorithmic Bias in Education */</p>
<hr />
<div>== Algorithmic Bias in Education ==<br />
<br />
This Wiki summarizes the current peer-reviewed published evidence surrounding Algorithmic Bias in Education:<br />
which groups are impacted, and in which contexts.<br />
<br />
For a relatively recent review on this topic, see <br />
Baker, R.S., Hawn, M.A. (in press) Algorithmic Bias in Education. To appear in <em>International Journal of Artificial Intelligence and Education</em><br />
([https://www.upenn.edu/learninganalytics/ryanbaker/AlgorithmicBiasInEducation_rsb3.7.pdf pdf])<br />
<br />
Note: Within this Wiki, we recommend that page editors use the group labels originally<br />
used within the publications being cited, to best represent the articles included here. We also request that each page center the members of the group the page is about.<br />
<br />
== By Group Impacted ==<br />
* Race and Ethnicity<br />
** [[Black/African-American Learners in North America]]<br />
** [[Latino/Latina/Latinx/Hispanic Learners in North America]]<br />
** [[Asian/Asian-American Learners in North America]]<br />
** [[White Learners in North America]]<br />
** [[Indigenous Learners in North America]]<br />
** [[Research on Race and Ethnicity Conducted Outside of North America]] <br />
* [[Gender: Male/Female]]<br />
* [[Gender: Non-Binary and Transgender Learners]]<br />
* [[Sexual Orientation]]<br />
* [[Linguistic Origin]]<br />
* [[National Origin or National Location]]<br />
* [[International Students]]<br />
* [[Native Language and Dialect]]<br />
* [[Learners with Disabilities]]<br />
* [[Age]]<br />
* [[Urbanicity]]<br />
* [[Parental Educational Background]]<br />
* [[Socioeconomic Status]]<br />
* [[Military-Connected Status]]<br />
* [[Children of Migrant Workers]]<br />
* [[Religion and Religious Background]]<br />
* [[Public or Private K-12 School]]<br />
* [[Intersectional Research]]<br />
<br />
== By Algorithm Application == <br />
* [[At-risk/Dropout/Stopout/Graduation Prediction]]<br />
* [[Course Grade and GPA Prediction]]<br />
*[[National and International Examination]]<br />
* [[Short-term Performance and Learning Gains Prediction]]<br />
* [[Automated Essay Scoring]]<br />
* [[Speech Recognition for Education]]<br />
* [[Other NLP Applications of Algorithms in Education]]<br />
* [[Student Knowledge Modeling]]<br />
* [[Engagement and Affect Detection]]<br />
*[[Self-regulated Learning]]<br />
* [[Task/Activity Quit Prediction]]</div>Ryanhttps://www.pcla.wiki/index.php?title=Algorithmic_Bias_in_Education&diff=384Algorithmic Bias in Education2022-06-27T01:34:43Z<p>Ryan: /* Algorithmic Bias in Education */</p>
<hr />
<div>== Algorithmic Bias in Education ==<br />
<br />
This Wiki summarizes the current published evidence surrounding Algorithmic Bias in Education:<br />
which groups are impacted, and in which contexts.<br />
<br />
For a relatively recent review on this topic, see <br />
Baker, R.S., Hawn, M.A. (in press) Algorithmic Bias in Education. To appear in <em>International Journal of Artificial Intelligence and Education</em><br />
([https://www.upenn.edu/learninganalytics/ryanbaker/AlgorithmicBiasInEducation_rsb3.7.pdf pdf])<br />
<br />
Note: Within this Wiki, we recommend that page editors use the group labels originally<br />
used within the publications being cited, to best represent the articles included here. We also request that each page center the members of the group the page is about.<br />
<br />
== By Group Impacted ==<br />
* Race and Ethnicity<br />
** [[Black/African-American Learners in North America]]<br />
** [[Latino/Latina/Latinx/Hispanic Learners in North America]]<br />
** [[Asian/Asian-American Learners in North America]]<br />
** [[White Learners in North America]]<br />
** [[Indigenous Learners in North America]]<br />
** [[Research on Race and Ethnicity Conducted Outside of North America]] <br />
* [[Gender: Male/Female]]<br />
* [[Gender: Non-Binary and Transgender Learners]]<br />
* [[Sexual Orientation]]<br />
* [[Linguistic Origin]]<br />
* [[National Origin or National Location]]<br />
* [[International Students]]<br />
* [[Native Language and Dialect]]<br />
* [[Learners with Disabilities]]<br />
* [[Age]]<br />
* [[Urbanicity]]<br />
* [[Parental Educational Background]]<br />
* [[Socioeconomic Status]]<br />
* [[Military-Connected Status]]<br />
* [[Children of Migrant Workers]]<br />
* [[Religion and Religious Background]]<br />
* [[Public or Private K-12 School]]<br />
* [[Intersectional Research]]<br />
<br />
== By Algorithm Application == <br />
* [[At-risk/Dropout/Stopout/Graduation Prediction]]<br />
* [[Course Grade and GPA Prediction]]<br />
*[[National and International Examination]]<br />
* [[Short-term Performance and Learning Gains Prediction]]<br />
* [[Automated Essay Scoring]]<br />
* [[Speech Recognition for Education]]<br />
* [[Other NLP Applications of Algorithms in Education]]<br />
* [[Student Knowledge Modeling]]<br />
* [[Engagement and Affect Detection]]<br />
*[[Self-regulated Learning]]<br />
* [[Task/Activity Quit Prediction]]</div>Ryanhttps://www.pcla.wiki/index.php?title=Gender:_Male/Female&diff=383Gender: Male/Female2022-06-21T03:03:29Z<p>Ryan: added Rzepka</p>
<hr />
<div>Kai et al. (2017) [https://www.upenn.edu/learninganalytics/ryanbaker/DLRN-eVersity.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J48 decision trees achieved significantly lower Kappa but higher AUC for male students than female students<br />
* JRip decision rules achieved much lower Kappa and AUC for male students than female students<br />
<br />
<br />
Christie et al. (2019) [https://files.eric.ed.gov/fulltext/ED599217.pdf pdf]<br />
* Models predicting student's high school dropout<br />
* The decision trees showed very minor differences in AUC between female and male students<br />
<br />
<br />
Hu and Rangwala (2020) [https://files.eric.ed.gov/fulltext/ED608050.pdf pdf]<br />
* Models predicting if a college student will fail in a course<br />
* Multiple cooperative classifier model (MCCM) model was the best at reducing bias, or discrimination against male students, performing particularly better for Psychology course.<br />
* Other models (Logistic Regression and Rawlsian Fairness) performed far worse for male students, performing particularly worse in Computer Science and Electrical Engineering.<br />
<br />
<br />
Anderson et al. (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/EDM2019_paper56.pdf pdf]<br />
* Models predicting six-year college graduation<br />
* False negatives rates were greater for male students than female students when SVM, Logistic Regression, and SGD were used<br />
<br />
<br />
Gardner, Brooks and Baker (2019) [https://www.upenn.edu/learninganalytics/ryanbaker/LAK_PAPER97_CAMERA.pdf pdf]<br />
* Model predicting MOOC dropout, specifically through slicing analysis<br />
* Some algorithms studied performed worse for female students than male students, particularly in courses with 45% or less male presence<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
* Model predicting course outcome<br />
* Marginal differences were found for prediction quality and in overall proportion of predicted pass between groups<br />
* Inconsistent in direction between algorithms.<br />
<br />
<br />
Lee and Kizilcec (2020) [https://arxiv.org/pdf/2007.00088.pdf pdf]<br />
* Models predicting college success (or median grade or above)<br />
* Random forest algorithms performed significantly worse for male students than female students<br />
* The fairness of the model, namely demographic parity and equality of opportunity, as well as its accuracy, improved after correcting the threshold values from 0.5 to group-specific values<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
* Model predicting undergraduate short-term (course grades) and long-term (average GPA) success<br />
* Female students were inaccurately predicted to achieve greater short-term and long-term success than male students.<br />
* The fairness of models improved when a combination of institutional and click data was used in the model<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
* Models predicting college dropout for students in residential and fully online program<br />
* Whether the socio-demographic information was included or not, the model showed worse true negative rates and worse accuracy for male students<br />
* The model showed better recall for male students, especially for those studying in person<br />
* The difference in recall and true negative rates were lower, and thus fairer, for male students studying online if their socio-demographic information was not included in the model<br />
<br />
<br />
Riazy et al. (2020) [https://www.scitepress.org/Papers/2020/93241/93241.pdf pdf]<br />
* Models predicting course outcome of students in a virtual learning environment (VLE)<br />
* More male students were predicted to pass the course than female students, but this overestimation was fairly small and not consistent across different algorithms<br />
* Among the algorithms, Naive Bayes had the lowest normalized mutual information value and the highest ABROCA value<br />
<br />
<br />
Bridgeman et al. (2009) <br />
[https://www.researchgate.net/publication/242203403_Considering_Fairness_and_Validity_in_Evaluating_Automated_Scoring pdf]<br />
<br />
* Automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater system performed comparably accurately for male and female students when assessing their 11th grade essays<br />
<br />
<br />
<br />
Bridgeman et al. (2012) [https://www.tandfonline.com/doi/pdf/10.1080/08957347.2012.635502?needAccess=true pdf]<br />
* A later version of automated scoring models for evaluating English essays, or e-rater<br />
* E-Rater system correlated comparably well with human rater when assessing TOEFL and GRE essays written by male and female students<br />
<br />
<br />
Verdugo et al. (2022) [https://dl.acm.org/doi/abs/10.1145/3506860.3506902 pdf]<br />
* An algorithm predicting dropout from university after the first year<br />
* Several algorithms achieved better AUC for male than female students; results were mixed for F1.<br />
<br />
<br />
Zhang et al. (in press)<br />
* Detecting student use of self-regulated learning (SRL) in mathematical problem-solving process<br />
* For each SRL-related detector, relatively small differences in AUC were observed across gender groups. <br />
* No gender group consistently had best-performing detectors<br />
<br />
<br />
Rzepka et al. (2022) [https://www.insticc.org/node/TechnicalProgram/CSEDU/2022/presentationDetails/109621 pdf]<br />
* Models predicting whether student will quit spelling learning activity without completing<br />
* Multiple algorithms have slightly better false positive rates and AUC ROC for male students than female students, but equivalent performance on multiple other metrics.</div>Ryanhttps://www.pcla.wiki/index.php?title=Parental_Educational_Background&diff=382Parental Educational Background2022-06-21T03:02:32Z<p>Ryan: </p>
<hr />
<div>Kai et al. (2017) [https://files.eric.ed.gov/fulltext/ED596601.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J-48 decision trees achieved much higher Kappa and AUC for students whose parents did not attend college than those whose parents did<br />
* J-Rip decision rules achieved much higher Kappa and AUC for students whose parents did not attended college than those whose parents did<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
*Models predicting undergraduate course grades and average GPA<br />
<br />
* First-generation college students were inaccurately predicted to get lower course grade and average GPA<br />
* Fairness of models improved with the inclusion of clickstream and survey data<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
<br />
* Models predicting college dropout for students in residential and fully online program<br />
* Whether the socio-demographic information was included or not, the model showed worse accuracy and true negative rates for first-generation students who are studying in person<br />
* The model showed better recall for first-generation students, especially for those studying in person<br />
<br />
<br />
Kung & Yu (2020)<br />
[https://dl.acm.org/doi/pdf/10.1145/3386527.3406755 pdf]<br />
* Predicting course grades and later GPA at public U.S. university<br />
* Worse performance on independence for first-generation students in course grade prediction on 5 of 5 classic machine algorithms; worse performance on separation for 3 of 5 algorithms; comparable performance on sufficiency for 5 of 5 algorithms<br />
* Worse performance on independence for first-generation students in later GPA prediction for three of five algorithms; two algorithms had worse separation and one algorithm had worse sufficiency<br />
<br />
<br />
Rzepka et al. (2022) [https://www.insticc.org/node/TechnicalProgram/CSEDU/2022/presentationDetails/109621 pdf]<br />
* Models predicting whether student will quit spelling learning activity without completing<br />
* Multiple algorithms have slightly better false positive rates and AUC ROC for students with at least one parent who graduated high school, but equivalent performance on multiple other metrics.</div>Ryanhttps://www.pcla.wiki/index.php?title=Parental_Educational_Background&diff=381Parental Educational Background2022-06-21T03:02:11Z<p>Ryan: added Rzepka</p>
<hr />
<div>Kai et al. (2017) [https://files.eric.ed.gov/fulltext/ED596601.pdf pdf]<br />
* Models predicting student retention in an online college program<br />
* J-48 decision trees achieved much higher Kappa and AUC for students whose parents did not attend college than those whose parents did<br />
* J-Rip decision rules achieved much higher Kappa and AUC for students whose parents did not attended college than those whose parents did<br />
<br />
<br />
Yu et al. (2020) [https://files.eric.ed.gov/fulltext/ED608066.pdf pdf]<br />
*Models predicting undergraduate course grades and average GPA<br />
<br />
* First-generation college students were inaccurately predicted to get lower course grade and average GPA<br />
* Fairness of models improved with the inclusion of clickstream and survey data<br />
<br />
<br />
Yu et al. (2021) [https://dl.acm.org/doi/pdf/10.1145/3430895.3460139 pdf]<br />
<br />
* Models predicting college dropout for students in residential and fully online program<br />
* Whether the socio-demographic information was included or not, the model showed worse accuracy and true negative rates for first-generation students who are studying in person<br />
* The model showed better recall for first-generation students, especially for those studying in person<br />
<br />
<br />
Kung & Yu (2020)<br />
[https://dl.acm.org/doi/pdf/10.1145/3386527.3406755 pdf]<br />
* Predicting course grades and later GPA at public U.S. university<br />
* Worse performance on independence for first-generation students in course grade prediction on 5 of 5 classic machine algorithms; worse performance on separation for 3 of 5 algorithms; comparable performance on sufficiency for 5 of 5 algorithms<br />
* Worse performance on independence for first-generation students in later GPA prediction for three of five algorithms; two algorithms had worse separation and one algorithm had worse sufficiency<br />
<br />
<br />
Rzepka et al. (2022) [https://www.insticc.org/node/TechnicalProgram/CSEDU/2022/presentationDetails/109621 pdf]<br />
* Models predicting whether student will quit spelling learning activity without completing<br />
* Multiple algorithms have slightly better false positive rates for second-language speakers than native speakers, but equivalent performance on multiple other metrics.<br />
* Multiple algorithms have slightly better false positive rates and AUC ROC for students with at least one parent who graduated high school, but equivalent performance on multiple other metrics.<br />
* Multiple algorithms have slightly better false positive rates and AUC ROC for male students than female students, but equivalent performance on multiple other metrics.</div>Ryanhttps://www.pcla.wiki/index.php?title=Native_Language_and_Dialect&diff=380Native Language and Dialect2022-06-21T03:01:36Z<p>Ryan: added Rzepka</p>
<hr />
<div>Naismith et al. (2018) [http://d-scholarship.pitt.edu/40665/1/EDM2018_paper_37.pdf pdf]<br />
* Model that measures L2 learners’ lexical sophistication with the frequency list based on the native speaker corpora<br />
* Arabic-speaking learners are rated systematically lower across all levels of human-assessed English proficiency than speakers of Chinese, Japanese, Korean, and Spanish<br />
* Level 5 Arabic-speaking learners are inaccurately evaluated to have similar level of lexical sophistication as Level 4 learners from China, Japan, Korean and Spain<br />
* When used on the ETS corpus, essays by Japanese-speaking learners with higher human-rated lexical sophistication are rated significantly lower in lexical sophistication than Arabic, Japanese, Korean and Spanish peers<br />
<br />
<br />
<br />
Loukina et al. (2019) [https://aclanthology.org/W19-4401.pdf pdf]<br />
<br />
* Models providing automated speech scores on English language proficiency assessment<br />
* L1-specific model trained on the speaker’s native language was the least fair, especially for Chinese, Japanese, and Korean speakers, but not for German speakers<br />
* All models (Baseline, Fair feature subset, L1-specific) performed worse for Japanese speakers<br />
<br />
<br />
Rzepka et al. (2022) [https://www.insticc.org/node/TechnicalProgram/CSEDU/2022/presentationDetails/109621 pdf]<br />
* Models predicting whether student will quit spelling learning activity without completing<br />
* Multiple algorithms have slightly better false positive rates for second-language speakers than native speakers, but equivalent performance on multiple other metrics.</div>Ryan