← Homepage
[Outreachy reports] · · 3 min read

Outreachy report #53: January 2024

This month, I finished my Data Mining/Data Analysis course. It was offered under the “Advanced Information Systems Topics” umbrella — a set of optional courses for Information Systems undergraduates. I learned a little bit about the use of machine learning and natural language processing to analyze non-structured data, using libraries such as, Pandas, Transformers, PyTorch, Gensim, and Natural Language Toolkit. I experimented with non-structured data extracted from our 2019 longitudinal study. I focused on the following open-ended questions to evaluate the overall sentiment of alums about the program:

question_list = [
    'How did your mentor impact your career or life?',
    'How did your Outreachy project impact your career or life?',
    'What other ways has Outreachy impacted your career or life?',
    'After your Outreachy internship, did you have another internship? Which company or internship program?',
    'After your Outreachy internship, did you win any awards?',
    'After your Outreachy internship, did you take on any leadership roles?',
    'Tell us more about your successes after Outreachy!',
    'How would you like to volunteer to help Outreachy?',
    'Do you have any other feedback for Outreachy organizers?'
]

I used this model for my first experiments with sentiment analysis. Reading samples and their outputs, I realized the Neutral classification didn’t provide much insight on the answers we had. I decided to add Neutral-Positive answers to the Positive answers pool, and Neutral-Negative answers to the Negative answers pool. Here are some of the outputs:

  • Number of positive answers: 931
  • Number of negative answers: 280
  • Participants with mostly positive answers: 198
  • Participants with mostly negative answers: 17
  • Participants with mostly positive answers - Continued contributing to open source projected after the internship: 84%
  • Participants with mostly negative answers - Continued contributing to open source projected after the internship: 53%
  • Participants with positive mentor experiences: 173
  • Participants with negative mentor experiences: 62
  • Participants with positive mentor experiences AND achievements: 57%
  • Participants with negative mentor experiences AND achievements: 21%

I found this experiment so interesting one of my main goals for this year is continuing the work I’ve started during this course (which had limited scope and time to be completed) and developing my data analysis skills. One of my first tasks is writing a report on how we can improve our data collection (creating the survey itself, understanding the output our tools provide us). One thing that became clear to me very quickly was how poorly designed the multiple choice questions were. Some of them produced data that is quite difficult to process and not that useful when it’s finally processed.

Unexpectedly, I found another useful tool for our future data analysis as I finished my final Algorithm Design assignment. I had to write a report on NP-complete problems, more specifically, on the Clique Problem. Clique is originally a social science concept, created to describe a group of people who all have connections with each and every person in that group. Psychometrics, sociometrics, and later on computer science have all expanded the understanding of that concept to be mathematically equivalent to undirected subgraphs. It has given us interesting insights on fields like bioinformatics, and Skiena in particular mentions applications on detecting tax fraud. I can see us using it to explore relationships between sponsors and communities, internship cohorts, etc.

Overall, this academic term showed me how good of a decision it was to re-enter university in 2019. I’m approaching my final undergraduate years (they last 4.5 to 6 years in Brazil, which is why so many countries may consider our undergraduate degrees to be the equivalent of an undergraduate degree + a masters), and I’ve been finding more and more usefulness in merging both worlds (work and studies). Here’s to more opportunities to learn and create!