I am a software development engineer at Amazon. I am interested in machine learning research, specially, natural language processing (NLP) for Arabic language and source code related problems. In parallel, I am working on multiple small open-source projects to learn more about software engineering. Also, I am a Youtuber (A small one), I am teaching problem solving, basic machine learning and other topics on my channel, YAGs.
BSc in Computer Science, 2019
Jordan University of Science and Technology
Research project to power Arabic NLP by building a system for automatic diacritization of Arabic text using deep learning techniques.
Web crawler brings the programming contests from many online judges and schedule them in one place.
Simple and easy to use tool to extract CodeForces contests and problems into PDF files in a readable and user friendly format.
Authorship identification is essential to the detection of undesirable deception of others’ content misuse or exposing the owners of some anonymous malicious content. While it is widely studied for natural languages, it is rarely considered for programming languages. Accordingly, a PAN@FIRE task, named Authorship Identification of SOurce COde (AI-SOCO), is proposed with the focus on the identification of source code authors. The dataset consists of crawled source codes submitted by the top 1,000 human users with 100 correct C++ submissions or more from the CodeForces online judge platform. The participating systems are asked to predict the author of a given source code from the predefined list of code authors. In total, 60 teams registered on the task’s CodaLab page. Out of them, 14 teams submitted 94 runs. The results are surprisingly high with many teams and baselines breaking the 90% accuracy barrier. These systems used a wide range of models and techniques from pretrained word embeddings (especially, those that are tweaked to handle source code) to stylometric features.
In this paper, we describe our team’s (JUSTers) effort in the Commonsense Validation and Explanation (ComVE) task, which is part of SemEval2020. We evaluate five pre-trained Transformer-based language models with various sizes against the three proposed subtasks. For the first two subtasks, the best accuracy levels achieved by our models are 92.90% and 92.30%, respectively, placing our team in the 12th and 9th places, respectively. As for the last subtask, our models reach 16.10 BLEU score and 1.94 human evaluation score placing our team in the 5th and 3rd places according to these two metrics, respectively. The latter is only 0.16 away from the 1st place human evaluation score.
In this work, we provide a Genetic-based algorithm that is used to quickly find a placement for a set of objects within a given layout such that access to these objects is optimized. The given layout describes the free locations of the objects and the object handles and the access is done through a corpus of object requests. The proposed algorithm optimizes the placement of the objects by searching through a small fraction of the search space. As a case study, we use the algorithm to find a better placement for the keyboard characters than QWERTY and Dvorak Simplified characters placements. The algorithm finds a placement that is better than both QWERTY and Dvorak Simplified by 32.68% and 15.79% respectively on the training set, and 32.71% and 15.84% respectively on the testing set. This result is achieved after searching through only 500K possible solutions, which is about 1.23 × 10-19percent only of the total search space. Both training and testing sets are extracted randomly from TED2013 v1.1 English corpus. Moreover, we release the dataset, code and experimental results on our GitHub repository.
In this paper, we describe our team’s effort on the semantic text question similarity task of NSURL 2019. Our top performing system utilizes several innovative data augmentation techniques to enlarge the training data. Then, it takes ELMo pre-trained contextual embeddings of the data and feeds them into an ON-LSTM network with self-attention. This results in sequence representation vectors that are used to predict the relation between the question pairs. The model is ranked in the 1st place with 96.499 F1-score (same as the second place F1-score) and the 2nd place with 94.848 F1-score (differs by 1.076 F1-score from the first place) on the public and private leaderboards, respectively.
In this work, we present several deep learning models for the automatic diacritization of Arabic text. Our models are built using two main approaches, viz. Feed-Forward Neural Network (FFNN) and Recurrent Neural Network (RNN), with several enhancements such as 100-hot encoding, embeddings, Conditional Random Field (CRF) and Block-Normalized Gradient (BNG). The models are tested on the only freely available benchmark dataset and the results show that our models are either better or on par with other models, which require language-dependent post-processing steps, unlike ours. Moreover, we show that diacritics in Arabic can be used to enhance the models of NLP tasks such as Machine Translation (MT) by proposing the Translation over Diacritization (ToD) approach.