The Kaggle Wine Reviews dataset is composed of ~150,000 unique wine reviews and contains the following fields: country, description, designation, points, price, province, region_1, region_2, variety, and winery. Country, province, region_1, and region_2 refer to various degrees of specificity regarding the origin of the wine. Winery refers to the winery that produced the wine and designation refers to the vineyard within the winery from which the grapes were picked. The description field includes a description from a sommelier about the wine’s taste, smell, look, feel, etc. Finally, the variety field refers to the type of grape used to produce the wine. Using the aforementioned features, we hope to predict points, which is a proxy for quality, and price.
Very quaffable and great fun: Applying NLP to wine reviews, Hendrickx et al.
Using textual reviews gathered from WineMag, Hendrickx et al. use lexical and semantic information to predict color, grape variety, price, and country of origin of various wines. Due to the fact that reviewers use similar descriptors, such as ‘fruity’, ‘notes of blackberry’, and ‘elegant’, wine reviews are consistent enough to draw conclusions about the wine itself. The researchers processed the textual data by combining a bag-of-words model with 100 topics generated from Latent Dirichlet Allocation and 100 clusters based on word embeddings from Word2Vec. The experiment resulted in high F-scores for each predicted category. Price was predicted categorically using ‘expensive’ and ‘cheap’ as the categories; however, the researchers mention that in the future it would be better to predict price as a regression task. Thus, we are hoping to extend their work by predicting price and quality as a regression task and by addressing the aforementioned questions.