You may remember that a couple of weeks ago we compiled a list of tricks for image segmentation problems.
This time we’ve gone through the latest 5 Kaggle competitions in text classification and extracted some great insights from the discussions and winning solutions and put them into this article.
It took some work but we structured them into:
- Dealing with large datasets
- Small datasets and external data
- Data exploration for NLP
- Data cleaning
- Text representation
- Evaluation and cross-validation
- Runtime tricks
- Model ensembling
What do you think should be added to this?
Any additional tips that come from your experience working with text classification problems (both research and industry) that you could share?