Research Engineer, 4 years of machine learning and NLP experience, 10 years of computational algorithm development, programming and math competitions winner (Google Code Jam – 11th absolute place, IMC – 1st prize).

Research Engineer, 4 years of machine learning and NLP experience, 10 years of computational algorithm development, programming and math competitions winner (Google Code Jam – 11th absolute place, IMC – 1st prize).
At Grammarly, we have long used Amazon EMR with Hadoop and Pig in support of our big data processing needs. However, we were really excited about the improvements that the maturing Apache Spark offers over Hadoop and Pig, and so set about getting Spark to work with our petabyte text data set. This talk describes the challenges we had in the process and a scalable working setup of Spark that we have discovered as a result.