Job Description
Looking for a few individuals to help clean and normalize a large text dataset for a research project. The data consists of data stored in Parquet: Sample here:
Primary Task:
Clean description text fields by removing noise that interferes with downstream NLP and embedding work, including:
- HTML tags and artifacts
- Malformed UTF-8 characters
- EOF markers and control characters
- Boilerplate headers, footers, and repeated templates
- Other non-content noise
Tech Stack:
- Python (Pandas, regex, multiprocessing)
- Parquet file format
- Data hosted on DigitalOcean Spaces
Ideal Candidate:
- Proficient in Python and Pandas
- Experience processing large-scale text data efficiently
- Strong regex skills
- Comfortable working with Parquet files and batch processing
- Bonus: NLP or text preprocessing background
Additional Info:
- After the contract starts, we will use Slack for communication and task assignment
- Additional tasks may...
Primary Task:
Clean description text fields by removing noise that interferes with downstream NLP and embedding work, including:
- HTML tags and artifacts
- Malformed UTF-8 characters
- EOF markers and control characters
- Boilerplate headers, footers, and repeated templates
- Other non-content noise
Tech Stack:
- Python (Pandas, regex, multiprocessing)
- Parquet file format
- Data hosted on DigitalOcean Spaces
Ideal Candidate:
- Proficient in Python and Pandas
- Experience processing large-scale text data efficiently
- Strong regex skills
- Comfortable working with Parquet files and batch processing
- Bonus: NLP or text preprocessing background
Additional Info:
- After the contract starts, we will use Slack for communication and task assignment
- Additional tasks may...
Ready to Apply?
Take the next step in your AI career. Submit your application to Confidential today.
Submit Application