For the new analytical system, the Customer’s architects selected the following frameworks:
• Apache Hadoop – for data storage
• Apache Hive – for data aggregation, query and analysis
• Apache spark – for data processing
Amazon Web Services and Microsoft Azure were selected as cloud computing platforms. Upon the Customer’s request, during the migration, the old system and the new one were operating in parallel.
The system has been supplied with raw data taken from multiple sources, such as TV views, mobile devices browsing history, website visits data and surveys. To enable the system to process more than 1,000 different types of raw data (archives, XLS, TXT, etc.), data preparation included the following stages coded in Python:
• Data transformation
• Data parsing
• Data merging
• Data loading into the system