LETTER | I would like to point out some problems with the Hotspot Identification for Dynamic Engagement (Hide) and the list of "potential hotspots" generated using the system.

Data analytics have been widely used by big data companies like Google and Facebook. Predictive data analytics involves data collection, analysis, cleaning, validation and a predictive model can be made. This workflow/cycle is often repeated thousand or millions of times by computers to come up with a predictive model.

Big tech companies can use this effectively as their entire business ecosystem is online. Big tech has almost zero physical business or products. The cycle of data collection to validation can be done millions of times virtually from a workstation in California.

When it comes to real-world applications of big data, it becomes a lot more complex. There is no crystal ball that can predict what will happen in the real world. Even the most advanced algorithms cannot predict the stock market, as the stock market is tied to transactions that happen in the real world and not the virtual world.

Using MySejahtera data to predict hotspots is prone to sampling errors and bias. These errors are well known to the scientific community but not well understood by the public. Any data scientist or researcher will know that rubbish in equals to rubbish out. The conclusions from the prediction will depend on the quality of the data collected.

Looking at the list produced, nearly all the potential hotspots identified are shopping malls. The vast majority are in the Klang Valley. There are only three potential hotspots in Kelantan despite the huge outbreak in the state.

Shopping malls in the Klang Valley are more strict with their standard operating procedure (SOP) and compliance is enforced by mall security. These businesses have a direct incentive to ensure their compliance with the SOP.

On the other hand, some places have no incentive to comply with the SOP. They see the SOP as an obstacle to their work.

MySejahtera also does not collect accurate data from homes, nursing homes, factories, prisons, hospitals or public areas such as parks. Certain groups such as the elderly, young children and undocumented workers are also unrepresented with MySejahtera. It appears the time spent shopping for groceries is over-represented.

The average human spends most of his time at home or at work, but this information is not represented. The number of people at home or work may be fewer than at the shopping mall, but the time spent with them is longer and masks are not always worn.

Some of the known factors for spreading the Covid-19 virus are the distance from the infected case, time spent in close proximity, air quality, the infectivity of the individual, whether the persons were wearing masks and etc. This information is not represented by so-called "big data" from MySejahtera.

This leads to several ethical questions.

The more people scan in at locations, no doubt the risk of Covid-19 is higher. But is it fair to name certain businesses simply because they are using MySejahtera?

In the early days of the pandemic, assurance was given by the honourable minister that the data will be kept confidential. Is that still true now? Businesses are now openly named and identified.

Is it ethical for the state to release information, without mentioning the appropriate caveats, biases and limitations to the data? How would the average layperson be able to know the truth? Is this a good or bad scientific practice?

Would this not instead discourage the usage of MySejahtera in the future?

The views expressed here are those of the author/contributor and do not necessarily represent the views of Malaysiakini.