The importance of official statistics

17 replies [Last post]
lutes
Offline
Joined: 09/04/2020

The day algorithms will be such that they point to this when people do random searches on popular video platforms, I might start using them again.

https://invidious.tube/watch?v=l3nqqebTXrw&ckattempt=1

lutes
Offline
Joined: 09/04/2020
chaosmonk

I am a member!

I am a translator!

Offline
Joined: 07/07/2017

Have you read Algorithms of Oppression by Safiya Umoja Noble?

https://en.wikipedia.org/wiki/Algorithms_of_Oppression

lutes
Offline
Joined: 09/04/2020

I am asked to connect with Google or Facebook in order to be able to download the introduction.

This must be a bad dream.

I will try to borrow it somewhere, inter-library service is quite efficient.

lutes
Offline
Joined: 09/04/2020

I have located it, I only need to get it shipped to my local library. I do not know why they do not have the pdf version available for download. That will be a good occasion to ask.

lutes
Offline
Joined: 09/04/2020

I will certainly read it, I think it is worth getting to the roots of the problem.

It is not only about any kind of prejudice, it is about setting the world ablaze. One has to know in depth who is adding fuel, and how.

lutes
Offline
Joined: 09/04/2020

Meanwhile, I am reading this exquisite food for thought:

https://www.tandfonline.com/doi/full/10.1080/1369118X.2018.1477967

While skimming through it I got that uncanny feeling that Big Brother, HAL and the Matrix have merged while we were busy philosophizing about freedom.

"Within code, algorithms are usually woven together with hundreds of other algorithms to create algorithmic systems. It is the workings of these algorithmic systems that critical inquirers are mostly interested in, not the specific algorithms (many of which are quite benign and procedural). However, disentangling these ecologies often proves nigh impossible due to their topological complexity."

chaosmonk

I am a member!

I am a translator!

Offline
Joined: 07/07/2017

I bought a hard copy on ebay, so I'm not sure where it can be found online. Library works too (the replacement for freedom-hostile software doesn't always have to be software).

> It is not only about any kind of prejudice, it is about setting the world ablaze. One has to know in depth who is adding fuel, and how.

Right. Since the code is proprietary, we can't study the algorithms directly, but we compare different combinations of inputs to and outputs from the black box to gain some understanding of what's going on internally. It's reverse engineering in a way. The author uses racial bias as an entry point into studying how Google's algorithms steer people in certain directions. By chapter 3 it becomes clear why this is a useful lens to look at the problem through (that's the part of the book your post reminded me of).

gaseousness
Offline
Joined: 08/25/2020

In my opinion, Google is definitely rigged. I recall trying to use "refined" searches to filter through the authoritative sources they wanna push on certain controversial issues to no avail.

https://support.google.com/websearch/answer/2466433?hl=en

lutes
Offline
Joined: 09/04/2020

Google is rotten.

Companies usually keep their motto for life. What urge did Google have to drop their "Don't be evil" motto?

If I remember well, in the first days of large scale internet we were using "Netscape" and more or less thematic "portals" as entry points while search and meta-search engines were plowing their way across the glebe to index stuff. Google killed that original model in an easily predictable winner-take-all competition which public authorities totally failed to regulate, and consequently were able to build their data empire upon their uncontested dominant position.

There still are many entry points through which lambda users can bypass the monopoly behemoth. I try to keep bookmarking as many of them as needed instead of "googling" my way online. Maybe that's the subtle nuance between "to search" and "to google".

chaosmonk

I am a member!

I am a translator!

Offline
Joined: 07/07/2017

> If I remember well, in the first days of large scale internet we were using "Netscape" and more or less thematic "portals" as entry points while search and meta-search engines were plowing their way across the glebe to index stuff. Google killed that original model in an easily predictable winner-take-all competition which public authorities totally failed to regulate, and consequently were able to build their data empire upon their uncontested dominant position.

I think the way to begin to try to compete with Google would be to tackle one use case at a time, and strive to provide high quality search results for each use case with separable search engines, rather than try to be an "everything" search engine.

For example, news search might be a good use case to start with. We don't need to index every single article going back to the beginning of time by every single news source to be somewhat useful. We start by indexing recent articles by a variety of reasonably reliable sources, and any older articles linked back to by those articles (if a recent article links to an older article, that implies that something has happened recently that makes the older article a newly relevant bit of history).

I think that should make for a reasonable starting point. Then we'd look at the quality of the results and see what can be done to improve them. The cutoff point for recentness could be pushed back as resources become available.

In the short term, everything other than news could be proxied from another search engine via Searx. Maybe news search could even be supplemented with results from another search engine until it's good enough to stand on its own. Maybe a browser could have different search providers for different kinds of search use cases, rather than use a single search provider for everything. Once news search is high quality enough to be actually useful to people, move onto other use cases:

For a generic information search, I suspect that indexing Wikidata, every Wikipedia article, every source cited by Wikipedia, and perhaps other documents by same sources on a particular topic (meaning that if a source is frequently cited on Wikipedia in reference to a particular topic, then we index other documents by that source if they are about the same topic) might make for a decent starting point.

We can ease the burden on what the generic search engine has to be capable of by creating smaller search engines for different niches (academic articles, sheet music, etc). These search engines don't have to all be run on the same server. It's hard to decentralize "everything search" because it is not efficient to make a requests to a bunch of different servers for each search query, but if each server is dedicated to a specific use case and focuses on providing high quality results for that use case, we can achieve a little bit of decentralization while still making only one request per query. There's no reason that my news searches have to go to the same server as my music searches. There's also no reason that my music searches have to go to the same server as your music searches; maybe our listening habits will determine which music search engine we prefer.

Google obviously has an insurmountable advantage when it comes to scale. One shouldn't even try to compete when it comes to scale, and instead focus on trying to do a good job serving different use cases and niches. However, Google has one *disadvantage* that we should try to exploit:

Google obviously tries to provide useful, relevant, and reliable results, and it sometimes does, but its other interests, including its business model, will always come first. For example, if I search "best way to learn spanish", every single result on the first page is trying to sell me something, because Google is advertiser-centric. What I would prefer to see is a layman-friendly summary of some academic research into language acquisition, but between advertisers paying Google for placement in search results, and companies hiring SEO firms to game the system, anything potentially useful is buried in traditional ads or ads disguised as blog posts. A non-advertiser-centric search engine would not need to have paid results, and if we have smaller, use-case-specific search engines rather than "everything search" then we can index what's known to be often useful rather than index everything, making it hard to game the system with SEO.

lutes
Offline
Joined: 09/04/2020

Now I'll need to be ruminating on this, your suggestions have striking similarities with some stuff I had been trying to put together after I quit participating in some decentralized search engine (which eventually morphed into a sort of indexing engine used for SEO consulting). I could not find anyone interested in the topic at the time so I let it sleep. That must have been around 2008-2010.

Unfortunately I cannot put my hands on any sort of notes I could have been writing down by then. I only found a 2015 odt file named "Prioriries" and containing the following lines:

"1. focused, learning search engine
2. resource-sparing technologies
3. secure, serverless (decentralized) communications"

That's not much.

andyprough
Offline
Joined: 02/12/2015

> For example, if I search "best way to learn spanish", every single result on the first page is trying to sell me something, because Google is advertiser-centric. What I would prefer to see is a layman-friendly summary of some academic research into language acquisition, but between advertisers paying Google for placement in search results, and companies hiring SEO firms to game the system, anything potentially useful is buried in traditional ads or ads disguised as blog posts.

When I was taking Hebrew last year in the university I found the same problem. Searching invidio.us for "Hebrew alphabet" or "Hebrew nouns" was pulling up mostly videos that advertised paid services for the first 10-20 hits. Because invidio.us gets its search results from youtube of course. I've gotten so used to skipping past advertiser search hits that it's just second nature to me now and it's like I don't even see them.

lutes
Offline
Joined: 09/04/2020

Overall, Google search results seem to be reflecting what people are looking for, which would be fine if it was happening at the individual level, although there arguably would already be some unwelcome belief reinforcement effect. At the aggregate level, it has the perverse effects of inflating both commercial and political propaganda. Note that this is perverse enough that they will normally not get mixed: the algorithm is able to determine whether the user is looking to buy products or to buy in political crap. Users who want both will surely get both.

In an ecosystem based on resource sharing and focused on education and knowledge, I would bet that the very same algorithms would give dramatically more socially satisfying results. Similarly, the prejudice evidenced by the search results for "black on white crime" for instance is not fabricated by the algorithm, it stems from the prejudice of the very users searching this phrase. This is a typical case of belief reinforcement, very similar to what most media are deliberately feeding.

I definitely need to read that book. I just discovered that my library card has expired but I will manage to put my hands on it no matter what. I do not find the wikipedia summary completely convincing but it might be because complex arguments cannot always be summed up, so it is in fact making me want to read the full text all the more.

--

Some thoughts about separable search engines:

- general public scientific publication digests (indexing abstracts?) would greatly help the public get a first-hand idea of what science is actually saying, and possibly also help professionals keep up.

- figures and statistics: the current trend of reducing funding for official statistics in many places (see above) could be mitigated by diversifying and/or aggregating sources. How to search these I am not sure, I usually need to manually browse a few institutional websites before I can locate the graph I am looking for. Sometimes I will find the correct data buried inside commercial blurb that completely misinterpret it before I can locate the original source.

- referencing Wikipedia: how to mitigate the inherent problematic content of hot topic articles? Only reference articles with less than a given amount of edits in a given time-span? Wikipedia should not be used for news, its model does not seem fit for that purpose.

- searching news. Now that's the hot potato. I do not find any current source of "news" reasonably reliable (from which the problem of basing news search on Wikipedia derives, I guess). I have found that I can usually get a reasonably accurate picture (based on what happens next and on what is eventually published on the topic in the aftermath) by crossing multiple polarized sources, but this is time consuming and even that sometimes gives a totally unreadable output. In general I find in-depth analysis sources to be much more reliable than instant news sources. There must be accurate, non partisan (or partisan, but explicitly enough to allow the reader to apply the correct filter) sources for "raw" news but I still have to find them. Also, they might only be available in other languages than English, depending on the topic.

I am trying to figure out what I meant by a "learning" search engine. It was obviously not about sneakily learning anything about the user but about improving the quality of the results, but on what criteria? Maybe the idea was that the user could rate the quality of the results locally, and then what I think was supposed to be a client-based algorithm would learn from that.

[to be continued, weather permitting]

chaosmonk

I am a member!

I am a translator!

Offline
Joined: 07/07/2017

> Overall, Google search results seem to be reflecting what people are looking for, which would be fine if it was happening at the individual level, although there arguably would already be some unwelcome belief reinforcement effect.

That's how people often assume it works, but it's not quite that simple. There are multiple factors involved, and what people search for is not even the most important one. I think you'll like the book.

> general public scientific publication digests (indexing abstracts?)

That's not a bad idea. Academic articles are often behind a paywall, but abstracts are usually freely accessible. I wonder if there's already a search engine devoted to open access journals.

> figures and statistics

A "data search" for finding credible data and statistics could be neat.

> referencing Wikipedia: how to mitigate the inherent problematic content of hot topic articles? Only reference articles with less than a given amount of edits in a given time-span? Wikipedia should not be used for news, its model does not seem fit for that purpose.

I was suggesting Wikipedia and Wikidata more for searching for general overviews of topics, rather than news.

It may be possible to index Wikipedia's version control system, rather than the web frontend, and ignore revisions less than a certain age (determined by how quickly vandalism tends to get corrected, which different for different articles depending on how popular and controversial they are), for the purposes of choosing to index sources cited.

> news. Now that's the hot potato. I do not find any current source of "news" reasonably reliable

I think you can assess sources as generally reliable or unreliable, if you separate reliability from bias and treat biased sources differently. Ideally there'd be a mix of unbiased, reliable sources (e.g. a source like the AP, which is generally accurate and not particularly slanted) and biased, reliable sources (e.g. sources like The Intercept and The Wall Street Journal, which are biased in opposite directions when it comes to what they cover and how they cover it, but the information in both is generally accurate). If you exclude biased sources, you tend to get basic reporting on facts without much depth, analysis, or context; you just don't want all sources to be biased in the same direction. Unreliable sources, whether they are biased (e.g. Infowars) or unbiased (e.g. most tabloids) should probably be avoided, unless one of their articles becomes the subject of discussion by other sources (e.g. if sources are reporting on what Alex Jones said yesterday, an Infowars video may be a relevant primary source).

The reason I'm interested to see what happens if you index sources cited by Wikipedia has to do with something I read in Algorithms of Oppression. Google ranks pages by how many other pages link to them, an idea drawn from citation analysis. Noble writes of citation analysis,

"In the process of citing work in a publication, all citations are given equal weight in the bibliography, although their relative importance to the development of thought may not be equal at all. Additionally, no relative weight is given to whether a reference is validated, rejected, employed, or engaged--complicating the ability to know what a citation actually *means* in a document. Authors who have become so mainstream as not to be cited, such as not attributing modern discussions of class or power dynamics to Karl Marx or the notion of "the Individual" to the scholar of the Italian Renaissance Jacob Burckhardt, mean that these intellectual contributions may undergird the framework of an argument but move though works without being cited any longer. Concepts that may widely be understood and accepted ways of knowing are rarely cited in mainstream scholarship, an important dynamic that Linda Smith... argues is part of the flawed system of citation analysis that deserves greater attention if bibliometrics are to serve as a legitimating force for valuing knowledge production."

In the context of backlink-based webpage rankings, the fact that a page is linked to does not necessarily mean that the page is a good source of information (a source may link to an article by another source and then go on to thoroughly debunk it), and another page may be highly relevant but not linked to because the connection is obvious to a human reader.

Wikipedia is an interesting case, in that its policy is to avoid primary sources and stick to reliable secondary sources. The lack of primary sources means that it is less likely to link to a controversial article and more likely to link to other sources discussing the merits of the article. A reader may then discover the original article via those sources, but it will have been placed in context rather than appearing first in search results. Also, if a topic is relevant to another topic, that topic's Wikipedia page is likely to be linked to by the other topic, which might make it possible to catch other relevant sources that are not mentioned in the other article. For this reason, I suspect that backlinks in Wikipedia citations are on average more meaningful than an arbitrary backlink on the web. There may be other sites or kinds of sites for which this is could be the case. Indexing backlinks from a small set of such sites would result in a far smaller index than Google's, but one which may be quite useful for some kinds of searches.

Of course there's the non-trivial task of determining how reliable and how biased a source is. Wikipedia's approach is to do this through crowdsourced debate and consensus, which can lead to bias when the demographics of volunteers skew in one direction[1]. There are also some organizations, like [2], which claim to evaluate reliability and bias systematically, though I haven't thoroughly looked into or thought very hard about their methodology yet.

[1] https://en.wikipedia.org/wiki/Gender_bias_on_Wikipedia

[2] https://www.adfontesmedia.com/

I am about to begin winding down my forum activity followed by taking a temporary break, as it's becoming too much of a no-longer-very-rewarding distraction at a time when I need to focus on wrapping up my PhD. I'll check back here though in case you have any other thoughts or ideas to share on this topic. If you get around to reading AoO, feel free to shoot me a message[3] letting me know what you think of it.

[3] https://trisquel.info/en/users/chaosmonk/contact

lutes
Offline
Joined: 09/04/2020

> I need to focus on wrapping up my PhD.

I am sure you will wrap it up masterfully. I also have no doubt that requires focus.

I am thankful for your help, your detailed and illuminating comments and the constructive discussions we managed to have here in spite of the inherent limitations of forums - and in spite of my coarse-grained and highly idiosyncratic English. Until our electronic paths meet again, I guess.

(Of course, I have also read the reminder of your post, thanks for that too).

lutes
Offline
Joined: 09/04/2020

The video on the importance of official statistics led me to watch this one about hopes and fears from a semi-retired digital archivist:

https://invidious.tube/watch?v=gErZHDVP-Mk

After which I felt I had to check the kind of music he was referring to, and eventually found this introduction to the baroque theorbo:

https://invidious.tube/watch?v=eVabz8LneI4

lanun
Offline
Joined: 04/01/2021

I just found this on the topic of search engines:

https://drewdevault.com/2020/11/17/Better-than-DuckDuckGo.html

Interestingly, it was also posted on gemini:

https://proxy.vulpes.one/gemini/drewdevault.com/2020/11/17/Better-than-DuckDuckGo.gmi