Lmst

#HackerNewsAnalytics

@PixelJones What I'd really like to do is to get away from the per-item / per-use payment model, and instead think of the infostream as a distribution utility for which access rather than use is the principle consideration, and in which payment ability (wealth & income) rather than content value is the basis on which payments are made.

The questions of both what content is made available and how that content is compensated I'm leaving somewhat vague, though in general we have systems which work for this, and which have worked for nearly a century now based on broadcast & cable media, audit-based measurement (Nielson, Aribitron, etc.), distributor-based negotiations (with individual broadcast stations or networks), and something closely approaching a common-carrier model for the actual access providers (that is, ISPs).

The points @dangillmor raised are valid: a gatekeeper monopoly is a critical hazard, and is worth addressing from a competitiveness standpoint, independent of this proposal.

Why "all you can eat"? Two principle reasons:

1: Need for information is strongly independent of capacity to pay, and often inversely associated.

2: There are entirely novel capabilities afforded by access at scale which a usage-based payment model largely forecloses on. Aaron Swartz's work which lead to his prosecution and suicide based on wholesale downloading of JSTOR scientific papers is a key case in point. It's possible to look through, over, and among a corpus to find relationships not otherwise manifest. (I'm doing something along these lines with my #HackerNewsAnalytics series posted here on the Fediverse.)

The notion of an individual or household account, associated with personal mobile devices and/or household Internet service, from which pro-rata payments are then allocated amongst various providers is one option for compensation, though even that might well not be ideal. That imposes a huge surveillance component itself (who is reading, listening to, or watching what), and could well disproportionately benefit or starve less substantial or more substantial works. More critically on that last: works which are far more expensive to produce at quality, such as investigative journalism or scientific research.

Some sense of local, regional, national, and global providers / publishers, within genres, funded with a specific budget and for a minimum guaranteed time period, would provide the institutional stability to provide certain classes of work: news, education, business and government publications, academic research, and of course, entertainment.

And, again, multiple revenue streams, including premium subscriptions, patronage, advertising, etc., could well be additional components. But an access-based automatic and universally-billed tier really does seem to be a possibility that's rarely mentioned or advocated.

@cobalt

#UniversalContentSyndication #PayingForJournalism

On general discussion forums and "paying for media"

One frequent dispute online is over paywalled links, and the general advisability on various grounds of sharing workarounds. I happen to have data for Hacker News (HN), so that's what I discuss here.

As I'm sitting on a trove of ~190k front page stories and the sites linked by them, I can bring some insight to this debate. As of 21 June 2023, there were 52,642 distinct sites which have made just the front page (30 items/day). That's roughly 3% of all submitted posts, which would be a rather larger site tally.

How many of those 52,642 sites should HN members subscribe to?

If we restrict that to only the sites with 100+ front-page submissions, that number falls to 149. Still, arguably, excessive.

Of the sites I've identified as "general news" (all sites w/ >= 17 appearances, plus a few others), that list is 146.

Those constitute 8.47% of all HN front-page posts, the second-largest overall category following blogs.

I would suggest that expecting the 600k+ active HN participants, let alone the 5 million or so total monthly users, to individually subscribe to more than a very small handful of such sites is entirely unrealistic.

Subscriptions are a concept which worked reasonably well for local newspapers serving limited areas for which some fraction of households might subscribe to one, and far fewer multiple dailies. The majority of expenses were covered by advertising, however.

Whatever business model people are going to suggest for online media, it's going to have to address the fact that individual people cannot and will not register many thousands, or even dozens, of subscriptions.

(Adapted from an earlier HN comment: https://news.ycombinator.com/item?id=36832354)

Edits: Rephrasing.

#HackerNews #HackerNewsAnalytics #Paywalls #Subscriptions #Journalism

HackerNews changed how it dealt with highly-active discussions around January 2009, based on evidence I see (far fewer spicy threads after that date).

I'm also seeing that spicy stories actually tend to rank slightly higher on the page (a lower "storypos", that is, story position, value), which is counter to my expectation. This may of course be due to selection bias --- moderators specifically lift limit on overheated stories, so that those stories that do survive are more appropriate to HN.

I'd like to look at semantic / sentiment elements here as well, words or phrases which seem more prevalent on high-ratio stories. Here my analytic methods work against me as the HN title of a post is often quite short and not especially descriptive, though with some examples (as with the mental health study mentioned earlier).

#HackerNews #HackerNewsAnalytics

Hacker News "Ratio": political commentary sites

Continuing my look at the comments/votes ratio, a look at sites which tend to focus on political commentary and their "spiciness". These tend to be well above mean (0.63), median (0.52), and tend to be a standard deviation or more from the mean (1 sd: 0.78, 2 sd: 0.92, 3 sd: 1.06).

Stories Vote    Comm   Ratio  Site         
   2      18      57   3.167  heritage.org
   4     143     224   1.566  hoover.org
   9     473     603   1.275  breitbart.com
   8    1724    1873   1.086  cityobservatory.org
   9     364     379   1.041  mises.org
   1      56      55   0.982  adamsmith.org
   7    2488    2372   0.953  city-journal.org
   1      92      85   0.924  manhattan-institute.org
  70   13143   11614   0.884  reason.com
   5     854     722   0.845  jacobinmag.com
   1     204     153   0.750  theblaze.com
  13    1607    1202   0.748  bostonreview.net
   5    1682    1252   0.744  tribunemag.co.uk
   4     629     465   0.739  nationaljournal.com
   5    1907    1400   0.734  americanaffairsjournal.or
  12    2164    1584   0.732  alternet.org
  10    1302     871   0.669  cato.org
   5     738     493   0.668  dailycaller.com
   9    1387     844   0.609  dailykos.com
   5     759     450   0.593  rawstory.com
  10    2538    1455   0.573  rootsofprogress.org
   2     552     275   0.498  theroot.com
  30    7881    3850   0.489  rt.com
   2    1256     467   0.372  wsws.org

Note that general news tends somewhat toward spicy, though not as much as the explicitly political sites. Of the 147 sites I'd identified as "general news", ratio statistics are:

n: 147, sum: 94.415, min: 0.092, max: comms,, mean: 0.642279, median: 0.605, sd: 0.433165

%-ile:

5: 0.234, 10: 0.341, 15: 0.4515,
20: 0.491, 25: 0.51, 30: 0.5305,
35: 0.5415, 40: 0.566, 45: 0.581,
55: 0.614, 60: 0.6285, 65: 0.654,
70: 0.68, 75: 0.716, 80: 0.734,
85: 0.7875, 90: 0.8715, 95: 1.1925

(As with other toots in this series, Markdown formatting is used, toot.cat may be better than your own instance's presentation.)

#HackerNews #HackerNewsAnalytics

The 20 "spiciest" sites seem to be (using a cut-off of 20+ stories):

apnews.com                     36      14674      17512     1.193
sfchronicle.com                25       5771       6174     1.070
variety.com                    24       5479       4992     0.911
mattmaroon.com                 73       3332       3023     0.907
axios.com                      92      38075      34150     0.897
bizjournals.com                20       2183       1959     0.897
cnbc.com                      174      59983      53056     0.885
apple.com                     241      99945      88396     0.884
reason.com                     70      13143      11614     0.884
nypost.com                     28       5851       5088     0.870
markevanstech.com              22        290        251     0.866
macrumors.com                  62      18700      16162     0.864
nikkei.com                     56      17568      15174     0.864
economist.com                 829     119205     102702     0.862
thewalrus.ca                   30       6194       5199     0.839
techradar.com                  30       7227       6053     0.838
backreaction.blogspot.com      33       7209       5968     0.828
strongtowns.org                27       8279       6857     0.828
mondaynote.com                 45       7581       6268     0.827
coindesk.com                   22      10236       8355     0.816

And the 20 least spicy sites are:

particletree.com               37        997        227     0.228
brendangregg.com               40      11135       2512     0.226
intruders.tv                   28        324         73     0.225
aphyr.com                      34       8514       1910     0.224
andrewchen.typepad.com         51        757        168     0.222
michaelnielsen.org             31       3335        723     0.217
igvita.com                     38       3626        767     0.212
startuplessonslearned.blo      24       1101        232     0.211
citusdata.com                  51       8361       1717     0.205
ferd.ca                        21       5883       1132     0.192
ocks.org                       27       6036       1120     0.186
tensorflow.org                 22       5612       1020     0.182
aosabook.org                   21       3899        669     0.172
ocw.mit.edu                    41       8793       1500     0.171
david.weebly.com               20       1364        226     0.166
jslogan.com                    24         97         16     0.165
burningdoor.com                23        149         23     0.154
linusakesson.net               26       4531        684     0.151
github.com/0xax                22       2168        121     0.056

#HackerNews #HackerNewsAnalytics

The Hacker News Ratio

One concept Hacker News uses to moderate discussions is a "flamewar detector", which based on moderator comments over the years is triggered when a discussion has > 40 comments AND there are more comments than votes on the article.

That had long struck me as questionable, but it's now something I can look at and ... it seems reasonably accurate. I've calculated ratios of all 178,882 HN Front Page stories (as of 2023-6-31), and ... do I have some ratios.

Basic stats:
n: 178882, sum: 89796.9, min: 0.00, max: 21.00, mean: 0.501990, median: 0.4, sd: 0.432899

Percentiles:
%-ile: 5: 0.08, 10: 0.13, 15: 0.17, 20: 0.21, 25: 0.24, 30: 0.27, 35: 0.3, 40: 0.33, 45: 0.37, 55: 0.44, 60: 0.48, 65: 0.53, 70: 0.58, 75: 0.64, 80: 0.72, 85: 0.82, 90: 0.96, 95: 1.22

Because of how I've parsed and processed data, it's not entirely straightforward to pull up the specific posts, though I can find those by the date and story position (ranked 1--30 on the page).

And ... yeah, the stories that tend to rate high based on this metric do tend to be sort of flamey.

The most ratioed post of all time was "juwo beta is released (at last!) Please use it and help improve it!", from 18 April 2007, at 21.0:

https://news.ycombinator.com/item?id=14253

Sometime around 2009--2010 the flamewar detector seems to have been implemented and ratios tend to be much lower, though there are still some pretty spicy discussions. One from the National Institutes on Health titled "Mental illness, mass shooting,s and the politics of American firearms", posted on 26 May 2022 (for a story originally dating from 2015) is the highest-ratioed post after the flamewar detector came into use, at 5.99:

https://news.ycombinator.com/item?id=31511274

I find it interesting how being able to query my archive affords insights on HN which aren't available through the standard search tools. It's possible to look for specific keywords, or submissions or comments from a specific account, but searching for contentious posts isn't really A Thing.

I'm doing some further digging to see what patterns might emerge by site, though finding a good minimum number of front-page appearances is one question I'm looking at.

#HackerNews #HackerNewsAnalytics

More on "UNCLASSIFIED": there are 36,520 of those sites right now. (Despite knowing better I keep diving in and classifying more of them.)

It's not practical to list all of them. But we can randomly sample. And large-sample statistics start to apply at about n=30, so let's just grab 30 of those sites at random using sort -R | head -30:

   1  sfg.io
   1  extroverteddeveloper.com
   2  letmego.com
   1  thestrad.com
   2  bombmagazine.org
   1  domlaut.com
   1  bootstrap.io
   1  jumpdriveair.com
   2  desmos.com
   1  leo32345.com
   1  echopen.org
   1  schd.ws
   1  web3us.com
   7  akkartik.name
   1  bcardarella.com
   1  cancerletter.com
   1  platinumgames.com
   1  industrytap.com
   2  worldoftea.org
   1  motion.ai
   1  vectorly.io
   2  enterprise.google.com
   1  lift-heavy.com
   1  davidpeter.me
   1  panoye.com
   3  thestrategybridge.org
   2  fontsquirrel.com
   1  kettunen.io
   1  moogfoundation.org
   2  elekslabs.com

That's a few foundations, a few blogs, a corporate site (enterprise.google.com), and something about tea, all with a small number of posts (1--7).

I'm looking at some slightly larger samples (60--100) here on my own system, and can actually make some comparisons across samples (to see how much variance there is) which can give some more information on tuning what I would expect to find under the "UNCLASSIFIED" sites.

Which is one way of using #StatisticalMethods to make estimates where direct measurement or assessment is impractical.

#HackerNewsAnalytics #HackerNews #MediaAnalysis #RandomSampling #Statistics

So ... I'm starting to get the reporting by site classification across years down and ... it is interesting.

Preliminary and buggy code yet. Also this is highly dependent on how I've actually classified sites.

I've got a few classifications I'd wanted to keep an eye on:

Programming-specific sites. A lot of this is github and gitlab, basically, software projects with code. I'm distinguishing software (which is mostly about use) and programming which involves, or at least anticipates, actual development.
"Political commentary". I used this as a description for ... highly political sites (spot-checking to see what stories actually hit the front page, though I should be more robust in that). The list: reason.com, rt.com, bostonreview.net, alternet.org, cato.org, rootsofprogress.org, breitbart.com, dailykos.com, mises.org, dailycaller.com, jacobinmag.com, rawstory.com, tribunemag.co.uk, hoover.org, heritage.org, theroot.com, wsws.org, adamsmith.org, manhattan-institute.org, theblaze.com.

And there's "academic / science" which is mostly university and academic press / journal sites.

Anywho....

... at least from initial takes, the trending on these does not suggest a trending toward sensationalistic topics and/or sites, but the opposite. Much more programming FP stories in recent years, fewer political commentary, and more academic/science items.

Presuming this holds up as I code further.

This is one of the fun things about data analysis: stuff jumps out at you, sometimes confirming hunches, but often radically violating preconceptions.

I want to look more closely at what happens in the lead-up and follow-on to the 2016 US elections cycle in particular....

Hrm. What does spike is cryptocurrency-specific sites in 2014. Though that falls off again. (I suspect as that discussion enters more mainstream sources.)

And "general info" and "general interest" sites seem to rise in recent years.

#HackerNewsAnalytics #HackerNews #MediaAnalysis

OK, current stats are 63.5% of posts classified, with 29.8% of sites classified, a/k/a the old 65/30 rule. The mean posts per unclassified site is 1.765, so my returns for further classification will be ... small.

Full breakdown:

A ... large number of sites w/ <= 20 posts are actually classified, mostly by regexp rules & patterns. Oh, hey, I can dump that breakdown as well:

I could pick just under 4% more posts by classifying another 564 sites but ... that sounds a bit too much like work at the moment. Compromises and trade-offs.

Now to try to turn this into an analysis over time.

I've been working with a summary of activity by site, so running analysis has been pretty quick (52k records and gawk running over that).

To do full date analysis requires reading nearly 180k records, and ... hopefully not having to loop through 52k sites for each of those. Gawk's runtimes start to asplode when running tens of millions of loop iterations, especially if regexes are involved.

#HackerNewsAnalytics #HackerNews #gawk #awk #DataAnalysis #MediaAnalysis

Oh, and something that would be really useful would be a quick way of looking up a website and getting a rough classification as to what type of content it presents.

Wikipedia can offer some of this, occasionally sources such as Crunchbase, though the first is hard to parse.

The Alexa Crawl (Amazon, originally by Brewster Kahle of the Internet Archive) used to offer this as well, though I think that's no longer active.

If anyone knows of other / better sources, I'd love to know.

#DearMastomind #DearHivemind #HackerNewsAnalytics

I've got this to about 60% of posts classified (by submitted site). I can continue winnowing this down, though there's obviously diminishing returns.

I've also revised my analysis code so that anything that's not classified defaults to "UNCLASSIFIED", without having to explicitly code that in the sites file.

I'm thinking of how I might crossref / correlate the site-based findings with title-based analysis. I'm also thinking of looking at average comments / votes by classification, as well as looking at the ratio of comments to votes (HN uses this as a very rough "flamewar" heuristic, though on somewhat shaky grounds IMO).

My sense is that many of the less-frequently-posted sites will turn out to be blogs of some form. I'm thinking of how I might assess this without having to key all of them.

<stage_whisper> random sampling <\stage_whisper>

One issue issue for less-frequently-occuring sites is that it's easy to code those which match a pattern (twitter, blogspot, livejournal, medium, substack, etc.) than those which are idiosyncratic. Note that a lot of Medium blogs don't appear on Medium domains, as well.

#HackerNewsAnalytics #HackerNews #MediaAnalyhsis

I'm continuing to play with this, and have classified a whole mess more sites (reminder to self: update that count) (response to self: 13,150 sites classified).

So that's about 25% of all sites that are classified. Looking by story count ... it's about 55% of all FP stories. (Power laws are your friend here...)

Looking at my current breakdowns (and again, this is all VERY ROUGH):

     1   15770  8.82%  blog
     2   15034  8.40%  general news
     3   13899  7.77%  software
     4   12889  7.21%  tech news
     5    7960  4.45%  academic / science
     6    7294  4.08%  n/a
     7    6025  3.37%  corporate comm.
     8    4859  2.72%  business news
     9    2120  1.19%  social media
    10    2031  1.14%  general interest
    11    1557  0.87%  general magazine
    12    1397  0.78%  general information
    13    1239  0.69%  technology
    14    1099  0.61%  videos
    15     975  0.55%  government
    16     607  0.34%  ???
    17     559  0.31%  tech discussion
    18     505  0.28%  tech law
    19     497  0.28%  misc documents
    20     420  0.23%  science news
    21     316  0.18%  mailing list
    22     251  0.14%  tech publications
    23     171  0.10%  tech blog
    24     149  0.08%  literature
    25     136  0.08%  business education
    26     133  0.07%  cryptocurrency
    27     126  0.07%  law
    28     118  0.07%  webcomic
    29     109  0.06%  entertainment news
    30     103  0.06%  health news
    31     103  0.06%  video
    32      96  0.05%  general discussion
    33      80  0.04%  misc
    34      71  0.04%  technology / security
    35      49  0.03%  translation
    36      47  0.03%  images
    37      46  0.03%  podcast
    38      42  0.02%  journalism
    39      30  0.02%  propaganda
    40      29  0.02%  healthcare / medicine
    41      18  0.01%  medicine
    42       7  0.00%  legal news

Classified:    98966
Unclassified:  79916
Total:        178882
Ratio:             0.553

My classifications are rough and I may revisit these. "blog" covers a lot of sins, though most are tech blogs (which makes "technology blog" redundant).

What I'd really like to do is to look at how trends vary over the years. Perhaps also by day of week / month of year. Finally answer that age-old question of whether HN is turning into Reddit....

As noted above, this is based on classifying the site rather than interpreting the title or reading the source article, so it's all a bit wobbly.

(This post formats better on toot.cat or on sites that render Markdown.)

#HackerNewsAnalytics #HackerNews #MediaAnalysis

gagejustins's HN analysis has inspired me to take a crack at typifying Hacker News front page stories by type.

Whilst he'd manually assessed each front-page story, I'm classifying by site, so that an NY Times article on, say, quantum computing would still be described as "general news".

I've classified 10,200 of 52,642 domains, the first 300 or so manually, much of the rest using regexes and imputation (e.g., ".edu", ".gov", and sites on Blogspot, Substack, Medium, etc.).

Results by story count:

     1  13782  general news
     2  13398  software
     3  10473  tech news
     4   8677  blog
     5   7651  academic / science
     6   7294  n/a
     7   4750  ???
     8   4600  business news
     9   3546  corporate comm.
    10   1504  general magazine
    11   1291  general information
    12   1162  general interest
    13   1132  technology
    14   1099  videos
    15   1073  social media
    16    975  government
    17    568  corporate comm
    18    559  tech discussion
    19    505  tech law
    20    251  tech publications
    21    171  tech blog
    22    170  science news
    23    136  business education
    24    104  corporate comm. 
    25    103  video
    26     99  corporate commm.
    27     96  general discussion
    28     80  misc
    29     71  technology / security
    30     61  law 
    31     59  webcomic
    32     49  translation
    33     48  health news
    34     47  images
    35     46  podcast
    36     32  law
    37      7  legal news

  Unclassified: 93213

"n/a" indicates no site, e.g., an Ask, Tell, or Show HN post.

'???' indicates I couldn't (quickly) assess a domain.  Examples:  37signals.com, readwriteweb.com, thenextweb.com, archive.org, anandtech.com, avc.com, docs.google.com, righto.com, slideshare.net, infoq.com, hackaday.com, gamasutra.com, marco.org, smashingmagazine.com, highscalability.com, catonmat.net, centernetworks.com, jvns.ca, scribd.com, about.gitlab.com, cloud.google.com, alleyinsider.com, msn.com, firstround.com, axios.com, openculture.com, onstartups.com, ejohn.org, dadgum.com, shkspr.mobi, mixergy.com, geek.com, gmane.org, foundread.com.

"cproorate commm." is an obvious typo.  This is very rough code & classification.

#HackerNewsAnalytics #MediaAnalysis #HackerNews

I have Found My People: "What gets to the front page of Hacker News? A data project"

Some marketing dude is also looking at the HN front page. We're comparing notes ...

https://randomshit.dev/posts/what-gets-to-the-front-page-of-hacker-news

https://news.ycombinator.com/item?id=36521887

#HackerNewsAnalytics

With my HN FP archive updated through yesterday, as one does, updated occurrences of "Reddit" in front-page story titles:

And what's the occurrence by month in 2023, you ask? Why, I'll tell you:

And those 22 stories in the first half of June are ... not positive:

Teddit – An alternative Reddit front-end focused on privacy
[dupe] Third-party Reddit apps are being crushed by price increases
Demo: Fully P2P and open source Reddit alternative
Reddit’s plan to kill third-party apps sparks widespread protests
Reddit's Recently Announced API Changes, and the future of /r/blind
Redditor creates working anime QR codes using Stable Diffusion
ArchiveTeam has saved over 11.2B Reddit links
Archive your Reddit data before it's too late
Reddit Strike Has Started
Thousands of subreddits pledge to go dark after the Reddit CEO’s recent remarks
Show HN: Non.io, a Reddit-like platform Ive been working on for the last 4 years
Did Reddit just destroy mobile browser access?
Reddit.com appears to be having an outage
Show HN: Zsync, a Reddit Alternative with the Goal to Reward Quality Comments
Apollo’s Christian Selig explains his fight with Reddit – and why users revolted
The Reddit blackout will continue
The Reddit blackout has left Google barren and full of holes
Reddit’s blackout protest is set to continue indefinitely
Reddit Threatens to Remove Moderators from Subreddits Continuing Blackouts
Reddit is removing moderators that protest by taking their communities private
Louis Rossmann calls community to leave Reddit
Reddit App – Suspicious high number of recent 5 star, one word reviews

#HackerNews #HackerNewsAnalytics #Reddit #RedditStrike #RedditBlackout

Given the #RedditStrike / #RedditBlackout, question popped up on Hacker News as to whether or not stories critical of Reddit were being overwhelmingly flagged.

So I updated my Front Page archive through 2023-06-13, and looked at the numbers.

There've been 16 front-page stories since 31 May 2023 when the first story on API pricing broke.

That compares against total mentions of Reddit since 2007:

Note that we're only 45% of the way through 2023, so at the rate of stories-to-date for the year (and ignoring the blow-up in the past two weeks which itself is well-above trend), 2023 is on track for 46 FP stories, which ties the high-water mark set in 2012.

#HackerNews #HackerNewsAnalytics #MediaAnalysis

So ... I'm playing with a report showing how often F500 companies are mentioned in HN submission titles.

As I've noted, most of my scripting is in awk (gawk), and it's ... usually pretty good.

I'm toying with a couple of loops where I read all 178k titles, and all 500 company names, into arrays, then check to see if the one appears in the other.

The first iteration of that was based on the index() function, which is a simple string match. Problem is that there are substring matches, for example "Lear" (the company) will match on "Learn", "Learning", etc., and so is strongly overrepresented.

So I swapped in match(), which is a regular-expression match, and added \W as word-boundaries.

The index-based search ran in about 20 seconds. That's a brief wait, but doable.

The match (regex) based search ... just finished as I'm writing this. 13 minutes 40 seconds.

Regexes are useful, but can be awfully slow.

Which means that my first go at this --- still using gawk but having it generate grep searches and printing the match count only ... is much faster whilst being accurate. That runs in just under a minute here. I'd looked for another solution as awk is "dumb" re the actually output: it doesn't read or capture the actual counts, so I'll either have to tweak that program or feed its output to an additional parser. Neither of which is a big deal, mind.

Oh, and Apple seems to be the most-mentioned company, though the F500 list omits Google (or YouTube, or Android), listing only Alphabet, which probably results in a severe undercount.

Top 10 using the F100 list:

     1  Apple:  2447
     2  Microsoft:  1517
     3  Amazon:  1457
     4  Intel:  554
     5  Tesla:  404
     6  Netflix:  322
     7  IBM:  309
     8  Adobe:  180
     9  Oracle:  167
    10  AT&T:  143

Add to those:

$ egrep -wc '(Google|Alphabet|You[Tt]ube|Android)' hn-titles
7163
egrep -wc '(Apple|iPhone|iPad|iPod|Mac[Bb]ook)' hn-titles
3656
 egrep -wc '(Facebook|Instagram)' hn-titles
2512

Note I didn't even try "Meta", though let's take a quick look ... yeah, that's a mess.

Up until 2021-10-28, "Meta" is a concept, with 33 entries. That was the day Facebook announced its name change. 82 total matches (so low overall compared to the earlier numbers above), 49 post-announcement, of which two are not related to Facebook a/k/a Meta. Several of the titles mention both FB & Meta ... looks like that's four of 'em.

So "Meta" boosts FB's count by 45.

There are another 296 mentions of Steve Jobs and Tim Cook which don't also include "Apple".

And "Alphabet" has 54 matches, six of which don't relate to the company.

Of the MFAANG companies:

Google: 5796
Apple: 2447
Facebook: 2371
Microsoft: 1517
Amazon: 1457
Netflix: 322

(Based on grep.)

#DataAnalysis #awk #grep #bash #HackerNewsAnalytics

In fact-checking my own comment, I found that my success rate in reaching the HN front page is not the roughly 10% I'd thought.

It's pretty much spang on 3%, which is the overall site average.

That's based on my archive's count of my own FP submissions (60) and Algolia search's results for all my article submission, whether or not they hit the front page (1,974).

So I guess I'm just about average.

This gives me the idea of checking against the HN Leaders list to see if anyone's markedly above 3% for FP placements.

#HackerNewsAnalytics #HackerNews

I was able to draw on my HN FP archive to respond in part to concerns over topic suppression by an HN member:

https://news.ycombinator.com/item?id=36191005

This is an interesting superpower ...

Not an awesome, superpower, mind, but an interesting one.

#HackerNews #HackerNewsAnalytics

Hacker News characteristics --- banned sites (2009)

I've been crawling through some of the early discussions about HN's design, intent, and characteristics.

One interesting item is a list of 2,096 banned sites from 2009:

https://news.ycombinator.com/item?id=498910

There's also Paul Graham's "What I've Learned from Hacker News" (2009):
https://www.paulgraham.com/hackernews.html

Edit: *Markdown*

#HackerNews #HackerNewsAnalytics

#HackerNewsAnalytics

Client Info