Identifikasi entitas halaman Wikipedia

Aug 18 2020

Pertanyaan saya terkait dengan pertanyaan / komentar serupa yang sayangnya tidak pernah mendapat jawaban.

Diberikan daftar beberapa halaman Wikipedia, misalnya:

  • https://en.wikipedia.org/wiki/Donald_Trump
  • https://en.wikipedia.org/wiki/The_Matrix
  • https://en.wikipedia.org/wiki/Tiger
  • ...

bagaimana cara mengetahui jenis entitas yang dirujuk artikel ini. yaitu idealnya saya menginginkan sesuatu pada tingkat yang lebih tinggi misalnya orang, film, hewan, dll.

Tebakan terbaik saya sejauh ini adalah API Wikidata yang menggunakan SPARQL untuk kembali ke pohon instance_ofatau subclass. Namun, hal ini tidak membuahkan hasil yang berarti.

SELECT ?lemma ?item ?itemLabel ?itemDescription ?instance ?instanceLabel ?subclassLabel WHERE {
  VALUES ?lemma {
    "Donald Trump"@en
    "The Matrix"@en
    "Tiger" @en
  }
  ?sitelink schema:about ?item;
    schema:isPartOf <https://en.wikipedia.org/>;
    schema:name ?lemma.
  ?item wdt:P31* ?instance.
  ?item wdt:P279* ?subclass.
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "en,da,sv".}
}

Hasilnya bisa dilihat disini: https://w.wiki/ZmQ

One option would of course also be to look at the itemDescription, but I'm afraid that this is too granular to build meaningful groups from larger lists and count frequencies later on. Does anyone have a hint/idea on how to get more general entity categories? Maybe also from the mediawiki API?

Any input would be highly appreciated!

Jawaban

1 MatthiasWinkelmann Aug 18 2020 at 09:18

Here are three possibilities, side-by-side:

SELECT ?lemma ?item (GROUP_CONCAT(DISTINCT ?instanceLabel; SEPARATOR = " ") AS ?a) (GROUP_CONCAT(DISTINCT ?subclassLabel; SEPARATOR = " ") AS ?b) (GROUP_CONCAT(DISTINCT ?isaLabel; SEPARATOR = " ") AS ?c) WHERE {
  VALUES ?lemma {
    "Donald Trump"@en
    "The Matrix"@en
    "Tiger"@en
  }
  ?sitelink schema:about ?item;
    schema:isPartOf <https://en.wikipedia.org/>;
    schema:name ?lemma.
  OPTIONAL { ?item (wdt:P31/(wdt:P279*)) ?instance. }
  OPTIONAL { ?item wdt:P279 ?subclass. }
  OPTIONAL { ?item wdt:P31 ?isa. }
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en,da,sv".
    ?instance rdfs:label ?instanceLabel.
    ?subclass rdfs:label ?subclassLabel.
    ?isa rdfs:label ?isaLabel.
  }
    # Here, you could add: FILTER(?instanceLabel in ("mammal"@en, "movie"@en, "musical"@en (and so on...)))
}
GROUP BY ?lemma ?item

Live here.

If you're looking at labels such as "film" and "mammal", i. e. a couple dozen at most, you could explicitly list them in order of preference, then use the first one that occurs.

Note that you may be running into this bug: https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial#wikibase:Label_and_aggregations_bug