Unicode Over Lx Pct Of The Web

Computers shop every slice of text using a “character encoding,” which gives a issue to each character. For example, the byte 61 stands for ‘a’ in addition to 62 stands for ‘b’ inwards the ASCII encoding, which was launched inwards 1963. Before the web, figurer systems were siloed, in addition to in that place were hundreds of unlike encodings. Depending on the encoding, C1 could hateful whatever of ¡, Ё, Ą, Ħ, ‘, ”, or parts of thousands of characters, from æ to 品. If y'all brought a file from 1 figurer to another, it could come upwards out every bit gobbledygook.

Unicode was invented to solve that problem: to encode all human languages, from Chinese (中文) to Russian (русский) to Standard Arabic (العربية), in addition to fifty-fifty emoji symbols similar or
; it encodes nearly 75,000 Chinese ideographs alone. In the ASCII encoding, in that place wasn’t fifty-fifty plenty room for all the English linguistic communication punctuation (like curly quotes), spell Unicode has room for over a 1 G m characters. Unicode was showtime published inwards 1991, coincidentally the twelvemonth the WWW debuted—little did anyone realize at the fourth dimension they would live on then of import for each other. Today, people tin easily portion documents on the web, no affair what their language.

Every January, nosotros expect at the pct of the webpages inwards our index that are inwards unlike encodings. Here’s what our information looks similar amongst the latest figures*:

*Your mileage may vary: these figures may vary somewhat from what other search engines find. The graph lumps together encodings past times script. We discovery the encoding for each webpage; the ASCII pages simply comprise ASCII characters, for example. Thanks 1 time to a greater extent than to Erik van der Poel for collecting the data.

As y'all tin see, Unicode has experienced an 800 percent increase inwards “market share” since 2006. Note that nosotros split upwards out ASCII ( xvi percent) since it is a subset of most other encodings. When y'all include ASCII, nearly fourscore percent of spider web documents are inwards Unicode (UTF-8). The to a greater extent than documents that are inwards Unicode, the less probable y'all volition run into mangled characters (what Japanese telephone outcry upwards mojibake) when you’re surfing the web.

We’ve long used Unicode every bit the internal format for all the text Google searches in addition to process: whatever other encoding is showtime converted to Unicode. Version 6.1 simply released amongst over 110,000 characters; presently we’ll live on updating to that version in addition to to Unicode’s locale information from CLDR 21 (both via ICU). The continued ascent inwards role of Unicode makes it fifty-fifty easier to practise the processing for the many languages that nosotros cover. Without it, our unified index it would live on nearly impossible—it’d live on a fleck similar non existence able to convert betwixt the hundreds of currencies inwards the world; commerce would be, well, difficult. Thanks to Unicode, Google is able to assistance people discovery information inwards almost whatever language.


Sumber http://googleblog.blogspot.com
Labels: lainnya
Back To Top